# Recurrent Neural Networks for BoolQ Reading Comprehension

## 1. Introduction

- **Objective**: Develop a reading comprehension model using a 2-layer LSTM and a 2-layer classifier. The model will be trained end-to-end on the BoolQ dataset.
- **Task**: The BoolQ dataset involves answering yes/no questions given a passage. The goal is to predict the correct label for each question.
- **Approach**: Utilize PyTorch for building the model, and Hugging Face's datasets library to manage data.


## 2. Setup
- **Libraries**: 
  - `torch`: For building the neural network.
  - `datasets`: For loading the BoolQ dataset.
  - `transformers`: For using a pre-trained BPE tokenizer.
  - `fasttext`: To load and use FastText embeddings.
  - `numpy`, `pandas`, `matplotlib`, `seaborn`: For data manipulation and visualization.
  - `gensim`: For loading the pre-trained word embedding model.
  - `sklearn`: For metrics.
  - `wandb`: For experiment tracking

- **Environment Configuration**:
  - Ensure reproducibility by setting random seeds where possible.

- **Planned Correctness Tests**:
  - Use `assert` statements to check tensor dimensions, and confirm the expected shapes of inputs and outputs throughout the data pipeline.
  - Print sample outputs at different stages to validate transformations.

- **Experiment Tracking**:
  - Use `wandb` for logging experiments, including hyperparameters, metrics, and visualizations.


## 3. Preprocessing
- **Tokenization**:
  - **Decision**: Use a pre-trained Byte-Pair Encoding (BPE) tokenizer from the `transformers` library.
  - **Rationale**:
    - Using a pre-trained tokenizer simplifies the preprocessing pipeline, as the tokenizer has already been trained on a large and diverse corpus, which increases its generalization capability.
    - Pre-trained tokenizers from `transformers` are well-optimized and widely used in various NLP tasks.
    - BPE helps handle out-of-vocabulary (OOV) words by breaking them into known subword units, allowing for more robust word representations.

- **Word Embedding**:
  - **Decision**: Use the FastText API for embeddings, as it can leverage subword information to generate embeddings for words not in the vocabulary.
  - **Rationale**:
    - The FastText API is designed to handle OOV words by computing embeddings using subword units.

- **Handling Text Cleaning**:
  - **Operations**:
    - Convert text to lowercase for consistency.
    - Remove special characters and extra whitespace.
  - **Justification**: These basic cleaning steps standardize the input without over-complicating the preprocessing and removing as little sentiment as possible from the sentences. I chose to not remove stopwords and not stemm- / lemmatize for the same reason.

- **Sequence Truncation and Padding**:
  - **Truncating**: Truncate sequences to a fixed length of 512 tokens.
  - **Padding**: Apply padding to make all sequences in a batch have the same length.
  - **Rationale**:
    - Limiting the sequence length to 512 tokens balances computational efficiency and context retention. This choice ensures that the input size remains manageable while still covering most of the content in the passages. It is also a popular sequence length for nlp applications, that's why I chose it.

- **Input**:
  - **Decision**: Pass the sequence of word embeddings directly to the LSTM without averaging, ensuring that each embedding retains its position in the sequence.
  - **Rationale**:
    - LSTM networks are designed to process sequential data, where each step in the sequence corresponds to a time step in the model. By feeding the LSTM with a sequence of embeddings, it can learn the dependencies between the tokens, capturing contextual information across the entire sequence.
    - Retaining the full sequence allows the LSTM's gating mechanisms to selectively remember or forget information based on the input at each step, which is crucial for understanding long-term dependencies in reading comprehension tasks.

  - **Input Preparation**:
    - Each input sequence will be tokenized and converted into a sequence of FastText word embeddings (each of dimension 300).
    - The resulting input will have the required shape of `(max_sequence_length, batch_size, embedding_dim)`— for example, `(512, 32, 300)` for a batch size of 32.


## 4. Model Architecture
- **RNN Type**:
  - **Decision**: Use LSTM for the RNN layers.
  - **Rationale**: LSTM cells help maintain long-term dependencies through gating mechanisms, which is beneficial for reading comprehension tasks where context from the entire passage can be important for answering questions.

- **Model Configuration**:
  - **Embedding Layer**: Input dimension of 300 using FastText embeddings.
  - **RNN Layers**: Two LSTM layers with a hidden size of 128.
  - **Dropout**: Apply dropout with a rate of 0.3 between the LSTM layers for regularization.
  - **Classifier**: A two-layer fully connected network (hidden layer of size 64) with ReLU activation.

- **Loss and Optimizer**:
  - **Loss Function**: Use Binary Cross-Entropy Loss for the binary classification task.
  - **Optimizer**: Use the Adam optimizer with an initial learning rate of 0.001.
  - **Rationale**:
    - Adam is chosen for its adaptive learning rate, which can improve training stability and convergence.

- **Regularization**:
  - **Dropout**: Applied to reduce overfitting.
  - **Early Stopping**: Monitor validation loss and stop training if it does not improve for 3 consecutive epochs.


## 5. Training
- **Number of Epochs**: Train for up to 20 epochs with early stopping.
- **Checkpointing**: Save the model with the best validation accuracy to avoid overfitting.

## 6. Evaluation
- **Primary Metric**:
  - **Accuracy**: Chosen as the main evaluation metric since it reflects the overall model performance in binary classification.
- **Baseline Comparison**:
  - Compare the model's accuracy against a majority class baseline (e.g., always predicting "yes") to understand the model's relative performance.
- **Error Analysis**:
  - Analyze the confusion matrix to identify patterns in misclassifications and judge the types of errors the model makes.


## 7. Interpretation
- **Performance Expectations**:
  - Learning from the results of Project 1 I am setting my expectations a bit lower (more realistic) this time. I'm expecting the LSTM to achieve an accuracy of 65 - 70%. Hopefully beating the baseline of always predicting "yes" (accuracy of 61-63%)
