# Recurrent Neural Networks for BoolQ Reading Comprehension

## 1. Introduction

- **Objective**: Develop a reading comprehension model using a 2-layer LSTM and a 2-layer classifier. The model will be trained end-to-end on the BoolQ dataset.
- **Task**: The BoolQ dataset involves answering yes/no questions given a passage. The goal is to predict the correct label for each question.
- **Approach**: Utilize PyTorch for building the model, and Hugging Face's datasets library to manage data.


## 2. Setup
- **Libraries**: 
  - `torch`: For building the neural network.
  - `datasets`: For loading the BoolQ dataset.
  - `transformers`: For using a pre-trained BPE tokenizer.
  - `fasttext`: To load and use FastText embeddings.
  - `numpy`, `pandas`, `matplotlib`, `seaborn`: For data manipulation and visualization.
  - `gensim`: For loading the pre-trained word embedding model.
  - `sklearn`: For metrics.
  - `wandb`: For experiment tracking

- **Planned Correctness Tests**:
  - Use `assert` statements to check tensor dimensions, and confirm the expected shapes of inputs and outputs throughout the data pipeline.
  - Print sample outputs at different stages to validate transformations.

- **Experiment Tracking**:
  - Use `wandb` for logging experiments, including hyperparameters, metrics, and visualizations.


In [40]:
%pip install torch datasets transformers fasttext numpy pandas matplotlib gensim scikit-learn wandb 

Note: you may need to restart the kernel to use updated packages.


In [41]:
from datasets import load_dataset
import gensim.downloader as api
import gensim
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
import fasttext.util
import re
from pathlib import Path
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import wandb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix

Downloading the required BoolQ dataset and splitting it like required from the project presentation

In [42]:
train_data = load_dataset('google/boolq', split='train[:-1000]')
validation_data = load_dataset('google/boolq', split='train[-1000:]')
test_data = load_dataset('google/boolq', split='validation')

Have a look at the data and labels

In [43]:
test_question = train_data[5]['question']
test_passage = train_data[5]['passage']
print(train_data[5])
print(f"Number of training samples: {len(train_data)}")
print(f"Number of validation samples: {len(validation_data)}")
print(f"Number of validation samples: {len(test_data)}")

train_yes_count = sum(1 for label in train_data['answer'] if label == 1)
train_no_count = sum(1 for label in train_data['answer'] if label == 0)

validation_yes_count = sum(1 for label in validation_data['answer'] if label == 1)
validation_no_count = sum(1 for label in validation_data['answer'] if label == 0)

test_yes_count = sum(1 for label in test_data['answer'] if label == 1)
test_no_count = sum(1 for label in test_data['answer'] if label == 0)

# Print the counts and ratios
print(f"Train set - Yes: {train_yes_count}, No: {train_no_count}, Ratio (y/n): {round(train_yes_count / train_no_count, 2)}")
print(f"Validation set - Yes: {validation_yes_count}, No: {validation_no_count}, Ratio (y/n): {round(validation_yes_count / validation_no_count, 2)}")
print(f"Test set - Yes: {test_yes_count}, No: {test_no_count}, Ratio (y/n): {round(test_yes_count / test_no_count, 2)}")


{'question': 'can you use oyster card at epsom station', 'answer': False, 'passage': "Epsom railway station serves the town of Epsom in Surrey. It is located off Waterloo Road and is less than two minutes' walk from the High Street. It is not in the London Oyster card zone unlike Epsom Downs or Tattenham Corner stations. The station building was replaced in 2012/2013 with a new building with apartments above the station (see end of article)."}
Number of training samples: 8427
Number of validation samples: 1000
Number of validation samples: 3270
Train set - Yes: 5279, No: 3148, Ratio (y/n): 1.68
Validation set - Yes: 595, No: 405, Ratio (y/n): 1.47
Test set - Yes: 2033, No: 1237, Ratio (y/n): 1.64


Download the model if not already in directory

In [45]:
# Define the file names
model_bin = Path('cc.en.300.bin')
model_gz = Path('cc.en.300.bin.gz')

# Check if the model files already exist
if not model_bin.exists() and not model_gz.exists():
    fasttext.util.download_model('en', if_exists='ignore') # download it

# Load the model
ft = fasttext.load_model(str(model_bin))



## 3. Preprocessing

- **Handling Text Cleaning**:
  - **Operations**:
    - Convert text to lowercase for consistency.
    - Remove special characters and extra whitespace.
  - **Reasoning**: These basic cleaning steps standardize the input without over-complicating the preprocessing and removing as little sentiment as possible from the sentences. I chose to not remove stopwords and not do stemming or lemmatizing for the same reason.


In [35]:
def to_lowercase(text: str) -> str:
    return text.lower()

print(to_lowercase(test_question))
print(to_lowercase(test_passage))

can you use oyster card at epsom station
epsom railway station serves the town of epsom in surrey. it is located off waterloo road and is less than two minutes' walk from the high street. it is not in the london oyster card zone unlike epsom downs or tattenham corner stations. the station building was replaced in 2012/2013 with a new building with apartments above the station (see end of article).


In [36]:
def remove_special_characters_and_urls(text: str) -> str:
   # Remove URLs using regex
    text: str = re.sub(r'http[s]?://\S+|www\.\S+', '', text)
   
   # Replace slashes with spaces first
    text: str = text.replace('/', ' ')

    # Remove special characters except for alphanumeric characters and spaces
    cleaned_text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
   
    return cleaned_text

test_question_w_url = "Visit us at https://example.com for more info!"
test_passage_w_url = "Some RANDOM text With VARIETY. Check this out: www.example.org and the year is 2012/2013."

print(remove_special_characters_and_urls(test_question))
print(remove_special_characters_and_urls(test_passage))
print(remove_special_characters_and_urls(test_question_w_url))
print(remove_special_characters_and_urls(test_passage_w_url))

can you use oyster card at epsom station
Epsom railway station serves the town of Epsom in Surrey It is located off Waterloo Road and is less than two minutes walk from the High Street It is not in the London Oyster card zone unlike Epsom Downs or Tattenham Corner stations The station building was replaced in 2012 2013 with a new building with apartments above the station see end of article
Visit us at  for more info
Some RANDOM text With VARIETY Check this out  and the year is 2012 2013


In [37]:
def remove_extra_whitespace(text: str) -> str:
    return re.sub(r'\s+', ' ', text).strip()

print(remove_extra_whitespace(test_question))
print(remove_extra_whitespace(test_passage))

can you use oyster card at epsom station
Epsom railway station serves the town of Epsom in Surrey. It is located off Waterloo Road and is less than two minutes' walk from the High Street. It is not in the London Oyster card zone unlike Epsom Downs or Tattenham Corner stations. The station building was replaced in 2012/2013 with a new building with apartments above the station (see end of article).


Preprocessing function to combine all preprocessing steps.

In [38]:
def preprocessing(text: str) -> str:
    lowercase_text = to_lowercase(text)
    cleaned_text = remove_special_characters_and_urls(lowercase_text)
    prepared_text = remove_extra_whitespace(cleaned_text)
    
    return prepared_text

test_question_w_url = "Visit us at https://example.com for more info!"
test_passage_w_url = "Some RANDOM text With VARIETY. Check this out: www.example.org and the year is 2012/2013."


print(preprocessing(test_question))
print(preprocessing(test_passage))
print(preprocessing(test_question_w_url))
print(preprocessing(test_passage_w_url))


can you use oyster card at epsom station
epsom railway station serves the town of epsom in surrey it is located off waterloo road and is less than two minutes walk from the high street it is not in the london oyster card zone unlike epsom downs or tattenham corner stations the station building was replaced in 2012 2013 with a new building with apartments above the station see end of article
visit us at for more info
some random text with variety check this out and the year is 2012 2013


- **Tokenization**:
  - **Decision**: Use a pre-trained Byte-Pair Encoding (BPE) tokenizer from the `transformers` library.
  - **Reasoning**:
    - Using a pre-trained tokenizer simplifies the preprocessing pipeline, as the tokenizer has already been trained on a large and diverse corpus, which increases its generalization capability.
    - Pre-trained tokenizers from `transformers` are well-optimized and widely used in various NLP tasks.
    - BPE helps handle out-of-vocabulary (OOV) words by breaking them into known subword units, allowing for more robust word representations.

- **Sequence Truncation and Padding**:
  - **Truncating**: Truncate sequences to a fixed length of 512 tokens.
  - **Padding**: Apply padding to make all sequences in a batch have the same length.
  - **Reasoning**:
    - Limiting the sequence length to 512 tokens balances computational efficiency and context retention. This choice ensures that the input size remains manageable while still covering most of the content in the passages. It is also a popular sequence length for nlp applications, that's why I chose it.

- **Word Embedding Lookups**:
  - **Decision**: Use the FastText API directly to obtain embeddings for tokenized words.
  - **Reasoning**:
    - The FastText API considers subword information when generating word embeddings, providing robust handling of OOV words.
    - This approach prevents the issue of having to map subword tokens directly to embeddings, which is not feasible with traditional embedding lookup methods.
  - **OOV Word Handling**:
    - Rely on FastText's built-in subword handling to generate embeddings for unknown words.

- **Input Preparation**:
  - Each input is a concatenation of the question and passage of total length 1024 (512 * 2). This sequence will be tokenized and converted into a sequence of FastText word embeddings (each of dimension 300).
  - The resulting input will have the required shape of `(max_sequence_length * 2, batch_size, embedding_dim)`— for example, `(1024, 32, 300)` for a batch size of 32.



## 4. Model Architecture
- **RNN Type**:
  - **Decision**: Use LSTM for the RNN layers.
  - **Rationale**: LSTM cells help maintain long-term dependencies through gating mechanisms, which is beneficial for reading comprehension tasks where context from the entire passage can be important for answering questions.

- **Model Configuration**:
  - **Embedding Layer**: Input dimension of 300 using FastText embeddings.
  - **RNN Layers**: Two LSTM layers with a hidden size of 128.
  - **Dropout**: Apply dropout with a rate of 0.3 between the LSTM layers for regularization.
  - **Classifier**: A two-layer fully connected network (hidden layer of size 64) with ReLU activation.

- **Loss and Optimizer**:
  - **Loss Function**: Use Binary Cross-Entropy Loss for the binary classification task.
  - **Optimizer**: Use the Adam optimizer with an initial learning rate of 0.001.
  - **Rationale**:
    - Adam is chosen for its adaptive learning rate, which can improve training stability and convergence.

- **Regularization**:
  - **Dropout**: Applied to reduce overfitting.
  - **Early Stopping**: Monitor validation loss and stop training if it does not improve for 3 consecutive epochs.


## 5. Training
- **Number of Epochs**: Train for up to 20 epochs with early stopping.
- **Checkpointing**: Save the model with the best validation accuracy to avoid overfitting.

- **Hyperparameter Experimentation**:
  - **Learning Rate**: Test various learning rates (e.g., 0.001, 0.0005, 0.0001) to find an optimal balance between convergence speed and training stability.
  - **Batch Size**: Experiment with different batch sizes (e.g., 16, 32, 64) to optimize memory usage and training time.
  - **Dropout Rate**: Adjust dropout rates (e.g., 0.2, 0.3, 0.5) to find the optimal level of regularization.
  - **Hidden Layer Size**: Try varying the number of hidden units in the RNN and classifier layers (e.g., 64, 128, 256) to assess their impact on model capacity.


## 6. Evaluation
- **Primary Metric**:
  - **Accuracy**: Chosen as the main evaluation metric since it reflects the overall model performance in binary classification.
- **Baseline Comparison**:
  - Compare the model's accuracy against a majority class baseline (e.g., always predicting "yes") to understand the model's relative performance.
- **Error Analysis**:
  - Analyze the confusion matrix to identify patterns in misclassifications and judge the types of errors the model makes.


## 7. Interpretation
- **Performance Expectations**:
  - Learning from the results of Project 1 I am setting my expectations a bit lower (more realistic) this time. I'm expecting the LSTM to achieve an accuracy of 65 - 70%. Hopefully beating the baseline of always predicting "yes" (accuracy of 61-63%)
