# NLP - Word Embeddings - Pascal Thürig

## Introduction
Starting point for this project is the following key requirements:
1. Use the BoolQ Dataset from Hugging Face
2. Use pre-trained model for word embeddings (word2vec, GloVe or fastText)
3. Train a 2-layer classifier with ReLU non-linearity

In this project I will be using pre-trained embeddings from word2vec and a simple 2-layer neural network to do the reading comprehension task on the BoolQ dataset.
I will document every decision made, from preprocessing to model training and evaluation. The goal is to classify each BoolQ question-answer pair as either 'Yes' or 'No'.

## Setup
Importing necessary libraries:
- datasets
- gensim
- transformers
- numpy
- torch
- wandb
- sklearn

First up the BoolQ dataset is loaded

For easy access during experiments I like to define the hyperparameters at the top of my notebooks

Now the pre-trained embeddings from word2vec

In [None]:
# (down-)load word2vec - word2vec-google-news-300

## Preprocessing

The BoolQ data will be processed in the following way:
1.  Tokenizing: the input questions and passages using a subword tokenizer
2.  Lowercasing: the text for simplicity and to reduce the total vocabulary size
3.  Stemming: No, will not stem the words as to not lose information
4.  Lemmatizing: No, will try if it improves performance
5.  Stopword removal: No, will not be removed to not lose potentially critical information [research](https://datascience.stackexchange.com/questions/31048/pros-cons-of-stop-word-removal)
6.  Removal of other words: No, will not be removing any other words
7.  Format cleaning: The dataset is already sufficiently clean, it shouldn't impact performance
8.  Truncation: the input text is truncated to a maximum of 512 tokens
9.  Feature selection: ???
10. Input format: ???
11. Label format: Binary labels "yes" or "no"
12. train/valid/test splits: Prerequisite to project (66/8/26)
13. Padding: the sequences is padded to ensure all inputs have the same length in each batch
14. Embedding: Using word2vec, solely for simplicity as I already know it.
15. Planned correctness tests: ???

**TLDR: Decisions for Preprocessing**
- Tokenizing: Yes, subword tokenizer
- Lowercasing: Yes
- Stemming: No
- Lemmatizing: No
- Stopword removal: No
- Removal of other words: No
- Format cleaning: No
- Truncation: Yes, max 512 tokens
- Feature selection: ???
- Input format: ???
- Label format: Binary
- train/valid/test splits: (66/8/26)
- Padding: Yes
- Embedding: word2vec
- Planned correctness tests: ??? 

1. Preprocess text (lowercasing)

2. Tokenize with AutoTokenizer from Hugging Face

In [None]:
# tokenize w/ AutoTokenizer.from_pretrained("bert-base-uncased")

3. Truncate or add padding

4. embed tokens using word2vec (word2vec-google-news-300)

5. Create a custom BoolQ dataset class to:
    - get the data into a compatible format for the pyTorch dataloader.
    - organize question-answer pairs and apply the preprocessing pipeline.
    - easily batch, shuffle, and load the data during training.

In [None]:
# class BoolQDataset(dataset):

6. Dataloaders as required by pyTorch

7. Initialize weights and biases for experiment tracking

## Model

The model for this project is pre-defined as a 2-Layer network with a ReLU non-linearity

1. Creating the neural network class:

2. Create instance of model and move it to the GPU

3. Loss (nn.CrossEntropyLoss)and optimizer (optim.Adam)

4. Training loop

5. Evaluation function

## Training

## Evaluation

## Finish the WandB run

## Interpretation
