# If this doesn't work I'm ending it all tonight.

## Setup

**Importing Python Packages**
Making sure the notebook is reproducible and runs without error, I will install the necessary libraries in a pip cell below.

**Data Loading and Split**
The data consists of the questions, a passage and the answer. In total there are 12'697 entries in the dataset. Splitting them according to the lecture slides into train (8427), validation (1000) and test (3270).

**Seeding for Reproducibility**
Setting the random Seed to 42 for reproducibility.

## Preprocessing
### Tokenizer
In past projects I always did some sort of manual preprocessing of the data. In this project I deliberately refrain from any manual preprocessing and will let the built-in features of the AutoTokenizer with the from_pretrained("bert-base-cased") model handle the following steps for me:
- Whitespace and Special Character removal (e.g. emojis or phonetic pronunciations)
- Case Sensitivity
- Padding and Truncation (pad automatically, truncate to max: 512 tokens - amount of pretrained position embeddings)

I only now found out about this from the Hugging Face Transformer [Preprocessing Data Documentation](https://huggingface.co/transformers/v3.0.2/preprocessing.html).

### Lowercase / Case Sensitivity
From my feedback I will now keep case sensitivity instead of lower-casing all text, as well as using a more efficient maximum sequence length for the questions of 21 tokens (the length of the longest question), passages will be padded dynamically within the batch to the length of the longest passage in the batch without exceeding a total maximal length of 512 tokens (resulting in an absolute maximum length for passages of 491 tokens). Example of case sensitivity: the word "US" would become "us" and could thus change the meaning of a sentence drastically. <br>
*Source*: Feedback from Project 2 (LSTM)

### Stemming / Lemmatization
From a past lecture I took away that stemming or lemmatization is not the right choice for a reading comprehension task. It removes valuable meaning
No stemming or lemmatization will be done in my preprocessing as to keep the most amount of information possible in my sequences.

### Embedding Layer
In this project, the embedding layer is implemented using PyTorch's nn.Embedding class. The embeddings are trained end-to-end alongside the rest of the model, allowing them to adapt to the specific nuances of the BoolQ dataset.
- **Vocabulary Size**: Determined by the tokenizer
- **Embedding Dimension**: Set to 300 as this is widely used by large pretrained embedding models like fastText or word2vec.
- **Training**: Initialized randomly and updated during training through backpropagation.

### Input / Label format
In the final input each data point will have the following format:
- asdf

## Model
### Architecture
- **Input Layer**:
    - The input to my model is the nn.Embedding layer that will be trained on the dataset with the network.
    - Each input sequence consists of a concatenated question and passage with a [SEP] token between them, marking the boundary. The separator token allows the model to distinguish between the two segments.
    - The resulting shape of the input tensor after embedding is (batch_size, sequence_length, embedding_dim).
- **6-Layer Transformer Encoder**:
    - Using the PyTorch implementation of the Transformer Encoder. The input to this model will be the output of the embedding layer with shape (batch_size, sequence_length, embedding_dim). Using six layers to learn contextual representations of the concatenated questino-passage sequence.
- **Pooling Layer**:
    - Apply *mean pooling* across the sequence length to reducing the output from (batch_size, sequence_length, embedding_dim) to (batch_size, embedding_dim). This provides a fixed-size single vector that summarizes the entire sequence for the classifier which provides the advantage of efficient memory use in training with varying sequence lengths and a fixed-sized input for my classifier.
- **2-Layer Classifier with ReLU**
    - I will implement a two-layer classifier network as defined in the project assignment. The first layer will take the output from the pooling layer of size (batch_size, embedding_dim) as its input and provide an output shape of (batch_size, hidden_dim). Using a ReLU for non-linearity. The second layer has output dimensions of (batch_size, num_classes) with num_classes=2. The output layer will use a softmax as the activation function as it is preferable over a sigmoid function for binary classification.

### Loss and Optimizer
For this binary classification task I'm using Binary Cross-Entropy Loss. BCE is widely used in binary classification problems, as it provides a probabilistic interpretation of the model's outputs, making it convenient for distinguishing between two classes. <br>
*Source*: [Binary Cross-Entropy/Log Loss for Binary Classification](https://www.geeksforgeeks.org/binary-cross-entropy-log-loss-for-binary-classification/)

For my optimizer I choose the Adam Optimizer for its adaptive learning rates and efficient handling of sparse gradients. It is well suited for deep learning tasks, provides fast convergence and has worked well in prior projects. <br>
*Source*: [Introduction to the Adam Optimizer](https://machinelearningmastery.com/adam-optimization-algorithm-for-deep-learning/)

### Experiments
To tune my models' hyperparameters I will be experimenting with the following ranges:
- Learning Rate: [1e-2, 1e-7]

### Training

### Checkpointing and Early Stopping

### Planned Correctness Tests
- Testing input shape to ensure the model receives a valid input format
- Testing output shape to verify the model produces the expected output shape
- Visually check the loss is decreasing while training
- Visually check the output for overfitting
- Visually check predictions using a confusion matrix
- Ensure reproducibility by setting the random seed.

## Evaluation


## Interpretation
