# Abstract

# Table of Contents

# Introduction

## Background and motivation

Natural Language Processing (NLP) has come a long way with models like BERT, which help computers understand and generate human language more effectively. This progress opens up exciting possibilities for creating engaging and contextually rich stories.

The idea for this project started from a personal experience within our team. One of us had a nephew who asked for a bedtime story about "pirates in space." As Catalans who value storytelling but aren't all naturally imaginative, we struggled. Attempts to use existing generative models resulted in stories that missed the context and creativity the child wanted. This frustration highlighted a gap in current story generation tools.

Motivated by this experience, we set out to create a better way to generate fairy tales that truly capture the magic and context of the themes kids ask for. We decided to train a BERT model specifically for this purpose and use a Retrieval-Augmented Generation (RAG) system. This way, when someone requests a story, the system can understand the request, find similar tales, and generate a new, engaging story.

Our goal is to provide a tool that can make storytelling easier and more fun, ensuring that every story is as imaginative and contextually rich as the ones we cherish from our childhood.

## Objectives and scope

### Objectives

The primary objective of this project is to develop an advanced story generation system that can create contextually rich and engaging fairy tales based on user input. 

In order to achieve this, we would divide our project into 3 different phases:

![Phases of our project](initial_arquitecture.png)

These 3 phases have been further divided into more specific tasks:

**Create a Custom Tokenizer**:

- Develop a tokenizer specifically designed for our dataset of fairy tales to ensure accurate text processing.
  
**Train a BERT Model**: 

- Train a BERT model using the custom tokenizer and a diverse dataset of fairy tales to enable it to understand and generate narrative content effectively.

**Implement a RAG System**: 
- Develop a system that integrates a vector database for storing embeddings of fairy tales and uses these embeddings to enhance the story generation process.

**Generate Rich Fairy Tales**: 
- Use the embeddings retrieved from our database as context for an LLM to generate a more rich and engaging version of a fairy tale.

**Evaluate the System**: 
- Assess the performance of the system through qualitative and quantitative metrics to ensure it meets the desired objectives.

### Scope

The scope of this project encompasses the following key areas:


**Data Collection and Preprocessing:**

- Collect a dataset of fairy tales for training the BERT model.
- Preprocess the dataset to ensure it is suitable for training.

**Tokenizer Development:**

- Create a custom tokenizer tailored to the fairy tale dataset.

**Model Training and Development:**

- Train the BERT model on the collected dataset. We will do this by implementing:
  - **MLM (Masked Language Model)**
  - **NSP (Next Sentence Prediction)** tasks to fine-tune the model for narrative generation.

**Embedding and Vector Database:**

- Create embeddings of the fairy tales using the trained BERT model.
- Store these embeddings in a vector database to facilitate efficient retrieval.

**Retrieval-Augmented Generation System:**

- Develop the retrieval mechanism using cosine similarity to find relevant story contexts.
- Integrate the retrieval system with a language model to generate new stories based on user input.

**System Evaluation:**

- Conduct experiments to evaluate the retrieval accuracy and the quality of the generated stories.

**User Interface:**

- Design a user-friendly interface where users can input their story requests and receive generated stories.


# Methodology

### Data Collection

**Data Gathering**

To develop an advanced story generation system, we needed a robust and diverse dataset of fairy tales. Our data collection process involved multiple sources to ensure a comprehensive dataset. Here’s how we approached it:

1. **Kaggle**: We found several relevant datasets on Kaggle, including:
   - [Grimm’s Fairy Tales](https://www.kaggle.com/datasets/tschomacker/grimms-fairy-tales)
   - [Grimms' Brother Fairy Tale Dataset](https://www.kaggle.com/datasets/cornellius/grimms-brother-fairy-tale-dataset)

2. **Hugging Face**: We utilized datasets from Hugging Face, such as:
   - [FairytaleQA Dataset](https://huggingface.co/datasets/WorkInTheDark/FairytaleQA)
   - [FairyTales Dataset](https://huggingface.co/datasets/KyiThinNu/FairyTales)

3. **GitHub**: We accessed the FairytaleQA dataset from GitHub:
   - [FairytaleQAData](https://github.com/uci-soe/FairytaleQAData/tree/main)

4. **Web Scraping**: To supplement the datasets, we performed web scraping on various websites dedicated to fairy tales, including:
   - [Dream Little Star](https://dreamlittlestar.com/)
   - [Read the Tale](https://www.readthetale.com/)

We used web scraping tools and techniques to extract text data from these sources, carefully handling HTML parsing and cleaning the text to make it suitable for training.

By combining datasets from these sources, we compiled a rich and diverse collection of fairy tales. This dataset was then preprocessed to remove any inconsistencies and ensure it was ready for training our BERT model.

#### Dataset Analysis

To better understand the characteristics of our dataset, we performed a detailed analysis, including visualizations such as histograms. Below are some of the key insights:

- **Story Length Distribution**: The histogram below shows the distribution of story lengths in the dataset.

![Story Length Distribution](imagen3.png)

- **Vocabulary Size**: The histogram below represents the distribution of vocabulary sizes across different stories.

![Vocabulary Size Distribution](imagen2.png)

- **Average Sentence Length**: The histogram below shows the distribution of average sentence lengths in the dataset.

![Average Sentence Length](imagen1.png)

These analyses helped us to understand the dataset better and guided our preprocessing and model training steps.

#### Dataset Metrics

For training and validating our model, we have gathered a total of 1,183 fairy tales. Here are some key metrics of our dataset:

- **Total Sentences**: 128,541 sentences
- **Mean Sentences per Story**: 114 sentences
- **Total Words**: 2,631,859 words


### Tokenizer

**Byte-Pair Encoding (BPE) Tokenizer**

Initially, our team developed a Byte-Pair Encoding (BPE) tokenizer to process the fairy tales. BPE tokenization involves merging the most frequent pairs of characters or subwords iteratively until a specified vocabulary size is reached. This method helps in efficiently encoding words and subwords, which is beneficial for the training of our BERT model.

**Key steps in our BPE tokenizer implementation:**
1. **Initialization**: Define the number of merges (iterations) for the tokenization process.
2. **Vocabulary Creation**: Split words into characters and calculate the frequency of each pair of characters.
3. **Pair Merging**: Iteratively merge the most frequent pairs of characters to create subwords.

```python
for _ in range(self.num_merges):
    pairs = defaultdict(int)
    for word, freq in vocab.items():
        symbols = word.split()
        for i in range(len(symbols) - 1):
            pairs[(symbols[i], symbols[i + 1])] += freq
    best_pair = max(pairs, key=pairs.get)
    new_symbol = ''.join(best_pair)
    new_vocab = defaultdict(int)
    for word in vocab:
        new_word = word.replace(' '.join(best_pair), new_symbol)
        new_vocab[new_word] += vocab[word]
    vocab = new_vocab
```

**WordPiece Tokenizer**

After experimenting with the BPE tokenizer, we discovered that the WordPiece tokenizer provided better performance for our task. The WordPiece tokenizer iteratively builds the vocabulary by considering the most frequent subword pairs, similar to BPE, but with additional handling for character-level tokens.

**Key advantages of the WordPiece tokenizer:**
1. **Efficiency in handling rare words**: The WordPiece tokenizer can break down rare words into subwords more effectively, improving the model's ability to generalize.
2. **Handling of special characters**: The tokenizer includes special tokens for punctuation, spaces, and other characters, enhancing its ability to process diverse text formats.

**Key steps in our WordPiece tokenizer implementation:**
1. **Initialization**: Define the vocabulary size and special tokens.
2. **Word Frequency Calculation**: Count the frequency of words and characters in the text.
3. **Vocabulary Building**: Iteratively merge the most frequent pairs to form subwords and build the vocabulary.


```python
while len(self.vocab) < self.vocab_size:
scores = self._compute_pair_scores(splits)
best_pair = max(scores, key=scores.get)
splits = self._merge_pair(*best_pair, splits)
new_token = best_pair[0] + best_pair[1][2:] if best_pair[1].startswith("##") else best_pair[0] + best_pair[1]
self.vocab.append(new_token)
```

**Conclusion: WordPiece Tokenizer > BPE Tokenizer**

We opted for the WordPiece tokenizer over the Byte-Pair Encoding (BPE) tokenizer for several key reasons:

##### Key Differences

1. **Handling Rare Words**:
   - **BPE**: Less effective at handling rare words due to its frequency-based merges.
   - **WordPiece**: Breaks down rare words into smaller, meaningful subwords, improving generalization.

2. **Special Characters**:
   - **BPE**: Lacks explicit handling of special characters.
   - **WordPiece**: Includes special tokens for punctuation and spaces, providing accurate text representation.

3. **Vocabulary Efficiency**:
   - **BPE**: Fixed number of merges can lead to suboptimal vocabulary size.
   - **WordPiece**: Dynamically builds a balanced and efficient vocabulary.

4. **Contextual Understanding**:
   - **BPE**: May miss contextual nuances in narrative text.
   - **WordPiece**: Better contextual understanding due to granular word breakdown and handling of special characters.


The WordPiece tokenizer demonstrated superior performance in handling rare words, special characters, and providing efficient, contextually aware tokenization. These advantages make it the optimal choice for processing our fairy tale dataset and generating rich, engaging stories.


**Tokenizer Output**

Specifically, our final tokenizer outputs the following information:

- **Token IDs**: Each token in the text is mapped to a unique integer ID from the vocabulary.
- **Special Tokens**: These are tokens added to the text to provide additional information about the structure of the input.
  - **[CLS]**: Added at the beginning of the text. Represents the entire input sequence and its embedding is used for classification tasks.
  - **[SEP]**: Separates different parts of the input (e.g., question and answer, two sentences). For single sequences, it is added at the end.
  - **[PAD]**: Used to pad sequences to the same length within a batch.
- **Attention Mask**: Indicates which tokens should be attended to and which should be ignored (due to padding).
- **Token Type IDs**: Identifies different segments in the input. For single sequences, all values are typically 0. For paired sequences, the first sequence might have all 0s and the second sequence all 1s.


### Model Architecture

#### Custom BERT

##### Visual Summary

Here's the image illustrating the BERT model architecture:

![Model Architecture](bert_model.png)

Follow this section for specific explanations on every section

##### Input of our model

1. **Tokenizer Output**

**Components**:
- **Input IDs**: Unique integer IDs for each token in the text.
- **Attention Mask**: Indicates which tokens should be attended to (1) and which should be ignored (0) due to padding.
- **Segment IDs**: Identifies different segments in the input. For single sequences, all values are typically 0. For paired sequences, the first sequence might have all 0s and the second sequence all 1s.


##### Embedding Layer

The embedding layer converts input tokens into dense vectors.

```python
class EmbeddingLayer(nn.Module):
    def __init__(self, vocab_size, embed_size, seq_len=64, dropout=0.1):
        super(EmbeddingLayer, self).__init__()
        self.token_embeddings = nn.Embedding(vocab_size, embed_size, padding_idx=0)
        self.segment_embeddings = nn.Embedding(3, embed_size, padding_idx=0)
        self.position_embeddings = PositionalEmbedding(d_model=embed_size, max_len=seq_len)
        self.dropout = nn.Dropout(p=dropout)
       
    def forward(self, input_ids, segment_ids):        
        x = self.token_embeddings(input_ids) + self.position_embeddings(input_ids) + self.segment_embeddings(segment_ids)
        x = self.dropout(x)
        return x
```

**Key Steps:**

- Convert token IDs to embeddings.
- Add positional embeddings to encode the position of each token. (See next section)
- Add segment embeddings to differentiate between different segments of input.
- Apply dropout for regularization.

##### Positional Encoding

3. **Positional Encoding**

Adds positional information to the embeddings to ensure the model understands the order of tokens.

```python
class PositionalEmbedding(torch.nn.Module):
    def __init__(self, d_model, max_len=128):
        super().__init__()
        pe = torch.zeros(max_len, d_model).float()
        pe.require_grad = False
        for pos in range(max_len):
            for i in range(0, d_model, 2):
                pe[pos, i] = math.sin(pos / (10000 ** ((2 * i)/d_model)))
                pe[pos, i + 1] = math.cos(pos / (10000 ** ((2 * (i + 1))/d_model)))
        self.pe = pe.unsqueeze(0)
    
    def forward(self, x):
        return self.pe
```

##### Transformer Encoder Block

4. **Transformer Encoder Blocks**

Consists of multiple encoder layers that apply self-attention mechanisms and feed-forward networks.

```python
class EncoderLayer(torch.nn.Module):
    def __init__(self, d_model=768, heads=12, feed_forward_hidden=768 * 4, dropout=0.1):
        super(EncoderLayer, self).__init__()
        self.layernorm = torch.nn.LayerNorm(d_model)
        self.self_multihead = MultiHeadedAttention(heads, d_model)
        self.feed_forward = FeedForward(d_model, middle_dim=feed_forward_hidden)
        self.dropout = torch.nn.Dropout(dropout)

    def forward(self, embeddings, mask):
        interacted = self.dropout(self.self_multihead(embeddings, embeddings, embeddings, mask))
        interacted = self.layernorm(interacted + embeddings)
        feed_forward_out = self.dropout(self.feed_forward(interacted))
        encoded = self.layernorm(feed_forward_out + interacted)
        return encoded
```

**Key Steps:**

 - **Self-Attention:** Allows the model to focus on different parts of the input sequence.
 - **Feed-Forward Network:** Applies a fully connected feed-forward network to each token.
 - **Layer Normalization and Dropout:** Normalizes the output and applies dropout for regularization.

##### Training Tasks

**Explanation of our training tasks:**
- For our **MLM task**, our goal is to predict some tokens that we have previously masked
- For our **NSP task**, we will try to predict whether the second sentence of our input is the following sentence of our dataset
    

###### MLM: Masked Language Model

```python
class MaskedLanguageModel(nn.Module):
    def __init__(self, hidden, vocab_size):
        super().__init__()
        self.linear = nn.Linear(hidden, vocab_size)
        self.softmax = nn.LogSoftmax(dim=-1)
    
    def forward(self, input):
        x = self.linear(input)
        x = self.softmax(x)
        return x
```

**Key Steps:**
- Use a linear followed by a softmax

###### NSP: Next sentence Predict

```python
class NextSentencePrediction(nn.Module):
    def __init__(self, hidden):
        super().__init__()
        self.linear = nn.Linear(hidden, 2)
        self.softmax = nn.LogSoftmax(dim=-1)
    
    def forward(self, input):
        x = input[:, 0]
        x = self.linear(x)
        x = self.softmax(x)
        return x
```    

**Key Steps:**
- Use only the first token ([CLS]) to predict if the next sentence follows.
- Apply a linear layer followed by a softmax for classification.

#### Transfer Learning with DistilBERT

As a second approach for building a model that generates coherent embeddings, we opted for fine-tuning a pretrained DistilBERT model. DistilBERT is a transformer model based on the already described BERT architecture, but containing 40% less parameters. The DistilBERT base model training uses knowledge distillation, which compresses a larger model known as teacher (in this case BERT) into a smaller model called student (in this case DistilBERT)[1](https://arxiv.org/abs/1910.01108).

In the context of our project, the pretrained DistilBERT model was imported from [huggingface](https://huggingface.co/distilbert/distilbert-base-uncased). This model was originally trained using the BookCorpus and English Wikipedia datasets to perform the same objective tasks as we described previously: masked language modelling and next sentence prediction. The DistilBERT model's follows a practically identical architecture as the one previously described for the developed custom BERT model. In this case, however, the model contains 6 transformer encoder blocks.

To build and train our DistilBERT model, we performed transfer learning from a trained DistilBERT model in order to start our training routine with a model with pretrained weights in all of its blocks (embedding layer, transformer encoder blocks, MLM head block and NSP head block). We then kept frozen all model weights but the ones belonging to the last transformer encoder, MLM head and NSP head blocks. This way, the training process is simplified since the number of trainable weights is drastically reduced, while keeping unmodified the weights transfered from a more extensive pretraining process. 
Last but not least, in order to use the pretrained model, the specific Distilbert pretrained tokenizer (with a vocabulary size of 30522 tokens) was used to process and tokenize our dataset.

```python

class BERT_TL(nn.Module):
    """
    BERT Language Model - Fine-tuning DistilBERT
    Next Sentence Prediction Model + Masked Language Model
    Separated to be able to do inference to the main model
    """

    def __init__(self):
        """
        :param bert: BERT model which should be trained
        :param vocab_size: total vocab size for masked_lm
        """
        super().__init__()
        model_MLM = DistilBertForMaskedLM.from_pretrained("distilbert-base-uncased")
        model_NSP = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased")

        self.bert = model_MLM.distilbert
        for param in self.bert.parameters(): # we just keep unfrozen the last encoder block
            param.requires_grad = False
        for param in self.bert.transformer.layer[-1].parameters():
            param.requires_grad = True

        self.next_sentence = nn.Sequential(model_NSP.pre_classifier,model_NSP.classifier) 
        self.mask_lm = nn.Sequential(model_MLM.vocab_transform,model_MLM.vocab_layer_norm, model_MLM.vocab_projector,nn.LogSoftmax(dim=-1))
````

### Training Routine

#### Dataset split for training, validating and testing

The built dataset was split into three subsets in order to generate a train set, which will be used to compute the loss during the training step and update the model's weights through backpropagation; a validation set, which will be used to evaluate the model during training and adjust different hyperparameters correctly; and a test set, which is used to evaluate the model's generalization capabilities after performing all training steps. These splits were performed at a "text" level to avoid introducing sentences coming from the same text in different subsets, which in our case is key to obtain independent splits for performing each of the aforementioned tasks.

We performed a train-validation-test split of 90%-10%-10%. The main statistics of each subset are described in the table below:
| Subset |Number of texts | Number of sentences |
|----------|----------|----------|
| **Train set**   | 894   | 107326|
| **Validation set**   | 112  |12992 |
| **Test set**   | 112   |10938 |

To ensure that each subset was representative of the whole dataset, we also compared how distributions of different parameters of each subset were statistically similar using ---INTRODUCE TEST, FALTA POSAR TAULA AMB EL TEST--- with respect to the complete dataset. The results of these statistical tests can be found in the following figure:

![Subset splits](split.png)


#### Hyperparameters and training methods

##### Learning rate scheduler


During the developed training routine for our model we decided to use a learning rate scheduler, which is a mechanism used for adjusting the learning rate value during the training process. The learning rate determines the size of the steps the optimization algorithm takes when updating the model's weights. Proper adjustment of the learning rate is crucial because it can significantly affect the training dynamics and the performance of the model. There are several functions that can be used for updating the learning rate (such as step decay, exponential decay, reduce on plateau or cosine annealing, to name a few). 

In our case, since we wanted to start by using a higher learning rate value for an initial adjustment of the model weights and then find the optimal weights using a learning rate with a lower value, we decided to initially increase by a linear factor the learning rate during a fixed number of warm up steps, and then use an exponential decay scheduler until reaching the final learning rate fixed value. Thus, both the number of warm up steps and the final learning rate value were hyperparameters that needed to be adjusted. In the figure below it is shown an example of the implemented learning rate scheduler.

<img src="learning_Rate_scheduler.png" alt="drawing" width="600" height="400"/>


##### Dropout

##### Loss functions

As it was previously explained, training our BERT model implies performing two independent tasks: predicting the correct token for each masked one in the MLM head; and predicting if the second sentence of our input is the following sentence of the same text in the NSP head. In essence, both of them are classification tasks. As such, we will need to use an individual loss function that is able to compare the predicted value of each head with its true value, and then add both values to compute the total loss for a given batch of samples: $$ Loss = Loss_{MLM} + Loss_{NSP} $$

For training the custom BERT model, we decided to use negative log-likelihood loss for evaluating both MLM and NSP tasks. This is why we use a logarithmic softmax as our final activation function for both heads. On the other hand, for fine-tuning the pretrained DistilBERT we used two different loss functions: the MLM task was evaluated using the log-likelihood loss, while the NSP task was assessed using the cross-entropy loss. This change was motivated since the latter loss function was used for evaluating the NSP task in the original training of DistilBERT.

##### Batch size and gradient accumulation

When building each batch during each training step, either train or validation batch, a single masked sentence pair for each text is introduced (until fulfilling the batch size). This ensures that all items in a batch originally come from different texts. The batch size was set as a hyperparameter so as to handle local resources constraints and have a balance between training speed, stability and convergence.

On the other hand, we introduced gradient accumulation during training. Gradient accumulation is used to effectively increase the batch size without requiring more resources than available. When using this method, the gradient is accumulated over multiple mini-batches before updating the weights through backpropagation, simulating a larger batch size.

##### Optimizer

The optimizer can be described as the method used to adjust the weights of the neural network in order to minimize the loss function during training. In our case, we decided to use the Adam algorithm with a fixed value for the betas coefficients of (0.9, 0.999) and weight decay of 0.01 for all experiments, while the learning rate was adjusted as a hyperparameter.

##### Early stopper

In order to avoid overfitting and diminishing the validation set performance during training, we decided to implement an early stopper function as a regularization method. In essence, during each epoch the early stopper checks if the validation loss is higher than the one in the best performing epoch. If this condition is fulfilled, the best performing epoch is updated and the training routine continues as scheduled. On the other hand, if the current epoch's validation performance is lower than the best observed so far, the training is stopped if the performance has not improve for a specific number of epochs (patience). The patience parameter was set to 10 epochs. 

```python
class EarlyStopper:
    def __init__(self, patience=1, min_delta=0):
        self.patience = patience
        self.min_delta = min_delta
        self.counter = 0
        self.min_validation_loss = float('inf')

    def early_stop(self, validation_loss):
        if validation_loss < self.min_validation_loss:
            self.min_validation_loss = validation_loss
            self.counter = 0
        elif validation_loss > (self.min_validation_loss + self.min_delta):
            self.counter += 1
            print('*')
            if self.counter >= self.patience:
                return True
        return False
```


### Embedding Generation

#### Custom BERT

**Inference and Embedding Generation**

During inference, we use the BERT model **without** the **Next Sentence Prediction (NSP)** and **Masked Language Model (MLM)** tasks. Instead, we focus on generating meaningful embeddings for our fairy tale dataset. These embeddings capture the contextual information of the input text, which is crucial for our RAG system.

**Embedding Generation:**
- The BERT model's embedding layer and encoder layers are used to transform input text into dense vector representations.
- These embeddings encapsulate the semantic meaning and context of the input text.

**Visual Summary**
Here's the image illustrating the BERT model architecture:

![Model Architecture](bert_inf.png)

Follow this section for specific explanations on every step.

```python
class BERT(nn.Module):
    def __init__(self, vocab_size, seq_len=512, d_model=768, n_layers=12, heads=12, dropout=0.1):
        super().__init__()
        self.d_model = d_model
        self.n_layers = n_layers
        self.heads = heads
        self.feed_forward_hidden = d_model * 4
        self.embedding = EmbeddingLayer(vocab_size=vocab_size, embed_size=d_model, seq_len=seq_len, dropout=dropout)
        self.encoder_blocks = nn.ModuleList(
            [EncoderLayer(d_model, heads, d_model * 4, dropout) for _ in range(n_layers)])
    
    def forward(self, x, segment_info):
        mask = (x > 0).unsqueeze(1).repeat(1, x.size(1), 1).unsqueeze(1)
        x = self.embedding(x, segment_info)
        for encoder in self.encoder_blocks:
            x = encoder.forward(x, mask)
        return x
```

#### Transfer Learning with DistilBERT

### Vector DB

#### Chroma DB

#### Zilliz

### LLM & prompting

### UI

# Experiments & Results

### Tensorboard for metric follow-up

### Challenges that we faced

#### Introduction

#### Token Length adaptation

In our first iterations of training, we would have the issue where the combined loss of our training tasks would return NaN at what it seemed to be completely random iterations. 

Sometimes, it would be in our second epoch. Sometimes later. This gave us a hard time spotting the specific place where this error was coming from.

**Solution**

At that point, the maximum length of our tokens was 512. Most of the sentences that we were processing had a lower token count than that. So, in most of those cases, we would then add [PAD] (padding) tokens to make up for the difference up to 512 tokens. 

However, we eventually realized, through extensive search of the specific embeddings that were crashing, that whenever the combination of embedding of the 2 sentences was large enough (exactly 512), no [PAD] would be added. In those cases, our attention Mask would fail to understand which sections of the sentence required attention and which not. 

We solved it by making sure that, whenever a sentence was combined with another sentece that had exactly 512 tokens, we would then just look for another sentece that did not cause this issue. Combinations that were longer than 512 were discarded from the start.

#### Labeling of Masked Tokens

After debugging these issues, our training finally started without errors. However, we started to relize that our model was failing to learn properly:

**Tokenizer modification**

On of our initial approaches was to switch our Tokenizer for the original BERT tokenizer, to make sure that our custom tokenizer wasnt the problem. However, after making the change, we realized that the training process would still stagnate eventually

<img src="loss_0.png" alt="Loss of our first iterations of training" style="width:50%;">

<img src="Acc_0.png" alt="Accuracy of our first iterations of training" style="width:50%;">

- **Blue & Grey** lines are the loss using the BERT tokenizer
- **Green** line used our custom tokenizer

**Some hyperparameter modifications**

We then tried to increase the size of the model as well as adding some quality of life modifications for our model.

*We applied the following modifications:
- Batch Size increase
- Modification of segment IDs 
- More epochs
- Adding Warmup & Scheduler
 

<img src="loss_1.png" alt="Loss" style="width:50%;">

<img src="Acc_1.png" alt="Accuracy" style="width:50%;">

Even though it looked much better than before, our Loss continued to stagnate, and the accuracy of our MLM task was very low, clearly indicating that the model was not yet learning.

**Solution**

we finally realized that there was a problem in the masking inside of the encoder

**Padding Mask:** In the encoder, there was a mask used for padding that was incorrectly set up. The padding mask should typically ignore the padding tokens (usually set to **False** for padding and **True** for actual content), but it was the other way around (padding was marked as **True** and content as **False**).

We then, set up an experiment with batch size 1, in order to see if the model would overfit. And it did:


<img src="Acc_2.png" alt="Accuracy" style="width:30%;">

**New Metric**
- At this point, we also decided to add a top 5 token accuracy

After performing another complete training, we saw that our training was already not stagnating as much

<img src="Acc_3.png" alt="Accuracy" style="width:30%;">
<img src="Acc_4.png" alt="Accuracy" style="width:30%;">
<img src="loss_2.png" alt="Accuracy" style="width:30%;">

#### NSP Continues to be an issue

At this point, our MLM task was learning preety well, and it was all about tuning the hyperparamenters.
However we observed that our NSP (Next sentence predict) task was still completely random.

- This indicated that even though our model was learning the vocabulary of our tokenizer, it would not understand the semantic meaning of the sentences good enough to be able to tell if a sentence was close to another sentence.

**Weighted Loss**

One of the first things we tried was to give different weigths to the losses of the NSP task and the MLM task in order to force the model to have a better performance in the NSP task.

This did not really work. 

**Learing Rate and other hyperparameters**

 - We tried lowering the LR even though our scheduler would do the same thing
 - Increased hidden size of our forward

**Dataset Issues**

One of our fears from the start was that our dataset would not be good enough for this complex task.
At the end of the day, we finally realized that this was the case.

Roughly 1.200 fairy tales is not enough for a model to learn the ropes of a language from scratch.

At this point we were very disappointed, but we had to improvise a solution to make the model work. 

**Transfer Learning**



### Description of the experiments performed

### Results of the model

# Conclusions & further exploration

In [3]:
## Distil BERT vs Custom BERT

# References

[1]: Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter." arXiv preprint [arXiv:1910.01108](https://arxiv.org/abs/1910.01108) (2019).