# Machine Learning for Software Analysis (MLSA)

## University of Florence -- IMT School for Advanced Studies Lucca

### Fabio Pinelli
<a href="mailto:fabio.pinelli@imtlucca.it">fabio.pinelli@imtlucca.it</a><br/>
IMT School for Advanced Studies Lucca<br/>
2025/2026<br/>
November, 11 2025

In [None]:
###########################################################
!pip install gensim==4.3.3
# The library has been archived and won't be used anymore
# # !pip install allennlp==0.9.0
!pip install flair==0.13.1
!pip install torchvision==0.18.1
# # HuggingFace

!pip uninstall -y transformers peft
!pip install transformers==4.38.0
!pip install datasets==2.18.0
!pip install peft==0.8.2
!pip install accelerate==0.30.0
###########################################################

In [None]:
try:
    import google.colab
    import requests
    url = 'https://raw.githubusercontent.com/dvgodoy/PyTorchStepByStep/master/config.py'
    r = requests.get(url, allow_redirects=True)
    open('config.py', 'wb').write(r.content)
except ModuleNotFoundError:
    pass

from config import *
config_chapter11()
# This is needed to render the plots in this chapter
from plots.chapter11 import *

In [None]:
import os
import json
import errno
import requests
import numpy as np
from copy import deepcopy
from operator import itemgetter

import torch
import torch.optim as optim
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset, Dataset

from data_generation.nlp import ALICE_URL, WIZARD_URL, download_text
from stepbystep.v4 import StepByStep
# These are the classes we built in previous class 10
from seq2seq import *

import nltk
from nltk.tokenize import sent_tokenize
nltk.download('punkt_tab')

In [None]:
import gensim
from gensim import corpora, downloader
from gensim.parsing.preprocessing import *
from gensim.utils import simple_preprocess
from gensim.models import Word2Vec

In [None]:
from flair.data import Sentence
#from flair.embeddings import ELMoEmbeddings, WordEmbeddings, \
#    TransformerWordEmbeddings, TransformerDocumentEmbeddings
from flair.embeddings import WordEmbeddings, \
    TransformerWordEmbeddings, TransformerDocumentEmbeddings

In [None]:
from datasets import load_dataset, Split
from transformers import (
    DataCollatorForLanguageModeling,
    BertModel, BertTokenizer, BertForSequenceClassification,
    DistilBertModel, DistilBertTokenizer,
    DistilBertForSequenceClassification,
    AutoModelForSequenceClassification,
    AutoModel, AutoTokenizer, AutoModelForCausalLM,
    Trainer, TrainingArguments, pipeline, TextClassificationPipeline
)
from transformers.pipelines import SUPPORTED_TASKS

# Down the Yellow Brick Rabbit Hole

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/alice_dorothy.png?raw=1)

*Left: "Alice and the Baby Pig" illustration by John Tenniel's, from "Alice's Adventure's in Wonderland" (1865).*

*Right: "Dorothy meets the Cowardly Lion" illustration by W.W. Denslow, from "The Wonderful Wizard of Oz" (1900)*

## Classification task

- Given a sentence we want to classify if this is written into *ALICE'S ADVENTURES IN WONDERLAN* or *The wonderful Wizard of Oz*.

- This is a Natural Language Processing (NLP) task.

- We explore how to handle word that are not numbers. How we can use them to learn and train a model and use it in various tasks.

- Tokenization, Embeddings, Bert, ChatGPT...

- Some references to understand what we do in Software Analysis with ML (you can find them also on our repository)

https://arxiv.org/pdf/2002.08155


https://arxiv.org/abs/1803.09473


https://dl.acm.org/doi/pdf/10.1145/3460348

# Building a Dataset

In [None]:
localfolder = 'texts'
download_text(ALICE_URL, localfolder)
download_text(WIZARD_URL, localfolder)

In [None]:
with open(os.path.join(localfolder, 'alice28-1476.txt'), 'r') as f:
    alice = ''.join(f.readlines()[104:3704])

with open(os.path.join(localfolder, 'wizoz10-1740.txt'), 'r') as f:
    wizard = ''.join(f.readlines()[310:5100])

In [None]:
print(alice[:500])
print('\n')
print(wizard[:500])


You can explore the content of the downloaded text on Colab left panel [folder icon] and you could see that the initial part of both books contains useless text for our purposes.

We write a cfg file that contains the lines to be considered then we will proceed on identifying sentences in the books that will be our training dataset

In [None]:
text_cfg = """fname,start,end
alice28-1476.txt,104,3704
wizoz10-1740.txt,310,5100"""
bytes_written = open(os.path.join(localfolder, 'lines.cfg'), 'w').write(text_cfg)

### Setting the goals

Our goal is to **transform** the books in CSV files so that they can be read as datasets.

Each line of the CSV will have a line for each sentence of the book, so that we could try to classify them, accordingly.

Something like:
```
sentence,source
"dsf sdf sdfsd sdfs, sdfs: dsfsdf, sdfs", Alice
"ipopsd iops siopiopi sdoiop sdifop", Alice
"mnmnb nmmnnj jj kjkj; \"jkj jk jk, ", Alice
"qweq qweq qweqw, qweqe ,\"qweqwe w,qwe,w\" ", Alice
```

## Sentence Tokenization

You might have already met the words **token** and **tokenization**
- A **token** is a piece of text
- Tokenize a text means to split it into pieces (token), getting as result a list of **tokens**

The most common kind of pieces is a word, typically tokenizing a text usually means to split it into words, for instance, using the **white space** a separator.

In [None]:
sentence = "I'm following the white rabbit"
tokens = sentence.split(' ')
tokens

What about *I'm*, it should be two tokens or just one?

IT DEPENDS :-)!

We will see later how to more properly perform tokenization at the word level.

For now, we are interested in sentence tokenization, our task is to classify sentences.

In order to do that, we use a library called NLTK (Natural Language Toolkit) and it ```sent_tokenize()``` method.


```punkt``` is a pre-trained model for tokenization.

- It's based on the Punkt Sentence Tokenizer, a popular unsupervised method for sentence boundary detection.
- NLTK's punkt tokenizer can split text into sentences and further tokenize sentences into words.
- Language Support: punkt includes models trained for multiple languages (e.g., English, German, French).

In [None]:
import nltk
from nltk.tokenize import sent_tokenize

nltk.download('punkt')
corpus_alice = sent_tokenize(alice)
corpus_wizard = sent_tokenize(wizard)
len(corpus_alice), len(corpus_wizard)

A **corpus** is a structured set of **Documents**.

A document can be a **sentence**, a **paragraph**, a **tweet** or a **book**.

In our case, the document is a **sentence**, so each book is actually a set of **sentences**, thus each book may be considered a corpus...

Its plural???


**CORPORA**

In [None]:
corpus_alice[2]

This includes ```\n```, the sentence tokenizer only handles the sentence splitting and it doesn't clean the text.

In [None]:
corpus_wizard[30]

The same here, we have " the quotation mark

Remember, we need to create CSV files with one sentence per line. Therefore, we need to:
- Clean the line breaks
- Define a quote char to wrap such that the original commas or semicolons are not misinteoreted as separation chars of the CSV
- add a second column (label) to identify the source

Then we will concatenate everything, shuffle the sentences before training a model

We expect to have something like this:
```
\"There's a cyclone coming, Em," he called to his wife.\,wizoz10-1740.txt
```
Where the **escape** character "**\\**" is the quote char because is not present in any (novel) books.

Different choices has to be done when we will deal with code data.

In [None]:
def sentence_tokenize(source, quote_char='\\', sep_char=',',
                      include_header=True, include_source=True,
                      extensions=('txt'), **kwargs):
    nltk.download('punkt')
    # If source is a folder, goes through all files inside it
    # that match the desired extensions ('txt' by default)
    if os.path.isdir(source):
        filenames = [f for f in os.listdir(source)
                     if os.path.isfile(os.path.join(source, f)) and
                        os.path.splitext(f)[1][1:] in extensions]
    elif isinstance(source, str):
        filenames = [source]

    # If there is a configuration file, builds a dictionary with
    # the corresponding start and end lines of each text file
    config_file = os.path.join(source, 'lines.cfg')
    config = {}
    if os.path.exists(config_file):
        with open(config_file, 'r') as f:
            rows = f.readlines()

        for r in rows[1:]:
            fname, start, end = r.strip().split(',')
            config.update({fname: (int(start), int(end))})

    new_fnames = []
    # For each file of text
    for fname in filenames:
        # If there's a start and end line for that file, use it
        try:
            start, end = config[fname]
        except KeyError:
            start = None
            end = None

        # Opens the file, slices the configures lines (if any)
        # cleans line breaks and uses the sentence tokenizer
        with open(os.path.join(source, fname), 'r') as f:
            contents = (''.join(f.readlines()[slice(start, end, None)])
                        .replace('\n', ' ').replace('\r', ''))
        corpus = sent_tokenize(contents, **kwargs)

        # Builds a CSV file containing tokenized sentences
        base = os.path.splitext(fname)[0]
        new_fname = f'{base}.sent.csv'
        new_fname = os.path.join(source, new_fname)
        with open(new_fname, 'w') as f:
            # Header of the file
            if include_header:
                if include_source:
                    f.write('sentence,source\n')
                else:
                    f.write('sentence\n')
            # Writes one line for each sentence
            for sentence in corpus:
                if include_source:
                    f.write(f'{quote_char}{sentence}{quote_char}{sep_char}{fname}\n')
                else:
                    f.write(f'{quote_char}{sentence}{quote_char}\n')
        new_fnames.append(new_fname)

    # Returns list of the newly generated CSV files
    return sorted(new_fnames)

The function above:
- takes a folder and goes through the file with the right extension
- it removes the lines based on a ```.cfg``` file (if any)
- it applies the sentence tokenizer to each file
- it generates the corresponding CSV files of sentences using the configured ```quote_char``` and ```sep_char```
- it names the CSV files for each original file by dropping the extension and appending ```.sent.csv``` to it.

In [None]:
new_fnames = sentence_tokenize(localfolder)
new_fnames

Each CSV file contains the sentences of a book and we will use them to build our dataset (concatenating and shuffling)

## HuggingFace's Dataset

## Loading a Dataset

Instead of using regular PyTorch ```Dataset``` we use **Hugging face*** ```Dataset```.

In order to accomplish our task, we will use a pre-trained method coming from hugging face.

### Hugging face

Hugging Face is a company and open-source platform focused on making advanced machine learning, particularly natural language processing (NLP), more accessible and easier to use. It’s widely recognized for its transformer library and model hub, which provide pre-trained models and tools to streamline the use of state-of-the-art AI models in applications.

Here are the key aspects of Hugging Face:

1. Transformers Library
Hugging Face’s transformers library is one of the most popular libraries for NLP and beyond, offering access to hundreds of pre-trained models like BERT, GPT, RoBERTa, T5, and many others.
It allows users to load and fine-tune these models for tasks like text classification, translation, summarization, question answering, text generation, and more.
The library supports multiple deep learning frameworks, primarily PyTorch and TensorFlow, and includes APIs that make it easy to switch between them.

2. Model Hub
The Hugging Face Model Hub is a repository and sharing platform for machine learning models. It hosts thousands of pre-trained models contributed by both Hugging Face and the community.
Users can find, share, and download models for various tasks (not limited to NLP, but also including computer vision, audio, and more).
The hub allows seamless model access via model IDs that can be loaded directly into the transformers library, enabling rapid experimentation and deployment.

3. Datasets Library
Hugging Face’s datasets library provides a large collection of datasets for NLP, computer vision, and audio tasks. It includes popular datasets like IMDB, SQuAD, and ImageNet.
It supports streaming, efficient data handling, and pre-processing tools, making it easier to train models on large datasets.

In [None]:
from datasets import load_dataset, Split

dataset = load_dataset(path='csv', data_files=new_fnames, quotechar='\\', split=Split.TRAIN)

In [None]:
type(dataset)

What we have done?

We created a hugging face's dataset, specifying:
1. the type of script we are using to process the data ```path='csv'``` (Yes, a bit misleading)
2. the path where the files are located ```data_files=new_fnames```
3. the quotation mark
4. the split we are generating (TRAIN)

### Attributes

In [None]:
dataset.features, dataset.num_columns, dataset.shape

Our dataset contains **2** columns (sentence, source) and there are 3852 sentences on it.

In [None]:
dataset[2]

In [None]:
dataset['source'][:3] ## we didnt' shuffle yet

### Methods of ```Dataset```
- unique, to compute for instance, the unique sources
- map, to create new columns by using a function

In [None]:
dataset.unique('source')

In [None]:

def is_alice_label(row):
    is_alice = int(row['source'] == 'alice28-1476.txt')
    return {'labels': is_alice}

dataset = dataset.map(is_alice_label)

We create a new column for label and it is already applied to the dataset

In [None]:
dataset[2]

In [None]:
shuffled_dataset = dataset.shuffle(seed=42)

In [None]:
split_dataset = shuffled_dataset.train_test_split(test_size=0.2)
split_dataset

In [None]:
train_dataset, test_dataset = split_dataset['train'], split_dataset['test']

In [None]:
train_dataset.features

# Word Tokenization

Now we have the dataset structured with all the information we need.

But the basic bricks of NLP tasks are the words, and we need to process them...

We already see a simple word tokenizer that uses the white space as separator, but this doesn't work well as we want.

In [None]:
sentence = "I'm following the white rabbit"
tokens = sentence.split(' ')
tokens

We can use another library ```gensim``` that can help us on this task.

We could use the NLKT tokenizer, but Gensim includes interesting tools that we can use later, but doesn't have a sentence tokenizer... and also it is a way to know different libraries :-)  

In [None]:
from gensim.parsing.preprocessing import *

preprocess_string(sentence)

```def preprocess_string(s, filters=DEFAULT_FILTERS)```
Apply list of chosen filters to s.

Default list of filters:
```
~gensim.parsing.preprocessing.strip_tags,
~gensim.parsing.preprocessing.strip_punctuation,
~gensim.parsing.preprocessing.strip_multiple_whitespaces,
~gensim.parsing.preprocessing.strip_numeric,
~gensim.parsing.preprocessing.remove_stopwords,
~gensim.parsing.preprocessing.strip_short,
~gensim.parsing.preprocessing.stem_text.
```
**Parameters**
s : str
filters: list of functions, optional

**Returns**
list of str
    Processed strings (cleaned).



We keep the list of filters limited: lower case, tags, punctuactions, multiple white spaces and numeric.

You can play on your own with the filters and check the obtained results...

In [None]:
filters = [lambda x: x.lower(), strip_tags, strip_punctuation, strip_multiple_whitespaces, strip_numeric]
preprocess_string(sentence, filters=filters)

Another option is to use ```simple_preprocess``` method that limit the filters to:
- lower case
- remove tokens too short (less than 3 chars)
- remove tokens too long (more than 15 chars)

In [None]:
from gensim.utils import simple_preprocess

tokens = simple_preprocess(sentence)
tokens

## Vocabulary

Once we have the token, a list of words, we can build our vocabulary, a list of unique words that appear in the text corpora.

In [None]:
sentences = train_dataset['sentence'] ## we take the column sentence
tokens = [simple_preprocess(sent) for sent in sentences] # for each sentence we extract the tokens
tokens[0]

In [None]:
'''
Once we have the tokens we can build our vocabulary, a list of unique words that appear in the text corpora.
'''

from gensim import corpora

dictionary = corpora.Dictionary(tokens)
print(dictionary)

The Dictionary object computes some specific attributes

In [None]:
dictionary.num_docs

In [None]:
dictionary.num_pos # processed words

In [None]:
dictionary.token2id

In [None]:
vocab = list(dictionary.token2id.keys())
vocab[:5]

In [None]:
#collection frequencies, how many times a given token appears in the corpora
dictionary.cfs

In [None]:
# in how many documents a token appears
dictionary.dfs

In [None]:
'''
We can convert a list of tokens into a list of their corresponding indices in
the vocabulary.
'''


sentence = 'follow the white rabbit'
new_tokens = simple_preprocess(sentence)
ids = dictionary.doc2idx(new_tokens)
print(new_tokens)
print(ids)

Despite the size of our vocabulary, we need to think that there will always be a word that isn't include...

Therefore, we use special token:

- ```[UNK]``` for unknown
- ```[PAD]``` to pad the short sentences

We can add them to our voc.

In [None]:
special_tokens = {'[PAD]': 0, '[UNK]': 1}
dictionary.patch_with_special_tokens(special_tokens)

Again, we could also be interested in removing rare terms as well as 'bad words'.

Gensim has a couple of methods that can help on that:
- ```filter_extremes()```
- ```filter_tokens()```




In [None]:
def get_rare_ids(dictionary, min_freq):
    rare_ids = [t[0] for t in dictionary.cfs.items() if t[1] < min_freq]
    return rare_ids

In [None]:
'''
The code below wraps everything togheter:
it takes a list of sentences, generates the corresponding vocabulary
and it saves it
'''

def make_vocab(sentences, folder=None, special_tokens=None, vocab_size=None, min_freq=None):
    if folder is not None:
        if not os.path.exists(folder):
            os.mkdir(folder)

    # tokenizes the sentences and create a Dictionary
    tokens = [simple_preprocess(sent) for sent in sentences]
    dictionary = corpora.Dictionary(tokens)
    # keeps only the most frequent words (vocab size)
    if vocab_size is not None:
        dictionary.filter_extremes(keep_n=vocab_size)
    # removes rare words (in case the vocab size still
    # includes words with low frequency)
    if min_freq is not None:
        rare_tokens = get_rare_ids(dictionary, min_freq)
        dictionary.filter_tokens(bad_ids=rare_tokens)
    # gets the whole list of tokens and frequencies
    items = dictionary.cfs.items()
    # sorts the tokens in descending order
    words = [dictionary[t[0]] for t in sorted(dictionary.cfs.items(), key=lambda t: -t[1])]
    # prepends special tokens, if any
    if special_tokens is not None:
        to_add = []
        for special_token in special_tokens:
            if special_token not in words:
                to_add.append(special_token)
        words = to_add + words

    with open(os.path.join(folder, 'vocab.txt'), 'w') as f:
        for word in words:
            f.write(f'{word}\n')

In [None]:
make_vocab(train_dataset['sentence'], 'our_vocab/', special_tokens=['[PAD]', '[UNK]', '[SEP]', '[CLS]', '[MASK]'], min_freq=2)

## HugginFace's Tokenizer

Since we are going to use BERT, we will use the corresponding pre-trained tokenizer.

It standardize, in some sense, the input for BERT.
It has the same information as gensim, i.e., the mapping between tokens and their id, but it includes many more information.

In [None]:
'''
It takes a vocabulary as input
'''

from transformers import BertTokenizer

tokenizer = BertTokenizer('our_vocab/vocab.txt')

In [None]:
new_sentence = 'follow the white rabbit neo'
new_tokens = tokenizer.tokenize(new_sentence)
new_tokens

In [None]:
new_ids = tokenizer.convert_tokens_to_ids(new_tokens)
new_ids

In [None]:
new_ids = tokenizer.encode(new_sentence)
new_ids

In [None]:
tokenizer.convert_ids_to_tokens(new_ids)

In Transformer-based models, particularly BERT (Bidirectional Encoder Representations from Transformers), the [CLS] token stands for Classification token.

**Purpose of the [CLS] Token**

**Representation**: The [CLS] token is a special token added to the beginning of every input sequence in models like BERT. During training, the model learns to treat the embedding of this token as a representation of the entire sequence.

**Classification Tasks**: For tasks that involve classification (e.g., sentiment analysis, sentence classification, or Next Sentence Prediction), the final hidden state of the [CLS] token (after processing through all transformer layers) is used as a summary of the entire sequence.

**General-Purpose Embedding**: It’s often considered the “pooled” output, as it attempts to capture information from the whole sequence in a single vector. This embedding can then be passed to a classifier or other downstream layers for decision-making.


In [None]:
tokenizer.encode(new_sentence, add_special_tokens=False)

In [None]:
tokenizer(new_sentence, add_special_tokens=False, return_tensors='pt')

In [None]:
sentence1 = 'follow the white rabbit neo'
sentence2 = 'no one can be told what the matrix is'
joined_sentences = tokenizer(sentence1, sentence2)
joined_sentences

3 level of information:
- input_ids $⇒$ OK! we know them
- token_type_ids works as sentence index
- attention_mask that corresponds to our source mask

In [None]:

print(tokenizer.convert_ids_to_tokens(joined_sentences['input_ids']))

In [None]:
separate_sentences = tokenizer([sentence1, sentence2], padding=True)
separate_sentences

In [None]:
print(tokenizer.convert_ids_to_tokens(separate_sentences['input_ids'][0]))
print(separate_sentences['attention_mask'][0])

In [None]:
first_sentences = [sentence1, 'another first sentence']
second_sentences = [sentence2, 'a second sentence here']
batch_of_pairs = tokenizer(first_sentences, second_sentences)
first_input = tokenizer.convert_ids_to_tokens(batch_of_pairs['input_ids'][0])
second_input = tokenizer.convert_ids_to_tokens(batch_of_pairs['input_ids'][1])
print(first_input)
print(second_input)

In [None]:
tokenized_dataset = tokenizer(dataset['sentence'],
                              padding=True,
                              return_tensors='pt',
                              max_length=50,
                              truncation=True)
tokenized_dataset['input_ids']

In reality, BERT uses vectors to represent the words, using a big look-up table with the token ids as indeces.

This recalls us... **EMBEDDINGS**, in this case **WORD EMBEDDINGS**, so a representation of each token (word) as a vector, a vector of numbers... The size of the vector is the dimensionality of the embeddings.

We can build the embeddings or we can learn... then we start a beautiful trip into that.

# Before Word Embeddings

## One-Hot Encoding (OHE)

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/ohe1.png?raw=1)

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/ohe2.png?raw=1)

As you can imagine, a representation like this has some issues:
- large
- sparse (lots of zeros!!!)


## Bag of Words (BoW)

Of course, we can do better. We could, for instance, sum up the corresponding OHE vectors, disregarding any underlying structure of relationships between the words.

The result is the counts of the words appearing in the text

In [None]:
sentence = 'the white rabbit is a rabbit'
bow_tokens = simple_preprocess(sentence)
bow_tokens

In [None]:
bow = dictionary.doc2bow(bow_tokens)
bow

Also this approach has several limitations:
- it represents the frequencies, nothing else
- the representations can be really different when we compute a similarity function (hortogonal in the case of OHE)

Language models, instead, try to explore the structure and the relationships between words

## Language Models

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/blank1.png?raw=1)

It is easy to fill the gap with YOU

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/fill1.png?raw=1)

What about this?

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/blank2.png?raw=1)


![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/fill2.png?raw=1)




## N-grams

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/ngrams.png?raw=1)

n-grams is base on pure statistics, filling the blanks using the most common sequence that matches the words precceding the blank

With a large $n$, you might get good predictions, but with many cases with 0 predictions, on the contrary you may encounter many prediction errors.

This is due to the fact that we only **look back**.

Let's try to look ahead too!

## Continuous Bag-of-Words (CBoW)

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/blank_end.png?raw=1)

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/blank_center.png?raw=1)

It sums up (or averages) the vectors of the context words and uses it to predit the central word.

The vectors are not one-hot-encoded and have continous values.
The vector containing the continous values are called word embeddings.

We can learn these vectors..


# Word Embeddings

## Word2Vec

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/cbow.png?raw=1)

The **target** is the central word, therefore we deal with a multiclass classification problem, where the number of classes is the size of the vocabulary.

We use the context words, better their embedding vectors, as input. Therefore becoming a parameter itself of the model.

They are randomly initialized, then as the training progresses, their weights and biases are updated.

**How it works**

For each pair of the context words, and the corresponding target, the model will average the embeddings of the context and feed the result to a linear layer that will compute one logit for each word in the vocabulary


In [None]:
class CBOW(nn.Module):
    def __init__(self, vocab_size, embedding_size):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_size)
        self.linear = nn.Linear(embedding_size, vocab_size)

    def forward(self, X):
        embeddings = self.embedding(X)
        bow = embeddings.mean(dim=1)
        logits = self.linear(bow)
        return logits

In [None]:
torch.manual_seed(42)
dummy_cbow = CBOW(vocab_size=5, embedding_size=3)
dummy_cbow.embedding.state_dict()

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/w2v_embed.png?raw=1)

nn.Embeddings layer is a lookup table. It may be randomly initialized given the size of the vocabulary and the number of dimensions.

In [None]:
# tokens: ['is', 'barking']
dummy_cbow.embedding(torch.as_tensor([2, 3]))

Pretending we have performed a tokenization and we have the indices of our vocab, context and target

In [None]:
tiny_vocab = ['the', 'small', 'is', 'barking', 'dog']
context_words = ['the', 'small', 'is', 'barking']
target_words = ['dog']

In [None]:
batch_context = torch.as_tensor([[0, 1, 2, 3]]).long()
batch_target = torch.as_tensor([4]).long()

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/w2v_cbow.png?raw=1)

In [None]:
cbow_features = dummy_cbow.embedding(batch_context).mean(dim=1)
cbow_features

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/w2v_logits.png?raw=1)

In [None]:
logits = dummy_cbow.linear(cbow_features)
logits

## What is an Embeddings Anyway?

It is a representation, and each dimension of the vector corresponds to an attribute/feature.

For instance, we can describe restaurants with 3 features.

We can represent the values in numbers ranging in the interval [-1,1]

... and we can compute a similarity among them, cosine?

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/rest_discrete.png?raw=1)

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/rest_continuous.png?raw=1)

In [None]:
ratings = torch.as_tensor([[.7, -.4, .7],
                           [.3, .7, -.5],
                           [.9, -.55, .8],
                           [-.3, .8, .34]]).float()
sims = torch.zeros(4, 4)
for i in range(4):
    for j in range(4):
        sims[i, j] = F.cosine_similarity(ratings[i], ratings[j], dim=0)
sims

In [None]:
import seaborn as sns

fig, ax = plt.subplots(figsize=(6, 6))
sns.heatmap(sims.detach().numpy(), annot=True, ax=ax)

## Pre-trained Word2Vec

In general, pretrained-word2vec models and the embeddings correspoding to a word do not have a clear meaning, but we can do fancy stuff with them.

We don't know if the $n-th$ dimension correspond to a particular "behaviour" of the word

To train word2vec that still is a simple model it requires a certain amount of data to train it.

Luckly someone already did it the job for us and for instance gensim contains a variety of pre-trained word embeddings models.

But why different models and therefore different embeddings?
Well, using different corpora produces different embeddings, since they might be influenced by the **kind of language** used in the corpora

They might depend on the model used to learn the embeddings, word2vec is one but there are many others.

## Global Vectors (GloVe)

GloVe: Global Vectors for Word Representation

It combines skip-gram model with co-occurences statistics at the **global level**

Take it for grant, if you want to know more, you can read the paper and check [https://nlp.stanford.edu/projects/glove](https://nlp.stanford.edu/projects/glove)

There are many sizes and shapes, dimensions from 25 to 300 and vocab size between 400,000 and 2,200,000 words.

In [None]:
from gensim import downloader

glove = downloader.load('glove-wiki-gigaword-50')


len(glove.key_to_index)


In [None]:
glove['alice']

We don't know the meaning of the dimensions, but we can do math with them

We can define the queen with the following equation:
king - man + woman = queen

In [None]:
synthetic_queen = glove['king'] - glove['man'] + glove['woman']

In [None]:
fig = plot_word_vectors(glove,
                        ['king', 'man', 'woman', 'synthetic', 'queen'],
                        other={'synthetic': synthetic_queen})

In [None]:
glove.similar_by_vector(synthetic_queen, topn=5)

It's pretty common that the first results corresponds to the origin of the word embeddings arithmetic, so we can exclude it from the result... and we get queen :-)

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/embed_arithmetic.png?raw=1)

$$
\Large
w_{\text{king}} - w_{\text{man}}\approx w_{\text{queen}}-w_{\text{woman}} \implies w_{\text{king}} - w_{\text{man}} + w_{\text{woman}} \approx w_{\text{queen}}
$$

It is nice and it shows that effectively the embeddings are capturing the meaning/semantics of the words.

## Using Word Embeddings

### Vocabulary Coverage

In [None]:

vocab = list(dictionary.token2id.keys())
len(vocab)

In [None]:

unknown_words = sorted(list(set(vocab).difference(set(glove.key_to_index))))
###########################################################
print(len(unknown_words))
print(unknown_words[:5])

In [None]:
unknown_ids = [dictionary.token2id[w] for w in unknown_words if w not in ['[PAD]', '[UNK]']]
unknown_count = np.sum([dictionary.cfs[idx] for idx in unknown_ids])
unknown_count, dictionary.num_pos

In [None]:
def vocab_coverage(gensim_dict, pretrained_wv, special_tokens=('[PAD]', '[UNK]')):
    vocab = list(gensim_dict.token2id.keys())
    unknown_words = sorted(list(set(vocab).difference(set(pretrained_wv.key_to_index))))
    ###########################################################
    unknown_ids = [gensim_dict.token2id[w] for w in unknown_words if w not in special_tokens]
    unknown_count = np.sum([gensim_dict.cfs[idx] for idx in unknown_ids])
    cov = 1 - unknown_count / gensim_dict.num_pos
    return cov

In [None]:
vocab_coverage(dictionary, glove)

### Tokenizer

In [None]:
def make_vocab_from_wv(wv, folder=None, special_tokens=None):
    if folder is not None:
        if not os.path.exists(folder):
            os.mkdir(folder)

    words = wv.index_to_key
    ###########################################################
    if special_tokens is not None:
        to_add = []
        for special_token in special_tokens:
            if special_token not in words:
                to_add.append(special_token)
        words = to_add + words

    with open(os.path.join(folder, 'vocab.txt'), 'w') as f:
        for word in words:
            f.write(f'{word}\n')

In [None]:
make_vocab_from_wv(glove, 'glove_vocab/', special_tokens=['[PAD]', '[UNK]'])

In [None]:
glove_tokenizer = BertTokenizer('glove_vocab/vocab.txt')

In [None]:
glove_tokenizer.encode('alice followed the white rabbit', add_special_tokens=False)

In [None]:
len(glove_tokenizer.vocab), len(glove.vectors)

The difference is given by the two special tokens [PAD] and [UNK] and we can add them to our embeddings with all zeros

### Special Tokens' Embeddings

In [None]:
special_embeddings = np.zeros((2, glove.vector_size))

In [None]:
extended_embeddings = np.concatenate([special_embeddings, glove.vectors], axis=0)
extended_embeddings.shape

In [None]:
alice_idx = glove_tokenizer.encode('alice', add_special_tokens=False)
np.all(extended_embeddings[alice_idx] == glove['alice'])

## Model I - GloVe + Classifier

### Data Preparation

In [None]:
train_sentences = train_dataset['sentence']
train_labels = train_dataset['labels']

test_sentences = test_dataset['sentence']
test_labels = test_dataset['labels']

In [None]:
train_ids = glove_tokenizer(train_sentences,
                            truncation=True,
                            padding=True,
                            max_length=60,
                            add_special_tokens=False,
                            return_tensors='pt')['input_ids']
train_labels = torch.as_tensor(train_labels).float().view(-1, 1)

test_ids = glove_tokenizer(test_sentences,
                           truncation=True,
                           padding=True,
                           max_length=60,
                           add_special_tokens=False,
                           return_tensors='pt')['input_ids']
test_labels = torch.as_tensor(test_labels).float().view(-1, 1)

In [None]:
train_tensor_dataset = TensorDataset(train_ids, train_labels)
generator = torch.Generator()
train_loader = DataLoader(train_tensor_dataset, batch_size=32, shuffle=True, generator=generator)
test_tensor_dataset = TensorDataset(test_ids, test_labels)
test_loader = DataLoader(test_tensor_dataset, batch_size=32)

### Pre-Trained PyTorch Embeddings

In [None]:
extended_embeddings = torch.as_tensor(extended_embeddings).float()
torch_embeddings = nn.Embedding.from_pretrained(extended_embeddings)

In [None]:
token_ids, labels = next(iter(train_loader))
token_ids

In [None]:
token_embeddings = torch_embeddings(token_ids)
token_embeddings.shape

We used the ids to get the embeddings. since we have 32 sentences, of 60 tokens with 50 dimensions each


In [None]:
token_embeddings.mean(dim=1)

For each sentence, we can compute an embedding as an average of the word embeddings and therefore we can use it as features for a classification algorithm

In [None]:
'''
we can use the PyTorch implementation that is nnEmbeddingBag
'''

boe_mean = nn.EmbeddingBag.from_pretrained(extended_embeddings, mode='mean')
boe_mean(token_ids)

### Model Configuration & Training

In [None]:
extended_embeddings = torch.as_tensor(extended_embeddings).float()
boe_mean = nn.EmbeddingBag.from_pretrained(
    extended_embeddings, mode='mean'
)
torch.manual_seed(41)
model = nn.Sequential(
    # Embeddings
    boe_mean,
    # Classifier
    nn.Linear(boe_mean.embedding_dim, 128),
    nn.ReLU(),
    nn.Linear(128, 1)
)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=0.01)

In [None]:
sbs_emb = StepByStep(model, loss_fn, optimizer)
sbs_emb.set_loaders(train_loader, test_loader)
sbs_emb.train(20)

In [None]:
fig = sbs_emb.plot_losses()

In [None]:
StepByStep.loader_apply(test_loader, sbs_emb.correct)

## Model II - GloVe + Transformer

An instance of a transformer encoder, a layer of pre-trained embeddings and the desired number of outputs.

forward takes minibatches of tokenized sentences, preprocess them, encodes them and output the logits.

In [None]:
class TransfClassifier(nn.Module):
    def __init__(self, embedding_layer, encoder, n_outputs):
        super().__init__()
        self.d_model = encoder.d_model
        self.n_outputs = n_outputs
        self.encoder = encoder
        self.mlp = nn.Linear(self.d_model, n_outputs)

        self.embed = embedding_layer
        self.cls_token = nn.Parameter(torch.zeros(1, 1, self.d_model))

    def preprocess(self, X):
        # N, L -> N, L, D
        src = self.embed(X)
        # Special classifier token
        # 1, 1, D -> N, 1, D
        cls_tokens = self.cls_token.expand(X.size(0), -1, -1)
        # Concatenates CLS tokens -> N, 1 + L, D
        src = torch.cat((cls_tokens, src), dim=1)
        return src

    def encode(self, source, source_mask=None):
        # Encoder generates "hidden states"
        states = self.encoder(source, source_mask)
        # Gets state from first token only: [CLS]
        cls_state = states[:, 0]  # N, 1, D
        return cls_state

    @staticmethod
    def source_mask(X):
        cls_mask = torch.ones(X.size(0), 1).type_as(X)
        pad_mask = torch.cat((cls_mask, X > 0), dim=1).bool()
        return pad_mask.unsqueeze(1)

    def forward(self, X):
        src = self.preprocess(X)
        # Featurizer
        cls_state = self.encode(src, self.source_mask(X))
        # Classifier
        out = self.mlp(cls_state) # N, 1, outputs
        return out

In [None]:
torch.manual_seed(33)
# Loads the pretrained GloVe embeddings into an embedding layer
torch_embeddings = nn.Embedding.from_pretrained(extended_embeddings)
# Creates a Transformer Encoder
layer = EncoderLayer(n_heads=2, d_model=torch_embeddings.embedding_dim, ff_units=128)
encoder = EncoderTransf(layer, n_layers=1)
# Uses both layers above to build our model
model = TransfClassifier(torch_embeddings, encoder, n_outputs=1)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-4)

In [None]:
sbs_transf = StepByStep(model, loss_fn, optimizer)
sbs_transf.set_loaders(train_loader, test_loader)
sbs_transf.train(10)

In [None]:
fig = sbs_transf.plot_losses()

In [None]:
StepByStep.loader_apply(test_loader, sbs_transf.correct)

### Visualizing Attention

In [None]:
sentences = ['The white rabbit and Alice ran away', 'The lion met Dorothy on the road']
inputs = glove_tokenizer(sentences, add_special_tokens=False, return_tensors='pt')['input_ids']
inputs = inputs.to(sbs_transf.device)
inputs

In [None]:
sbs_transf.model.eval()
out = sbs_transf.model(inputs)
# our model outputs logits, so we turn them into probs
torch.sigmoid(out)

In [None]:
alphas = sbs_transf.model.encoder.layers[0].self_attn_heads.alphas
alphas[:, :, 0, :].squeeze()

In [None]:
tokens = [['[CLS]'] + glove_tokenizer.tokenize(sent) for sent in sentences]
fig = plot_attention(tokens, alphas)

# Contextual Word Embeddings

## ELMo

Watch in these two sentences has a different meaning noun and verb. Probabily, the single word embedding is not enough... we need to consider the context, the sentence itself to represent the word.

These are called contextual word embeddings where we don't have a look-up table between every combination of word and context, but the embeddings are the outputs of a model :-)

ELMo takes into account also the context.

It is a two layer bidirectional LSTM encoder using 4096 dimensions for its cell states

The representations are char-based, so it can easily handle unkown words

Flair is a NLP (yet another) library built on top of pytorch that offers word embeddings and document embeddings for ELMo and BERT as well as GloVe.

In [None]:
watch1 = """
The Hatter was the first to break the silence. `What day of the month is it?' he said, turning to Alice:  he had taken his watch out of his pocket, and was looking at it uneasily, shaking it every now and then, and holding it to his ear.
"""

watch2 = """
Alice thought this a very curious thing, and she went nearer to watch them, and just as she came up to them she heard one of them say, `Look out now, Five!  Don't go splashing paint over me like that!
"""

sentences = [watch1, watch2]

In [None]:
from flair.data import Sentence

flair_sentences = [Sentence(s) for s in sentences]
flair_sentences[0]

In [None]:
flair_sentences[0].get_token(32)

In [None]:
flair_sentences[0].tokens[31]

In [None]:
from flair.embeddings import FlairEmbeddings
flair_emb = FlairEmbeddings('news-forward')


In [None]:
flair_emb.embed(flair_sentences)


In [None]:
token_watch1 = flair_sentences[0].tokens[31]
token_watch2 = flair_sentences[1].tokens[13]
token_watch1, token_watch2

In [None]:
token_watch1.embedding, token_watch2.embedding

In [None]:
similarity = nn.CosineSimilarity(dim=0, eps=1e-6)
similarity(token_watch1.embedding, token_watch2.embedding)

In [None]:
def get_embeddings(embeddings, sentence):
    sent = Sentence(sentence)
    embeddings.embed(sent)
    return torch.stack([token.embedding for token in sent.tokens]).float()

In [None]:
get_embeddings(flair_emb, watch1)


## GloVe

In [None]:
from flair.embeddings import WordEmbeddings
glove_embedding = WordEmbeddings('glove')

In [None]:
new_flair_sentences = [Sentence(s) for s in sentences]
glove_embedding.embed(new_flair_sentences)

In [None]:
torch.all(new_flair_sentences[0].tokens[31].embedding == new_flair_sentences[1].tokens[13].embedding)

## BERT

In [None]:
from flair.embeddings import TransformerWordEmbeddings
bert_flair = TransformerWordEmbeddings('bert-base-uncased', layers='-1')

In [None]:
embed1 = get_embeddings(bert_flair, watch1)
embed2 = get_embeddings(bert_flair, watch2)
embed2

In [None]:
bert_watch1 = embed1[31]
bert_watch2 = embed2[13]
bert_watch1, bert_watch2

In [None]:
similarity = nn.CosineSimilarity(dim=0, eps=1e-6)
similarity(bert_watch1, bert_watch2)

## Document Embeddings

In [None]:
documents = [Sentence(watch1), Sentence(watch2)]

In [None]:
from flair.embeddings import TransformerDocumentEmbeddings
bert_doc = TransformerDocumentEmbeddings('bert-base-uncased')
bert_doc.embed(documents)

In [None]:
documents[0].embedding

In [None]:
documents[0].tokens[31].embedding

In [None]:
def get_embeddings(embeddings, sentence):
    sent = Sentence(sentence)
    embeddings.embed(sent)
    if len(sent.embedding):
        return sent.embedding.float()
    else:
        return torch.stack([token.embedding for token in sent.tokens]).float()

In [None]:
get_embeddings(bert_doc, watch1)

## Model III - Preprocessing Embeddings

We need to use the get_embeddings for every sentence, therefore we can use the map function of the HF dataset, then we need the embeddings to be PyTorch Tensors

### Data Preparation

In [None]:
train_dataset_doc = train_dataset.map(lambda row: {'embeddings': get_embeddings(bert_doc, row['sentence'])})
test_dataset_doc = test_dataset.map(lambda row: {'embeddings': get_embeddings(bert_doc, row['sentence'])})

In [None]:
train_dataset_doc.set_format(type='torch', columns=['embeddings', 'labels'])
test_dataset_doc.set_format(type='torch', columns=['embeddings', 'labels'])

In [None]:
train_dataset_doc['embeddings']

In [None]:
train_dataset_doc = TensorDataset(train_dataset_doc['embeddings'].float(),
                                  train_dataset_doc['labels'].view(-1, 1).float())
generator = torch.Generator()
train_loader = DataLoader(train_dataset_doc, batch_size=32, shuffle=True, generator=generator)

test_dataset_doc = TensorDataset(test_dataset_doc['embeddings'].float(),
                                 test_dataset_doc['labels'].view(-1, 1).float())
test_loader = DataLoader(test_dataset_doc, batch_size=32, shuffle=True)

### Model Configuration & Training

In [None]:
torch.manual_seed(41)
model = nn.Sequential(
    # Classifier
    nn.Linear(bert_doc.embedding_length, 3),
    nn.ReLU(),
    nn.Linear(3, 1)
)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-3)

In [None]:
sbs_doc_emb = StepByStep(model, loss_fn, optimizer)
sbs_doc_emb.set_loaders(train_loader, test_loader)
sbs_doc_emb.train(20)

In [None]:
fig = sbs_doc_emb.plot_losses()

In [None]:
StepByStep.loader_apply(test_loader, sbs_doc_emb.correct)

# BERT

**B**idirectional **E**ncoder **R**epresentation from **T**ransformers $→$ **BERT**

It is a model based on a **transformer encoder**.

It was introduced in a paper titled: *BERT: Pre-training of Bidirectional Transformers for Language Understanding* (2019)

Some number to give you an idea:
Trained on huge corpora: BookCorpus, 800M of words, 11.038 unpublished books and English Wikipedia with 2.5B of words

12 layers, 12 attention heads, 768 hidden dimensions, with a total of 110 Milion Parameters.

What does this mean? That we don't have -- as personal users -- the computational resources to train such a kind of models.

HuggingFace is at our disposal and there are many different version of BERT available.

What do we want to do now?

USE A PRE-TRAINED VERSION OF BERT, FINE-TUNING IT FOR OUR PURPOSES AND EVALUATE ON OUR SENTENCE CLASSIFIER

In [None]:
'''
if you want to try different models without having to import their
corresponding classes, you can use HuggingFace's AutoModel

It infers the corret model class based on the name of the model you are loading
'''
from transformers import AutoModel
auto_model = AutoModel.from_pretrained('bert-base-uncased')
print(auto_model.__class__)



In [None]:
'''
Or you can import the class model
'''

from transformers import BertModel
bert_model = BertModel.from_pretrained('bert-base-uncased')

In [None]:
bert_model.config

We are able to recognize some of these parameters, right?
hidden_size, num_attention_heads, num_hidden_layers...

Some of them will be explained in few minutes.

But first of all, our model needs to receive inputs and these inputs need to be **TOKENIZED**


## Tokenization

We can consider the tokenization as a pre-processing step, and since we are going to use a pre-trained BERT model, we need to use the same tokenizer that was used during the pre-training.

So, in HF each pre-trained model has its own pre-trained tokenizer as well.

Let's create our BERT tokenizer...

In [None]:
from transformers import BertTokenizer
bert_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
len(bert_tokenizer.vocab)

Only ```30522`` tokens...
But in reality these are not exactly words but they may also be **word pieces**.


Before, for words not belonging to our vocabulary we used the special token ```[UNK]```.
This approach gets some information loss, all the unknown words are replaced with the same token.

The approach defined here is a litlle bit different.
We disassemble an unknown word into its components, and for instance the word ```inexplicably``` can be disassembled into five word pieces:

```inexplicably``` $→$ ```in + ##ex + ##pl + ##ica + ##bly```  

Every word pieces is prefixed with ```##``` to indicate that is doesn't stand on its own as a word.

Therefore, an unknown word becomes a concatenation of **word-pieces**

In [None]:
sentence1 = 'Alice is inexplicably following the white rabbit'
sentence2 = 'Follow the white rabbit, Neo'
tokens = bert_tokenizer(sentence1, sentence2, return_tensors='pt')
tokens

- input_ids contains the token id,
- token_type_ids contains the sentence index
- the attention mask is self explanatory.

In [None]:
'''
We take the ids (input_ids)
and we convert them to the corresponding word pieces (tokens)
'''

print(bert_tokenizer.convert_ids_to_tokens(tokens['input_ids'][0]))

- [CLS] at the start, the classifier token
- [SEP] between the two sentences and at the end
- inexplicably got disassemled into word pieces

In [None]:
'''
As for the model you can use the AutoTokenizer
to try different tokenizers without importing their classes
'''

from transformers import AutoTokenizer
auto_tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
print(auto_tokenizer.__class__)

## Input Embeddings

Once the sentences are tokenized, we can use their tokens' IDs to look up the corresponding embeddings, as usual.

1. BERT is a transformer encoder, and it needs positional information, and BERT uses **position embeddings**
  a. position encoding used before had fixed values for each position, the **position embeddings** are learned by the model, as any other embedding layer. The number of entries is defined by the maximum length of the sequence (see parameters above).

2. BERT adds a third embedding: segment embedding, which is a position embedding at the sentence level

**Original Design of BERT**

- BERT was designed to handle tasks involving one sentence or two sentences.
- For single-sentence tasks (e.g., sentiment analysis), the input is just one sentence.
- For tasks requiring two sentences (e.g., next sentence prediction or sentence-pair classification), the input consists of two segments: Sentence A [SEP] Sentence B.


![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/bert_input_embed.png?raw=1)

In [None]:
input_embeddings = bert_model.embeddings
input_embeddings

In [None]:
token_embeddings = input_embeddings.word_embeddings
token_embeddings

30522 entries, and 768 hidden dimensions

In [None]:
input_token_emb = token_embeddings(tokens['input_ids'])
input_token_emb,input_token_emb.shape

In [None]:
position_embeddings = input_embeddings.position_embeddings
position_embeddings

In [None]:
position_ids = torch.arange(512).expand((1, -1))
position_ids

In [None]:
seq_length = tokens['input_ids'].size(1)
input_pos_emb = position_embeddings(position_ids[:, :seq_length])
input_pos_emb,input_pos_emb.shape

In [None]:
segment_embeddings = input_embeddings.token_type_embeddings
segment_embeddings

In [None]:
input_seg_emb = segment_embeddings(tokens['token_type_ids'])
input_seg_emb

BERT adds all three embeddings, then layer normalize and dropout, but these are the inputs that BERT uses

In [None]:
input_emb = input_token_emb + input_pos_emb + input_seg_emb
input_emb

## Pretraining Tasks

BERT is a autoencoding model because it is a trasfomer encoder and because it was trained to reconstruct sentences from corrupted inputs.

This type of Language models are called masked language models (MLM) pre-training task.

It tries to predict a masked word/token that is inside a sentence, filling the blanks as the continous bag-of-words (CBoW) does.

There are strategies to select which token has to be masked, the target of our encoder is the original sentence.

In particular, BERT computes the logits only for the randomly masked inputs, the others are not used to compute the loss.

### Masked Language Model (MLM)

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/bert_mlm.png?raw=1)

In [None]:
sentence = 'Alice is inexplicably following the white rabbit'
tokens = bert_tokenizer(sentence)
tokens['input_ids']

In [None]:
from transformers import DataCollatorForLanguageModeling
torch.manual_seed(41)
data_collator = DataCollatorForLanguageModeling(tokenizer=bert_tokenizer, mlm_probability=0.15)
mlm_tokens = data_collator([tokens])
mlm_tokens

In [None]:
print(bert_tokenizer.convert_ids_to_tokens(tokens['input_ids']))
print(bert_tokenizer.convert_ids_to_tokens(mlm_tokens['input_ids'][0]))

### Next Sentence Prediction (NSP)

Another pre-training task is the Next Sentence Prediction (NSP) task.
BERT was trained to predict if a second sentence is actually the next sentence in the original text or not.

In this way, the model learns the relationships between the sencences.

This task takes the special classifier token [CLS] (its final hidden states) as features for a classifier.


![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/bert_nsp.png?raw=1)

In [None]:
bert_model.pooler

In [None]:
sentence1 = 'alice follows the white rabbit'
sentence2 = 'follow the white rabbit neo'
bert_tokenizer(sentence1, sentence2, return_tensors='pt')

## Outputs

In [None]:
sentence = 'And, so far as they knew, they were quite right' #train_dataset[100]['sentence']
sentence

In [None]:
tokens = bert_tokenizer(sentence,
                        padding='max_length',
                        max_length=30,
                        truncation=True,
                        return_tensors="pt")
tokens

In [None]:
bert_model.eval()
out = bert_model(input_ids=tokens['input_ids'],
                 attention_mask=tokens['attention_mask'],
                 output_attentions=True,
                 output_hidden_states=True,
                 return_dict=True)

print()
out.keys()

-```last_hidden_state``` is returned by default and is the most importan output of the all: it contains the final hidden states for each and every token in the input, this can be seen as **contextual word embeddings**

  - [CLS], [SEP], and [PAD] are also included

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/bert_embeddings.png?raw=1)

In [None]:
last_hidden_batch = out['last_hidden_state']
last_hidden_sentence = last_hidden_batch[0]
# Removes hidden states for [PAD] tokens using the mask
mask = tokens['attention_mask'].squeeze().bool()
embeddings = last_hidden_sentence[mask]
# Removes embeddings for the first [CLS] and last [SEP] tokens
embeddings[1:-1]

In [None]:
get_embeddings(bert_flair, sentence)

- ```hidden_states``` returns hidden states for every layer in BERT encoder architecture, including the last one, and the input embedding as well.

Therefore 12 +1 (the input embeddings)

In [None]:
print(len(out['hidden_states']))
print(out['hidden_states'][0].shape)

In [None]:
(out['hidden_states'][0] == bert_model.embeddings(tokens['input_ids'])).all()

In [None]:
(out['hidden_states'][-1] == out['last_hidden_state']).all()

- ```pooler_output``` is returned by default, it's the output of the pooler given the last hidden state as its input

In [None]:
(out['pooler_output'] == bert_model.pooler(out['last_hidden_state'])).all()

- ```attentions``` return the self-attention scores for each attention head in each layer of BERT's encoder:

In [None]:
print(len(out['attentions']))
print(out['attentions'][0].shape)

In [None]:
print(type(out['attentions']))

12 elements, one for each layer, each element has a tensor containing the scores for the sentences in the mini-batch (only one in our case). Those scores include each 12 self-attention heads, each head indicating how mcuh attention each of the 30 tokens is paying to all 30 tokens.

## Model IV - Classifying using BERT

In [None]:
class BERTClassifier(nn.Module):
    def __init__(self, bert_model, ff_units, n_outputs, dropout=0.3):
        super().__init__()
        self.d_model = bert_model.config.dim
        self.n_outputs = n_outputs
        self.encoder = bert_model
        self.mlp = nn.Sequential(
            nn.Linear(self.d_model, ff_units),
            nn.ReLU(),
            nn.Dropout(dropout),
            nn.Linear(ff_units, n_outputs)
        )

    def encode(self, source, source_mask=None):
        states = self.encoder(input_ids=source,
                              attention_mask=source_mask)[0]
        cls_state = states[:, 0]
        return cls_state

    def forward(self, X):
        source_mask = (X > 0)
        # Featurizer
        cls_state = self.encode(X, source_mask)
        # Classifier
        out = self.mlp(cls_state)
        return out

our model takes
- an instance of a pretrained BERT model.
- the desidered number of outputs (logits) corresponding to the number of classes
- the ```forward()``` takes mini-batch of token-ids, encodes them using BERT and outputs logits

### Data Preparation

In [None]:
def tokenize_dataset(hf_dataset, sentence_field, label_field, tokenizer, **kwargs):
    sentences = hf_dataset[sentence_field]
    token_ids = tokenizer(sentences, return_tensors='pt', **kwargs)['input_ids']
    labels = torch.as_tensor(hf_dataset[label_field])
    dataset = TensorDataset(token_ids, labels)
    return dataset

In [None]:
auto_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
tokenizer_kwargs = dict(truncation=True, padding=True, max_length=30, add_special_tokens=True)

In [None]:
train_dataset_float = train_dataset.map(lambda row: {'labels': [float(row['labels'])]})
test_dataset_float = test_dataset.map(lambda row: {'labels': [float(row['labels'])]})

train_tensor_dataset = tokenize_dataset(train_dataset_float, 'sentence', 'labels', auto_tokenizer, **tokenizer_kwargs)
test_tensor_dataset = tokenize_dataset(test_dataset_float, 'sentence', 'labels', auto_tokenizer, **tokenizer_kwargs)

generator = torch.Generator()
train_loader = DataLoader(train_tensor_dataset, batch_size=4, shuffle=True, generator=generator)
test_loader = DataLoader(test_tensor_dataset, batch_size=8)

### Model Configuration & Training

In [None]:
torch.manual_seed(41)
bert_model = AutoModel.from_pretrained("distilbert-base-uncased")
model = BERTClassifier(bert_model, 128, n_outputs=1)
loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-5)

In [None]:
sbs_bert = StepByStep(model, loss_fn, optimizer)
sbs_bert.set_loaders(train_loader, test_loader)
sbs_bert.train(1)

In [None]:
sbs_bert.count_parameters()

In [None]:
StepByStep.loader_apply(test_loader, sbs_bert.correct)

# Fine-Tuning with HuggingFace

As we said before, there is a BERT model for every task, and we need just to fine-tune it.

HF makes at our disposal a **trainer** to do most of the fine-tuning work.

- Pre-training tasks:
  - Masked language mode (```BertForMaskedLM```)
  - Next sentence prediction (```BertForNextSentencePrediction```)
- Typical tasks:
  - Sequence classification (```BertForSequenceClassification```)
  - Token classification (```BertForTokenClassification```)
  - Question answering (```BertForQuestionAnswering```)
- Others:
 - Multiple choice (```BertForMultipleChoice```)

 In our case, we want to use ```DistilBERT``` for sequence classification.


## Sequence Classification (or Regression)

In [None]:
from transformers import DistilBertForSequenceClassification
torch.manual_seed(42)
bert_cls = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

In [None]:
from transformers import AutoModelForSequenceClassification
auto_cls = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)
print(auto_cls.__class__)

What we need to do is to add a single linear layer (classifier) on top of the pooled output from the underlying model to produce the logits.

We have the model, we prepare the dataset...

We need to tokenize our HF's datasets, and we do it one row at the time creating a new column to contain the tokenized version of the sentence.

## Tokenized Dataset

In [None]:
auto_tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')
def tokenize(row):
    return auto_tokenizer(row['sentence'],
                          truncation=True,
                          padding='max_length',
                          max_length=30)

In [None]:
tokenized_train_dataset = train_dataset.map(tokenize, batched=True)
tokenized_test_dataset = test_dataset.map(tokenize, batched=True)

In [None]:
print(tokenized_train_dataset[0])

In [None]:
'''
we select only the need columns and return them as tensors
'''

tokenized_train_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])
tokenized_test_dataset.set_format(type='torch', columns=['input_ids', 'attention_mask', 'labels'])

In [None]:
tokenized_train_dataset[0]

## Trainer

In [None]:
from transformers import Trainer
trainer = Trainer(model=bert_cls, train_dataset=tokenized_train_dataset)

In [None]:
trainer.args

In [None]:
from transformers import TrainingArguments
training_args = TrainingArguments(
    output_dir='./output', # Where to save the output files
    run_name="bert_experiment",  # Set a unique name for this run
    logging_dir="./logs",  # Directory for logging
    report_to=["none"],
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=8,
    evaluation_strategy='steps',
    eval_steps=300,
    logging_steps=300,
    gradient_accumulation_steps=8,
)

Check the batch size, it is only one, but we keep accumulating the gradients for 8 steps, it is a way to simulate 8 size batches.

In [None]:
def compute_metrics(eval_pred):
    predictions = eval_pred.predictions
    labels = eval_pred.label_ids
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": (predictions == labels).mean()}

We can specify a class to compute the desired metrics and pass it to the Trainer instance

In [None]:
trainer = Trainer(model=bert_cls,
                  args=training_args,
                  train_dataset=tokenized_train_dataset,
                  eval_dataset=tokenized_test_dataset,
                  compute_metrics=compute_metrics)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

We can save it, and use later.

In [None]:
trainer.save_model('bert_alice_vs_wizard')
os.listdir('bert_alice_vs_wizard')

In [None]:
loaded_model = AutoModelForSequenceClassification.from_pretrained('bert_alice_vs_wizard')
loaded_model.device

In [None]:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
loaded_model.to(device)
loaded_model.device

## Predictions

If you remember last time, we started with a peculiar sentence.
Now we are able to classify it (also before, you can try).

We tokenize it, we send to the right device and then we evaluate

In [None]:
sentence = 'Down the yellow brick rabbit hole'
tokens = auto_tokenizer(sentence, return_tensors='pt')
tokens

In [None]:
print(type(tokens))
tokens.to(loaded_model.device)

In [None]:
loaded_model.eval()
logits = loaded_model(input_ids=tokens['input_ids'], attention_mask=tokens['attention_mask'])
logits

In [None]:
logits.logits.argmax(dim=1)

## Pipeline

We can make it more efficient using pipelines

There are many pipelines, one for each task:
- ```TextClassificationPipeline```
- ```TextGenerationPipeline```

Every pipeline takes at least two argument:
- a model
- a tokenizer

Now, we can make predictions using the **original sentences**

In [None]:
from transformers import TextClassificationPipeline
device_index = loaded_model.device.index if loaded_model.device.type != 'cpu' else -1
classifier = TextClassificationPipeline(model=loaded_model,
                                        tokenizer=auto_tokenizer,
                                        device=device_index)

In [None]:
classifier(['Down the Yellow Brick Rabbit Hole', 'Alice rules!'])

In [None]:
loaded_model.config.id2label = {0: 'Wizard', 1: 'Alice'}

In [None]:
classifier(['Down the Yellow Brick Rabbit Hole', 'Alice rules!'])

## More Pipelines

It is possible to use pre-trained pipeline for **typical tasks** like sentiment analysis, without any fine-tuning.

*check the pipeline documentation on HuggingFace

In [None]:
from transformers import pipeline
sentiment = pipeline('sentiment-analysis')

In [None]:
sentence = train_dataset[0]['sentence']
print(sentence)
print(sentiment(sentence))

In [None]:
from transformers.pipelines import SUPPORTED_TASKS
# UPDATED
###########################################################
# sentiment-analysis was replaced by text-classification
# in the dictionary of supported tasks
# SUPPORTED_TASKS['sentiment-analysis']
SUPPORTED_TASKS['text-classification']
###########################################################

In [None]:
SUPPORTED_TASKS

In [None]:
SUPPORTED_TASKS['text-generation']

# GPT-2

The **G**enerative **P**retrained **T**ransfomer 2 is able to generate text.

It was trained to fill the in the blanks at the end of the sentences, effectively predicting the next word in a given sentence.

This taks is exactly what a transformer Decoder does, and this what GPT-2 is, a transfomer decoder.

40GB of internet text, 8 millions of web pages, 48 layers, 12 attention heads, and 1600 hidden dimensions, 1.5 billion parameters (Nov. 2019).

In [None]:
text_generator = pipeline("text-generation")

In [None]:
text_generator.model.config.task_specific_params

In [None]:
base_text = """
Alice was beginning to get very tired of sitting by her sister on the bank,
and of having nothing to do:  once or twice she had peeped into the book her
 sister was reading, but it had no pictures or conversations in it, `and what
 is the use of a book,'thought Alice `without pictures or conversation?'
 So she was considering in her own mind (as well as she could, for the hot day
 made her feel very sleepy and stupid), whether the pleasure of making a
 daisy-chain would be worth the trouble of getting up and picking the daisies,
 when suddenly a White Rabbit with pink eyes ran close by her.
"""

In [None]:
result = text_generator(base_text, max_length=250)
print(result[0]['generated_text'])

## Hold-on, we can fine-tune GPT2 too.

## Data Preparation

In [None]:
dataset = load_dataset(path='csv', data_files=['texts/alice28-1476.sent.csv'], quotechar='\\', split=Split.TRAIN)

In [None]:
shuffled_dataset = dataset.shuffle(seed=42)
split_dataset = shuffled_dataset.train_test_split(test_size=0.2, seed=42)
train_dataset, test_dataset = split_dataset['train'], split_dataset['test']

In [None]:
auto_tokenizer = AutoTokenizer.from_pretrained('gpt2')
def tokenize(row):
    return auto_tokenizer(row['sentence'])

- GPT2 uses a different pre-trained tokenizer based on Byte-Pair encoding
- we don't need padding, we need to generate text, and we don't want to write something after many padding tokens.
- we remove (below) the source and sentence columns (as before)
- then we pack sentence together, concateneting the inputs and chunk them into blocks.

In [None]:
tokenized_train_dataset = train_dataset.map(tokenize, remove_columns=['source', 'sentence'], batched=True)
tokenized_test_dataset = test_dataset.map(tokenize, remove_columns=['source', 'sentence'], batched=True)

In [None]:
list(map(len, tokenized_train_dataset[0:6]['input_ids']))

### "Packed" Dataset

![](https://github.com/dvgodoy/PyTorchStepByStep/blob/master/images/block_tokens.png?raw=1)

In [None]:
# Adapted from https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_clm.py
def group_texts(examples, block_size=128):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

In [None]:
lm_train_dataset = tokenized_train_dataset.map(group_texts, batched=True)
lm_test_dataset = tokenized_test_dataset.map(group_texts, batched=True)
lm_train_dataset.set_format(type='torch')
lm_test_dataset.set_format(type='torch')

In [None]:
print(lm_train_dataset[0]['input_ids'])

In [None]:

len(lm_train_dataset), len(lm_test_dataset)

## Model Configuration & Training

GPT2 is a causal language modeling, therefore we use it to import

In [None]:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained('gpt2')
print(model.__class__)

In [None]:
model.resize_token_embeddings(len(auto_tokenizer))

In [None]:
training_args = TrainingArguments(
    output_dir='./output', # Where to save the output files
    run_name="gpt2_experiment",  # Set a unique name for this run
    logging_dir="./logs",  # Directory for logging
    report_to=["none"],
    num_train_epochs=1,
    per_device_train_batch_size=1,
    per_device_eval_batch_size=8,
    evaluation_strategy='steps',
    eval_steps=50,
    logging_steps=50,
    gradient_accumulation_steps=4,
    prediction_loss_only=True,
)

trainer = Trainer(model=model,
                  args=training_args,
                  train_dataset=lm_train_dataset,
                  eval_dataset=lm_test_dataset)

In [None]:
trainer.train()

In [None]:
trainer.evaluate()

## Generating Text

In [None]:
device_index = model.device.index if model.device.type != 'cpu' else -1
gpt2_gen = pipeline('text-generation', model=model, tokenizer=auto_tokenizer, device=device_index)

In [None]:
result = gpt2_gen(base_text, max_length=250)
print(result[0]['generated_text'])