I talked to Gaurav on July 11 after class and we thought of 2 architectures. In both cases, we will use transfer learning from something like BERT. In both cases, we want to get embeddings for:
* the overview AND
* some main features (like genres, production companies | director and actors may be harder to get an embedding for).
Then, we will concat the two separate embeddings and use them for clustering. It would probably be better (and more natural) to train both halves together so the embeddings fit together.

1. Multi-task learning.
2. Have the genres or the production companies as "style". This will push the embeddings to form clusters.
3. Train the overview separately (maybe for auto-encoder) and the features (maybe for classification) separately then concat them together.

In all cases, the network must be trained to reach a goal, then the embeddings it came up with in the middle will be used for clsutering and interpolation.

> Since I will be doing transfer learning, it is better to have a warmup_rate so I do not shock the weights. So, I start with a smaller learning rate then work up to the learning rate I want.

[Fine tuning BERT for multi-label classfication](https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/BERT/Fine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb#scrollTo=4wxY3x-ZZz8h)

[Getting word and sentence embeddings from BERT](https://mccormickml.com/2019/05/14/BERT-word-embeddings-tutorial/)

In [1]:
import torch
DEV_MODE = True
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
from data_prep import load_movies_full_df

MIN_VOTES_PER_MOVIE = 50
NEUTRAL_RATING = 2.5
MIN_POSITIVE_VOTES_PER_USER = 20
DESIRED_COLUMNS = ['id', 'cast', 'title', 'crew',
                   'genres', 'overview', 'production_companies']

df = load_movies_full_df(
        movies_metadata_path='data/IMDB_Ratings/movies_metadata.csv',
        credits_path='data/IMDB_Ratings/credits.csv',
        n_votes=MIN_VOTES_PER_MOVIE,
        desired_columns=DESIRED_COLUMNS)

  df_movies = pd.read_csv(movies_metadata_path)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  enough_votes['overview'] = enough_votes['overview'].fillna('')


In [3]:
df.head()

Unnamed: 0,id,cast,title,crew,genres,overview,production_companies
0,862,"[{'cast_id': 14, 'character': 'Woody (voice)',...",Toy Story,"[{'credit_id': '52fe4284c3a36847f8024f49', 'de...","[{'id': 16, 'name': 'Animation'}, {'id': 35, '...","Led by Woody, Andy's toys live happily in his ...","[{'name': 'Pixar Animation Studios', 'id': 3}]"
1,8844,"[{'cast_id': 1, 'character': 'Alan Parrish', '...",Jumanji,"[{'credit_id': '52fe44bfc3a36847f80a7cd1', 'de...","[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",When siblings Judy and Peter discover an encha...,"[{'name': 'TriStar Pictures', 'id': 559}, {'na..."
2,15602,"[{'cast_id': 2, 'character': 'Max Goldman', 'c...",Grumpier Old Men,"[{'credit_id': '52fe466a9251416c75077a89', 'de...","[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",A family wedding reignites the ancient feud be...,"[{'name': 'Warner Bros.', 'id': 6194}, {'name'..."
3,11862,"[{'cast_id': 1, 'character': 'George Banks', '...",Father of the Bride Part II,"[{'credit_id': '52fe44959251416c75039ed7', 'de...","[{'id': 35, 'name': 'Comedy'}]",Just when George Banks has recovered from his ...,"[{'name': 'Sandollar Productions', 'id': 5842}..."
4,949,"[{'cast_id': 25, 'character': 'Lt. Vincent Han...",Heat,"[{'credit_id': '52fe4292c3a36847f802916d', 'de...","[{'id': 28, 'name': 'Action'}, {'id': 80, 'nam...","Obsessive master thief, Neil McCauley leads a ...","[{'name': 'Regency Enterprises', 'id': 508}, {..."


In [4]:
if DEV_MODE:
    df = df.head(100)

## Prepare for classification

We will use the genres of the movies (and maybe the production companies later) as the labels of the movie. This will result in this becoming a multi-label problem, since most movies fall under more than one genre.

> A question to answer is whether we should only keep it as the top 3 genres, or have all genres.

In [5]:
from features import get_top_n_per_feature, parse_into_python_objects

# the csv files have stringified objects to represnt the cast, the crew, the genres and the prodiction companies
# we have to parse them into python objects
df = parse_into_python_objects(df, ['cast', 'crew', 'genres', 'production_companies'])

# let's extract the top 3 genres and the top 2 production companies of a movie into lists (instead of objects)
df = get_top_n_per_feature(df, [('genres', 3)])

Parsing stringified objects into Python readable objects ...
Extracting top 3 genres ...


In [6]:
df['genres'].tail(20)

80       [Action, Crime, Fantasy]
81               [Drama, Romance]
82               [Action, Comedy]
83      [Fantasy, Comedy, Family]
84        [Mystery, Crime, Drama]
85     [Action, Adventure, Drama]
86      [Action, Thriller, Drama]
87                  [Documentary]
88      [Thriller, Action, Crime]
89     [Drama, Mystery, Thriller]
90             [Action, Thriller]
91        [Action, Comedy, Crime]
92     [Action, Adventure, Drama]
93                    [Adventure]
94      [Action, Crime, Thriller]
95     [Adventure, Action, Drama]
96              [Science Fiction]
97                 [Drama, Crime]
98                [Drama, Comedy]
99    [Mystery, Horror, Thriller]
Name: genres, dtype: object

### Turn genres into one-hot-encoded columns

To classify the movies over genres, we have to turn them into one-hot encoded columns, where each column represents a genre, and it's value is 1 if the movie falls under this genre and 0 otherwise.

In [7]:
# one hot encoding the genres
# note: it is better for memory to use a sparse matrix, but it is not compatible with the tokenizer
# solution taken from: https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list
from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

mlb_genres = MultiLabelBinarizer()

df = df.join(
        pd.DataFrame(
            mlb_genres.fit_transform(df.pop('genres')),
            columns=mlb_genres.classes_,
            index=df.index
        )
)

In [8]:
genre_names = mlb_genres.classes_.tolist()
genre_names

['Action',
 'Adventure',
 'Animation',
 'Comedy',
 'Crime',
 'Documentary',
 'Drama',
 'Family',
 'Fantasy',
 'History',
 'Horror',
 'Music',
 'Mystery',
 'Romance',
 'Science Fiction',
 'Thriller',
 'War']

### Add Production Companies
Now that we added the genres. We can also try to add the production companies. We can add the top 2 production companies.

In [10]:
mlb_prod_companies = MultiLabelBinarizer()

df = df.join(
        pd.DataFrame(
            mlb_prod_companies.fit_transform(df.pop('production_companies')),
            columns=mlb_prod_companies.classes_,
            index=df.index
        )
)

KeyError: 'genres'

In [None]:
prod_company_names = mlb_prod_companies.classes_.tolist()

### Create label indices
To ease our training, we will map our labels from strings to integers. Later on, we might want to map them back to their string format for interpretability. 

We will keep this flexible to accomodate for the genres only, the production companies only, or both.

> Idea: can I do multi-task learning where I use the same input to train two-classifiers: one for the genres and one for the production companies?

In [9]:
labels = genre_names # + prod_company_names
id2label = {idx:label for idx, label in enumerate(labels)}
label2id = {label:idx for idx, label in enumerate(labels)}

### Train/Test Split

We will split the dataset into training-testing (80%-20%) to help with the training.

In [10]:
from datasets import Dataset, DatasetDict
import numpy as np

# create a mask to split data into training and testing
msk = np.random.rand(len(df)) < 0.8

# TODO should I drop all the columns that I do not need (i.e., cast, crew, maybe title)
dataset = DatasetDict(
    train = Dataset.from_pandas(df[msk]),
    test = Dataset.from_pandas(df[~msk])
)

### Find the max length of tokens

The tokenizer will turn the words in a sentence into IDs. Since BERT is a neural network, it takes a fixed size input (i.e., fixed number of tokens) everytime. Nonetheless, not all ovewrviews have the same length. 

To overcome this problem, we give ourt tokenizer a `max_length` that all inputs should have. 
- If the overview has less words than `max_length`, we add `[PAD]` tokens to it till it reaches this `max_length`. 
- If the overview has more words than `max_length`, we turncate it to only have as many words as `max_length`.

To find a good `max_length` we will run simple statistics on the overview column of our dataset and we will use the 95th percentile of token length as our tokens' `max_length`.

In [11]:
count = df['overview'].str.split().str.len()
count.describe()

count    100.000000
mean      51.180000
std       23.965202
min        7.000000
25%       31.500000
50%       52.000000
75%       67.250000
max      123.000000
Name: overview, dtype: float64

In [12]:
tokens_max_length = count.quantile(0.95)

### Encoding/tokenizing the overview

The tokenizer will turn the words in a sentence into IDs, which correspond to the IDs that the original BERT used to represent words when it was training. This is crucial for the transfer-learning to work, to ensure consistency between the representations of the words that we will use.

Moreover, we will need the labels of each class to be used as the output of the model.


In [19]:
from transformers import AutoTokenizer
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_data(examples):
  # take a batch of texts
  text = examples["overview"]
  # encode them
  encoding = tokenizer(
    text, 
    padding="max_length", 
    truncation=True, 
    max_length=int(tokens_max_length), 
    # return_tensors='pt'
  ) # .to(device)
  
  # add labels
  labels_batch = {k: examples[k] for k in examples.keys() if k in labels}
  # create numpy array of shape (batch_size, num_labels)
  labels_matrix = np.zeros((len(text), len(labels)))
  # fill numpy array
  for idx, label in enumerate(labels):
    labels_matrix[:, idx] = labels_batch[label]

  encoding["labels"] = labels_matrix.tolist()
  
  encoding["id"] = examples["id"]
  
  return encoding

In [20]:
encoded_dataset = dataset.map(
    preprocess_data, 
    batched=True, # default batch size is 1,000
    # the returned values will have a new shape, 
    # so we must drop the old columns lest we have shape mismatch problems
    remove_columns=dataset['train'].column_names 
)

  0%|          | 0/1 [00:00<?, ?ba/s]


AttributeError: 'list' object has no attribute 'to'

In [19]:
example = encoded_dataset['train'][0]
print(example.keys())

dict_keys(['input_ids', 'token_type_ids', 'attention_mask', 'labels'])


In [20]:
tokenizer.decode(example['input_ids'])

"[CLS] led by woody, andy's toys live happily in his room until andy's birthday brings buzz lightyear onto the scene. afraid of losing his place in andy's heart, woody plots against buzz. but when circumstances separate buzz and woody from their owner, the duo eventually learns to put aside their differences. [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]"

In [21]:
example['labels']

[0.0,
 0.0,
 1.0,
 1.0,
 0.0,
 0.0,
 0.0,
 1.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0,
 0.0]

In [22]:
[id2label[idx] for idx, label in enumerate(example['labels']) if label == 1.0]

['Animation', 'Comedy', 'Family']

In [23]:
# set the format of our data to PyTorch tensors. 
# This will turn the training, validation and test sets into standard PyTorch datasets
encoded_dataset.set_format("torch")

# Define Model
We will use a pre-trained BERT model and do transfer-learning to classify the overviews into genres. This means that we have a multi-label classification problem, and we will have to modify the BERT pretrained model that we are given as it only supports binary classificatoin by default.

It is important to note that we primairly do this classification to encode some movie specific data into the embeddings of words/sentence of the overview. The final goal is to take these embeddings and use their domain to run nearest-neighbor or interpolation to find movie recommendations.

Here we define a model that includes a pre-trained base (i.e. the weights from bert-base-uncased) are loaded, with a random initialized classification head (linear layer) on top. One should fine-tune this head, together with the pre-trained base on a labeled dataset.

This is also printed by the warning.

We set the `problem_type` to be "multi_label_classification", as this will make sure the appropriate loss function is used (namely [`BCEWithLogitsLoss`](https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html)). We also make sure the output layer has `len(labels)` output neurons, and we set the id2label and label2id mappings.

In [35]:
from transformers import AutoModelForSequenceClassification

# TODO we should tell the model that we will want to extract the embeddings
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", 
                                                           problem_type="multi_label_classification", 
                                                           num_labels=len(labels),
                                                           id2label=id2label,
                                                           label2id=label2id,
                                                           output_hidden_states=True)
model.to(device)

loading configuration file https://huggingface.co/bert-base-uncased/resolve/main/config.json from cache at C:\Users\User/.cache\huggingface\transformers\3c61d016573b14f7f008c02c4e51a366c67ab274726fe2910691e2a761acf43e.37395cee442ab11005bcd270f3c34464dc1704b715b5d7d52b1a461abe3b9e4e
Model config BertConfig {
  "_name_or_path": "bert-base-uncased",
  "architectures": [
    "BertForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "classifier_dropout": null,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "Action",
    "1": "Adventure",
    "2": "Animation",
    "3": "Comedy",
    "4": "Crime",
    "5": "Documentary",
    "6": "Drama",
    "7": "Family",
    "8": "Fantasy",
    "9": "History",
    "10": "Horror",
    "11": "Music",
    "12": "Mystery",
    "13": "Romance",
    "14": "Science Fiction",
    "15": "Thriller",
    "16": "War"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3

## Train the model
We are going to train the model using HuggingFace's Trainer API. This requires us to define 2 things:

- `TrainingArguments`, which specify training hyperparameters. All options can be found in the docs. Below, we for example specify that we want to evaluate after every epoch of training, we would like to save the model every epoch, we set the learning rate, the batch size to use for training/evaluation, how many epochs to train for, and so on.
- a `Trainer` object (docs can be found here).

In [25]:
batch_size = 32
metric_name = "f1" # TODO should I use accuracy?

In [26]:
from transformers import TrainingArguments, Trainer

args = TrainingArguments(
    f"bert-finetuned-sem_eval-english",
    evaluation_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    num_train_epochs=5,
    weight_decay=0.01,
    load_best_model_at_end=True,
    metric_for_best_model=metric_name,
    #push_to_hub=True,
)

We are also going to compute metrics while training. For this, we need to define a `compute_metrics` function, that returns a dictionary with the desired metric values.

In [27]:
from sklearn.metrics import f1_score, roc_auc_score, accuracy_score
from transformers import EvalPrediction
import torch
    
# source: https://jesusleal.io/2021/04/21/Longformer-multilabel-classification/
def multi_label_metrics(predictions, labels, threshold=0.5):
    # first, apply sigmoid on predictions which are of shape (batch_size, num_labels)
    sigmoid = torch.nn.Sigmoid()
    probs = sigmoid(torch.Tensor(predictions))
    # next, use threshold to turn them into integer predictions
    y_pred = np.zeros(probs.shape)
    y_pred[np.where(probs >= threshold)] = 1
    # finally, compute metrics
    y_true = labels
    f1_micro_average = f1_score(y_true=y_true, y_pred=y_pred, average='micro')
    roc_auc = roc_auc_score(y_true, y_pred, average = 'micro')
    accuracy = accuracy_score(y_true, y_pred)
    # return as dictionary
    metrics = {'f1': f1_micro_average,
               'roc_auc': roc_auc,
               'accuracy': accuracy}
    return metrics

def compute_metrics(p: EvalPrediction):
    preds = p.predictions[0] if isinstance(p.predictions, 
            tuple) else p.predictions
    result = multi_label_metrics(
        predictions=preds, 
        labels=p.label_ids)
    return result

Let's verify a batch as well as a forward pass:

In [28]:
encoded_dataset['train'][0]['labels'].type()

'torch.FloatTensor'

In [29]:
encoded_dataset['train']['input_ids'][0]

tensor([  101,  2419,  2011, 13703,  1010,  5557,  1005,  1055, 10899,  2444,
        11361,  1999,  2010,  2282,  2127,  5557,  1005,  1055,  5798,  7545,
        12610,  2422, 29100,  3031,  1996,  3496,  1012,  4452,  1997,  3974,
         2010,  2173,  1999,  5557,  1005,  1055,  2540,  1010, 13703, 14811,
         2114, 12610,  1012,  2021,  2043,  6214,  3584, 12610,  1998, 13703,
         2013,  2037,  3954,  1010,  1996,  6829,  2776, 10229,  2000,  2404,
         4998,  2037,  5966,  1012,   102,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0])

In [36]:
#forward pass
outputs = model(
    input_ids=encoded_dataset['train']['input_ids'][0].unsqueeze(0), 
    labels=encoded_dataset['train'][0]['labels'].unsqueeze(0))
outputs

SequenceClassifierOutput(loss=tensor(0.7168, grad_fn=<BinaryCrossEntropyWithLogitsBackward0>), logits=tensor([[ 0.4976,  0.7592, -0.1099, -0.1505,  0.3372, -0.2077, -0.2145, -0.0220,
         -0.2930, -0.0139, -0.3660,  0.0062,  0.4493, -0.4902,  0.1032, -0.1988,
         -0.3099]], grad_fn=<AddmmBackward0>), hidden_states=(tensor([[[ 0.1686, -0.2858, -0.3261,  ..., -0.0276,  0.0383,  0.1640],
         [ 0.8878,  0.2431,  0.2230,  ...,  0.3238,  0.7185, -0.1581],
         [-0.0733,  0.2378,  0.6102,  ...,  0.3972, -0.2210,  0.4207],
         ...,
         [ 0.0159, -0.4712,  0.2090,  ..., -0.2394, -0.3774,  0.1976],
         [ 0.1598, -0.3782,  0.1954,  ..., -0.1481, -0.5739,  0.1019],
         [-0.0436, -0.5786,  0.5365,  ..., -0.2000, -0.4886, -0.0453]]],
       grad_fn=<NativeLayerNormBackward0>), tensor([[[ 0.0451, -0.0058, -0.2221,  ...,  0.2417, -0.1131,  0.0159],
         [ 0.7613, -0.0737, -0.2307,  ...,  0.2185,  0.9615, -0.3208],
         [-0.2301, -0.3402,  0.6632,  ...,  0.

Let's start training..

In [37]:
trainer = Trainer(
    model,
    args,
    # TODO are these datasets on GPU or CPU?
    train_dataset=encoded_dataset["train"],
    eval_dataset=encoded_dataset["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

In [38]:
trainer.train()

***** Running training *****
  Num examples = 82
  Num Epochs = 5
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 55
***** Running Evaluation *****
  Num examples = 18
  Batch size = 8

[A
[A
                                     

[A[A                                       
  0%|          | 0/5 [25:21<?, ?it/s]          
[A
[ASaving model checkpoint to bert-finetuned-sem_eval-english\checkpoint-11
Configuration saved in bert-finetuned-sem_eval-english\checkpoint-11\config.json


{'eval_loss': 0.5841608643531799, 'eval_f1': 0.1927710843373494, 'eval_roc_auc': 0.5334100322135297, 'eval_accuracy': 0.0, 'eval_runtime': 4.2447, 'eval_samples_per_second': 4.241, 'eval_steps_per_second': 0.707, 'epoch': 1.0}


Model weights saved in bert-finetuned-sem_eval-english\checkpoint-11\pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english\checkpoint-11\tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english\checkpoint-11\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 18
  Batch size = 8

[A
[A
                                     

[A[A                                       
  0%|          | 0/5 [27:21<?, ?it/s]          
[A
[ASaving model checkpoint to bert-finetuned-sem_eval-english\checkpoint-22
Configuration saved in bert-finetuned-sem_eval-english\checkpoint-22\config.json


{'eval_loss': 0.5170652270317078, 'eval_f1': 0.3448275862068965, 'eval_roc_auc': 0.6087436723423838, 'eval_accuracy': 0.05555555555555555, 'eval_runtime': 6.8896, 'eval_samples_per_second': 2.613, 'eval_steps_per_second': 0.435, 'epoch': 2.0}


Model weights saved in bert-finetuned-sem_eval-english\checkpoint-22\pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english\checkpoint-22\tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english\checkpoint-22\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 18
  Batch size = 8

[A
[A
                                     

[A[A                                       
  0%|          | 0/5 [29:15<?, ?it/s]          
[A
[ASaving model checkpoint to bert-finetuned-sem_eval-english\checkpoint-33
Configuration saved in bert-finetuned-sem_eval-english\checkpoint-33\config.json


{'eval_loss': 0.4737803041934967, 'eval_f1': 0.29629629629629634, 'eval_roc_auc': 0.5881270133456051, 'eval_accuracy': 0.1111111111111111, 'eval_runtime': 7.0915, 'eval_samples_per_second': 2.538, 'eval_steps_per_second': 0.423, 'epoch': 3.0}


Model weights saved in bert-finetuned-sem_eval-english\checkpoint-33\pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english\checkpoint-33\tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english\checkpoint-33\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 18
  Batch size = 8

[A
[A
                                     

[A[A                                       
  0%|          | 0/5 [31:15<?, ?it/s]          
[A
[ASaving model checkpoint to bert-finetuned-sem_eval-english\checkpoint-44
Configuration saved in bert-finetuned-sem_eval-english\checkpoint-44\config.json


{'eval_loss': 0.4563661217689514, 'eval_f1': 0.1702127659574468, 'eval_roc_auc': 0.5450069028992177, 'eval_accuracy': 0.1111111111111111, 'eval_runtime': 7.3994, 'eval_samples_per_second': 2.433, 'eval_steps_per_second': 0.405, 'epoch': 4.0}


Model weights saved in bert-finetuned-sem_eval-english\checkpoint-44\pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english\checkpoint-44\tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english\checkpoint-44\special_tokens_map.json
***** Running Evaluation *****
  Num examples = 18
  Batch size = 8

[A
[A
                                     

[A[A                                       
  0%|          | 0/5 [32:55<?, ?it/s]          
[A
[ASaving model checkpoint to bert-finetuned-sem_eval-english\checkpoint-55
Configuration saved in bert-finetuned-sem_eval-english\checkpoint-55\config.json


{'eval_loss': 0.4487520754337311, 'eval_f1': 0.20833333333333334, 'eval_roc_auc': 0.5572020248504372, 'eval_accuracy': 0.1111111111111111, 'eval_runtime': 6.4136, 'eval_samples_per_second': 2.807, 'eval_steps_per_second': 0.468, 'epoch': 5.0}


Model weights saved in bert-finetuned-sem_eval-english\checkpoint-55\pytorch_model.bin
tokenizer config file saved in bert-finetuned-sem_eval-english\checkpoint-55\tokenizer_config.json
Special tokens file saved in bert-finetuned-sem_eval-english\checkpoint-55\special_tokens_map.json


Training completed. Do not forget to share your model on huggingface.co/models =)


Loading best model from bert-finetuned-sem_eval-english\checkpoint-22 (score: 0.3448275862068965).
                                     
100%|██████████| 55/55 [08:43<00:00,  9.52s/it]

{'train_runtime': 523.5059, 'train_samples_per_second': 0.783, 'train_steps_per_second': 0.105, 'train_loss': 0.5314006458629261, 'epoch': 5.0}





TrainOutput(global_step=55, training_loss=0.5314006458629261, metrics={'train_runtime': 523.5059, 'train_samples_per_second': 0.783, 'train_steps_per_second': 0.105, 'train_loss': 0.5314006458629261, 'epoch': 5.0})

Once the model is done training (i.e. fine-tuning), we should save it to have an easy way to re-use in the future if we so desire.
We will use the `model.save_pretrained(<PATH_TO_STORAGE_DIRECTORY>)` method that hugging face provides.

In the future, if we want to retreive this dine-tuned model, all we have to do is call `AutoModelForSequenceClassification.from_pretrained(<PATH_TO_STORAGE_DIRECTORY>)`.

In [None]:
PATH_TO_MODEL_PICKELING_DIRECTORY = './bert_model'

import os

if not os.path.exists(PATH_TO_MODEL_PICKELING_DIRECTORY):
    os.makedirs(PATH_TO_MODEL_PICKELING_DIRECTORY)
else:



## Extracting Embeddings from BERT

Now that the models is trained, we need to extract the embeddings from the last layer.

1. Add special characters (`[CLS]`:start and `[SEP]`:end) to each sentence.
2. Tokenize each sentence according to BERT's characters.
3. Put the label with the sentence.
> Note: The first 3 steps are already done in our `encoded_dataset`
4. Set the `model` to the evaluation mode
5. Stop keeping track of the gradient
6. Pass the tokens tensor and the labels tensor to the `model` and save the output into a variable. This simply passes the inputs through one full forward pass of the model.
7. Extract the embeddings from this output variable, from the second last layer. \*\*
8. Combine the embeddings of the words to get an embedding of the setence. One sentence will have the second last layer of size `(int(tokens_max_length), 768)` and a batch of size `batch_size` will result in size `(batch_size, int(tokens_max_length), 768)`. `768` is the number of hidden features. To combine, we average the features over the words, to get a final vector of size `(batch_size, 768)`

#### \*\* Deciding on which layer to use

> Taken from [the FAQ](https://github.com/jina-ai/clip-as-service#speech_balloon-faq) of CLIP-as-service.

1. The embeddings start out in the first layer as having no contextual information (i.e., the meaning of a word does not take into account what the word means in this specific sentence).
2. As the embeddings move deeper into the network, they pick up more and more contextual information with each layer.
3. As you approach the final layer, however, you start picking up information that is specific to BERT’s pre-training tasks (the “Masked Language Model” (MLM) and “Next Sentence Prediction” (NSP)).
    * What we want is embeddings that encode the word meaning well…
    * BERT is motivated to do this, but it is also motivated to encode anything else that would help it determine what a missing word is (MLM), or whether the second sentence came after the first (NSP).
4. The second-to-last layer is what CLIP-as-service settled on as a reasonable sweet-spot.

In [70]:
# set the model to evaluation mode so we do not update any weights
model.eval()

# Run the text through BERT, and collect all of the hidden states produced
# from the last layer. 
with torch.no_grad():

    # this evaluates the model (a.k.a. finds the outputs) for only one sentence
    # outputs = model(
    #     input_ids=encoded_dataset['train']['input_ids'][10].unsqueeze(0),
    #     labels=encoded_dataset['train'][10]['labels'].unsqueeze(0))

    # this evaluates the model (a.k.a. finds the outputs) for the first 10 sentences
    # TODO we will do the same for batches (maybe size 64)
    outputs = model(
        input_ids=encoded_dataset['train']['input_ids'][:10],
        labels=encoded_dataset['train'][:10]['labels'])

    # get the features in the last layer
    # TODO is it better to get the features in the second last layer [-2] than the last layer [-1]
    last_layer = outputs.hidden_states[-2]

In [71]:
# to get the representation of a sentence, combine all the words embeddings (by averaging them) to become a single sentence embedding
# in other words, we want to get rid of the middle dimension (the <SENTENCE_LENGTH> dimension), which is at index 1 of the shape
# the shape is (batch_size, sentence_length, num_features_in_hidden_state)
sentence_embedding = last_layer.mean(1)
sentence_embedding.shape

torch.Size([10, 768])

Now that we have a vector that represents each sentence, we need to write it to csv.

In [75]:
# Importing library
import csv

SENTENCE_EMBEDDING_FILE = 'sentence_embeddings.csv'

with open(SENTENCE_EMBEDDING_FILE, 'a', newline='') as f:
    writer = csv.writer(f)

    for sentence in sentence_embedding:
        writer.writerow(sentence.tolist())


It's best to zip this csv as it will have a large size and might be difficult to upload.

In [77]:
encoded_dataset['train'][10]

{'input_ids': tensor([  101,  2019,  2035,  1011,  2732,  3459,  4204,  2023,  8680,  2298,
          2012,  2137,  2343,  2957,  1049,  1012, 11296,  1010,  1037,  2158,
          4755,  1996,  6580,  1997,  1996,  2088,  2006,  2010,  4065,  2096,
         17773,  1996,  2969,  1011, 15615,  7670,  2306,  1012, 13912,  2010,
         11587,  2879,  9021,  1999,  2662,  2000,  1996, 16880,  2300,  5867,
          9446,  2008,  2052,  2203,  2010,  8798,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]),
 'token_type_ids': tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0