`02_transformer_model.ipynb`
============================

> *Attribution note: portions of code in this notebook are borrowed from [another notebook](https://github.com/disinfo-detectors/tweet-turing-test/blob/main/src/05_BERT_fine_tuner.ipynb), which was a notebook written by one of our team members (Justin Minnion) for another class (DSCI 591/592).*

**Notes**
 - Section 0 "Setup" covers the external package dependencies (i.e. there are no `import` statements outside Section 0) for quick review, as well as loading the initial dataset from a CSV file. Section 0 should always be executed first and in its entirety.
 - Due to the large size of pretrained models, the best practice for this notebook is to restart the Jupyter kernel between model sections. That is to say, if Section 1 "Basic Transformer" has all of its cells executed with a GPU-equipped environment, it's recommended to not continue on to run other model sections without first restarting the kernel (thereby clearing the data stored in RAM/VRAM). If you have gobs of VRAM, disregard this best practice.
 - In the same theme as the prior point, for consistency and code re-use, each model section will often use a common set of variable names, e.g. "`model`" is used as the variable name for all models. Because of this, sections are not intended to be run in an intermixed manner, as the common namespace would promote negative side effects.

**Assumptions / Requirements**
 - Hardware: a CUDA-capable GPU is required (or at least "encouraged in the strongest terms") for execution of the transformer model fine-tuning. During development, maximum instantaneous VRAM usage of ~6 GB was observed, though RoBERTa models can quickly exceed that for larger batch sizes.
 - Software
   - Python version - development was performed using Python 3.11.2, though specific features added in 3.10 and 3.11 are not used. Some [PEP-585](https://www.python.org/dev/peps/pep-0585)-style type hints are used which would call for Python 3.9 at a minimum.
   - Other packages - Section 0.1.1 shows the top-level package requirements, with each imported package having numerous dependencies (not explicitly shown, but able to be resolved with `pip`).

# 0 - Setup

## 0.1 - Definitions

### 0.1.1 - Package Imports

In [1]:
# imports from python standard library
import re
from pathlib import Path

# data science packages
import nltk
import numpy as np
import pandas as pd
import plotly.express as px
import torch
from nltk.tokenize import word_tokenize

# huggingface packages
import evaluate
from datasets import Dataset, DatasetDict, ClassLabel
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import RobertaTokenizerFast, RobertaForSequenceClassification
from transformers import TrainingArguments, Trainer

### 0.1.2 - Constants

In [2]:
# file locations
DATA_DIR = Path("./data")
DATA_DIR_PROCESSED = DATA_DIR / "processed"
PROCESSED_DATA = DATA_DIR_PROCESSED / "script_data_processed.csv"

MODEL_DIR = DATA_DIR / "models"

### 0.1.3 - Options

In [3]:
pd.set_option('display.max_colwidth', None)

nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jminn\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### 0.1.4 - Helper Functions

In [4]:
def inspect_tokens(tokenizer, encoded_text: dict):
    '''Prints the provided encoded text as its original text and as its tokenized form.
        - tokenizer is an instantiated huggingface tokenizer (sub-subclass of PreTrainedTokenizerBase)
        - encoded_text is the dict created from one element of a huggingface dataset
        '''
    vocab = tokenizer.get_vocab()
    inverse_vocab = {v: k for (k, v) in vocab.items()}

    tokens_list = [inverse_vocab[i] for i in encoded_text['input_ids']]
    tokens_list_attention = [tokens_list[i] for i in range(len(tokens_list)) if (encoded_text['attention_mask'][i] == 1)]

    print("-"*50)
    print(f"Original text:\n\t{encoded_text['text']}", end="\n\n")
    print(f"Label:\t{encoded_text['label']}", end="\n\n")
    print(f"Tokenized form:\n\t{' '.join(tokens_list)}", end="\n\n")
    print(f"Tokens as a list:\n\t{tokens_list}", end="\n\n")
    print(f"Tokens as a list, attention mask applied:\n\t{tokens_list_attention}", end="\n\n")

def informal_test(tokenizer, model, test_line):
    with torch.no_grad():
        beets_or_not = tokenizer(test_line, return_tensors='pt').to('cuda')
        result = model(**beets_or_not)
        y_hat = result.logits.argmax().item()
        print(f'{"Test Line: ":>20}', f'"{test_line}"')
        print(f'{"Predicted Speaker: ":>20}', model.config.id2label[y_hat], f"({y_hat})")
        print(f'{"Logits: ":>20}', result.logits)

## 0.2 - Load Data

In [5]:
script_df = pd.read_csv(
    filepath_or_buffer=PROCESSED_DATA,
    header=0,
    index_col=0,
    encoding='utf-8'
)

In [6]:
script_df.head(3)

Unnamed: 0,season,episode,title,scene,speaker,line,directed_by,written_by,writer1,writer2,writer3
0,1,1,Pilot,1,michael,All right Jim. Your quarterlies look very good. How are things at the library?,Ken Kwapis,Ricky Gervais & Stephen Merchant and Greg Daniels,Ricky Gervais,Stephen Merchant,Greg Daniels
1,1,1,Pilot,1,jim,"Oh, I told you. I couldn't close it. So...",Ken Kwapis,Ricky Gervais & Stephen Merchant and Greg Daniels,Ricky Gervais,Stephen Merchant,Greg Daniels
2,1,1,Pilot,1,michael,"So you've come to the master for guidance? Is this what you're saying, grasshopper?",Ken Kwapis,Ricky Gervais & Stephen Merchant and Greg Daniels,Ricky Gervais,Stephen Merchant,Greg Daniels


In [7]:
# examine numeric fields
script_df.describe()

Unnamed: 0,season,episode,scene
count,54267.0,54267.0,54267.0
mean,5.538099,12.490003,4190.521606
std,2.349106,7.286262,2294.821819
min,1.0,1.0,1.0
25%,3.0,6.0,2325.0
50%,6.0,12.0,4215.0
75%,8.0,18.0,6153.0
max,9.0,28.0,8157.0


In [8]:
script_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54267 entries, 0 to 54266
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   season       54267 non-null  int64 
 1   episode      54267 non-null  int64 
 2   title        54267 non-null  object
 3   scene        54267 non-null  int64 
 4   speaker      54267 non-null  object
 5   line         54267 non-null  object
 6   directed_by  54267 non-null  object
 7   written_by   54267 non-null  object
 8   writer1      54267 non-null  object
 9   writer2      9816 non-null   object
 10  writer3      699 non-null    object
dtypes: int64(3), object(8)
memory usage: 29.1 MB


While the dataset isn't particularly large, we can improve performance / memory footprint if we are more prescriptive with `dtype` settings. At a minimum we should aim for no "`object`" type columns.

In [9]:
dtype_mapping = {
    'season': 'int8',
    'episode': 'int8',
    'title': 'string',
    'scene': 'int16',
    'speaker': 'string',    # could be category if we limit to top 10 speakers
    'line': 'string',
    'directed_by': 'category',
    'written_by': 'string',
    'writer1': 'category',
    'writer2': 'category',
    'writer3': 'category',
}

script_df = script_df.astype(dtype_mapping)

script_df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 54267 entries, 0 to 54266
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   season       54267 non-null  int8    
 1   episode      54267 non-null  int8    
 2   title        54267 non-null  string  
 3   scene        54267 non-null  int16   
 4   speaker      54267 non-null  string  
 5   line         54267 non-null  string  
 6   directed_by  54267 non-null  category
 7   written_by   54267 non-null  string  
 8   writer1      54267 non-null  category
 9   writer2      9816 non-null   category
 10  writer3      699 non-null    category
dtypes: category(4), int16(1), int8(2), string(4)
memory usage: 17.4 MB


# 1 - Basic Transformer

Attempting a basic transformer model without too much customization to establish a baseline (within transformer-type models) for performance.

**Task**: Sequence Classification (Binary)

**Classes**: 
 - Positive (1): "Dwight" - a line is spoken by the character Dwight K. Schrute (played by Rainn Wilson).
 - Negative (0): "Not Dwight" - a line is spoken by any other character than Dwight.

**Data**:
 - `speaker` as pre-cursor to class label. Limited to top-10 most frequent speakers based on number of lines in dataset
 - `line` as sequence text.

**Encoding**:
 - Tokenizer: DistilBertTokenizerFast
 - Max Sequence Length: 128
 - Padding: True
 - Truncate: True

**Pretrained Model**:
 - DistilBert (`distilbert-base-uncased`) [(link: huggingface.co)](https://huggingface.co/distilbert-base-uncased) - Intended to mimic the standard "BERTbase" model but in a smaller/faster/more efficient way.
 - Citation: Sanh et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (2019) - [https://arxiv.org/pdf/1910.01108.pdf](https://arxiv.org/pdf/1910.01108.pdf)

**Training**:
 - Train/Test/Validation Split: 50/25/25

**Notes**:
 - Class imbalance is present (positive: 6,752; negative: 32,668; about `1:4.8` imbalance ratio).
 - Vocabulary: no modifications made to pretrained transformer's vocabulary.
 - Secondary data: no inclusion of secondary data (director/writer credits).

## 1.1 - Dataset - Convert `pandas` -> 🤗 `dataset`

In [9]:
# limit to top 10 most frequent speakers
top_10_speaker_list = script_df['speaker'].value_counts(normalize=True).nlargest(10).index.tolist()
columns_to_keep = ['speaker', 'line']

script_df_subset = script_df.loc[script_df['speaker'].isin(top_10_speaker_list), columns_to_keep]

script_df_subset

Unnamed: 0,speaker,line
0,michael,All right Jim. Your quarterlies look very good. How are things at the library?
1,jim,"Oh, I told you. I couldn't close it. So..."
2,michael,"So you've come to the master for guidance? Is this what you're saying, grasshopper?"
3,jim,"Actually, you called me in here, but yeah."
4,michael,"All right. Well, let me show you how it's done."
...,...,...
54257,kevin,"No, but maybe the reason..."
54258,oscar,You're not gay.
54260,erin,"How did you do it? How did you capture what it was really like? How we felt and how made each other laugh and how we got through the day? How did you do it? Also, how do cameras work?"
54265,jim,"I sold paper at this company for 12 years. My job was to speak to clients on the phone about quantities and types of copier paper. Even if I didn't love every minute of it, everything I have, I owe to this job. This stupid...wonderful...boring...amazing job."


In [10]:
# rename the 'line' column to be 'text'
script_df_subset = script_df_subset.rename(columns={'line': 'text'})

In [11]:
# create class label column
dwight_mask = (script_df_subset['speaker'] == 'dwight')

# new column of zeros
script_df_subset['label'] = 0

# apply the Dwight mask (as seen in the CPR scene of S05E14 "Stress Relief")
script_df_subset.loc[dwight_mask, 'label'] = 1

# adjust dtype
script_df_subset['label'] = script_df_subset['label'].astype('int8')    
    # would love to use 'category', but not implemented in 🤗 datasets

# check results
script_df_subset['label'].value_counts()

0    32668
1     6752
Name: label, dtype: int64

In [12]:
script_df_subset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39420 entries, 0 to 54266
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   speaker  39420 non-null  string
 1   text     39420 non-null  string
 2   label    39420 non-null  int8  
dtypes: int8(1), string(2)
memory usage: 7.0 MB


In [13]:
# finally, convert to 🤗 dataset object
#   drop 'speaker' by way of not including it
dataset_full: Dataset = Dataset.from_pandas(script_df_subset[['text', 'label']].reset_index(drop=False)) \
                    .cast_column('label', ClassLabel(names=['not_dwight', 'dwight']))

# make sure we got the class labels mapped correctly
assert (dataset_full.features['label'].str2int('dwight') == 1)

Casting the dataset:   0%|          | 0/39420 [00:00<?, ? examples/s]

In [14]:
dataset_full

Dataset({
    features: ['index', 'text', 'label'],
    num_rows: 39420
})

## 1.2 - Train/Test/Val Split

As of v2.12.0, the 🤗 Datasets implementation of `train_test_split` is limited to outputting **two** splits only (train/test), so we'll perform the split twice to obtain train, test, and validation splits.

In [15]:
# set parameters
train_size = 0.50
test_size = 0.25
valid_size = 0.25

assert sum([train_size, test_size, valid_size]) == 1.0

split_random_seed = 27  # for Weird Al fans

first_split = dataset_full.train_test_split(
    test_size=(1.0 - train_size),
    shuffle=True,
    seed=split_random_seed,
    stratify_by_column='label'
)

second_split = first_split['test'].train_test_split(
    test_size=((valid_size) / (test_size + valid_size)),
    shuffle=True,
    seed=split_random_seed,
    stratify_by_column='label'
)

ds_dict = DatasetDict({
    'train': first_split['train'],
    'test': second_split['train'],
    'valid': second_split['test']
})

ds_dict

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 19710
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 9855
    })
    valid: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 9855
    })
})

In [16]:
# confirm stratified sample
num_negative = ds_dict['train'].to_pandas()['label'].value_counts()[0]
num_positive = ds_dict['train'].to_pandas()['label'].value_counts()[1]

print(f"ratio positive/negative is:\t1 to {num_negative/num_positive:0.1f}")

ratio positive/negative is:	1 to 4.8


## 1.3 - Tokenize and Encode

In [17]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

In [18]:
# tokenizer function
def tokenize_function(examples):
    return tokenizer(examples['text'], 
                     padding='longest', 
                     truncation=True, 
                     return_tensors='pt',
                     max_length=128)

ds_tokenized = ds_dict.map(
    tokenize_function, 
    batched=True, 
    batch_size=None)

Map:   0%|          | 0/19710 [00:00<?, ? examples/s]

Map:   0%|          | 0/9855 [00:00<?, ? examples/s]

Map:   0%|          | 0/9855 [00:00<?, ? examples/s]

In [20]:
inspect_tokens(tokenizer, ds_tokenized['train'][27])
inspect_tokens(tokenizer, ds_tokenized['test'][42])

--------------------------------------------------
Original text:
	 Birthday time is over! Now go make up for all the work you missed when you were taking your nap.  Many happy returns. 

Label:	1

Tokenized form:
	[CLS] birthday time is over ! now go make up for all the work you missed when you were taking your nap . many happy returns . [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Tokens as a list:
	['[CLS]', 'birthday', 'time', 'is

## 1.4 - Model

Create model from pre-trained 🤗 transformer.

In [21]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', 
    num_labels=2,
    id2label={idx: label for idx, label in enumerate(ds_dict['train'].features['label'].names)}
    )

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier

Setup training arguments:

In [22]:
start_time = pd.Timestamp.now().strftime(r'%Y%m%d_%H%M%S')  # yyyymmdd_hhmmss
run_name = f"basic_distilbert_{start_time}"

training_args = TrainingArguments(
    # model output
    run_name=run_name,
    output_dir=MODEL_DIR / run_name,
    save_strategy='epoch',
    save_total_limit=3,
    # training hyperparams
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    #gradient_accumulation_steps=4,
    #gradient_checkpointing=True,
    weight_decay=0.01,
    # evaluation during training
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    log_level='warning',
)

Establish evaluation metrics:

In [23]:
# setup training / evaluation metric
#   Docs: https://huggingface.co/docs/evaluate/package_reference/main_classes#evaluate.combine
#   Each of these metrics corresponds to a script from huggingface, below are the links for each script.
#       accuracy:       https://huggingface.co/spaces/evaluate-metric/accuracy
#       f1:             https://huggingface.co/spaces/evaluate-metric/f1
#       precision:      https://huggingface.co/spaces/evaluate-metric/precision
#       recall:         https://huggingface.co/spaces/evaluate-metric/recall
metric_list = ['accuracy', 'f1', 'precision', 'recall']

metric = evaluate.combine(evaluations=metric_list)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Finally, setup and run the 🤗 Trainer:

In [24]:
time_training_start = pd.Timestamp.now()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_tokenized['train'],
    eval_dataset=ds_tokenized['test'],
    compute_metrics=compute_metrics
)

result = trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start

print("\nTraining duration:", str(time_training))



  0%|          | 0/3080 [00:00<?, ?it/s]

{'loss': 0.4387, 'learning_rate': 4e-05, 'epoch': 1.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.4204602539539337, 'eval_accuracy': 0.8330796549974632, 'eval_f1': 0.0996168582375479, 'eval_precision': 0.6546762589928058, 'eval_recall': 0.05390995260663507, 'eval_runtime': 17.452, 'eval_samples_per_second': 564.692, 'eval_steps_per_second': 17.648, 'epoch': 1.0}
{'loss': 0.3663, 'learning_rate': 3e-05, 'epoch': 2.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.43789076805114746, 'eval_accuracy': 0.8310502283105022, 'eval_f1': 0.30884184308841844, 'eval_precision': 0.5159500693481276, 'eval_recall': 0.22037914691943128, 'eval_runtime': 17.4079, 'eval_samples_per_second': 566.123, 'eval_steps_per_second': 17.693, 'epoch': 2.0}
{'loss': 0.2701, 'learning_rate': 2e-05, 'epoch': 3.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.5542342662811279, 'eval_accuracy': 0.8284119736174531, 'eval_f1': 0.21531322505800468, 'eval_precision': 0.49678800856531047, 'eval_recall': 0.13744075829383887, 'eval_runtime': 17.614, 'eval_samples_per_second': 559.497, 'eval_steps_per_second': 17.486, 'epoch': 3.0}
{'loss': 0.1967, 'learning_rate': 1e-05, 'epoch': 4.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.6433358788490295, 'eval_accuracy': 0.812886859462202, 'eval_f1': 0.3366906474820144, 'eval_precision': 0.42857142857142855, 'eval_recall': 0.2772511848341232, 'eval_runtime': 17.4646, 'eval_samples_per_second': 564.284, 'eval_steps_per_second': 17.636, 'epoch': 4.0}
{'loss': 0.1501, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.7656663060188293, 'eval_accuracy': 0.8100456621004566, 'eval_f1': 0.3304721030042918, 'eval_precision': 0.41696750902527074, 'eval_recall': 0.273696682464455, 'eval_runtime': 17.5101, 'eval_samples_per_second': 562.819, 'eval_steps_per_second': 17.59, 'epoch': 5.0}
{'train_runtime': 587.8245, 'train_samples_per_second': 167.652, 'train_steps_per_second': 5.24, 'train_loss': 0.28436563417509003, 'epoch': 5.0}

Training duration: 0 days 00:09:48.068031


Save the trained model:

In [25]:
trainer.save_model()    # saves to self.args.output_dir

## 1.5 - Evaluate

In [26]:
time_evaluation_start = pd.Timestamp.now()

final_metrics = {}
final_metrics['train'] = trainer.evaluate(eval_dataset=ds_tokenized['train'], metric_key_prefix='final_train')
final_metrics['test']= trainer.evaluate(eval_dataset=ds_tokenized['test'], metric_key_prefix='final_test')
final_metrics['valid'] = trainer.evaluate(eval_dataset=ds_tokenized['valid'], metric_key_prefix='validation')

time_evaluation_stop = pd.Timestamp.now()
time_evaluation = time_evaluation_stop - time_evaluation_start

print("\nEvaluation duration, what's the situation:", str(time_evaluation))

  0%|          | 0/616 [00:00<?, ?it/s]

  0%|          | 0/308 [00:00<?, ?it/s]

  0%|          | 0/308 [00:00<?, ?it/s]


Evaluation duration, what's the situation: 0 days 00:01:12.892828


In [27]:
for split in final_metrics:
    print(f"\n{split.upper():->10}{'-'*15}")
    for k, v in final_metrics[split].items():
        print(f"{v:>10.3f} - {k}")
    print("-"*25)


-----TRAIN---------------
     0.113 - final_train_loss
     0.961 - final_train_accuracy
     0.875 - final_train_f1
     0.978 - final_train_precision
     0.791 - final_train_recall
    36.321 - final_train_runtime
   542.667 - final_train_samples_per_second
    16.960 - final_train_steps_per_second
     5.000 - epoch
-------------------------

------TEST---------------
     0.766 - final_test_loss
     0.810 - final_test_accuracy
     0.330 - final_test_f1
     0.417 - final_test_precision
     0.274 - final_test_recall
    18.720 - final_test_runtime
   526.447 - final_test_samples_per_second
    16.453 - final_test_steps_per_second
     5.000 - epoch
-------------------------

-----VALID---------------
     0.775 - validation_loss
     0.806 - validation_accuracy
     0.322 - validation_f1
     0.402 - validation_precision
     0.268 - validation_recall
    17.825 - validation_runtime
   552.862 - validation_samples_per_second
    17.279 - validation_steps_per_second
     5.000 

## 1.6 - Discussion / Conclusions (on this attempt)

| Metrics (Train/Test/Valid)         | Accuracy              | F1 Score              | Precision             | Recall                | Fine-Tuning Time |
|------------------------------------|-----------------------|-----------------------|-----------------------|-----------------------|------------------|
| (1) Basic Transformer (DistilBERT) | 0.961 / 0.810 / 0.806 | 0.875 / 0.330 / 0.322 | 0.978 / 0.417 / 0.402 | 0.791 / 0.274 / 0.268 | 0d 0h 9m 48s     |

### 1.6.1 - Accuracy
Because of the class imbalance present in this run (1 positive to ~4.8 negative), we wanted to be sure to compare against the trivial classifier, i.e. "always call 'negative'". This trivial classifier would have an accuracy of 82.9% ( $\frac{32,668}{32,668+6,752}$ ). Based on this, the test and validation accuracies *less than* 82.9% are suggesting the model is not performing better than the trivial classifier.

### 1.6.2 - F1 Score
The final F1 scores are significantly different between the training set and the test/validation set, and though the test F1 score did show some improvement over the training epochs, it does not appear to be stable (at least at the per-epoch resolution shown).

### 1.6.3 - Precision / Recall
For this attempt and its Dwight/Not-Dwight task, we would likely weigh the importance of precision and recall as approximately equivalent, with a slight advantage to precision:

 - Higher precision is indicative of fewer false **positives**, more true positives, or both. While a true positive is intuitively desirable, a false positive (labeling a line's speaker as "Dwight" when the speaker is not Dwight) would be confusing to the end-user of the speaker labels.
 - Higher recall is indicative of fewer false **negatives**, more true positives, or both. A false negative in this case would represent a line spoken by Dwight being labeled as "Not-Dwight". Because our task is not attempting to classify further than "Not-Dwight", e.g. to identify which Not-Dwight speaker uttered the line, false negatives would more likely signal the need for further analysis. While outside the scope of this attempt, the downstream analysis could take the form of an ensemble approach to apply other modeling techniques (e.g. boosting) to the difficult-to-classify lines.

In short, false positives are clearly detrimental, while false negatives *may* be detrimental if downstream modeling can't differentiate the speaker. This supports the prior assertion that precision has a small advantage over recall for importance to our anaysis. This also suggests that F1 score (noted above) is a valuable metric here, as it encapsulates both precision and recall.

Turning to the results of this attempt: for both precision and recall, like what was seen in F1 score the training versus test/eval performance again show a stark difference, suggesting the model is not generalizing well.

### 1.6.4 - Overall "Basic Transformer" Conclusion
We would not call this a successful model, most notably for its performance metrics falling significantly lower in the the test/validation sets.

# 2 - Modified Approach: Different Pretrained Model (RoBERTa)

We'll attempt to improve our model performance by starting with a "robustly optimized" (hence the "Ro") pretrained model, RoBERTa. Most of the details below will be kept the same for experimental control.

> NOTE: Differences from the "Basic Transformer" (Section 1) are noted with "`>>`" chevrons.

**Task**: Sequence Classification (Binary)

**Classes**: 
 - Positive (1): "Dwight" - a line is spoken by the character Dwight K. Schrute (played by Rainn Wilson).
 - Negative (0): "Not Dwight" - a line is spoken by any other character than Dwight.

**Data**:
 - `speaker` as pre-cursor to class label. Limited to top-10 most frequent speakers based on number of lines in dataset
 - `line` as sequence text.

**Encoding**:
 - `>>` Tokenizer: RobertaTokenizerFast
 - Max Sequence Length: 128
 - Padding: True
 - Truncate: True

**Pretrained Model**:
 - `>>` RoBERTa (`roberta-base`) [(link: huggingface.co)](https://huggingface.co/roberta-base) - Intended to improve upon BERTbase with refined pretraining hyperparameters and a larger pretraining text corpus.
 - `>>` Citation: Liu et al. "RoBERTa: A Robustly Optimized BERT Pretraining Approach" (2019) - [https://arxiv.org/pdf/1907.11692.pdf](https://arxiv.org/pdf/1907.11692.pdf)

**Training**:
 - Train/Test/Validation Split: 50/25/25

**Notes**:
 - Class imbalance is present (positive: 6,752; negative: 32,668; about `1:4.8` imbalance ratio).
 - Vocabulary: no modifications made to pretrained transformer's vocabulary.
 - Secondary data: no inclusion of secondary data (director/writer credits).

## 2.1 - Dataset - Convert `pandas` -> 🤗 `dataset`

In [9]:
# limit to top 10 most frequent speakers
top_10_speaker_list = script_df['speaker'].value_counts(normalize=True).nlargest(10).index.tolist()
columns_to_keep = ['speaker', 'line']

script_df_subset = script_df.loc[script_df['speaker'].isin(top_10_speaker_list), columns_to_keep]

script_df_subset

Unnamed: 0,speaker,line
0,michael,All right Jim. Your quarterlies look very good. How are things at the library?
1,jim,"Oh, I told you. I couldn't close it. So..."
2,michael,"So you've come to the master for guidance? Is this what you're saying, grasshopper?"
3,jim,"Actually, you called me in here, but yeah."
4,michael,"All right. Well, let me show you how it's done."
...,...,...
54257,kevin,"No, but maybe the reason..."
54258,oscar,You're not gay.
54260,erin,"How did you do it? How did you capture what it was really like? How we felt and how made each other laugh and how we got through the day? How did you do it? Also, how do cameras work?"
54265,jim,"I sold paper at this company for 12 years. My job was to speak to clients on the phone about quantities and types of copier paper. Even if I didn't love every minute of it, everything I have, I owe to this job. This stupid...wonderful...boring...amazing job."


In [10]:
# rename the 'line' column to be 'text'
script_df_subset = script_df_subset.rename(columns={'line': 'text'})

In [11]:
# create class label column
dwight_mask = (script_df_subset['speaker'] == 'dwight')

# new column of zeros
script_df_subset['label'] = 0

# apply the Dwight mask (as seen in the CPR scene of S05E14 "Stress Relief")
script_df_subset.loc[dwight_mask, 'label'] = 1

# adjust dtype
script_df_subset['label'] = script_df_subset['label'].astype('int8')    
    # would love to use 'category', but not implemented in 🤗 datasets

# check results
script_df_subset['label'].value_counts()

0    32668
1     6752
Name: label, dtype: int64

In [12]:
script_df_subset.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
Int64Index: 39420 entries, 0 to 54266
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   speaker  39420 non-null  string
 1   text     39420 non-null  string
 2   label    39420 non-null  int8  
dtypes: int8(1), string(2)
memory usage: 7.0 MB


In [13]:
# finally, convert to 🤗 dataset object
#   drop 'speaker' by way of not including it
dataset_full: Dataset = Dataset.from_pandas(script_df_subset[['text', 'label']].reset_index(drop=False)) \
                    .cast_column('label', ClassLabel(names=['not_dwight', 'dwight']))

# make sure we got the class labels mapped correctly
assert (dataset_full.features['label'].str2int('dwight') == 1)

Casting the dataset:   0%|          | 0/39420 [00:00<?, ? examples/s]

In [14]:
dataset_full

Dataset({
    features: ['index', 'text', 'label'],
    num_rows: 39420
})

## 2.2 - Train/Test/Val Split

As of v2.12.0, the 🤗 Datasets implementation of `train_test_split` is limited to outputting **two** splits only (train/test), so we'll perform the split twice to obtain train, test, and validation splits.

In [15]:
# set parameters
train_size = 0.50
test_size = 0.25
valid_size = 0.25

assert sum([train_size, test_size, valid_size]) == 1.0

split_random_seed = 27  # for Weird Al fans

first_split = dataset_full.train_test_split(
    test_size=(1.0 - train_size),
    shuffle=True,
    seed=split_random_seed,
    stratify_by_column='label'
)

second_split = first_split['test'].train_test_split(
    test_size=((valid_size) / (test_size + valid_size)),
    shuffle=True,
    seed=split_random_seed,
    stratify_by_column='label'
)

ds_dict = DatasetDict({
    'train': first_split['train'],
    'test': second_split['train'],
    'valid': second_split['test']
})

ds_dict

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 19710
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 9855
    })
    valid: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 9855
    })
})

In [16]:
# confirm stratified sample
num_negative = ds_dict['train'].to_pandas()['label'].value_counts()[0]
num_positive = ds_dict['train'].to_pandas()['label'].value_counts()[1]

print(f"ratio positive/negative is:\t1 to {num_negative/num_positive:0.1f}")

ratio positive/negative is:	1 to 4.8


In [34]:
ds_dict['train'].features['label'].names

['not_dwight', 'dwight']

## 2.3 - Tokenize and Encode

In [18]:
tokenizer = RobertaTokenizerFast.from_pretrained('roberta-base')

In [19]:
# tokenizer function
def tokenize_function(examples):
    return tokenizer(examples['text'], 
                     padding='longest', 
                     truncation=True, 
                     return_tensors='pt',
                     #max_length=128
                     )

ds_tokenized = ds_dict.map(
    tokenize_function, 
    batched=True, 
    batch_size=None
)

Map:   0%|          | 0/19710 [00:00<?, ? examples/s]

Map:   0%|          | 0/9855 [00:00<?, ? examples/s]

Map:   0%|          | 0/9855 [00:00<?, ? examples/s]

In [23]:
inspect_tokens(tokenizer, ds_tokenized['train'][27])
inspect_tokens(tokenizer, ds_tokenized['test'][42])

--------------------------------------------------
Original text:
	 Birthday time is over! Now go make up for all the work you missed when you were taking your nap.  Many happy returns. 

Label:	1

Tokenized form:
	<s> ĠBirthday Ġtime Ġis Ġover ! ĠNow Ġgo Ġmake Ġup Ġfor Ġall Ġthe Ġwork Ġyou Ġmissed Ġwhen Ġyou Ġwere Ġtaking Ġyour Ġnap . Ġ ĠMany Ġhappy Ġreturns . Ġ </s> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad> <pad>

## 2.4 - Model

Create model from pre-trained 🤗 transformer.

In [35]:
model = RobertaForSequenceClassification.from_pretrained(
    'roberta-base', 
    num_labels=2,
    id2label={idx: label for idx, label in enumerate(ds_dict['train'].features['label'].names)}
)

Some weights of the model checkpoint at roberta-base were not used when initializing RobertaForSequenceClassification: ['lm_head.decoder.weight', 'lm_head.dense.weight', 'roberta.pooler.dense.bias', 'roberta.pooler.dense.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaForSequenceClassification were not initialized from the model checkpoint at roberta-base and are newly initialized: ['classifier.out_proj.bias', 'classifi

In [38]:
model

RobertaForSequenceClassification(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=1)
      (position_embeddings): Embedding(514, 768, padding_idx=1)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): RobertaEncoder(
      (layer): ModuleList(
        (0-11): 12 x RobertaLayer(
          (attention): RobertaAttention(
            (self): RobertaSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): RobertaSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
             

In [37]:
model.config

RobertaConfig {
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "not_dwight",
    "1": "dwight"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.27.4",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

Setup training arguments:

In [41]:
start_time = pd.Timestamp.now().strftime(r'%Y%m%d_%H%M%S')  # yyyymmdd_hhmmss
run_name = f"basic_roberta_{start_time}"

training_args = TrainingArguments(
    # model output
    run_name=run_name,
    output_dir=MODEL_DIR / run_name,
    save_strategy='epoch',
    save_total_limit=3,
    # training hyperparams
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    gradient_accumulation_steps=4,
    gradient_checkpointing=True,
    weight_decay=0.01,
    # evaluation during training
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    log_level='warning',
)

Establish evaluation metrics:

In [42]:
# setup training / evaluation metric
#   Docs: https://huggingface.co/docs/evaluate/package_reference/main_classes#evaluate.combine
#   Each of these metrics corresponds to a script from huggingface, below are the links for each script.
#       accuracy:       https://huggingface.co/spaces/evaluate-metric/accuracy
#       f1:             https://huggingface.co/spaces/evaluate-metric/f1
#       precision:      https://huggingface.co/spaces/evaluate-metric/precision
#       recall:         https://huggingface.co/spaces/evaluate-metric/recall
metric_list = ['accuracy', 'f1', 'precision', 'recall']

metric = evaluate.combine(evaluations=metric_list)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Finally, setup and run the 🤗 Trainer:

In [43]:
time_training_start = pd.Timestamp.now()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_tokenized['train'],
    eval_dataset=ds_tokenized['test'],
    compute_metrics=compute_metrics
)

result = trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start

print("\nTraining duration:", str(time_training))



  0%|          | 0/770 [00:00<?, ?it/s]

{'loss': 0.463, 'learning_rate': 4e-05, 'epoch': 1.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.43414580821990967, 'eval_accuracy': 0.8292237442922374, 'eval_f1': 0.015213575190169687, 'eval_precision': 0.6190476190476191, 'eval_recall': 0.007701421800947867, 'eval_runtime': 83.2058, 'eval_samples_per_second': 118.441, 'eval_steps_per_second': 3.702, 'epoch': 1.0}
{'loss': 0.4257, 'learning_rate': 3e-05, 'epoch': 2.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.430319219827652, 'eval_accuracy': 0.8289193302891933, 'eval_f1': 0.29396984924623115, 'eval_precision': 0.5014285714285714, 'eval_recall': 0.2079383886255924, 'eval_runtime': 82.5001, 'eval_samples_per_second': 119.454, 'eval_steps_per_second': 3.733, 'epoch': 2.0}
{'loss': 0.3791, 'learning_rate': 2e-05, 'epoch': 3.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.43349748849868774, 'eval_accuracy': 0.8347031963470319, 'eval_f1': 0.20497803806734993, 'eval_precision': 0.5817174515235457, 'eval_recall': 0.12440758293838862, 'eval_runtime': 83.2959, 'eval_samples_per_second': 118.313, 'eval_steps_per_second': 3.698, 'epoch': 3.0}
{'loss': 0.325, 'learning_rate': 1e-05, 'epoch': 4.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.47166863083839417, 'eval_accuracy': 0.8259766615930999, 'eval_f1': 0.2968429684296843, 'eval_precision': 0.48202396804260983, 'eval_recall': 0.21445497630331753, 'eval_runtime': 81.369, 'eval_samples_per_second': 121.115, 'eval_steps_per_second': 3.785, 'epoch': 4.0}
{'loss': 0.2817, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.49859941005706787, 'eval_accuracy': 0.8242516489091831, 'eval_f1': 0.31378763866877973, 'eval_precision': 0.47368421052631576, 'eval_recall': 0.23459715639810427, 'eval_runtime': 81.8029, 'eval_samples_per_second': 120.472, 'eval_steps_per_second': 3.765, 'epoch': 5.0}
{'train_runtime': 2722.1736, 'train_samples_per_second': 36.203, 'train_steps_per_second': 0.283, 'train_loss': 0.3748962501426796, 'epoch': 5.0}

Training duration: 0 days 00:45:22.492777


Save the trained model:

In [44]:
trainer.save_model()    # saves to self.args.output_dir

## 2.5 - Evaluate

In [45]:
time_evaluation_start = pd.Timestamp.now()

final_metrics = {}
final_metrics['train'] = trainer.evaluate(eval_dataset=ds_tokenized['train'], metric_key_prefix='final_train')
final_metrics['test']= trainer.evaluate(eval_dataset=ds_tokenized['test'], metric_key_prefix='final_test')
final_metrics['valid'] = trainer.evaluate(eval_dataset=ds_tokenized['valid'], metric_key_prefix='validation')

time_evaluation_stop = pd.Timestamp.now()
time_evaluation = time_evaluation_stop - time_evaluation_start

print("\nEvaluation duration, what's the situation:", str(time_evaluation))

  0%|          | 0/616 [00:00<?, ?it/s]

  0%|          | 0/308 [00:00<?, ?it/s]

  0%|          | 0/308 [00:00<?, ?it/s]


Evaluation duration, what's the situation: 0 days 00:04:24.722081


In [46]:
for split in final_metrics:
    print(f"\n{split.upper():->10}{'-'*15}")
    for k, v in final_metrics[split].items():
        print(f"{v:>10.3f} - {k}")
    print("-"*25)


-----TRAIN---------------
     0.221 - final_train_loss
     0.917 - final_train_accuracy
     0.703 - final_train_f1
     0.909 - final_train_precision
     0.573 - final_train_recall
   117.271 - final_train_runtime
   168.072 - final_train_samples_per_second
     5.253 - final_train_steps_per_second
     5.000 - epoch
-------------------------

------TEST---------------
     0.499 - final_test_loss
     0.824 - final_test_accuracy
     0.314 - final_test_f1
     0.474 - final_test_precision
     0.235 - final_test_recall
    82.584 - final_test_runtime
   119.333 - final_test_samples_per_second
     3.730 - final_test_steps_per_second
     5.000 - epoch
-------------------------

-----VALID---------------
     0.495 - validation_loss
     0.825 - validation_accuracy
     0.324 - validation_f1
     0.480 - validation_precision
     0.245 - validation_recall
    64.822 - validation_runtime
   152.032 - validation_samples_per_second
     4.751 - validation_steps_per_second
     5.000 

### 2.5.1 - Informal Test

Feeding two completely made-up lines to the fine-tuned model, mostly for fun but also as a small test of the model's performance.

In [57]:
test_lines = [
    "Assistant to the regional manager of beets, Mose and mother on the farm",
    "My name is Michael Scott, paper is my business",
]

for line in test_lines:
    print("-"*50)
    informal_test(tokenizer, model, line)

--------------------------------------------------
         Test Line:  "Assistant to the regional manager of beets, Mose and mother on the farm"
 Predicted Speaker:  dwight
            Logits:  tensor([[-1.8353,  2.3127]], device='cuda:0')
--------------------------------------------------
         Test Line:  "My name is Michael Scott, paper is my business"
 Predicted Speaker:  not_dwight
            Logits:  tensor([[ 1.7975, -1.3024]], device='cuda:0')


## 2.6 - Discussion / Conclusions (on this attempt)

| Metrics (Train/Test/Valid)         | Accuracy              | F1 Score              | Precision             | Recall                | Fine-Tuning Time |
|------------------------------------|-----------------------|-----------------------|-----------------------|-----------------------|------------------|
| (1) Basic Transformer (DistilBERT) | 0.961 / 0.810 / 0.806 | 0.875 / 0.330 / 0.322 | 0.978 / 0.417 / 0.402 | 0.791 / 0.274 / 0.268 | 0d 0h 9m 48s     |
| (2) Mod: Different PLM (RoBERTa)   | 0.917 / 0.824 / 0.825 | 0.703 / 0.314 / 0.324 | 0.909 / 0.474 / 0.480 | 0.573 / 0.235 / 0.245 | 0d 0h 45m 23s    |

### 2.6.1 - Accuracy
Accuracy for this attempt was at least (roughly) equivalent to the trivial classifier but still not worth celebrating.

### 2.6.2 - F1 Score
Interestingly the validation F1 score stayed approximately the same while the train+test F1 scores decreased. During training, the test F1 score still did not appear to be stable/converging over the five epochs. Training loss did steadily decrease so it may be possible the model is undertrained, but considering the higher training accuracy we would be concerned that further training would overfit the training data.

### 2.6.3 - Precision / Recall
Neither measure appeared stable during training, but there is some small improvement here for precision (likely at the cost of the apparent decrease in recall).

### 2.6.4 - Overall "Modified Approach: Different Pretrained Model (RoBERTa)" Conclusion
This modification (to swap in a more "robustly optimized" pretrained language model) has some improvement to precision but overall does not appear to end up as winning model for our task. The training time for fine-tuning was also increased by a factor of ~4.6 which is an expected but discouraging result.

# 3 - Modified Approach: Re-Balance Dataset

Returning to our original pretrained model (DistilBERT), we'll attempt to make modifications to our dataset to try to realize some better classification performance.

> NOTE: Differences from the "Basic Transformer" (Section 1) are noted with "`>>`" chevrons.

**Task**: Sequence Classification (Binary)

**Classes**: 
 - Positive (1): "Dwight" - a line is spoken by the character Dwight K. Schrute (played by Rainn Wilson).
 - Negative (0): "Not Dwight" - a line is spoken by any other character than Dwight.

**Data**:
 - `speaker` as pre-cursor to class label. Limited to top-10 most frequent speakers based on number of lines in dataset.
 - `line` as sequence text.
 - `>>` Class imbalance is addressed. Two techniques are applied (separately): (1) undersample the negative class; (2) oversample the positive class. More details under "Notes" below.

**Encoding**:
 - Tokenizer: DistilBertTokenizerFast
 - Max Sequence Length: 128
 - Padding: True
 - Truncate: True

**Pretrained Model**:
 - DistilBert (`distilbert-base-uncased`) [(link: huggingface.co)](https://huggingface.co/distilbert-base-uncased) - Intended to mimic the standard "BERTbase" model but in a smaller/faster/more efficient way.
 - Citation: Sanh et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (2019) - [https://arxiv.org/pdf/1910.01108.pdf](https://arxiv.org/pdf/1910.01108.pdf)

**Training**:
 - `>>` Train/Test/Validation Split: 70/15/15 - increased training split size to balance the effect of undersampling negative class.

**Notes**:
 - `>>` Two approaches are attempted for re-balancing the training dataset.
   - (1) Undersample the negative class - There are several ways we could choose to accomplish this.
     - Choose a smaller N than $N=10$ of our "top N speakers" filter until the number of Not-Dwight utterances is closer to the number of Dwight utterances.
     - Maintain the same $N=10$ approach but reduce the number of Non-Dwight utterances in the training dataset (i.e. "use all of Dwight's utterances, but only use XX% of the Top-10-Speaker Non-Dwight utterances").
   - (2) Oversample the positive class
     - If we had unlimited time, it would be interesting to use an approach like [Synthetic Minority Over-sampling Technique (SMOTE)](https://arxiv.org/abs/1106.1813), perhaps using numeric features derived from GloVe vectors to choose "similar" words to real Dwight lines.
     - More within the realm of NLP, we could also consider using a generative language model to generate artificial lines until we have enough Dwight lines (real+artificial) to balance the dataset.
   - For this first attempt at re-balancing, we'll go with a low-complexity approach: 
     - Double the positive class samples (2x oversample)
     - Halve the negative class samples (2x undersample)
     - This would bring us to a class balance of about 1:1.2 (positive to negative).
 - Vocabulary: no modifications made to pretrained transformer's vocabulary.
 - Secondary data: no inclusion of secondary data (director/writer credits).

## 3.1 - Dataset Prep

Rather than using the 🤗 train/test split method (which itself is a wrapper for the `sklearn` method), we'll perform the split manually with `pandas`.

In [10]:
# limit to top 10 most frequent speakers
top_10_speaker_list = script_df['speaker'].value_counts(normalize=True).nlargest(10).index.tolist()
columns_to_keep = ['speaker', 'line']

script_df_subset = script_df.loc[script_df['speaker'].isin(top_10_speaker_list), columns_to_keep]

# rename the 'line' column to be 'text'
script_df_subset = script_df_subset.rename(columns={'line': 'text'})

# create class label column
dwight_mask = (script_df_subset['speaker'] == 'dwight')

# new column of zeros
script_df_subset['label'] = 0

# apply the Dwight mask (as seen in the CPR scene of S05E14 "Stress Relief")
script_df_subset.loc[dwight_mask, 'label'] = 1

# adjust dtype
script_df_subset['label'] = script_df_subset['label'].astype('int8')

# check results
print(script_df_subset['label'].value_counts())

pd.concat(
    [script_df_subset.loc[script_df_subset['label'] == 0].sample(3, random_state=42),
     script_df_subset.loc[script_df_subset['label'] == 1].sample(3, random_state=42)]
)

0    32668
1     6752
Name: label, dtype: int64


Unnamed: 0,speaker,text,label
30002,michael,"That's, that is true.",0
17790,andy,All right!,0
44715,oscar,Un-be-liev-a-ble.,0
50914,dwight,We just need a pretense to talk to him. We could tell him that his mother is dying. That usually works on him. Nate. Your mother is dying.,1
10118,dwight,I'm not.,1
11313,dwight,Do you have the tools to turn a wooden mop handle into a stake?,1


## 3.2 - Train/Test/Val Split

In [11]:
# set parameters
train_size = 0.70
test_size = 0.15
valid_size = 0.15

assert sum([train_size, test_size, valid_size]) == 1.0

split_random_seed = 27  # for Weird Al fans

# stratify by `label`
positive_index = script_df_subset.loc[script_df_subset['label'] == 1].index
negative_index = script_df_subset.loc[script_df_subset['label'] == 0].index

# first cut is training set
positive_index_train = script_df_subset.loc[positive_index].sample(
    frac=train_size,
    replace=False,
    random_state=split_random_seed
    ).index

negative_index_train = script_df_subset.loc[negative_index].sample(
    frac=train_size,
    replace=False,
    random_state=split_random_seed
    ).index

# with training set excluded, take a cut of what's left for test
positive_index_test = script_df_subset.loc[positive_index].drop(index=positive_index_train).sample(
    frac=(test_size / (test_size+valid_size)),  # accounting for train sample already removed
    replace=False,
    random_state=split_random_seed
    ).index

negative_index_test = script_df_subset.loc[negative_index].drop(index=negative_index_train).sample(
    frac=(test_size / (test_size+valid_size)),  # accounting for train sample already removed
    replace=False,
    random_state=split_random_seed
    ).index

# and then anything not in training or test is left to validation
positive_index_valid = script_df_subset.loc[positive_index] \
    .drop(index=positive_index_train) \
    .drop(index=positive_index_test).index

negative_index_valid = script_df_subset.loc[negative_index] \
    .drop(index=negative_index_train) \
    .drop(index=negative_index_test).index

# grab the glue and reassemble these pieces
#   (note: training data is re-assembled last)
#   also apply `sample(frac=1.0)` to shuffle data
script_df_subset_test = pd.concat([
    script_df_subset.loc[positive_index_test],
    script_df_subset.loc[negative_index_test]
], axis='index').sample(frac=1.0)

script_df_subset_valid = pd.concat([
    script_df_subset.loc[positive_index_valid],
    script_df_subset.loc[negative_index_valid]
], axis='index').sample(frac=1.0)

#   apply the over/under sampling to training data as we reassemble
script_df_subset_train = pd.concat([
    # 2x oversample of positive class
    script_df_subset.loc[positive_index_train],
    script_df_subset.loc[positive_index_train],
    # 2x undersample of negative class
    script_df_subset.loc[negative_index_train].sample(frac=0.5, random_state=split_random_seed)
], axis='index').sample(frac=1.0)

# confirm our ratios
#   we want:
#    - training data to be closer to 1:1 positive-to-negative
#    - test and validation data to be closer to original
num_negative = script_df_subset_train['label'].value_counts()[0]
num_positive = script_df_subset_train['label'].value_counts()[1]
print(f"train | ratio positive/negative is:\t1 to {num_negative/num_positive:0.1f}")

num_negative = script_df_subset_test['label'].value_counts()[0]
num_positive = script_df_subset_test['label'].value_counts()[1]
print(f"test  | ratio positive/negative is:\t1 to {num_negative/num_positive:0.1f}")

num_negative = script_df_subset_valid['label'].value_counts()[0]
num_positive = script_df_subset_valid['label'].value_counts()[1]
print(f"valid | ratio positive/negative is:\t1 to {num_negative/num_positive:0.1f}")

train | ratio positive/negative is:	1 to 1.2
test  | ratio positive/negative is:	1 to 4.8
valid | ratio positive/negative is:	1 to 4.8


In [12]:
cols_of_interest = ['text', 'label']
class_names = ['not_dwight', 'dwight']

# convert to 🤗 Dataset objects inside a DatasetDict
ds_dict = DatasetDict({
    'train': Dataset.from_pandas(
        script_df_subset_train[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
    'test': Dataset.from_pandas(
        script_df_subset_test[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
    'valid': Dataset.from_pandas(
        script_df_subset_valid[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
})

print(ds_dict)

Casting the dataset:   0%|          | 0/20886 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/5913 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/5913 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 20886
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 5913
    })
    valid: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 5913
    })
})


## 3.3 - Tokenize and Encode

In [15]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# tokenizer function
def tokenize_function(examples):
    return tokenizer(examples['text'], 
                     padding='longest', 
                     truncation=True, 
                     return_tensors='pt',
                     max_length=128
)

ds_tokenized = ds_dict.map(
    tokenize_function, 
    batched=True, 
    batch_size=None
)

# note because of the over-/under-sampling, these test indices will reference 
#   different lines than the previous cases
inspect_tokens(tokenizer, ds_tokenized['train'][999])
inspect_tokens(tokenizer, ds_tokenized['test'][42])

Map:   0%|          | 0/20886 [00:00<?, ? examples/s]

Map:   0%|          | 0/5913 [00:00<?, ? examples/s]

Map:   0%|          | 0/5913 [00:00<?, ? examples/s]

--------------------------------------------------
Original text:
	What did they do to you, Angela?

Label:	1

Tokenized form:
	[CLS] what did they do to you , angela ? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Tokens as a list:
	['[CLS]', 'what', 'did', 'they', 'do', 'to', 'you', ',', 'angela', '?', '[SEP]', '[PAD]', '[PAD]', '[PA

## 3.4 - Model

In [16]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', 
    num_labels=2,
    id2label={idx: label for idx, label in enumerate(ds_dict['train'].features['label'].names)}
    )

start_time = pd.Timestamp.now().strftime(r'%Y%m%d_%H%M%S')  # yyyymmdd_hhmmss
run_name = f"rebalanced_distilbert_{start_time}"

# setup training args
training_args = TrainingArguments(
    # model output
    run_name=run_name,
    output_dir=MODEL_DIR / run_name,
    save_strategy='epoch',
    save_total_limit=3,
    # training hyperparams
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    #gradient_accumulation_steps=4,
    #gradient_checkpointing=True,
    weight_decay=0.01,
    # evaluation during training
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    log_level='warning',
)

# establish evaluation metrics:
#   Docs: https://huggingface.co/docs/evaluate/package_reference/main_classes#evaluate.combine
#   Each of these metrics corresponds to a script from huggingface, below are the links for each script.
#       accuracy:       https://huggingface.co/spaces/evaluate-metric/accuracy
#       f1:             https://huggingface.co/spaces/evaluate-metric/f1
#       precision:      https://huggingface.co/spaces/evaluate-metric/precision
#       recall:         https://huggingface.co/spaces/evaluate-metric/recall
metric_list = ['accuracy', 'f1', 'precision', 'recall']

metric = evaluate.combine(evaluations=metric_list)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.bias', 'vocab_layer_norm.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classi

In [17]:
print(model)
print(model.config)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [19]:
time_training_start = pd.Timestamp.now()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_tokenized['train'],
    eval_dataset=ds_tokenized['test'],
    compute_metrics=compute_metrics
)

result = trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start

print("\nTraining duration:", str(time_training))

# save the trained model:
trainer.save_model()    # saves to self.args.output_dir

  0%|          | 0/3265 [00:00<?, ?it/s]

{'loss': 0.3386, 'learning_rate': 4e-05, 'epoch': 1.0}


  0%|          | 0/185 [00:00<?, ?it/s]

{'eval_loss': 0.7878588438034058, 'eval_accuracy': 0.7299171317436157, 'eval_f1': 0.38788807972403216, 'eval_precision': 0.31704260651629074, 'eval_recall': 0.49950641658440276, 'eval_runtime': 10.6394, 'eval_samples_per_second': 555.766, 'eval_steps_per_second': 17.388, 'epoch': 1.0}
{'loss': 0.2694, 'learning_rate': 3e-05, 'epoch': 2.0}


  0%|          | 0/185 [00:00<?, ?it/s]

{'eval_loss': 0.7454438805580139, 'eval_accuracy': 0.7375274818197193, 'eval_f1': 0.3602638087386645, 'eval_precision': 0.30927105449398445, 'eval_recall': 0.43139190523198423, 'eval_runtime': 10.3354, 'eval_samples_per_second': 572.109, 'eval_steps_per_second': 17.9, 'epoch': 2.0}
{'loss': 0.1808, 'learning_rate': 2e-05, 'epoch': 3.0}


  0%|          | 0/185 [00:00<?, ?it/s]

{'eval_loss': 0.9581283926963806, 'eval_accuracy': 0.7498731608320649, 'eval_f1': 0.3687580025608195, 'eval_precision': 0.324812030075188, 'eval_recall': 0.42645607107601186, 'eval_runtime': 10.5839, 'eval_samples_per_second': 558.678, 'eval_steps_per_second': 17.479, 'epoch': 3.0}
{'loss': 0.136, 'learning_rate': 1e-05, 'epoch': 4.0}


  0%|          | 0/185 [00:00<?, ?it/s]

{'eval_loss': 1.237331748008728, 'eval_accuracy': 0.7471672585827837, 'eval_f1': 0.3667937314697162, 'eval_precision': 0.3212166172106825, 'eval_recall': 0.42744323790720634, 'eval_runtime': 10.5803, 'eval_samples_per_second': 558.87, 'eval_steps_per_second': 17.485, 'epoch': 4.0}
{'loss': 0.107, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/185 [00:00<?, ?it/s]

{'eval_loss': 1.3681824207305908, 'eval_accuracy': 0.7469981396922036, 'eval_f1': 0.37510442773600666, 'eval_precision': 0.32512671976828383, 'eval_recall': 0.4432379072063179, 'eval_runtime': 10.5589, 'eval_samples_per_second': 560.0, 'eval_steps_per_second': 17.521, 'epoch': 5.0}
{'train_runtime': 582.4133, 'train_samples_per_second': 179.306, 'train_steps_per_second': 5.606, 'train_loss': 0.20636464888615047, 'epoch': 5.0}

Training duration: 0 days 00:09:42.429381


## 3.5 - Evaluate

In [20]:
time_evaluation_start = pd.Timestamp.now()

final_metrics = {}
final_metrics['train'] = trainer.evaluate(eval_dataset=ds_tokenized['train'], metric_key_prefix='final_train')
final_metrics['test']= trainer.evaluate(eval_dataset=ds_tokenized['test'], metric_key_prefix='final_test')
final_metrics['valid'] = trainer.evaluate(eval_dataset=ds_tokenized['valid'], metric_key_prefix='validation')

time_evaluation_stop = pd.Timestamp.now()
time_evaluation = time_evaluation_stop - time_evaluation_start

print("\nEvaluation duration, what's the situation:", str(time_evaluation))

# print the metrics
for split in final_metrics:
    print(f"\n{split.upper():->10}{'-'*15}")
    for k, v in final_metrics[split].items():
        print(f"{v:>10.3f} - {k}")
    print("-"*25)

  0%|          | 0/653 [00:00<?, ?it/s]

  0%|          | 0/185 [00:00<?, ?it/s]

  0%|          | 0/185 [00:00<?, ?it/s]


Evaluation duration, what's the situation: 0 days 00:00:57.943045

-----TRAIN---------------
     0.084 - final_train_loss
     0.956 - final_train_accuracy
     0.951 - final_train_f1
     0.969 - final_train_precision
     0.934 - final_train_recall
    36.878 - final_train_runtime
   566.358 - final_train_samples_per_second
    17.707 - final_train_steps_per_second
     5.000 - epoch
-------------------------

------TEST---------------
     1.368 - final_test_loss
     0.747 - final_test_accuracy
     0.375 - final_test_f1
     0.325 - final_test_precision
     0.443 - final_test_recall
    10.498 - final_test_runtime
   563.229 - final_test_samples_per_second
    17.622 - final_test_steps_per_second
     5.000 - epoch
-------------------------

-----VALID---------------
     1.349 - validation_loss
     0.748 - validation_accuracy
     0.373 - validation_f1
     0.325 - validation_precision
     0.437 - validation_recall
    10.523 - validation_runtime
   561.907 - validation_samp

### 3.5.1 - Informal Test

Feeding two completely made-up lines to the fine-tuned model, mostly for fun but also as a small test of the model's performance.

In [21]:
test_lines = [
    "Assistant to the regional manager of beets, Mose and mother on the farm",
    "My name is Michael Scott, paper is my business",
]

for line in test_lines:
    print("-"*50)
    informal_test(tokenizer, model, line)

--------------------------------------------------
         Test Line:  "Assistant to the regional manager of beets, Mose and mother on the farm"
 Predicted Speaker:  dwight (1)
            Logits:  tensor([[-2.3220,  2.0004]], device='cuda:0')
--------------------------------------------------
         Test Line:  "My name is Michael Scott, paper is my business"
 Predicted Speaker:  not_dwight (0)
            Logits:  tensor([[ 4.8112, -3.6310]], device='cuda:0')


## 3.6 - Discussion / Conclusions (on this attempt)

| Metrics (Train/Test/Valid)            | Accuracy              | F1 Score              | Precision             | Recall                | Fine-Tuning Time |
|---------------------------------------|-----------------------|-----------------------|-----------------------|-----------------------|------------------|
| (1) Basic Transformer (DistilBERT)    | 0.961 / 0.810 / 0.806 | 0.875 / 0.330 / 0.322 | 0.978 / 0.417 / 0.402 | 0.791 / 0.274 / 0.268 | 0d 0h 9m 48s     |
| (2) Mod: Different PLM (RoBERTa)      | 0.917 / 0.824 / 0.825 | 0.703 / 0.314 / 0.324 | 0.909 / 0.474 / 0.480 | 0.573 / 0.235 / 0.245 | 0d 0h 45m 23s    |
| (3) Mod: Re-Balance Data (DistilBERT) | 0.956 / 0.747 / 0.748 | 0.951 / 0.375 / 0.373 | 0.969 / 0.325 / 0.325 | 0.934 / 0.443 / 0.437 | 0d 0h 9m 42s     |

### 3.6.1 - Accuracy
While we re-balanced the class representation in our training dataset (1:1.2 pos/neg), we kept the test and validation sets with the original ratio (1:4.8). This means the trivial classifier accuracy is still 82.9% (for test or validation datasets) but would be about 54.7% ( $\frac{16,334}{29,838}$ ) for the training dataset. With that in mind, the validation accuracy for this attempt (75.0%) is a step backwards from prior attempts.

### 3.6.2 - F1 Score
We did see an improvement to F1 score for this attempt, with validation F1 of 0.384 (a 0.06 improvement).

### 3.6.3 - Precision / Recall
Precision decreased for this attempt (0.332 validation precision), a drop of ~0.07 from the baseline basic transformer. Recall, however, made a large jump upward to 0.456 (validation recall), an improvement of ~0.188. It would follow that this larger increase in recall has a larger influence on F1 score than the decrease in precision, and accordingly the F1 score increased.

### 3.6.4 - Overall "Modified Approach: Re-Balance Dataset" Conclusion
An initial run of this attempt conducted the dataset rebalancing *prior to* the train/test/validation split, and produced significantly better (albeit tainted) results. We kept the class imbalance in the test and verification sets to be representative of the raw data as we cannot dictate the class imbalance on future, new data. The resulting performance showed improvement for F1 and Recall at the expense of Precision and (to a greater extent) Accuracy.

# 4 - Modified Approach: Remove short lines

Attempting a basic transformer model without too much customization to establish a baseline (within transformer-type models) for performance.

**Task**: Sequence Classification (Binary)

**Classes**: 
 - Positive (1): "Dwight" - a line is spoken by the character Dwight K. Schrute (played by Rainn Wilson).
 - Negative (0): "Not Dwight" - a line is spoken by any other character than Dwight.

**Data**:
 - `speaker` as pre-cursor to class label. Limited to top-10 most frequent speakers based on number of lines in dataset
 - `line` as sequence text.
 - `>>` Lines with a word count at or lower than a given threshold ($N \le 5$) are discarded from the dataset. This is under the assumption that exceptionally short lines would be too challenging to differentiate.

**Encoding**:
 - Tokenizer: DistilBertTokenizerFast
 - Max Sequence Length: 128
 - Padding: True
 - Truncate: True

**Pretrained Model**:
 - DistilBert (`distilbert-base-uncased`) [(link: huggingface.co)](https://huggingface.co/distilbert-base-uncased) - Intended to mimic the standard "BERTbase" model but in a smaller/faster/more efficient way.
 - Citation: Sanh et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (2019) - [https://arxiv.org/pdf/1910.01108.pdf](https://arxiv.org/pdf/1910.01108.pdf)

**Training**:
 - `>>` Train/Test/Validation Split: 70/15/15
 - `>>` Due to some portion of data being discarded for low word count, the train/test/split from attempt #3 is used (to give the training dataset a fighting chance).

**Notes**:
 - Class imbalance is present (positive: 6,752; negative: 32,668; about `1:4.8` imbalance ratio).
   - `>>` After the low-word-count-cutoff step, the class imbalance ratio does not change significantly (new ratio is `1:4.7`).
 - Vocabulary: no modifications made to pretrained transformer's vocabulary.
 - Secondary data: no inclusion of secondary data (director/writer credits).

## 4.1 - Dataset Prep

Similar to Attemptm #3, we'll perform the train/test/validation split manually with `pandas`.

In [81]:
# limit to top 10 most frequent speakers
top_10_speaker_list = script_df['speaker'].value_counts(normalize=True).nlargest(10).index.tolist()
columns_to_keep = ['speaker', 'line']

script_df_subset = script_df.loc[script_df['speaker'].isin(top_10_speaker_list), columns_to_keep]

# rename the 'line' column to be 'text'
script_df_subset = script_df_subset.rename(columns={'line': 'text'})

# create class label column
dwight_mask = (script_df_subset['speaker'] == 'dwight')

# new column of zeros
script_df_subset['label'] = 0

# apply the Dwight mask (as seen in the CPR scene of S05E14 "Stress Relief")
script_df_subset.loc[dwight_mask, 'label'] = 1

# adjust dtype
script_df_subset['label'] = script_df_subset['label'].astype('int8')

# check results
print(script_df_subset['label'].value_counts())

pd.concat(
    [script_df_subset.loc[script_df_subset['label'] == 0].sample(3, random_state=42),
     script_df_subset.loc[script_df_subset['label'] == 1].sample(3, random_state=42)]
)

0    32668
1     6752
Name: label, dtype: int64


Unnamed: 0,speaker,text,label
30002,michael,"That's, that is true.",0
17790,andy,All right!,0
44715,oscar,Un-be-liev-a-ble.,0
50914,dwight,We just need a pretense to talk to him. We could tell him that his mother is dying. That usually works on him. Nate. Your mother is dying.,1
10118,dwight,I'm not.,1
11313,dwight,Do you have the tools to turn a wooden mop handle into a stake?,1


In [82]:
# create new column for number of words (using `word_tokenize()` from `nltk`)
script_df_subset['word_count'] = script_df_subset['text'].apply(lambda text: len(word_tokenize(text)))

In [83]:
# plot setup
title = "Per-Line Word Count Distribution"
hlabel = "Word Count (via nltk.word_tokenize); range=[0,100] shown"
vlabel = "Frequency (log scale)"
legend_label = "Class"

template = 'plotly_white'
colormap = px.colors.qualitative.D3

width = 1000; height = 600; margin = 110

fig = px.histogram(
    script_df_subset,
    x='word_count',
    color='label',
    marginal='box',
    log_y=True,
    color_discrete_sequence=colormap,
    opacity=0.6,
    range_x=[0, 100],
)

fig.update_layout(
    template=template,
    title=dict(text=title, font=dict(size=20)),
    xaxis_title=dict(text=hlabel, font=dict(size=16)),
    yaxis_title=dict(text=vlabel, font=dict(size=16)),
    legend_title=dict(text=legend_label, font=dict(size=16)),
    #showlegend=False,
    margin=dict(l=margin, r=margin, t=margin, b=margin),
    width=width, 
    height=height,
    font=dict(family='Open Sans, Arial', color='black')
)

fig.add_vline(x=5, line_dash="dash", line_color="black")

fig.show()

In [84]:
# set a low "word count" cutoff threshold and drop lines with word counts falling at or below this threshold
word_count_low_cutoff = 5

index_at_or_below_cutoff = script_df_subset.loc[script_df_subset['word_count'] <= word_count_low_cutoff].index
num_at_or_below_cutoff = len(index_at_or_below_cutoff)
num_total_before_cutoff = len(script_df_subset.index)

script_df_subset = script_df_subset.drop(index=index_at_or_below_cutoff)

# check the class balance after this cutoff
num_negative = script_df_subset['label'].value_counts()[0]
num_positive = script_df_subset['label'].value_counts()[1]
print(f"ratio positive/negative is:\t1 to {num_negative/num_positive:0.1f}")

# check how much of the dataset was cut
print(f"percent of dataset cut is:\t{num_at_or_below_cutoff/num_total_before_cutoff*100:0.1f} %")

ratio positive/negative is:	1 to 4.7
percent of dataset cut is:	30.3 %


## 4.2 - Train/Test/Val Split

Keeping the pandas-based splitting approach from 3.2, but not applying the over/undersampling step.

In [87]:
# set parameters
train_size = 0.70
test_size = 0.15
valid_size = 0.15

assert sum([train_size, test_size, valid_size]) == 1.0

split_random_seed = 27  # for Weird Al fans

# stratify by `label`
positive_index = script_df_subset.loc[script_df_subset['label'] == 1].index
negative_index = script_df_subset.loc[script_df_subset['label'] == 0].index

# first cut is training set
positive_index_train = script_df_subset.loc[positive_index].sample(
    frac=train_size,
    replace=False,
    random_state=split_random_seed
    ).index

negative_index_train = script_df_subset.loc[negative_index].sample(
    frac=train_size,
    replace=False,
    random_state=split_random_seed
    ).index

# with training set excluded, take a cut of what's left for test
positive_index_test = script_df_subset.loc[positive_index].drop(index=positive_index_train).sample(
    frac=(test_size / (test_size+valid_size)),  # accounting for train sample already removed
    replace=False,
    random_state=split_random_seed
    ).index

negative_index_test = script_df_subset.loc[negative_index].drop(index=negative_index_train).sample(
    frac=(test_size / (test_size+valid_size)),  # accounting for train sample already removed
    replace=False,
    random_state=split_random_seed
    ).index

# and then anything not in training or test is left to validation
positive_index_valid = script_df_subset.loc[positive_index] \
    .drop(index=positive_index_train) \
    .drop(index=positive_index_test).index

negative_index_valid = script_df_subset.loc[negative_index] \
    .drop(index=negative_index_train) \
    .drop(index=negative_index_test).index

# grab the glue and reassemble these pieces
#   apply `sample(frac=1.0)` to shuffle data
script_df_subset_test = pd.concat([
    script_df_subset.loc[positive_index_test],
    script_df_subset.loc[negative_index_test]
], axis='index').sample(frac=1.0)

script_df_subset_valid = pd.concat([
    script_df_subset.loc[positive_index_valid],
    script_df_subset.loc[negative_index_valid]
], axis='index').sample(frac=1.0)

#   re-assemble (no over/under sampling this time)
script_df_subset_train = pd.concat([
    script_df_subset.loc[positive_index_train],
    script_df_subset.loc[negative_index_train]
], axis='index').sample(frac=1.0)

# convert to 🤗 Dataset objects inside a DatasetDict
cols_of_interest = ['text', 'label']
class_names = ['not_dwight', 'dwight']

ds_dict = DatasetDict({
    'train': Dataset.from_pandas(
        script_df_subset_train[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
    'test': Dataset.from_pandas(
        script_df_subset_test[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
    'valid': Dataset.from_pandas(
        script_df_subset_valid[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
})

print(ds_dict)

Casting the dataset:   0%|          | 0/19236 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4122 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/4121 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 19236
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 4122
    })
    valid: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 4121
    })
})


## 4.3 - Tokenize and Encode

In [88]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# tokenizer function
def tokenize_function(examples):
    return tokenizer(examples['text'], 
                     padding='longest', 
                     truncation=True, 
                     return_tensors='pt',
                     max_length=128
)

ds_tokenized = ds_dict.map(
    tokenize_function, 
    batched=True, 
    batch_size=None
)

# note because of the over-/under-sampling, these test indices will reference 
#   different lines than the previous cases
inspect_tokens(tokenizer, ds_tokenized['train'][27])
inspect_tokens(tokenizer, ds_tokenized['test'][42])

Map:   0%|          | 0/19236 [00:00<?, ? examples/s]

Map:   0%|          | 0/4122 [00:00<?, ? examples/s]

Map:   0%|          | 0/4121 [00:00<?, ? examples/s]

--------------------------------------------------
Original text:
	Why... are you here?

Label:	0

Tokenized form:
	[CLS] why . . . are you here ? [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Tokens as a list:
	['[CLS]', 'why', '.', '.', '.', 'are', 'you', 'here', '?', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]

## 4.4 - Model

In [17]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', 
    num_labels=2,
    id2label={idx: label for idx, label in enumerate(ds_dict['train'].features['label'].names)}
    )

start_time = pd.Timestamp.now().strftime(r'%Y%m%d_%H%M%S')  # yyyymmdd_hhmmss
run_name = f"low_cutoff_distilbert_{start_time}"

# setup training args
training_args = TrainingArguments(
    # model output
    run_name=run_name,
    output_dir=MODEL_DIR / run_name,
    save_strategy='epoch',
    save_total_limit=3,
    # training hyperparams
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    #gradient_accumulation_steps=4,
    #gradient_checkpointing=True,
    weight_decay=0.01,
    # evaluation during training
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    log_level='warning',
)

# establish evaluation metrics:
#   Docs: https://huggingface.co/docs/evaluate/package_reference/main_classes#evaluate.combine
#   Each of these metrics corresponds to a script from huggingface, below are the links for each script.
#       accuracy:       https://huggingface.co/spaces/evaluate-metric/accuracy
#       f1:             https://huggingface.co/spaces/evaluate-metric/f1
#       precision:      https://huggingface.co/spaces/evaluate-metric/precision
#       recall:         https://huggingface.co/spaces/evaluate-metric/recall
metric_list = ['accuracy', 'f1', 'precision', 'recall']

metric = evaluate.combine(evaluations=metric_list)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'pre_classifier.bias', 'classifier.w

In [18]:
print(model)
print(model.config)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

In [93]:
time_training_start = pd.Timestamp.now()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_tokenized['train'],
    eval_dataset=ds_tokenized['test'],
    compute_metrics=compute_metrics
)

result = trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start

print("\nTraining duration:", str(time_training))

# save the trained model:
trainer.save_model()    # saves to self.args.output_dir





  0%|          | 0/3010 [00:00<?, ?it/s]

{'loss': 0.4354, 'learning_rate': 4e-05, 'epoch': 1.0}


  0%|          | 0/129 [00:00<?, ?it/s]

{'eval_loss': 0.43434104323387146, 'eval_accuracy': 0.8141678796700631, 'eval_f1': 0.3987441130298273, 'eval_precision': 0.4584837545126354, 'eval_recall': 0.3527777777777778, 'eval_runtime': 7.3437, 'eval_samples_per_second': 561.294, 'eval_steps_per_second': 17.566, 'epoch': 1.0}
{'loss': 0.3491, 'learning_rate': 3e-05, 'epoch': 2.0}


  0%|          | 0/129 [00:00<?, ?it/s]

{'eval_loss': 0.4114859700202942, 'eval_accuracy': 0.8313925278990781, 'eval_f1': 0.35707678075855687, 'eval_precision': 0.5346260387811634, 'eval_recall': 0.26805555555555555, 'eval_runtime': 7.3564, 'eval_samples_per_second': 560.327, 'eval_steps_per_second': 17.536, 'epoch': 2.0}
{'loss': 0.2197, 'learning_rate': 2e-05, 'epoch': 3.0}


  0%|          | 0/129 [00:00<?, ?it/s]

{'eval_loss': 0.5776357054710388, 'eval_accuracy': 0.8309073265405144, 'eval_f1': 0.34430856067732835, 'eval_precision': 0.5335276967930029, 'eval_recall': 0.25416666666666665, 'eval_runtime': 7.3604, 'eval_samples_per_second': 560.025, 'eval_steps_per_second': 17.526, 'epoch': 3.0}
{'loss': 0.1277, 'learning_rate': 1e-05, 'epoch': 4.0}


  0%|          | 0/129 [00:00<?, ?it/s]

{'eval_loss': 0.6998736262321472, 'eval_accuracy': 0.8095584667637069, 'eval_f1': 0.3881527669524552, 'eval_precision': 0.4422735346358792, 'eval_recall': 0.3458333333333333, 'eval_runtime': 7.3528, 'eval_samples_per_second': 560.6, 'eval_steps_per_second': 17.544, 'epoch': 4.0}
{'loss': 0.0738, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/129 [00:00<?, ?it/s]

{'eval_loss': 0.9122729897499084, 'eval_accuracy': 0.8105288694808346, 'eval_f1': 0.37168141592920356, 'eval_precision': 0.4416826003824092, 'eval_recall': 0.32083333333333336, 'eval_runtime': 7.3592, 'eval_samples_per_second': 560.117, 'eval_steps_per_second': 17.529, 'epoch': 5.0}
{'train_runtime': 526.5755, 'train_samples_per_second': 182.652, 'train_steps_per_second': 5.716, 'train_loss': 0.2411499036110913, 'epoch': 5.0}

Training duration: 0 days 00:08:46.835119


## 4.5 - Evaluate

In [95]:
time_evaluation_start = pd.Timestamp.now()

final_metrics = {}
final_metrics['train'] = trainer.evaluate(eval_dataset=ds_tokenized['train'], metric_key_prefix='final_train')
final_metrics['test']= trainer.evaluate(eval_dataset=ds_tokenized['test'], metric_key_prefix='final_test')
final_metrics['valid'] = trainer.evaluate(eval_dataset=ds_tokenized['valid'], metric_key_prefix='validation')

time_evaluation_stop = pd.Timestamp.now()
time_evaluation = time_evaluation_stop - time_evaluation_start

print("\nEvaluation duration, what's the situation:", str(time_evaluation))

# print the metrics
for split in final_metrics:
    print(f"\n{split.upper():->10}{'-'*15}")
    for k, v in final_metrics[split].items():
        print(f"{v:>10.3f} - {k}")
    print("-"*25)

  0%|          | 0/129 [00:00<?, ?it/s]

  0%|          | 0/129 [00:00<?, ?it/s]


Evaluation duration, what's the situation: 0 days 00:00:48.586719

-----TRAIN---------------
     0.034 - final_train_loss
     0.989 - final_train_accuracy
     0.969 - final_train_f1
     0.988 - final_train_precision
     0.951 - final_train_recall
    33.915 - final_train_runtime
   567.179 - final_train_samples_per_second
    17.750 - final_train_steps_per_second
     5.000 - epoch
-------------------------

------TEST---------------
     0.912 - final_test_loss
     0.811 - final_test_accuracy
     0.372 - final_test_f1
     0.442 - final_test_precision
     0.321 - final_test_recall
     7.339 - final_test_runtime
   561.668 - final_test_samples_per_second
    17.578 - final_test_steps_per_second
     5.000 - epoch
-------------------------

-----VALID---------------
     0.875 - validation_loss
     0.814 - validation_accuracy
     0.384 - validation_f1
     0.456 - validation_precision
     0.332 - validation_recall
     7.303 - validation_runtime
   564.250 - validation_samp

### 4.5.1 - Informal Test

Feeding two completely made-up lines to the fine-tuned model, mostly for fun but also as a small test of the model's performance.

In [96]:
test_lines = [
    "Assistant to the regional manager of beets, Mose and mother on the farm",
    "My name is Michael Scott, paper is my business",
]

for line in test_lines:
    print("-"*50)
    informal_test(tokenizer, model, line)

--------------------------------------------------
         Test Line:  "Assistant to the regional manager of beets, Mose and mother on the farm"
 Predicted Speaker:  not_dwight (0)
            Logits:  tensor([[ 0.7638, -1.0487]], device='cuda:0')
--------------------------------------------------
         Test Line:  "My name is Michael Scott, paper is my business"
 Predicted Speaker:  not_dwight (0)
            Logits:  tensor([[ 3.2214, -3.4033]], device='cuda:0')


## 4.6 - Discussion / Conclusions (on this attempt)

| Metrics (Train/Test/Valid)              | Accuracy              | F1 Score              | Precision             | Recall                | Fine-Tuning Time |
|-----------------------------------------|-----------------------|-----------------------|-----------------------|-----------------------|------------------|
| (1) Basic Transformer (DistilBERT)      | 0.961 / 0.810 / 0.806 | 0.875 / 0.330 / 0.322 | 0.978 / 0.417 / 0.402 | 0.791 / 0.274 / 0.268 | 0d 0h 9m 48s     |
| (2) Mod: Different PLM (RoBERTa)        | 0.917 / 0.824 / 0.825 | 0.703 / 0.314 / 0.324 | 0.909 / 0.474 / 0.480 | 0.573 / 0.235 / 0.245 | 0d 0h 45m 23s    |
| (3) Mod: Re-Balance Data (DistilBERT)   | 0.956 / 0.747 / 0.748 | 0.951 / 0.375 / 0.373 | 0.969 / 0.325 / 0.325 | 0.934 / 0.443 / 0.437 | 0d 0h 9m 42s     |
| (4) Mod: Word Count Cutoff (DistilBERT) | 0.989 / 0.811 / 0.814 | 0.969 / 0.372 / 0.384 | 0.988 / 0.442 / 0.456 | 0.951 / 0.321 / 0.332 | 0d 0h 8m 47s     |

### 4.6.1 - Accuracy
This attempt fell somewhere between the basic transformer (attempt #1) and the alternate transformer (attempt #2) for accuracy. It achieved the highest training accuracy so far but still did not beat the trivial classifier. 

### 4.6.2 - F1 Score
We saw about the same F1 score as with the re-balanced dataset (attempt #3).

### 4.6.3 - Precision / Recall
We appear to have effectively traded some precision for the same amount of recall when compared to attempt #3. Because we're placing a slightly higher importance on precision over recall, this was a positive improvement.

### 4.6.4 - Overall "Modified Approach: Re-Balance Dataset" Conclusion
For the low word count cutoff chosen (lines with 5 or fewer lines were excluded), we did end up discarding about 30% of our data. It could be argued these short lines would be sufficiently challenging to programmatically label and that they would need manual labeling instead, and as such the discarded data was not particularly valuable. This is still a key limitation worth noting however.

Overall this was an interesting attempt in that it was fairly straightforward to setup but acieved better <*every metric*> than the base model and ~tied the RoBERTa-based model.

# 5 - Modified Approach: Augment Vocabulary (Director/Writer Names)

The director and writer(s) of a given episode will dictate how the a character's concept/persona manifests into an actor's performance (using the theatrical term, not the metric term) of that character. While there are multiple dimensions to an actor's performance, their spoken dialogue is a key component. Because we're working with purely text-based dialogue data, our data is not enriched by any applied tone or other auditory cues. Still, it's possible that a given writer and/or director may choose different ways for a character to express themself through dialogue, so we're interested to see whether inclusion of the writer(s) and director can help differentiate within our data.

From a big picture view, it is believable that the writer(s) and director would be easily obtainable data to a tool seeking to label speakers. Such data would be available prior to the initial public release of an episode and without need for significant manual labeling. The motivation, likely naive, is that the way X writer versus Y writer chooses dialogue for Dwight has some degree of differentiability.

**Task**: Sequence Classification (Binary)

**Classes**: 
 - Positive (1): "Dwight" - a line is spoken by the character Dwight K. Schrute (played by Rainn Wilson).
 - Negative (0): "Not Dwight" - a line is spoken by any other character than Dwight.

**Data**:
 - `speaker` as pre-cursor to class label. Limited to top-10 most frequent speakers based on number of lines in dataset.
 - `line` as sequence text.
 - `>>` `directed_by` as the director of the episode during which the line was spoken.
 - `>>` `writer1` as the first-listed writer of the episde during which the line was spoken.

**Encoding**:
 - Tokenizer: DistilBertTokenizerFast
 - Max Sequence Length: 128
 - Padding: True
 - Truncate: True

**Pretrained Model**:
 - DistilBert (`distilbert-base-uncased`) [(link: huggingface.co)](https://huggingface.co/distilbert-base-uncased) - Intended to mimic the standard "BERTbase" model but in a smaller/faster/more efficient way.
 - Citation: Sanh et al. "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter" (2019) - [https://arxiv.org/pdf/1910.01108.pdf](https://arxiv.org/pdf/1910.01108.pdf)

**Training**:
 - Train/Test/Validation Split: 50/25/25

**Notes**:
 - Class imbalance is present (positive: 6,752; negative: 32,668; about `1:4.8` imbalance ratio).
 - `>>` Vocabulary: tokenizer vocabulary modified to include director names and first-listed writer (`writer1`) names.
 - `>>` Secondary data: adding in each episode's director and first-listed writer by appending their names to the lines of dialogue.
 - `>>` The set of directors / writers is finite, so one approach for this experiment could be to apply One-Hot encoding and (ultimately) feed the text-based dialogue data alongside tabular data into a multi-modal model. This approach, which could be created using a package like [`multimodal-transformers` (github.com)](https://github.com/georgian-io/Multimodal-Toolkit), would keep the writer/director data closer to its original structured/tabular form.
 - `>>` Another approach, and the one being used for this experiment, is to append the writer's/director's name to the line of dialogue itself. This would not necessarily leverage the sequential components of our "sequence" classification task, but would at least provide a simple means of bootstrapping this secondary data into the main dataset. This approach does some mangling to the original dialogue data.

## 5.1 - Dataset Prep

In [10]:
# limit to top 10 most frequent speakers
top_10_speaker_list = script_df['speaker'].value_counts(normalize=True).nlargest(10).index.tolist()
columns_to_keep = ['speaker', 'line', 'directed_by', 'writer1']

script_df_subset = script_df.loc[script_df['speaker'].isin(top_10_speaker_list), columns_to_keep]

# rename the 'line' column to be 'text'
script_df_subset = script_df_subset.rename(columns={'line': 'text'})

# create class label column
dwight_mask = (script_df_subset['speaker'] == 'dwight')     # dwight_mask costs thiry-five hundred dollars

# new column of zeros
script_df_subset['label'] = 0

# apply the Dwight mask (as seen in the CPR scene of S05E14 "Stress Relief")
script_df_subset.loc[dwight_mask, 'label'] = 1

# adjust dtype
script_df_subset['label'] = script_df_subset['label'].astype('int8')

# check results
print(script_df_subset['label'].value_counts())

pd.concat(
    [script_df_subset.loc[script_df_subset['label'] == 0].sample(3, random_state=42),
     script_df_subset.loc[script_df_subset['label'] == 1].sample(3, random_state=42)]
)

0    32668
1     6752
Name: label, dtype: int64


Unnamed: 0,speaker,text,directed_by,writer1,label
30002,michael,"That's, that is true.",Jeffrey Blitz,Jason Kessler,0
17790,andy,All right!,Tucker Gates,Lee Eisenberg,0
44715,oscar,Un-be-liev-a-ble.,Brian Baumgartner,Halsted Sullivan,0
50914,dwight,We just need a pretense to talk to him. We could tell him that his mother is dying. That usually works on him. Nate. Your mother is dying.,Lee Kirk,Owen Ellickson,1
10118,dwight,I'm not.,Julian Farino,Justin Spitzer,1
11313,dwight,Do you have the tools to turn a wooden mop handle into a stake?,Joss Whedon,Brent Forrester,1


In [11]:
# get a list of director names
director_names = script_df_subset['directed_by'].unique().tolist()

# get a list of writer1 names
writer1_names = script_df_subset['writer1'].unique().tolist()

# wrap both in some brackets so they look more like special tokens (to human readers)
director_names = [f"[{name}]" for name in director_names]
writer1_names = [f"[{name}]" for name in writer1_names]

# make a function to append names to the line text
#   note this will be mapped to dataset on a per-row basis
def enrich_line_director_and_writer(row: pd.Series):
    director_name_wrapped = f"[{row['directed_by']}]"
    writer1_name_wrapped = f"[{row['writer1']}]"

    enriched_text = f"{row['text']} {director_name_wrapped} {writer1_name_wrapped}"

    return enriched_text

# map that function down the rows
script_df_subset['annotated_text'] = script_df_subset.apply(enrich_line_director_and_writer, axis='columns')

# check result
script_df_subset.sample(10)

Unnamed: 0,speaker,text,directed_by,writer1,label,annotated_text
7366,michael,"Yeah, I'm sure everyone would appreciate me treating them like they were gay.",Ken Kwapis,Greg Daniels,0,"Yeah, I'm sure everyone would appreciate me treating them like they were gay. [Ken Kwapis] [Greg Daniels]"
37943,dwight,Do the monkey face!,Greg Daniels,Robert Padnick,1,Do the monkey face! [Greg Daniels] [Robert Padnick]
20441,michael,Unbelievable! Unbelievable.,Randall Einhorn,Brent Forrester,0,Unbelievable! Unbelievable. [Randall Einhorn] [Brent Forrester]
40304,erin,I... I don't think I can do that.,Jeffrey Blitz,Paul Lieberstein,0,I... I don't think I can do that. [Jeffrey Blitz] [Paul Lieberstein]
35937,oscar,Guys its not worth it really. Guys this is not worth our time.,Charles McDougall,Halsted Sullivan,0,Guys its not worth it really. Guys this is not worth our time. [Charles McDougall] [Halsted Sullivan]
2140,michael,Well then I won't get a warrantee.,Paul Feig,Michael Schur,0,Well then I won't get a warrantee. [Paul Feig] [Michael Schur]
26607,angela,"Pam, my bag was there...",Randall Einhorn,Aaron Shure,0,"Pam, my bag was there... [Randall Einhorn] [Aaron Shure]"
3012,jim,Obviously.,Ken Kwapis,Gene Stupnitsky,0,Obviously. [Ken Kwapis] [Gene Stupnitsky]
832,michael,And many more!,Bryan Gordon,Michael Schur,0,And many more! [Bryan Gordon] [Michael Schur]
3262,jim,"Oh, those are drawings. In case the writing didn't really put a picture in your head. And there he is, in the flesh, Agent Michael Scarn. Now we know what he looks like.",Greg Daniels,Paul Lieberstein,0,"Oh, those are drawings. In case the writing didn't really put a picture in your head. And there he is, in the flesh, Agent Michael Scarn. Now we know what he looks like. [Greg Daniels] [Paul Lieberstein]"


In [12]:
# overwrite the original
script_df_subset['text'] = script_df_subset['annotated_text']

## 5.2 - Train/Test/Val Split

Keeping the pandas-based splitting approach from 3.2, but not applying the over/undersampling step.

In [13]:
# set parameters
train_size = 0.50
test_size = 0.25
valid_size = 0.25

assert sum([train_size, test_size, valid_size]) == 1.0

split_random_seed = 27  # for Weird Al fans

# stratify by `label`
positive_index = script_df_subset.loc[script_df_subset['label'] == 1].index
negative_index = script_df_subset.loc[script_df_subset['label'] == 0].index

# first cut is training set
positive_index_train = script_df_subset.loc[positive_index].sample(
    frac=train_size,
    replace=False,
    random_state=split_random_seed
    ).index

negative_index_train = script_df_subset.loc[negative_index].sample(
    frac=train_size,
    replace=False,
    random_state=split_random_seed
    ).index

# with training set excluded, take a cut of what's left for test
positive_index_test = script_df_subset.loc[positive_index].drop(index=positive_index_train).sample(
    frac=(test_size / (test_size+valid_size)),  # accounting for train sample already removed
    replace=False,
    random_state=split_random_seed
    ).index

negative_index_test = script_df_subset.loc[negative_index].drop(index=negative_index_train).sample(
    frac=(test_size / (test_size+valid_size)),  # accounting for train sample already removed
    replace=False,
    random_state=split_random_seed
    ).index

# and then anything not in training or test is left to validation
positive_index_valid = script_df_subset.loc[positive_index] \
    .drop(index=positive_index_train) \
    .drop(index=positive_index_test).index

negative_index_valid = script_df_subset.loc[negative_index] \
    .drop(index=negative_index_train) \
    .drop(index=negative_index_test).index

# grab the glue and reassemble these pieces
#   apply `sample(frac=1.0)` to shuffle data
script_df_subset_test = pd.concat([
    script_df_subset.loc[positive_index_test],
    script_df_subset.loc[negative_index_test]
], axis='index').sample(frac=1.0)

script_df_subset_valid = pd.concat([
    script_df_subset.loc[positive_index_valid],
    script_df_subset.loc[negative_index_valid]
], axis='index').sample(frac=1.0)

#   re-assemble (no over/under sampling this time)
script_df_subset_train = pd.concat([
    script_df_subset.loc[positive_index_train],
    script_df_subset.loc[negative_index_train]
], axis='index').sample(frac=1.0)

# convert to 🤗 Dataset objects inside a DatasetDict
cols_of_interest = ['text', 'label']
class_names = ['not_dwight', 'dwight']

ds_dict = DatasetDict({
    'train': Dataset.from_pandas(
        script_df_subset_train[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
    'test': Dataset.from_pandas(
        script_df_subset_test[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
    'valid': Dataset.from_pandas(
        script_df_subset_valid[cols_of_interest].reset_index(drop=False)
        ).cast_column('label', ClassLabel(names=class_names)),
})

print(ds_dict)

Casting the dataset:   0%|          | 0/19710 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/9855 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/9855 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 19710
    })
    test: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 9855
    })
    valid: Dataset({
        features: ['index', 'text', 'label'],
        num_rows: 9855
    })
})


## 5.3 - Tokenize and Encode

In [14]:
tokenizer = DistilBertTokenizerFast.from_pretrained('distilbert-base-uncased')

# add new special tokens for writer1/director
#   note: from docs, setting `special_tokens=True" should indicate to tokenizer object
#       that these director/writer names are "special" in that they shouldn't be broken into smaller pieces,
#       but not so "special" that they're treated like a [SEP]/[PAD]/[MASK]/etc. special token.
additional_special_tokens = [*director_names, *writer1_names]
tokens_added = tokenizer.add_tokens(additional_special_tokens, special_tokens=True)
print(f">> Added {tokens_added} special tokens")

# tokenizer function
def tokenize_function(examples):
    return tokenizer(examples['text'], 
                     padding='longest', 
                     truncation=True, 
                     return_tensors='pt',
                     max_length=128
)

ds_tokenized = ds_dict.map(
    tokenize_function, 
    batched=True, 
    batch_size=None
)

# note because of the over-/under-sampling, these test indices will reference 
#   different lines than the previous cases
inspect_tokens(tokenizer, ds_tokenized['train'][200])
inspect_tokens(tokenizer, ds_tokenized['test'][42])

>> Added 82 special tokens


Map:   0%|          | 0/19710 [00:00<?, ? examples/s]

Map:   0%|          | 0/9855 [00:00<?, ? examples/s]

Map:   0%|          | 0/9855 [00:00<?, ? examples/s]

--------------------------------------------------
Original text:
	All we can do is sit and wait. [Brent Forrester] [Graham Wagner]

Label:	1

Tokenized form:
	[CLS] all we can do is sit and wait . [Brent Forrester] [Graham Wagner] [SEP] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD] [PAD]

Tokens as a list:
	['[CLS]', 'all', 'we', 'can', 'do', 'is', 'sit', 'an

## 5.4 - Model

In [15]:
model = DistilBertForSequenceClassification.from_pretrained(
    'distilbert-base-uncased', 
    num_labels=2,
    id2label={idx: label for idx, label in enumerate(ds_dict['train'].features['label'].names)}
    )

# because we added special tokens in section 5.3, have to adjust embedding layer size, too
model.resize_token_embeddings(len(tokenizer))

start_time = pd.Timestamp.now().strftime(r'%Y%m%d_%H%M%S')  # yyyymmdd_hhmmss
run_name = f"augmented_vocab_distilbert_{start_time}"

# setup training args
training_args = TrainingArguments(
    # model output
    run_name=run_name,
    output_dir=MODEL_DIR / run_name,
    save_strategy='epoch',
    save_total_limit=3,
    # training hyperparams
    num_train_epochs=5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    #gradient_accumulation_steps=4,
    #gradient_checkpointing=True,
    weight_decay=0.01,
    # evaluation during training
    evaluation_strategy='epoch',
    logging_strategy='epoch',
    log_level='warning',
)

# establish evaluation metrics:
#   Docs: https://huggingface.co/docs/evaluate/package_reference/main_classes#evaluate.combine
#   Each of these metrics corresponds to a script from huggingface, below are the links for each script.
#       accuracy:       https://huggingface.co/spaces/evaluate-metric/accuracy
#       f1:             https://huggingface.co/spaces/evaluate-metric/f1
#       precision:      https://huggingface.co/spaces/evaluate-metric/precision
#       recall:         https://huggingface.co/spaces/evaluate-metric/recall
metric_list = ['accuracy', 'f1', 'precision', 'recall']

metric = evaluate.combine(evaluations=metric_list)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.bias', 'vocab_projector.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.bias', 'pre_classi

In [16]:
print(model)
print(model.config)

DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30604, 768)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin

In [17]:
time_training_start = pd.Timestamp.now()

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=ds_tokenized['train'],
    eval_dataset=ds_tokenized['test'],
    compute_metrics=compute_metrics
)

result = trainer.train()

time_training_stop = pd.Timestamp.now()
time_training = time_training_stop - time_training_start

print("\nTraining duration:", str(time_training))

# save the trained model:
trainer.save_model()    # saves to self.args.output_dir



  0%|          | 0/3080 [00:00<?, ?it/s]

{'loss': 0.4356, 'learning_rate': 4e-05, 'epoch': 1.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.4125076234340668, 'eval_accuracy': 0.8356164383561644, 'eval_f1': 0.1834677419354839, 'eval_precision': 0.6148648648648649, 'eval_recall': 0.10781990521327015, 'eval_runtime': 17.7223, 'eval_samples_per_second': 556.08, 'eval_steps_per_second': 17.379, 'epoch': 1.0}
{'loss': 0.3535, 'learning_rate': 3e-05, 'epoch': 2.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.44452229142189026, 'eval_accuracy': 0.8360223236935566, 'eval_f1': 0.22605363984674332, 'eval_precision': 0.59, 'eval_recall': 0.13981042654028436, 'eval_runtime': 18.0013, 'eval_samples_per_second': 547.459, 'eval_steps_per_second': 17.11, 'epoch': 2.0}
{'loss': 0.2446, 'learning_rate': 2e-05, 'epoch': 3.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.5382821559906006, 'eval_accuracy': 0.8290208016235413, 'eval_f1': 0.32788193059433585, 'eval_precision': 0.5018315018315018, 'eval_recall': 0.24348341232227488, 'eval_runtime': 17.7481, 'eval_samples_per_second': 555.272, 'eval_steps_per_second': 17.354, 'epoch': 3.0}
{'loss': 0.1574, 'learning_rate': 1e-05, 'epoch': 4.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.7310391068458557, 'eval_accuracy': 0.8122780314561137, 'eval_f1': 0.34397163120567376, 'eval_precision': 0.4284452296819788, 'eval_recall': 0.28732227488151657, 'eval_runtime': 17.6606, 'eval_samples_per_second': 558.022, 'eval_steps_per_second': 17.44, 'epoch': 4.0}
{'loss': 0.1092, 'learning_rate': 0.0, 'epoch': 5.0}


  0%|          | 0/308 [00:00<?, ?it/s]

{'eval_loss': 0.8599780797958374, 'eval_accuracy': 0.8048706240487062, 'eval_f1': 0.3278573925200979, 'eval_precision': 0.3998294970161978, 'eval_recall': 0.2778436018957346, 'eval_runtime': 17.6618, 'eval_samples_per_second': 557.985, 'eval_steps_per_second': 17.439, 'epoch': 5.0}
{'train_runtime': 597.5382, 'train_samples_per_second': 164.927, 'train_steps_per_second': 5.154, 'train_loss': 0.2600630351475307, 'epoch': 5.0}

Training duration: 0 days 00:09:57.772052


## 5.5 - Evaluate

In [18]:
time_evaluation_start = pd.Timestamp.now()

final_metrics = {}
final_metrics['train'] = trainer.evaluate(eval_dataset=ds_tokenized['train'], metric_key_prefix='final_train')
final_metrics['test']= trainer.evaluate(eval_dataset=ds_tokenized['test'], metric_key_prefix='final_test')
final_metrics['valid'] = trainer.evaluate(eval_dataset=ds_tokenized['valid'], metric_key_prefix='validation')

time_evaluation_stop = pd.Timestamp.now()
time_evaluation = time_evaluation_stop - time_evaluation_start

print("\nEvaluation duration, what's the situation:", str(time_evaluation))

# print the metrics
for split in final_metrics:
    print(f"\n{split.upper():->10}{'-'*15}")
    for k, v in final_metrics[split].items():
        print(f"{v:>10.3f} - {k}")
    print("-"*25)

  0%|          | 0/616 [00:00<?, ?it/s]

  0%|          | 0/308 [00:00<?, ?it/s]

  0%|          | 0/308 [00:00<?, ?it/s]


Evaluation duration, what's the situation: 0 days 00:01:10.015885

-----TRAIN---------------
     0.062 - final_train_loss
     0.977 - final_train_accuracy
     0.929 - final_train_f1
     0.977 - final_train_precision
     0.885 - final_train_recall
    34.626 - final_train_runtime
   569.230 - final_train_samples_per_second
    17.790 - final_train_steps_per_second
     5.000 - epoch
-------------------------

------TEST---------------
     0.860 - final_test_loss
     0.805 - final_test_accuracy
     0.328 - final_test_f1
     0.400 - final_test_precision
     0.278 - final_test_recall
    17.649 - final_test_runtime
   558.377 - final_test_samples_per_second
    17.451 - final_test_steps_per_second
     5.000 - epoch
-------------------------

-----VALID---------------
     0.861 - validation_loss
     0.807 - validation_accuracy
     0.334 - validation_f1
     0.409 - validation_precision
     0.283 - validation_recall
    17.703 - validation_runtime
   556.681 - validation_samp

### 5.5.1 - Informal Test

Feeding two completely made-up lines to the fine-tuned model, mostly for fun but also as a small test of the model's performance.

In [19]:
test_lines = [
    "Assistant to the regional manager of beets, Mose and mother on the farm",
    "My name is Michael Scott, paper is my business",
]

for line in test_lines:
    print("-"*50)
    informal_test(tokenizer, model, line)

--------------------------------------------------
         Test Line:  "Assistant to the regional manager of beets, Mose and mother on the farm"
 Predicted Speaker:  dwight (1)
            Logits:  tensor([[-3.3036,  2.9753]], device='cuda:0')
--------------------------------------------------
         Test Line:  "My name is Michael Scott, paper is my business"
 Predicted Speaker:  not_dwight (0)
            Logits:  tensor([[ 3.8263, -3.1304]], device='cuda:0')


## 5.6 - Discussion / Conclusions (on this attempt)

| Metrics (Train/Test/Valid)              | Accuracy              | F1 Score              | Precision             | Recall                | Fine-Tuning Time |
|-----------------------------------------|-----------------------|-----------------------|-----------------------|-----------------------|------------------|
| (1) Basic Transformer (DistilBERT)      | 0.961 / 0.810 / 0.806 | 0.875 / 0.330 / 0.322 | 0.978 / 0.417 / 0.402 | 0.791 / 0.274 / 0.268 | 0d 0h 9m 48s     |
| (2) Mod: Different PLM (RoBERTa)        | 0.917 / 0.824 / 0.825 | 0.703 / 0.314 / 0.324 | 0.909 / 0.474 / 0.480 | 0.573 / 0.235 / 0.245 | 0d 0h 45m 23s    |
| (3) Mod: Re-Balance Data (DistilBERT)   | 0.956 / 0.747 / 0.748 | 0.951 / 0.375 / 0.373 | 0.969 / 0.325 / 0.325 | 0.934 / 0.443 / 0.437 | 0d 0h 9m 42s     |
| (4) Mod: Word Count Cutoff (DistilBERT) | 0.989 / 0.811 / 0.814 | 0.969 / 0.372 / 0.384 | 0.988 / 0.442 / 0.456 | 0.951 / 0.321 / 0.332 | 0d 0h 8m 47s     |
| (5) Mod: Augmented Vocab (DistilBERT)   | 0.977 / 0.805 / 0.807 | 0.929 / 0.328 / 0.334 | 0.977 / 0.400 / 0.409 | 0.885 / 0.278 / 0.283 | 0d 0h 9m 58s     |

### 5.6.1 - Accuracy
The final accuracy of this attempt did not improve from the basic transformer, though we did note that the accuracy briefly surpassed that of the trivial classifier early in the fine-tune training (peaking at 0.836 after epoch #2). We chose against varying the number of epochs or ending training early for this attempt so as to control for that variable, but future work on hyperparameter optimization could explore this further.

### 5.6.2 - F1 Score
F1 score appeared to steadily rise from epoch to epoch, though its ultimate value was only slightly above the base transformer model.

### 5.6.3 - Precision / Recall
Neither precision nor recall were exceptional in this attempt, both falling around the middle of the pack.

### 5.6.4 - Overall "Modified Approach: Re-Balance Dataset" Conclusion
This attempt appeared to have some potential with further hyperparameter tuning, especially around length of training / learning rate schedule during training. Even so, there is potential that including additional information in this manner could be a step towards data leakage, though the added information in this attempt was at least "easy" to obtain and would be well-known during show production.