For supervised learning I recently discovered via a podcast the hub hugging face and wanted to try out its capabilities and power for being already pre-trained on the IBMD dataset implementing it in the fellowship challenge. The hints for the challenge are in the README and here which I tried to answer to give an idea of my approach.
Hints for the challenge: https://arxiv.org/pdf/1910.01108.pdf
1. Ask yourself why would they have selected this problem for the challenge? What are some gotchas in this domain I should know about?
Given that models, particularly in NLP and thus sentiment analysis, are useful if models have been pre-trained on similar data to predict. The transformers model such as BERT was created to predict next sentence on the Toronto Book Corpus and Wikipedia.
2. What is the highest level of accuracy that others have achieved with this dataset or similar problems / datasets ?
With over 215 sentiment analysis models being ready to be deployed on hugging face I chose this DistilBERT because it is faster to compute, cheaper processing power wise and smaller than the traditional BERT model, while preserving almost 95 % of BERTs performance. I will be applying the lazy predict library to benchmark it against other vanilla models.
3. What types of visualizations will help me grasp the nature of the problem / data?
In the sentiment analysis case, printing a sample amount of the rows will allow me to gain an insight into the nature of the raw a/o pre-processed data and the removal of fillers and correct conversion of words into their base form.
4. What feature engineering might help improve the signal?
Thinking about text classification and sentiment analysis in general, human sarcasm is very hard to grasp (even for humans) and thus predict for these models as sarcasm is context dependent. Creating an additional binary variable to spot sarcasm (0 if no and 1 if sarcasm is existent) and perhaps a string variable indicating the word that is meant sarcastically although labelling that manually would require a lot of tedious pre-processing work and thus not be feasible.
5. Which modeling techniques are good at capturing the types of relationships I see in this data?
Models being pre-trained on the same or similar data set (i.e. movie reviews) will be helpful.
6. Now that I have a model, how can I be sure that I didn't introduce a bug in the code? If results are too good to be true, they probably are!
In order to get an idea, as I am new to sentiment analysis, on the overall accuracy of other models here, I use the library lazy predict to benchmark my models performance.
7. What are some of the weaknesses of the model and and how can the model be improved with additional work.
Diagnostics test sparsely from linguists
DistilBERT performs very poorly on negations from sensitivity tests applied to it. (a robin is not a bird, very bad prediction). About BERT whereby DistilBERT is a refined version of BERT with the same caveats https://arxiv.org/pdf/1907.13528.pdf. Future fine-tuneing might proof fruitful for negations but given the size of data used to train on and model complexity the structure of this model might simply be insufficiently able to handle negations and thus a different model might be more adequate.




1. Accessing hugging face by installing transformers
Using pipeline to make predictions, using the default DistilBERT model for sentiment analysis to analyse the list of text and data. Use Google colab or GPU/TPU for faster running of code.

In [None]:
!pip --version # Check pip's version for updates

pip 21.1.3 from /usr/local/lib/python3.8/dist-packages/pip (python 3.8)


In [None]:
!pip install -q transformers
!pip install -q torch
!pip install -q tensorflow

[K     |████████████████████████████████| 5.8 MB 35.1 MB/s 
[K     |████████████████████████████████| 7.6 MB 58.2 MB/s 
[K     |████████████████████████████████| 182 kB 80.4 MB/s 
[?25h

In [None]:
from transformers import pipeline
sentiment_pipeline = pipeline("sentiment-analysis", model = "distilbert-base-uncased") # default distilbert-base-uncased-finetuned-sst-2-english and revisio af0f99b
data = ["I love you", "I hate you"]
sentiment_pipeline(data)

Downloading:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

[{'label': 'LABEL_1', 'score': 0.5189329981803894},
 {'label': 'LABEL_1', 'score': 0.5201061964035034}]

In [None]:
# Import torch to check whether GPU/TPU is available and install dependencies.
import torch
torch.cuda.is_available


<function torch.cuda.is_available() -> bool>

In [None]:
# Install libraries available from huggingface and install git-lfs for model repository use.
!pip install datasets transformers huggingface_hub
!git lfs install

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.8.0-py3-none-any.whl (452 kB)
[K     |████████████████████████████████| 452 kB 32.4 MB/s 
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting multiprocess
  Downloading multiprocess-0.70.14-py38-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 72.4 MB/s 
Collecting xxhash
  Downloading xxhash-3.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 86.0 MB/s 
Collecting urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1
  Downloading urllib3-1.25.11-py2.py3-none-any.whl (127 kB)
[K     |████████████████████████████████| 127 kB 81.3 MB/s 
Installing collected packages: urllib3, xxhash, responses, multiprocess, datasets
  Attempting uninstall: urllib3
    Found existing installation: urllib3 1.24.3
    Uninstalling urllib3-1.24.

Error: Failed to call git rev-parse --git-dir --show-toplevel: "fatal: not a git repository (or any of the parent directories): .git\n"
Git LFS initialized.


2. Preprocess data
In sentiment analysis using text preprocessing data is key. Load datasets library to preprocess the data rather than doing it manually, which I aim to learn during the fellowship.

In [None]:
from datasets import load_dataset
imdb = load_dataset("imdb") # Import IMDB dataset

Downloading builder script:   0%|          | 0.00/4.31k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/2.17k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.59k [00:00<?, ?B/s]

Downloading and preparing dataset imdb/plain_text to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1...


Downloading data:   0%|          | 0.00/84.1M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

Dataset imdb downloaded and prepared to /root/.cache/huggingface/datasets/imdb/plain_text/1.0.0/2fdd8b9bcadd6e7055e742a706876ba43f19faee861df134affd7a3f60fc38a1. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [None]:
# Create a smaller dataset to enable faster tests and training times
smallTrainDataset = imdb["train"].shuffle(seed=42).select([i for i in list(range(4000))])
smallTestDataset = imdb["test"].shuffle(seed=42).select([i for i in list(range(400))])

In [None]:
# Using huggingface's tokenizers to preprocess data, utilising the base DistilBERT model
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

In [None]:
# Prepare the text inputs to split for training and testing using batch mapping by defining a preprocessing function
def preprocessFunction(examples):
    return tokenizer(examples["text"], truncation = True)

tokenizedTrain = smallTrainDataset.map(preprocessFunction, batched = True)
tokenizedTest = smallTestDataset.map(preprocessFunction, batched = True)

  0%|          | 0/4 [00:00<?, ?ba/s]

  0%|          | 0/1 [00:00<?, ?ba/s]

In [None]:
# Use PyTorch's tensors to concatenate the training samples and speed up training by using a dataCollator
from transformers import DataCollatorWithPadding
dataCollator = DataCollatorWithPadding(tokenizer = tokenizer)

3. Training the model
Throw away to pretrained head of the DistilBERT model and replace it with a classification head which was fine-tuned for sentiment analysis on the IMDB data. I will use the Trainer API in huggingface used for transformers models such as DistilBERT.

In [None]:
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased")

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_projector.bias', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_projector.weight', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

In [None]:
# Define evaluation metrics, accuracy and f1 score in my case
import numpy as np
from datasets import load_metric

def computeMetrics(evalPred):
    loadAccuracy = load_metric("accuracy")
    loadF1 = load_metric("f1")

    logits, labels = evalPred
    predictions = np.argmax(logits, axis = -1)
    accuracy = loadAccuracy.compute(predictions = predictions, references = labels)["accuracy"]
    f1 = loadF1.compute(predictions = predictions, references = labels)["f1"]
    return {"accuracy": accuracy, "f1": f1}

In [None]:
# Login to huggingface account, not necessary if not working with it.
from huggingface_hub import notebook_login
notebook_login()


Token is valid.
Your token has been saved in your configured git credential helpers (store).
Your token has been saved to /root/.huggingface/token
Login successful


In [None]:
# Define the training arguments and define a trainer
from transformers import TrainingArguments, Trainer

repoName = "finetuning-DistilBERT-model-4000-samples"

trainingArgs = TrainingArguments(
    output_dir = repoName,
    learning_rate = 2e-5, # the smaller the longer it takes to compute
    per_device_train_batch_size = 16, # arbitrary choice
    per_device_eval_batch_size = 16, # arbitrary choice
    num_train_epochs = 2, # arbitrary choice but standard in sentiment analysis
    weight_decay = 0.01, # also arbitrary
    save_strategy = "epoch",
    push_to_hub = True, # depending on whether we want to work with huggingface hub
)

trainer = Trainer(
    model = model,
    args = trainingArgs,
    train_dataset = tokenizedTrain,
    eval_dataset = tokenizedTest,
    tokenizer = tokenizer,
    data_collator = dataCollator,
    compute_metrics = computeMetrics,
)

PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
/content/finetuning-DistilBERT-model-4000-samples is already a clone of https://huggingface.co/gent-scholar/finetuning-DistilBERT-model-4000-samples. Make sure you pull the latest changes with `repo.git_pull()`.


In [None]:
# Fine tune the model by using the trainer function
trainer.train()

***** Running training *****
  Num examples = 4000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 500
  Number of trainable parameters = 66955010
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


Step,Training Loss
500,0.2813


Saving model checkpoint to finetuning-DistilBERT-model-4000-samples/checkpoint-250
Configuration saved in finetuning-DistilBERT-model-4000-samples/checkpoint-250/config.json
Model weights saved in finetuning-DistilBERT-model-4000-samples/checkpoint-250/pytorch_model.bin
tokenizer config file saved in finetuning-DistilBERT-model-4000-samples/checkpoint-250/tokenizer_config.json
Special tokens file saved in finetuning-DistilBERT-model-4000-samples/checkpoint-250/special_tokens_map.json
tokenizer config file saved in finetuning-DistilBERT-model-4000-samples/tokenizer_config.json
Special tokens file saved in finetuning-DistilBERT-model-4000-samples/special_tokens_map.json
Saving model checkpoint to finetuning-DistilBERT-model-4000-samples/checkpoint-500
Configuration saved in finetuning-DistilBERT-model-4000-samples/checkpoint-500/config.json
Model weights saved in finetuning-DistilBERT-model-4000-samples/checkpoint-500/pytorch_model.bin
tokenizer config file saved in finetuning-DistilBERT

TrainOutput(global_step=500, training_loss=0.28127996826171875, metrics={'train_runtime': 409.8287, 'train_samples_per_second': 19.52, 'train_steps_per_second': 1.22, 'total_flos': 1048802349646464.0, 'train_loss': 0.28127996826171875, 'epoch': 2.0})

In [None]:
# Compute evaluation metrics
trainer.evaluate()

***** Running Evaluation *****
  Num examples = 400
  Batch size = 16
The following columns in the evaluation set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.


{'eval_loss': 0.2928427457809448,
 'eval_accuracy': 0.905,
 'eval_f1': 0.9107981220657277,
 'eval_runtime': 8.9839,
 'eval_samples_per_second': 44.524,
 'eval_steps_per_second': 2.783}

4. Apply model to new data and look at performance


In [None]:
# Push model to huggingface hub, not necessary
trainer.push_to_hub()

Saving model checkpoint to finetuning-DistilBERT-model-4000-samples
Configuration saved in finetuning-DistilBERT-model-4000-samples/config.json
Model weights saved in finetuning-DistilBERT-model-4000-samples/pytorch_model.bin
tokenizer config file saved in finetuning-DistilBERT-model-4000-samples/tokenizer_config.json
Special tokens file saved in finetuning-DistilBERT-model-4000-samples/special_tokens_map.json
Several commits (2) will be pushed upstream.
The progress bars may be unreliable.


Upload file pytorch_model.bin:   0%|          | 3.30k/255M [00:00<?, ?B/s]

Upload file runs/Dec22_23-07-27_e0ffbeb79e0d/events.out.tfevents.1671750462.e0ffbeb79e0d.1229.0:  81%|########…

Upload file training_args.bin:  98%|#########7| 3.30k/3.37k [00:00<?, ?B/s]

Upload file runs/Dec22_23-19-31_e0ffbeb79e0d/events.out.tfevents.1671751193.e0ffbeb79e0d.1229.2: 100%|########…

remote: Scanning LFS files for validity, may be slow...        
remote: LFS file scan complete.        
To https://huggingface.co/gent-scholar/finetuning-DistilBERT-model-4000-samples
   d2ba62a..e0b86ab  main -> main

remote: LFS file scan complete.        
To https://huggingface.co/gent-scholar/finetuning-DistilBERT-model-4000-samples
   d2ba62a..e0b86ab  main -> main

Dropping the following result as it does not have all the necessary fields:
{'task': {'name': 'Text Classification', 'type': 'text-classification'}}
To https://huggingface.co/gent-scholar/finetuning-DistilBERT-model-4000-samples
   e0b86ab..641c08c  main -> main

   e0b86ab..641c08c  main -> main



'https://huggingface.co/gent-scholar/finetuning-DistilBERT-model-4000-samples/commit/e0b86ab9f2b3ce1cb9bf0cd9bcfe1168ecccc27c'

In [None]:
# Analyse two new movie reviews (same as the vanilla default model at the start) and compare sentiment.
from transformers import pipeline

sentimentModel = pipeline(model = "gent-scholar/finetuning-DistilBERT-model-4000-samples")
sentimentModel(["I love this movie", "This movie sucks!"])

Downloading:   0%|          | 0.00/615 [00:00<?, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gent-scholar--finetuning-DistilBERT-model-4000-samples/snapshots/641c08c0b9eb3c171fdf53d4da1ac83997c4b040/config.json
Model config DistilBertConfig {
  "_name_or_path": "gent-scholar/finetuning-DistilBERT-model-4000-samples",
  "activation": "gelu",
  "architectures": [
    "DistilBertForSequenceClassification"
  ],
  "attention_dropout": 0.1,
  "dim": 768,
  "dropout": 0.1,
  "hidden_dim": 3072,
  "initializer_range": 0.02,
  "max_position_embeddings": 512,
  "model_type": "distilbert",
  "n_heads": 12,
  "n_layers": 6,
  "pad_token_id": 0,
  "problem_type": "single_label_classification",
  "qa_dropout": 0.1,
  "seq_classif_dropout": 0.2,
  "sinusoidal_pos_embds": false,
  "tie_weights_": true,
  "torch_dtype": "float32",
  "transformers_version": "4.25.1",
  "vocab_size": 30522
}

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--gent-scholar--finetuning-D

Downloading:   0%|          | 0.00/268M [00:00<?, ?B/s]

loading weights file pytorch_model.bin from cache at /root/.cache/huggingface/hub/models--gent-scholar--finetuning-DistilBERT-model-4000-samples/snapshots/641c08c0b9eb3c171fdf53d4da1ac83997c4b040/pytorch_model.bin
All model checkpoint weights were used when initializing DistilBertForSequenceClassification.

All the weights of DistilBertForSequenceClassification were initialized from the model checkpoint at gent-scholar/finetuning-DistilBERT-model-4000-samples.
If your task is similar to the task the model of the checkpoint was trained on, you can already use DistilBertForSequenceClassification for predictions without further training.


Downloading:   0%|          | 0.00/360 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/711k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/125 [00:00<?, ?B/s]

loading file vocab.txt from cache at /root/.cache/huggingface/hub/models--gent-scholar--finetuning-DistilBERT-model-4000-samples/snapshots/641c08c0b9eb3c171fdf53d4da1ac83997c4b040/vocab.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--gent-scholar--finetuning-DistilBERT-model-4000-samples/snapshots/641c08c0b9eb3c171fdf53d4da1ac83997c4b040/tokenizer.json
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--gent-scholar--finetuning-DistilBERT-model-4000-samples/snapshots/641c08c0b9eb3c171fdf53d4da1ac83997c4b040/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--gent-scholar--finetuning-DistilBERT-model-4000-samples/snapshots/641c08c0b9eb3c171fdf53d4da1ac83997c4b040/tokenizer_config.json


[{'label': 'LABEL_1', 'score': 0.9800215363502502},
 {'label': 'LABEL_0', 'score': 0.9634125232696533}]

[{'label': 'LABEL_1', 'score': 0.9800215363502502},
 {'label': 'LABEL_0', 'score': 0.9634125232696533}]

5. TO-DO in future projects: Load lazy predict library and compare other models on same IMDB dataset and training and test split and size to classification models without parameter tuning.

In [None]:
# Install lazy predict library
!pip install lazypredict

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [None]:
from lazypredict.Supervised import LazyClassifier, LazyRegressor
from sklearn.model_selection import train_test_split # not needed in my case


In [None]:
# Load data
data = 