# Introduction to HuggingFace

Hugging Face is a popular platform for accessing state-of-the-art models, datasets, and tools for natural language processing (NLP). In this tutorial, we'll walk through the basics of using Hugging Face for common NLP tasks such as text classification, text generation, and question answering.

## **Import necessary Python libraries and modules**

First, we will import necessary Python libraries and modules. These include scikit-learn (`sklearn`) and PyTorch (`torch`), for various machine learning tools.

In [109]:
# Basic Python modules
from collections import defaultdict

# For data manipulation and analysis
import pandas as pd
import numpy as np

# For deep learning
# https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html
import torch

We will use the same sentiment dataset that we used for the previous notebook.

In [110]:
datapath = 'https://raw.githubusercontent.com/leelaylay/TweetSemEval/master/dataset/train/twitter-2013train-A.txt'
data = pd.read_csv(datapath, sep = '\t', names = ['id', 'sentiment', 'text'])
data

Unnamed: 0,id,sentiment,text
0,264183816548130816,positive,Gas by my house hit $3.39!!!! I\u2019m going t...
1,263405084770172928,negative,Theo Walcott is still shit\u002c watch Rafa an...
2,262163168678248449,negative,its not that I\u2019m a GSP fan\u002c i just h...
3,264249301910310912,negative,Iranian general says Israel\u2019s Iron Dome c...
4,262682041215234048,neutral,Tehran\u002c Mon Amour: Obama Tried to Establi...
...,...,...,...
9679,103158179306807296,positive,RT @MNFootNg It's monday and Monday Night Foot...
9680,103157324096618497,positive,All I know is the road for that Lomardi start ...
9681,100259220338905089,neutral,"All Blue and White fam, we r meeting at Golden..."
9682,104230318525001729,positive,@DariusButler28 Have a great game agaist Tam...


In [111]:
# !pip3 install transformers

In [112]:
! pip install -U accelerate



We will again use DistilBERT.

In [113]:
from transformers import DistilBertTokenizerFast, DistilBertForSequenceClassification
from transformers import Trainer, TrainingArguments

We will set some of the configurations

In [114]:
model_name = 'distilbert-base-uncased'
device_name = 'cuda'

# This is the maximum number of tokens in any document; the rest will be truncated.
max_length = 512

# This is the name of the directory where we'll save our model. You can name it whatever you want.
cached_model_directory_name = 'output_hf'

In [115]:
# let's split the data into train and test

from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size = 0.3)

In [116]:
# let's encode the data again

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
le.fit(train['sentiment'])
train['labels'] = le.transform(train['sentiment'])
test['labels'] = le.transform(test['sentiment'])

Compared to simpletransformers, we get a closer look at what happens 'under the hood' with huggingface. We will see the transformation of the text better --- each tweet will be truncated if they're more than 512 tokens or padded if they're fewer than 512 tokens.

The tokens will be separated into "word pieces" using the transformers tokenizers ('DistilBertTokenizerFast' in this case to match the DistiBERT model). And some special tokens will also be added such as **CLS** (start token of every tweet) and **SEP** (separator between each sentence {not tweet}):

In [117]:
tokenizer = DistilBertTokenizerFast.from_pretrained(model_name)

We now encode our texts using the tokenizer.

In [118]:
# ! pip install datasets # a library for handling data in huggingface format

In [119]:
from datasets import Dataset

train = Dataset.from_pandas(train)
test = Dataset.from_pandas(test)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)


tokenized_train_df = train.map(tokenize_function, batched=True)
tokenized_test_df = test.map(tokenize_function, batched=True)

Map:   0%|          | 0/6778 [00:00<?, ? examples/s]

Map:   0%|          | 0/2906 [00:00<?, ? examples/s]

We now load the DistilBERT model and specify that it should use the GPU.

In [120]:
model = DistilBertForSequenceClassification.from_pretrained(model_name,
                                                            num_labels=3).to(device_name)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


As we did with simpletransformers, we now set the training parameters, i.e., the number of epochs.

In [121]:
training_args = TrainingArguments(
    num_train_epochs=3,              # total number of training epochs
    output_dir='./results',          # output directory
    report_to='none'
)

## **Fine-tune the BERT model**

First, we define a custom evaluation function that returns the accuracy. You could modify this function to return precision, recall, F1, and/or other metrics.

In [122]:
from sklearn.metrics import accuracy_score

def compute_metrics(pred):
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    acc = accuracy_score(labels, preds)
    return {
      'accuracy': acc,
  }

Then we create a HuggingFace `Trainer` object using the `TrainingArguments` object that we created above. We also send our `compute_metrics` function to the `Trainer` object, along with our test and train datasets.

In [123]:
trainer = Trainer(
    model=model,                         # the instantiated 🤗 Transformers model to be trained
    args=training_args,                  # training arguments, defined above
    train_dataset=tokenized_train_df,         # training dataset
    compute_metrics=compute_metrics      # our custom evaluation function
)

In [124]:
tokenized_train_df

Dataset({
    features: ['id', 'sentiment', 'text', 'labels', '__index_level_0__', 'input_ids', 'attention_mask'],
    num_rows: 6778
})

Time to finally fine-tune!

In [125]:
trainer.train()

Step,Training Loss
500,0.7219
1000,0.5558
1500,0.4016
2000,0.2799
2500,0.1839


Checkpoint destination directory ./results/checkpoint-500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-1500 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-2000 already exists and is non-empty.Saving will proceed but saved results may be invalid.
Checkpoint destination directory ./results/checkpoint-2500 already exists and is non-empty.Saving will proceed but saved results may be invalid.


TrainOutput(global_step=2544, training_loss=0.42470809273749777, metrics={'train_runtime': 998.0882, 'train_samples_per_second': 20.373, 'train_steps_per_second': 2.549, 'total_flos': 2693640120662016.0, 'train_loss': 0.42470809273749777, 'epoch': 3.0})

## **Save fine-tuned model**

The following cell will save the model and its configuration files to a directory in Colab. To preserve this model for future use, you should download the model to your computer.

In [126]:
trainer.save_model(cached_model_directory_name)

(Optional) If you've already fine-tuned and saved the model, you can reload it using the following line. You don't have to run fine-tuning every time you want to evaluate.

In [127]:
# trainer = DistilBertForSequenceClassification.from_pretrained(cached_model_directory_name)

We can now evaluate the model by predicting the labels for the test set.

In [128]:
predicted_results = trainer.predict(tokenized_test_df)

In [129]:
predicted_labels = predicted_results.predictions.argmax(-1) # Get the highest probability prediction
predicted_labels = predicted_labels.flatten().tolist()      # Flatten the predictions into a 1D list
predicted_labels[0:5]

[0, 1, 1, 2, 0]

In [130]:
from sklearn.metrics import classification_report

print(classification_report(tokenized_test_df['labels'],
                            predicted_labels))

              precision    recall  f1-score   support

           0       0.66      0.63      0.64       443
           1       0.78      0.78      0.78      1389
           2       0.76      0.77      0.77      1074

    accuracy                           0.76      2906
   macro avg       0.73      0.73      0.73      2906
weighted avg       0.76      0.76      0.76      2906



In [131]:
# with the reverse encoded labels

print(classification_report(test['sentiment'],
                            le.inverse_transform(predicted_labels)))

              precision    recall  f1-score   support

    negative       0.66      0.63      0.64       443
     neutral       0.78      0.78      0.78      1389
    positive       0.76      0.77      0.77      1074

    accuracy                           0.76      2906
   macro avg       0.73      0.73      0.73      2906
weighted avg       0.76      0.76      0.76      2906



## 2. Using a Pretrained model for inference

We can also apply pretrained models directly without finetuning them on further data. Let's try out this model for sentiment analysis: https://huggingface.co/lxyuan/distilbert-base-multilingual-cased-sentiments-student

In [132]:
from transformers import pipeline

distilled_student_sentiment_classifier = pipeline(
    model="lxyuan/distilbert-base-multilingual-cased-sentiments-student",
    return_all_scores=True
)



In [133]:
# let's try some examples
anti_biden_tweet = "Ugh, this was true yesterday and it's also true now: Biden is an idiot"
pro_biden_tweet = "I'm voting for Biden 100%"
labels = distilled_student_sentiment_classifier ([anti_biden_tweet, pro_biden_tweet])
labels

[[{'label': 'positive', 'score': 0.20064100623130798},
  {'label': 'neutral', 'score': 0.13778355717658997},
  {'label': 'negative', 'score': 0.661575436592102}],
 [{'label': 'positive', 'score': 0.5918682217597961},
  {'label': 'neutral', 'score': 0.18477307260036469},
  {'label': 'negative', 'score': 0.22335869073867798}]]

In [134]:
# get the highest scoring label
final_labels = []
for label in labels:
  label_df = pd.DataFrame(label).sort_values(by = ['score'], ascending = False).reset_index()
  print(label_df)
  print()
  final_labels.append(label_df['label'][0])

final_labels

   index     label     score
0      2  negative  0.661575
1      0  positive  0.200641
2      1   neutral  0.137784

   index     label     score
0      0  positive  0.591868
1      2  negative  0.223359
2      1   neutral  0.184773



['negative', 'positive']

In [135]:
from tqdm import tqdm

labels = []
for text in tqdm(test['text']):
  labels.append(distilled_student_sentiment_classifier(text)[0])

100%|██████████| 2906/2906 [04:08<00:00, 11.69it/s]


In [136]:
final_labels = []
for label in labels:
  label_df = pd.DataFrame(label).sort_values(by = ['score'], ascending = False).reset_index()
  final_labels.append(label_df['label'][0])

final_labels[:5]

['negative', 'positive', 'negative', 'positive', 'negative']

In [137]:
print(classification_report(test['sentiment'],final_labels))

              precision    recall  f1-score   support

    negative       0.41      0.78      0.54       443
     neutral       0.54      0.01      0.03      1389
    positive       0.46      0.87      0.60      1074

    accuracy                           0.45      2906
   macro avg       0.47      0.55      0.39      2906
weighted avg       0.49      0.45      0.32      2906

