# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`. 

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [46]:
!git clone https://github.com/acheampongmaa/Natural-Language-Processing-Project.git

Cloning into 'Natural-Language-Processing-Project'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (25/25), done.[K
remote: Total 27 (delta 3), reused 0 (delta 0), pack-reused 0[K
Unpacking objects: 100% (27/27), 839.14 KiB | 4.66 MiB/s, done.


In [47]:
%cd Natural-Language-Processing-Project

/content/Natural-Language-Processing-Project/Natural-Language-Processing-Project


In [48]:
# Install the necessary package to create a virtual environment
!pip3 install virtualenv

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [49]:
# Create the virtual environment venv
!virtualenv venv

created virtual environment CPython3.10.11.final.0-64 in 237ms
  creator CPython3Posix(dest=/content/Natural-Language-Processing-Project/Natural-Language-Processing-Project/venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==23.1.2, setuptools==67.7.2, wheel==0.40.0
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [50]:
# Activate the virtual environment
!source venv/bin/activate

In [51]:
!pip install --upgrade huggingface_hub

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [54]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [55]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [56]:
import os
import pandas as pd
import numpy as np
from datasets import load_dataset
from datasets import load_metric
from sklearn.model_selection import train_test_split
from transformers import AutoTokenizer, TrainingArguments, Trainer, AutoModelForSequenceClassification
from huggingface_hub import login



In [57]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"
os.environ['CUDA_LAUNCH_BLOCKING'] = '1'
os.environ["HUGGINGFACE_API_KEY"] = "hf_OVxqIgPDGGIodndhJJnDfkzuKBUehhqAAcn"

In [58]:
# Load the dataset and display some values
df = pd.read_csv('/content/Natural-Language-Processing-Project/zindi_challenge/data/Train.csv')



In [59]:
#checking shape of dataframe

df.shape

(10001, 4)

In [60]:
#checking data
df

Unnamed: 0,tweet_id,safe_text,label,agreement
0,CL1KWCMY,Me &amp; The Big Homie meanboy3000 #MEANBOY #M...,0.0,1.000000
1,E3303EME,I'm 100% thinking of devoting my career to pro...,1.0,1.000000
2,M4IVFSMS,"#whatcausesautism VACCINES, DO NOT VACCINATE Y...",-1.0,1.000000
3,1DR6ROZ4,I mean if they immunize my kid with something ...,-1.0,1.000000
4,J77ENIIE,Thanks to <user> Catch me performing at La Nui...,0.0,1.000000
...,...,...,...,...
9996,IU0TIJDI,Living in a time where the sperm I used to was...,1.0,1.000000
9997,WKKPCJY6,<user> <user> In spite of all measles outbrea...,1.0,0.666667
9998,ST3A265H,Interesting trends in child immunization in Ok...,0.0,1.000000
9999,6Z27IJGD,CDC Says Measles Are At Highest Levels In Deca...,0.0,1.000000


In [61]:
#checking null values
df.isna().sum()

tweet_id     0
safe_text    0
label        1
agreement    2
dtype: int64

In [62]:
#checking where the null values are located
df[df.isna().any(axis=1)]

Unnamed: 0,tweet_id,safe_text,label,agreement
4798,RQMQ0L2A,#lawandorderSVU,,
4799,I cannot believe in this day and age some pare...,1,0.666667,


In [63]:
#replacing null values
df.loc[4798, 'label']= 0.0

df.loc[4798, 'agreement']=0.0

In [64]:
#replacing null values
df.loc[4799, 'label']=1.0

df.loc[4799, 'agreement']=0.666667

df.loc[4799, 'safe_text']='I cannot believe in this day and age some pare...'

df.loc[4799, 'tweet_id']= 'SHG7JIY'

In [65]:
#recheck null values
df.isnull().sum()

tweet_id     0
safe_text    0
label        0
agreement    0
dtype: int64

In [66]:
#splitting data
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [67]:
#checking train set
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
1641,CQDD6QLM,"New <user> ""Hey Love"" #MMR #ManyMenRecords #Yo...",0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
336,I4D043ST,<user> esp when mercury free vaccines are avai...,1.0,0.666667
6861,CKX52Y8G,"My Life, Your Entertainment #YOTC #MMR @ Exoti...",0.0,1.0
720,07S3NL2T,Baby Luna is sore from her vaccines :( #poorpuppy,0.0,0.666667


In [68]:
#checking eval set
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
5818,Y8PQ0BT7,So nervous... The baby's getting vaccines... (...,1.0,0.666667
7842,C9Z6JBSS,AIDS N : A malaria vaccine in children with HI...,0.0,0.666667
880,0VE4NWWQ,Measles Outbreak Hits Texas Church That Preach...,1.0,0.666667
9072,RHQRUF14,Thank you <user> for mtg with your staff. We l...,1.0,1.0
288,ZWEP2IL4,Health district offers no-cost immunizations f...,1.0,0.666667


In [69]:
#checking shape of train and eval set
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (8000, 4), eval is (2001, 4)


In [70]:
# Save splitted subsets
train.to_csv("/content/Natural-Language-Processing-Project/zindi_challenge/data/train_subset.csv", index=False)
eval.to_csv("/content/Natural-Language-Processing-Project/zindi_challenge/data/eval_subset.csv", index=False)

In [71]:
#loading the saved splitted datasets
dataset = load_dataset('csv',
                        data_files={'train': '/content/Natural-Language-Processing-Project/zindi_challenge/data/train_subset.csv',
                        'eval': '/content/Natural-Language-Processing-Project/zindi_challenge/data/eval_subset.csv'}, encoding = "ISO-8859-1")

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-85725e1e2a99799e/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-85725e1e2a99799e/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

# Finetuning Distilbert_base_uncased Model

In [72]:
#perfoming tokenization
tokenizer = AutoTokenizer.from_pretrained('distilbert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [73]:
'''  tokenizing the text in the 'safe_text' column of the dataset using the specified tokenizer,
 transforms the label values to numerical values, and removes the unnecessary columns'''

def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

def tokenize_data(example):
    return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

Map:   0%|          | 0/8000 [00:00<?, ? examples/s]

Map:   0%|          | 0/2001 [00:00<?, ? examples/s]

In [74]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 8000
    })
    eval: Dataset({
        features: ['input_ids', 'attention_mask', 'labels'],
        num_rows: 2001
    })
})

In [75]:

# Configure the trianing parameters like `num_train_epochs`: 
# the number of time the model will repeat the training loop over the dataset
training_args = TrainingArguments(
    "finetuned_distilbert_base_uncased",
    num_train_epochs=3 ,
    load_best_model_at_end=True,
    evaluation_strategy='epoch',
    weight_decay=0.01,
    save_strategy='epoch')

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [76]:
# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model = AutoModelForSequenceClassification.from_pretrained('distilbert-base-uncased',num_labels=3)

Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_layer_norm.bias', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.weight', 'vocab_projector.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.weight', 'classifier.weight', 'classifi

In [77]:
#shuffling train and eval dataset
train_dataset = dataset['train'].shuffle(seed=10) 
eval_dataset = dataset['eval'].shuffle(seed=10)


In [78]:
#creating an instance of the Trainer class from the Hugging Face transformers library, which is used to train and evaluate the model on the specified dataset.

trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset)

In [79]:
# Launch the learning process: training 
trainer.train()



Epoch,Training Loss,Validation Loss
1,0.6434,0.582563
2,0.4594,0.614786


Epoch,Training Loss,Validation Loss
1,0.6434,0.582563
2,0.4594,0.614786
3,0.2847,0.824298


TrainOutput(global_step=3000, training_loss=0.4794327850341797, metrics={'train_runtime': 1390.4914, 'train_samples_per_second': 17.26, 'train_steps_per_second': 2.158, 'total_flos': 3179274264576000.0, 'train_loss': 0.4794327850341797, 'epoch': 3.0})

In [90]:

#defining a function compute_metrics that will be used to compute the evaluation metrics during the training process.
metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)



In [81]:
'''creating a Trainer object with the given model, training and evaluation datasets, and training arguments, and 
specifying that the evaluation metrics should be computed using the compute_metrics function'''
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [82]:
# Launch the final evaluation 
trainer.evaluate()

{'eval_loss': 0.5825627446174622,
 'eval_accuracy': 0.7641179410294853,
 'eval_runtime': 35.6205,
 'eval_samples_per_second': 56.176,
 'eval_steps_per_second': 7.047}

In [86]:
# Authentication token for hugging face
login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [87]:
# Save pretrained model to hugging face
finetuned_model = trainer.model
finetuned_model.push_to_hub("finetuned_distilbert_base_uncased")

# trainer.push_to_hub(repo_name='finetuned_albert_base_v2')

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Queensly/finetuned_distilbert_base_uncased/commit/656cdc04166ed01a1b3c110a5642c360ce63d921', commit_message='Upload DistilBertForSequenceClassification', commit_description='', oid='656cdc04166ed01a1b3c110a5642c360ce63d921', pr_url=None, pr_revision=None, pr_num=None)

In [88]:
tokenizer.push_to_hub("finetuned_distilbert_base_uncased")

CommitInfo(commit_url='https://huggingface.co/Queensly/finetuned_distilbert_base_uncased/commit/bd81baa81fe8678b8885e59708626ccf31247f5f', commit_message='Upload tokenizer', commit_description='', oid='bd81baa81fe8678b8885e59708626ccf31247f5f', pr_url=None, pr_revision=None, pr_num=None)