# Sentiment Analysis with Hugging Face

Hugging Face is an open-source and platform provider of machine learning technologies. You can use install their package to access some interesting pre-built models to use them directly or to fine-tune (retrain it on your dataset leveraging the prior knowledge coming with the first training), then host your trained models on the platform, so that you may use them later on other devices and apps.

Please, [go to the website and sign-in](https://huggingface.co/) to access all the features of the platform.

[Read more about Text classification with Hugging Face](https://huggingface.co/tasks/text-classification)

The Hugging face models are Deep Learning based, so will need a lot of computational GPU power to train them. Please use [Colab](https://colab.research.google.com/) to do it, or your other GPU cloud provider, or a local machine having NVIDIA GPU.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Application of Hugging Face Text classification model Fune-tuning

Find below a simple example, with just `3 epochs of fine-tuning`. 

Read more about the fine-tuning concept : [here](https://deeplizard.com/learn/video/5T-iXNNiwIs#:~:text=Fine%2Dtuning%20is%20a%20way,perform%20a%20second%20similar%20task.)

In [None]:
#torch.cuda.is_available()

In [2]:
!pip install datasets

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting datasets
  Downloading datasets-2.11.0-py3-none-any.whl (468 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m468.7/468.7 kB[0m [31m19.3 MB/s[0m eta [36m0:00:00[0m
Collecting xxhash
  Downloading xxhash-3.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.2/212.2 kB[0m [31m22.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub<1.0.0,>=0.11.0
  Downloading huggingface_hub-0.13.4-py3-none-any.whl (200 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m200.1/200.1 kB[0m [31m19.1 MB/s[0m eta [36m0:00:00[0m
Collecting responses<0.19
  Downloading responses-0.18.0-py3-none-any.whl (38 kB)
Collecting dill<0.3.7,>=0.3.0
  Downloading dill-0.3.6-py3-none-any.whl (110 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.5/110.5 kB

In [3]:
import os
import pandas as pd
from datasets import load_dataset
from sklearn.model_selection import train_test_split

This code sets the environment variable "WANDB_DISABLED" to "true", which disables the use of the Weights and Biases (W&B) tool. W&B is a third-party tool that can be used to track and visualize the training progress of machine learning models. By setting this environment variable, you are telling your code to not use this tool.

In [4]:
# Disabe W&B
os.environ["WANDB_DISABLED"] = "true"

In [5]:
# Load the dataset and display some values

# Define file path
file_path = "/content/sample_data/"

# Load the CSV file into a DataFrame
df = pd.read_csv(file_path + "Train.csv")




In [6]:
# A way to eliminate rows containing NaN values
df = df[~df.isna().any(axis=1)]

I manually split the training set to have a training subset ( a dataset the model will learn on), and an evaluation subset ( a dataset the model with use to compute metric scores to help use to avoid some training problems like [the overfitting](https://www.ibm.com/cloud/learn/overfitting) one ). 

There are multiple ways to do split the dataset. You'll see two commented line showing you another one.

In [7]:
# Split the train data => {train, eval}
train, eval = train_test_split(df, test_size=0.2, random_state=42, stratify=df['label'])

In [8]:
train.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
9305,YMRMEDME,Mickey's Measles has gone international <url>,0.0,1.0
3907,5GV8NEZS,S1256 [NEW] Extends exemption from charitable ...,0.0,1.0
795,EI10PS46,<user> your ignorance on vaccines isn't just ...,1.0,0.666667
5793,OM26E6DG,Pakistan partly suspends polio vaccination pro...,0.0,1.0
3431,NBBY86FX,In other news I've gone up like 1000 mmr,0.0,1.0


In [9]:
eval.head()

Unnamed: 0,tweet_id,safe_text,label,agreement
6571,R7JPIFN7,Children's Museum of Houston to Offer Free Vac...,1.0,1.0
1754,2DD250VN,<user> no. I was properly immunized prior to t...,1.0,1.0
3325,ESEVBTFN,<user> thx for posting vaccinations are impera...,1.0,1.0
1485,S17ZU0LC,This Baby Is Exactly Why Everyone Needs To Vac...,1.0,0.666667
4175,IIN5D33V,"Meeting tonight, 8:30pm in room 322 of the stu...",1.0,1.0


In [10]:
print(f"new dataframe shapes: train is {train.shape}, eval is {eval.shape}")

new dataframe shapes: train is (7999, 4), eval is (2000, 4)


In [11]:
# Save splitted subsets
train.to_csv(os.path.join(file_path, "train_subset.csv"), index=False)
eval.to_csv(os.path.join(file_path, "eval_subset.csv"), index=False)

In [12]:
# Load the CSV files into a dataset

from datasets import load_dataset

dataset = load_dataset('csv', data_files={
    'train': file_path + 'train_subset.csv',
    'eval': file_path + 'eval_subset.csv'
}, encoding='ISO-8859-1')

Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-eb6b2b34a2fc86dd/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating eval split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-eb6b2b34a2fc86dd/0.0.0/6954658bab30a358235fa864b05cf819af0e179325c740e4bc853bcc7ec513e1. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [13]:
!pip install transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.28.1-py3-none-any.whl (7.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.0/7.0 MB[0m [31m81.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m103.3 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, transformers
Successfully installed tokenizers-0.13.3 transformers-4.28.1


In [15]:
from transformers import AutoTokenizer
tokenizer1 = AutoTokenizer.from_pretrained('bert-base-cased')
#tokenizer2 = AutoTokenizer.from_pretrained('roberta-base')

In [16]:
def transform_labels(label):

    label = label['label']
    num = 0
    if label == -1: #'Negative'
        num = 0
    elif label == 0: #'Neutral'
        num = 1
    elif label == 1: #'Positive'
        num = 2

    return {'labels': num}

#def tokenize_data(example):
    #return tokenizer(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
#dataset = dataset.map(tokenize_data, batched=True)

# Transform	labels and remove the useless columns
#remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
#dataset = dataset.map(transform_labels, remove_columns=remove_columns)

def tokenize_data1(example):
    return tokenizer1(example['safe_text'], padding='max_length')

#def tokenize_data2(example):
    #return tokenizer2(example['safe_text'], padding='max_length')

# Change the tweets to tokens that the models can exploit
dataset = dataset.map(tokenize_data1, batched=True, num_proc=4)
#dataset = dataset.map(tokenize_data2, batched=True, num_proc=4)

# Transform labels and remove the useless columns
remove_columns = ['tweet_id', 'label', 'safe_text', 'agreement']
dataset = dataset.map(transform_labels, remove_columns=remove_columns)

Map (num_proc=4):   0%|          | 0/7999 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/7999 [00:00<?, ? examples/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

In [17]:
dataset

DatasetDict({
    train: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 7999
    })
    eval: Dataset({
        features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
        num_rows: 2000
    })
})

In [18]:
dataset['train']

Dataset({
    features: ['input_ids', 'token_type_ids', 'attention_mask', 'labels'],
    num_rows: 7999
})

In [19]:
from transformers import TrainingArguments

from transformers import IntervalStrategy, TrainingArguments

training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy=IntervalStrategy.STEPS,  # match with save_strategy
    save_strategy=IntervalStrategy.STEPS,
    save_steps=500,
    load_best_model_at_end=True,
    num_train_epochs=5,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
)



# Configure the trianing parameters like `num_train_epochs`: 
# the number of time the model will repeat the training loop over the dataset
#training_args = TrainingArguments("test_trainer", num_train_epochs=3000, load_best_model_at_end=True,)

Using the `WANDB_DISABLED` environment variable is deprecated and will be removed in v5. Use the --report_to flag to control the integrations used for logging result (for instance --report_to none).


In [20]:
from transformers import AutoModelForSequenceClassification

# Loading a pretrain model while specifying the number of labels in our dataset for fine-tuning
model1 = AutoModelForSequenceClassification.from_pretrained("bert-base-cased", num_labels=3)
#model2 = AutoModelForSequenceClassification.from_pretrained('roberta-base', num_labels=3)

Downloading pytorch_model.bin:   0%|          | 0.00/436M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSequenceClassification were not initialized from the model checkpoint at b

In [21]:
train_dataset = dataset['train'].shuffle(seed=10) #.select(range(40000)) # to select a part
eval_dataset = dataset['eval'].shuffle(seed=10)

## other way to split the train set ... in the range you must use: 
# # int(num_rows*.8 ) for [0 - 80%] and  int(num_rows*.8 ),num_rows for the 20% ([80 - 100%])
# train_dataset = dataset['train'].shuffle(seed=10).select(range(40000))
# eval_dataset = dataset['train'].shuffle(seed=10).select(range(40000, 41000))

In [22]:
from transformers import Trainer

trainer1 = Trainer(
    model=model1, args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset
)
#trainer2 = Trainer(
    #model= model2 ,args=training_args, train_dataset=train_dataset, eval_dataset=eval_dataset 


In [24]:
# Launch the learning process: training 
trainer1.train()



Step,Training Loss,Validation Loss
500,1.0014,0.956402
1000,1.012,0.91731
1500,0.9887,0.890278
2000,1.0382,0.988791
2500,0.9797,0.975605
3000,1.0079,0.95898
3500,0.9788,1.028735
4000,1.001,0.96715
4500,1.0034,0.956286
5000,0.9894,1.004046


TrainOutput(global_step=20000, training_loss=0.9737689590454102, metrics={'train_runtime': 7031.7068, 'train_samples_per_second': 5.688, 'train_steps_per_second': 2.844, 'total_flos': 1.052322114203136e+16, 'train_loss': 0.9737689590454102, 'epoch': 5.0})

Don't worry the above issue, it is a `KeyboardInterrupt` that means I stopped the training to avoid taking a long time to finish.

In [40]:
import numpy as np
from datasets import load_metric
#from sklearn.metrics import mean_squared_error

#metric = load_metric("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    rmse=np.sqrt(np.mean((predictions-labels)**2))
    return{"rmse":rmse}
    #return metric.compute(predictions=predictions, references=labels)

In [41]:
trainer = Trainer(
    model=model1,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

In [42]:
# Launch the final evaluation 
trainer.evaluate()

{'eval_loss': 0.8706203699111938,
 'eval_rmse': 0.6734240862933253,
 'eval_runtime': 61.5355,
 'eval_samples_per_second': 32.502,
 'eval_steps_per_second': 16.251}

Some checkpoints of the model are automatically saved locally in `test_trainer/` during the training.

You may also upload the model on the Hugging Face Platform... [Read more](https://huggingface.co/docs/hub/models-uploading)

This notebook is inspired by an article: [Fine-Tuning Bert for Tweets Classification ft. Hugging Face](https://medium.com/mlearning-ai/fine-tuning-bert-for-tweets-classification-ft-hugging-face-8afebadd5dbf)

Do not hesitaite to read more and to ask questions, the Learning is a lifelong activity.