
Fine Tuning roadmap:

- Prepare data: Convert your balanced DataFrame into a Hugging Face Dataset with text and numeric labels

- Tokenize: Run the tokenizer over all your texts so the model can read them
- Train/test split: Split the dataset (Hugging Face has a built-in method for this)
- Load model: Load roberta-base configured for 3-class classification
- Set training arguments: Things like learning rate, batch size, number of epochs
- Train: Use Hugging Face's Trainer to fine-tune the model on your data
- Evaluate: Run predictions on your test set, generate classification report and confusion matrix

In [None]:
from transformers import DataCollatorWithPadding

In [4]:
from google.colab import files
uploaded = files.upload()

Saving balanced_reviews.csv to balanced_reviews.csv


In [5]:
# Tokenize

import pandas as pd

from transformers import AutoTokenizer

df_balanced = pd.read_csv("balanced_reviews.csv")
tokenizer = AutoTokenizer.from_pretrained("roberta-base")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

In [6]:
df_balanced = pd.read_csv("balanced_reviews.csv")
df_balanced.head()

Unnamed: 0,id,dateadded,dateupdated,name,asins,brand,categories,primarycategories,imageurls,keys,...,reviews.date,reviews.dateseen,reviews.rating,reviews.sourceurls,reviews.text,reviews.title,reviews.username,sourceurls,rating_sentiment,predicted_sentiment
0,AVpe7xlELJeJML43ypLz,2015-12-03T01:23:41Z,2019-04-24T02:17:42Z,AmazonBasics AA Performance Alkaline Batteries...,"B00QWO9P0O,B01IB83NZG,B00MNV8E0C",Amazonbasics,"AA,AAA,Electronics Features,Health,Electronics...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,amazonbasicsaaperformancealkalinebatteries48co...,...,2015-10-05T00:00:00.000Z,2017-06-28T00:00:00Z,2,https://www.amazon.com/product-reviews/B00QWO9...,I am all about Amazon Basics but I have ordere...,They just don't last...,ByS. Ogran,"https://www.barcodable.com/upc/841710106411,ht...",negative,negative
1,AVpf2cQm1cnluZ0-sb5y,2017-01-30T18:40:57Z,2019-02-24T04:01:52Z,Amazon Kindle Replacement Power Adapter (Fits ...,B001NIZB5M,Amazon,"Computers & Accessories,Electronics,Amazon Dev...",Electronics,https://images-na.ssl-images-amazon.com/images...,"0892685001164,amazon/a00810,amazoncom/a00810,a...",...,2011-09-23T00:00:00Z,2017-01-26T00:00:00Z,1,https://www.amazon.com/Amazon-Replacement-Adap...,Can someone explain to me why I can find hundr...,Why no Amazon statement regarding disintegrati...,Bill Clinton,https://www.barcodable.com/upc/892685001164,negative,negative
2,AVqkIhxunnc1JgDc3kg_,2017-03-06T14:59:43Z,2019-02-23T02:49:38Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",B018T075DC,Amazon,"Fire Tablets,Tablets,All Tablets,Amazon Tablet...",Electronics,https://www.upccodesearch.com/images/barcode/0...,"amazon/b018t075dc,firehd8tabletwithalexa8hddis...",...,2017-03-06T00:00:00.000Z,"2017-04-30T00:00:00Z,2017-06-07T00:00:00Z",2,http://reviews.bestbuy.com/3545/5620410/review...,The first Fire HD8 we purchased died within 1 ...,the first Fire HD8 we purchased died,steve,http://reviews.bestbuy.com/3545/5620410/review...,negative,negative
3,AVpfw2hvilAPnD_xh0rH,2017-01-11T06:58:33Z,2019-03-09T07:13:43Z,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",B018Y226XO,Amazon,"Fire Tablets,Learning Toys,Toys,Tablets,Amazon...","Toys & Games,Electronics",https://pisces.bbystatic.com/image2/BestBuy_US...,"amazon/53004754,841667103372,0841667103372,7in...",...,2017-03-16T00:00:00.000Z,"2017-04-26T00:00:00Z,2017-05-10T00:00:00Z,2017...",2,http://reviews.bestbuy.com/3545/5026100/review...,So I bought my little guys these tablets... so...,Disappointed doesn't play Netflix,Toshalee,http://www.toysrus.com/product/index.jsp?produ...,negative,negative
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,2017-05-16T00:00:00.000Z,2017-08-28T00:00:00Z,1,https://www.amazon.com/product-reviews/B00QWO9...,Awful poor performance. My mouse ran for sever...,Awful poor performance. My mouse ran for sever...,ByAmazon Customer,"https://www.barcodable.com/upc/841710106442,ht...",negative,negative


In [7]:
# convert your actual sentiment labels into numbers for the model

actual_map = {"negative": 0, "neutral": 1, "positive": 2}
df_balanced["actual_sentiment"] = df_balanced["rating_sentiment"].map(actual_map)
df_balanced.head()

Unnamed: 0,id,dateadded,dateupdated,name,asins,brand,categories,primarycategories,imageurls,keys,...,reviews.dateseen,reviews.rating,reviews.sourceurls,reviews.text,reviews.title,reviews.username,sourceurls,rating_sentiment,predicted_sentiment,actual_sentiment
0,AVpe7xlELJeJML43ypLz,2015-12-03T01:23:41Z,2019-04-24T02:17:42Z,AmazonBasics AA Performance Alkaline Batteries...,"B00QWO9P0O,B01IB83NZG,B00MNV8E0C",Amazonbasics,"AA,AAA,Electronics Features,Health,Electronics...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,amazonbasicsaaperformancealkalinebatteries48co...,...,2017-06-28T00:00:00Z,2,https://www.amazon.com/product-reviews/B00QWO9...,I am all about Amazon Basics but I have ordere...,They just don't last...,ByS. Ogran,"https://www.barcodable.com/upc/841710106411,ht...",negative,negative,0
1,AVpf2cQm1cnluZ0-sb5y,2017-01-30T18:40:57Z,2019-02-24T04:01:52Z,Amazon Kindle Replacement Power Adapter (Fits ...,B001NIZB5M,Amazon,"Computers & Accessories,Electronics,Amazon Dev...",Electronics,https://images-na.ssl-images-amazon.com/images...,"0892685001164,amazon/a00810,amazoncom/a00810,a...",...,2017-01-26T00:00:00Z,1,https://www.amazon.com/Amazon-Replacement-Adap...,Can someone explain to me why I can find hundr...,Why no Amazon statement regarding disintegrati...,Bill Clinton,https://www.barcodable.com/upc/892685001164,negative,negative,0
2,AVqkIhxunnc1JgDc3kg_,2017-03-06T14:59:43Z,2019-02-23T02:49:38Z,"Fire HD 8 Tablet with Alexa, 8 HD Display, 16 ...",B018T075DC,Amazon,"Fire Tablets,Tablets,All Tablets,Amazon Tablet...",Electronics,https://www.upccodesearch.com/images/barcode/0...,"amazon/b018t075dc,firehd8tabletwithalexa8hddis...",...,"2017-04-30T00:00:00Z,2017-06-07T00:00:00Z",2,http://reviews.bestbuy.com/3545/5620410/review...,The first Fire HD8 we purchased died within 1 ...,the first Fire HD8 we purchased died,steve,http://reviews.bestbuy.com/3545/5620410/review...,negative,negative,0
3,AVpfw2hvilAPnD_xh0rH,2017-01-11T06:58:33Z,2019-03-09T07:13:43Z,"Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16...",B018Y226XO,Amazon,"Fire Tablets,Learning Toys,Toys,Tablets,Amazon...","Toys & Games,Electronics",https://pisces.bbystatic.com/image2/BestBuy_US...,"amazon/53004754,841667103372,0841667103372,7in...",...,"2017-04-26T00:00:00Z,2017-05-10T00:00:00Z,2017...",2,http://reviews.bestbuy.com/3545/5026100/review...,So I bought my little guys these tablets... so...,Disappointed doesn't play Netflix,Toshalee,http://www.toysrus.com/product/index.jsp?produ...,negative,negative,0
4,AVpgNzjwLJeJML43Kpxn,2015-10-30T08:59:32Z,2019-04-25T09:08:16Z,AmazonBasics AAA Performance Alkaline Batterie...,"B00QWO9P0O,B00LH3DMUO",Amazonbasics,"AA,AAA,Health,Electronics,Health & Household,C...",Health & Beauty,https://images-na.ssl-images-amazon.com/images...,"amazonbasics/hl002619,amazonbasicsaaaperforman...",...,2017-08-28T00:00:00Z,1,https://www.amazon.com/product-reviews/B00QWO9...,Awful poor performance. My mouse ran for sever...,Awful poor performance. My mouse ran for sever...,ByAmazon Customer,"https://www.barcodable.com/upc/841710106442,ht...",negative,negative,0


In [8]:
# convert the dataset from pandas to Dataset for HuggingFace model

from datasets import Dataset

dataset = Dataset.from_pandas(df_balanced[["reviews.text", "actual_sentiment"]])
dataset = dataset.rename_column("actual_sentiment", "labels")

In [9]:
# tokenize the reviews for embedding

def tokenize_function(examples):
    return tokenizer(examples["reviews.text"], truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/3618 [00:00<?, ? examples/s]

In [10]:
# train test split tokenized dataset with HuggingFace
split_tokenized_data = tokenized_dataset.train_test_split(test_size=0.3)

In [11]:
# load the model
# The num_labels=3 tells it you have three classes (positive, negative, neutral)

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("roberta-base", num_labels=3)

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

RobertaForSequenceClassification LOAD REPORT from: roberta-base
Key                             | Status     | 
--------------------------------+------------+-
lm_head.bias                    | UNEXPECTED | 
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
classifier.out_proj.weight      | MISSING    | 
classifier.dense.bias           | MISSING    | 
classifier.out_proj.bias        | MISSING    | 
classifier.dense.weight         | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


In [12]:
# Set training arguments: Things like learning rate, batch size, number of epochs

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    eval_strategy="epoch",
    logging_strategy="epoch",
)

In [14]:
from transformers import Trainer

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=split_tokenized_data["train"],
    eval_dataset=split_tokenized_data["test"],
    data_collator=data_collator
)

trainer.train()

Epoch,Training Loss,Validation Loss
1,0.744552,0.690135
2,0.466455,0.539119
3,0.270097,0.495039


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

TrainOutput(global_step=477, training_loss=0.49370119906571425, metrics={'train_runtime': 264.5929, 'train_samples_per_second': 28.708, 'train_steps_per_second': 1.803, 'total_flos': 572656075817112.0, 'train_loss': 0.49370119906571425, 'epoch': 3.0})

In [15]:
# run classificaiton report

from sklearn.metrics import classification_report, confusion_matrix

predictions = trainer.predict(split_tokenized_data["test"])


In [16]:
print(predictions)

PredictionOutput(predictions=array([[-2.7671025 , -0.62494934,  3.3911276 ],
       [ 3.7381635 , -1.2086195 , -2.3113477 ],
       [-2.8536851 , -0.29879537,  3.0665889 ],
       ...,
       [ 3.6894476 , -1.1869177 , -2.555846  ],
       [-3.0767865 , -0.04762653,  3.003581  ],
       [-2.99981   ,  1.4532255 ,  1.5120962 ]], dtype=float32), label_ids=array([2, 0, 1, ..., 0, 2, 2]), metrics={'test_loss': 0.49503856897354126, 'test_runtime': 11.1409, 'test_samples_per_second': 97.479, 'test_steps_per_second': 6.104})


In [20]:
import numpy as np

y_pred = np.argmax(predictions.predictions, axis=1)
y_test = predictions.label_ids

print(classification_report(y_test, y_pred, target_names=["negative", "neutral", "positive"]))

              precision    recall  f1-score   support

    negative       0.84      0.85      0.85       382
     neutral       0.77      0.74      0.75       355
    positive       0.89      0.91      0.90       349

    accuracy                           0.83      1086
   macro avg       0.83      0.83      0.83      1086
weighted avg       0.83      0.83      0.83      1086



In [21]:
model.save_pretrained("./fine_tuned_roberta")
tokenizer.save_pretrained("./fine_tuned_roberta")

Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]

('./fine_tuned_roberta/tokenizer_config.json',
 './fine_tuned_roberta/tokenizer.json')