<font size= '6'> <b> Labeling with the fine-tuned model </font> </b>  
The goal of this notebook is to use the previously fine-tuned model to label the whole dataset of tweets.

As the code runs on Google Colab, the first thing to do is to import the model and the necessary data files.

In [1]:
from google.colab import files
uploaded = files.upload()


Saving sentiment_model_twitter-roberta-base_20250510_1019.zip to sentiment_model_twitter-roberta-base_20250510_1019.zip


The model used is the fine-tuned one while the tokenizer is the original one since it wasn't modified during fine-tuning.


In [4]:
import zipfile
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import os

# Extract the uploaded model zip file
zip_path = "sentiment_model_twitter-roberta-base_20250510_1019.zip"
extract_dir = "/content/sentiment_model_twitter-roberta-base_20250510_1019"

with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

# Verify extracted files
extracted_files = os.listdir(extract_dir)
print("Extracted files:", extracted_files)


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load model from folder
model = AutoModelForSequenceClassification.from_pretrained(
    extract_dir,
    use_safetensors=True
).to(device)


tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base")  # Tokenizer used during training since it hasn't been fine tuned

print("Model and tokenizer successfully loaded!")

Extracted files: ['config.json', 'model.safetensors', 'merges.txt', 'special_tokens_map.json', 'vocab.json', 'tokenizer.json', 'tokenizer_config.json']


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/565 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Model and tokenizer successfully loaded!


In [5]:
uploaded = files.upload()


Saving Tweets to label.csv to Tweets to label.csv


Now we can use our downloaded model to label the remaining tweets in batches.

In [6]:
import pandas as pd
from tqdm import tqdm
import torch

df = pd.read_csv("Tweets to label.csv")

batch_size = 32
all_preds = []

# Iterate in batches
for i in tqdm(range(0, len(df), batch_size)):
    batch_texts = df['Tweets'][i:i+batch_size].tolist()
    inputs = tokenizer(batch_texts, padding=True, truncation=True, max_length=128, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = model(**inputs)
        preds = torch.argmax(outputs.logits, dim=1)
        all_preds.extend(preds.cpu().numpy())

df["predicted_label"] = all_preds
df.to_csv("Labeled tweets final.csv", index=False)

# Download
from google.colab import files
files.download("Labeled tweets final.csv")



100%|██████████| 649/649 [01:12<00:00,  8.92it/s]


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>