<font size="6"><b>Fine-tuning DistilBert</b></font>  
The goal of this notebook is to fine-tune a classifier for sentiment analysis, using a baseline of manually annotated tweets. The tweets were selected to include all original labels (positive, negative, neutral). These were then annotated by us after deciding and explicitly stating the question we wanted to keep in mind while annotating. The pre-trained model used is DistilBert.

In [24]:
import pandas as pd 

To facilitate the annotation process, the tweets were left unedited with all hashtags and URLs. Thus, in this first part of the notebook, the annotated tweets are processed to create the final dataset used for training.

In [25]:
# Load the annotated dataset and convert it into the right format
annotation_df_1=pd.read_csv('Annotation set 1.csv')
annotation_df_2=pd.read_csv('Annotation set 2.csv')
annotation_df_1.columns = annotation_df_1.columns.str.strip()
annotation_df_2.columns = annotation_df_2.columns.str.strip()
annotation_df = pd.concat([annotation_df_1, annotation_df_2])

print(annotation_df.head())
print(annotation_df['Annotation'].value_counts())
clean_df=pd.read_csv('Cleaned data.csv')

   Serial                                             Tweets Annotation
0   13202  Now you have zero reasons to not type annotate...   Positive
1   12423                                #glycotime #chatGPT    Neutral
2    6686  Organizations need to embed ethics-oriented th...    Neutral
3    2568  Study the antifragility of gamers. Crypto pric...   Negative
4    6554  Startup Spotlight: AI Edition Today, I want to...   Positive
Annotation
Neutral     859
Negative    587
Positive    553
Name: count, dtype: int64


In [27]:
merged_df=pd.merge(clean_df,annotation_df[['Serial','Annotation']],on='Serial',how='inner')
print(merged_df.head)
print(merged_df.isnull().sum())
merged_df=merged_df.dropna()

# Replace string with numeric value as expected by the classifier 
label_map={'Positive': 2,'Neutral': 1,'Negative': 0}
merged_df['Annotation']=merged_df['Annotation'].map(label_map)

training_df=merged_df[['Tweets','Annotation']].copy()
training_df=training_df.rename(columns={'Tweets':'text','Annotation':'label'})

print(training_df['label'].value_counts())

# Balance the classes by downsampling
min_count = training_df['label'].value_counts().min()

balanced_df = (
    training_df.groupby('label')
    .sample(n=min_count, random_state=42) 
    .reset_index(drop=True)
)


print(balanced_df['label'].value_counts())

# Save dataset for training
balanced_df.to_csv('Training data.csv',index=False)



<bound method NDFrame.head of       Serial                                             Tweets Annotation
0          5  theres still lots to unpack with ai (llms) the...   Positive
1          5  theres still lots to unpack with ai (llms) the...   Positive
2         36  ai adoption accelerated during the pandemic, b...    Neutral
3         36  ai adoption accelerated during the pandemic, b...   Positive
4         47  absolutely, mastering critical thinking is key...   Positive
...      ...                                                ...        ...
1964   25062  is this the level you're sinking to? using 4ch...    Neutral
1965   25091  frequently the use of a word or two will trigg...    Neutral
1966   25107  reports of israel using ai for targeting in th...   Negative
1967   25120  using ai for israel killing machine. that is t...   Negative
1968   25143  mfs really out here using ai for a console war...   Negative

[1969 rows x 3 columns]>
Serial        0
Tweets        0
Annotation  

In order to evaluate the performance of the fine-tuned model, it is necessary to establish a suitable <b>baseline</b>.   

A robust yet elementary baseline is implemented in the following cell, utilising TF-IDF on character n-grams, followed by logistic regression. The model achieves a good level of accuracy, with an overall accuracy of  0.6512 and a overall f1 score of 0.6416. 
We will try to achieve an higher  macro averaged f1 score fine tuning DistilBert.

In [34]:
# Baseline to confront this with

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline


# Load your preprocessed data
df = pd.read_csv("Training data.csv")

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.2, random_state=42
)

# Define pipeline 
pipeline = Pipeline([
        ("tfidf", TfidfVectorizer(analyzer="char", ngram_range=(2,6))),
        ("clf", LogisticRegression(class_weight="balanced", max_iter=1000))
    ])

# Train the model
pipeline.fit(X_train, y_train)

# Predict and evaluate
y_pred = pipeline.predict(X_test)
print(classification_report(y_test, y_pred, digits=4))



              precision    recall  f1-score   support

           0     0.7099    0.7561    0.7323       123
           1     0.6296    0.5100    0.5635       100
           2     0.5982    0.6634    0.6291       101

    accuracy                         0.6512       324
   macro avg     0.6459    0.6432    0.6416       324
weighted avg     0.6503    0.6512    0.6480       324



In [35]:
from transformers import DistilBertTokenizer, DistilBertForSequenceClassification, Trainer, TrainingArguments
from transformers import EarlyStoppingCallback
from datasets import load_dataset
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support,f1_score
import torch
import accelerate

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(accelerate.__version__)

1.5.2


In [None]:
# Let's download and fine-tune the model

dataset = load_dataset('csv', data_files='Training data.csv')
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize function
def tokenize_function(example):
    return tokenizer(example['text'], padding='max_length', truncation=True, max_length=128)

# Apply tokenization
tokenized_dataset = dataset['train'].map(tokenize_function, batched=True)

tokenized_dataset = tokenized_dataset.rename_column("label", "labels")
tokenized_dataset.set_format("torch", columns=["input_ids", "attention_mask", "labels"])

# Split dataset in train/validation/test splits 
train_test_split = tokenized_dataset.train_test_split(test_size=0.2, seed=42)


temp_dataset = train_test_split['test']
val_test_split = temp_dataset.train_test_split(test_size=0.5, seed=42) # Split to create validation 

# Get the final train, validation, and test sets
train_dataset = train_test_split['train']
val_dataset = val_test_split['train']
test_dataset = val_test_split['test']

# Check the sizes of each dataset
print(f"Train dataset size: {len(train_dataset)}")
print(f"Validation dataset size: {len(val_dataset)}")
print(f"Test dataset size: {len(test_dataset)}")


Generating train split: 1617 examples [00:00, 48496.52 examples/s]
Map: 100%|██████████| 1617/1617 [00:01<00:00, 969.75 examples/s] 

Train dataset size: 1293
Validation dataset size: 162
Test dataset size: 162





In [17]:
# Load pre-trained DistilBERT model
model = DistilBertForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=3)
model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): DistilBertSdpaAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)


In [None]:
# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=8,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    warmup_steps=50,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    no_cuda=True,
    learning_rate=1e-5
)


def compute_metrics(p):
    preds = torch.tensor(p.predictions).argmax(dim=1)
    labels = torch.tensor(p.label_ids)
    
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average=None, labels=[0, 1, 2])

    acc = accuracy_score(labels, preds)
    f1_macro = f1_score(labels, preds, average="macro")
    
    return {
        "accuracy": acc,
        "precision_negative": precision[0],
        "recall_negative": recall[0],
        "f1_negative": f1[0],
        "precision_neutral": precision[1],
        "recall_neutral": recall[1],
        "f1_neutral": f1[1],
        "precision_positive": precision[2],
        "recall_positive": recall[2],
        "f1_positive": f1[2],
        "f1_macro": f1_macro
    }


# Initialize trainer with the updated datasets
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,  # Using the train dataset
    eval_dataset=val_dataset,     # Using the validation dataset
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
)
 

  trainer = Trainer(


In [11]:
# Train the model
trainer.train()

# Evaluate the model
eval_result = trainer.evaluate()
print("Evaluation results:", eval_result)
test_results = trainer.evaluate(test_dataset)
print("Test results:", test_results)

# Save the model and tokenizer
model.save_pretrained("./sentiment_model_with_validation")
tokenizer.save_pretrained("./sentiment_model_with_validation")

Epoch,Training Loss,Validation Loss,Accuracy,Precision Negative,Recall Negative,F1 Negative,Precision Neutral,Recall Neutral,F1 Neutral,Precision Positive,Recall Positive,F1 Positive,F1 Macro
1,0.9344,0.925758,0.555556,0.505263,0.90566,0.648649,0.333333,0.065217,0.109091,0.672414,0.619048,0.644628,0.467456
2,0.7757,0.839353,0.592593,0.610169,0.679245,0.642857,0.404255,0.413043,0.408602,0.732143,0.650794,0.689076,0.580178
3,0.5849,0.782924,0.623457,0.614035,0.660377,0.636364,0.458333,0.478261,0.468085,0.77193,0.698413,0.733333,0.612594
4,0.4823,0.793994,0.685185,0.642857,0.849057,0.731707,0.642857,0.391304,0.486486,0.75,0.761905,0.755906,0.658033
5,0.3371,0.818975,0.654321,0.723404,0.641509,0.68,0.466667,0.608696,0.528302,0.8,0.698413,0.745763,0.651355
6,0.3186,0.873729,0.654321,0.727273,0.603774,0.659794,0.472727,0.565217,0.514851,0.761905,0.761905,0.761905,0.645517


Evaluation results: {'eval_loss': 0.7939937114715576, 'eval_accuracy': 0.6851851851851852, 'eval_precision_negative': 0.6428571428571429, 'eval_recall_negative': 0.8490566037735849, 'eval_f1_negative': 0.7317073170731707, 'eval_precision_neutral': 0.6428571428571429, 'eval_recall_neutral': 0.391304347826087, 'eval_f1_neutral': 0.4864864864864865, 'eval_precision_positive': 0.75, 'eval_recall_positive': 0.7619047619047619, 'eval_f1_positive': 0.7559055118110236, 'eval_f1_macro': 0.6580331051235603, 'eval_runtime': 8.1643, 'eval_samples_per_second': 19.842, 'eval_steps_per_second': 2.572, 'epoch': 6.0}
Test results: {'eval_loss': 0.807064414024353, 'eval_accuracy': 0.654320987654321, 'eval_precision_negative': 0.6176470588235294, 'eval_recall_negative': 0.7924528301886793, 'eval_f1_negative': 0.6942148760330579, 'eval_precision_neutral': 0.6923076923076923, 'eval_recall_neutral': 0.32142857142857145, 'eval_f1_neutral': 0.43902439024390244, 'eval_precision_positive': 0.6764705882352942, '

('./sentiment_model_with_validation\\tokenizer_config.json',
 './sentiment_model_with_validation\\special_tokens_map.json',
 './sentiment_model_with_validation\\vocab.txt',
 './sentiment_model_with_validation\\added_tokens.json')

The final model has a macro f1 score of  approximately 0.632. Unfortunately after different attempts we couldn't manage to further improve performance, so we settled on this model which performs worst than the baseline on the macro f1 score metric. In this last snippet of code the previously cleaned data is prepared to be labeled. 

In [5]:
tweet_df=pd.read_csv('Cleaned data.csv')

mask=~tweet_df['Serial'].isin(merged_df['Serial'])
final_tweets=tweet_df[mask]
final_tweets['Tweets'].to_csv('Tweets to label.csv')