# LABELLING - ACTIVE LEARNING

In [1]:
%pip install transformers

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 25.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [11]:
import pandas as pd
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer, AutoConfig

## 1. Labelling and Finetuning functions

In [None]:
'''
Function that labels the data with the provided model
and saves the labeled data to a csv file. Additionally,
it saves 100 rows with the lowest RoBERTa confidence scores
to a new CSV file.

Params:
model - the model to be used for sentiment analysis
tokenizer - the tokenizer to be used for sentiment analysis
df - dataframe with the text column to be labeled
round - active learning round
'''
def label_data(model, tokenizer, df, round):
  # Initialize the sentiment analysis pipeline
  sentiment_pipeline = pipeline("text-classification", 
                                model=model,
                                tokenizer=tokenizer,
                                device=0) 
  
  # Extract the text column of selected_data as a list
  reviews = df["text"].tolist()
    
  # Calculate the sentiment of the each of the reviews
  print("Active Learning - Automated Labelling - Round ", round)
  print("Predicting sentiment labels of data...")

  kwargs = {'padding':True,'truncation':True,'max_length':512}
  results = sentiment_pipeline(reviews, **kwargs) 

  print("Sentiment labels predicted.")
  print("Saving labeled data to a csv files...")

  # Add the sentiment and score to the selected_data DataFrame
  label2id = {"positive": 1, "negative": -1, "neutral": 0}
  df["roberta_label"] = [label2id[res["label"]] for res in results]
  df["roberta_score"] = [res["score"] for res in results]

  # Save the labeled data to a csv file
  df.to_csv(f'../Data/Labelling/round{round}_roberta_labelled_all_data.csv', index=False)

  # Save 100 rows with the lowest RoBERTa confidence scores to a new CSV file
  df_low_confidence = df.nsmallest(100, 'roberta_score')
  df_low_confidence.to_csv(f'../Data/Labelling/round{round}_roberta_labelled_low_confidence.csv', index=False)
  
  print(f"Completed Round {round} - Automated Labeling")

  return df

In [None]:
def finetune(model, train_data):
  pass

## 2. Active Learning Based Labelling

Active learning allows us to manually label the most informative parts of the dataset that confuses the model the most. 

### Round 1 - Use pretrained sentiment analysis Transformer model for automated labelling

In [None]:
# Load the data
selected_data = pd.read_csv('../Data/selected_data.csv')

In [14]:
# Load the pretrained model, tokenizer, and configuration from Hugging Face
pretrained_model_name = "cardiffnlp/twitter-roberta-base-sentiment-latest"
model = AutoModelForSequenceClassification.from_pretrained(pretrained_model_name)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)
config = AutoConfig.from_pretrained(pretrained_model_name)

Some weights of the model checkpoint at cardiffnlp/twitter-roberta-base-sentiment-latest were not used when initializing RobertaForSequenceClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


In [None]:
label_data(model = model, 
           df = selected_data, 
           tokenizer = tokenizer,
           round = 1)

Device set to use cuda:0


Active Learning - Automated Labelling - Round  1
Predicting sentiment labels of data...
Sentiment labels predicted.
Saving labeled data to a csv files...
Completed Round 1 - Automated Labeling


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["roberta_label"] = [label2id[res["label"]] for res in results]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["roberta_score"] = [res["score"] for res in results]


Unnamed: 0,post_id,subreddit,post_title,post_body,number_of_comments,readable_datetime,post_author,number_of_upvotes,query,text,comment_id,comment_body,comment_author,cosine_similarity,roberta_label,roberta_score
0,1d31lxf,technology,Former OpenAI board member explains why they f...,,97,2024-05-29 06:31:18,Maxie445,84,OpenAI,Good luck to the consumers/customers who are t...,l64i9ts,Good luck to the consumers/customers who are t...,imaketrollfaces,0.717946,1,0.931254
1,1dn7dwq,OpenAI,I’m sick of waiting for chatGPT 4o Voice and I...,I’ve been religiously checking for the voice u...,368,2024-06-24 11:02:41,surfer808,45,ChatGPT vs Claude,OpenAI did a great job of showing the public t...,la0rsb1,OpenAI did a great job of showing the public t...,q_freak,0.710471,1,0.950637
2,1hiru1c,ChatGPT,OpenAI's new model is equivalent to the 175th ...,,114,2024-12-20 23:38:56,MetaKnowing,236,o3,OpenAI's new model is equivalent to the 175th ...,,,,0.708699,1,0.947382
