<a href="https://colab.research.google.com/github/cheatham1/EU-JAV-ItalianTweetStance/blob/main/Upload_a_saved_model_finetuned_BERT_model_for_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Upload a finetuned model and run

## EU-JAV: Understanding the vaccine stance of Italian tweets

We have finetuned a Transformer-based machine learning model for analysing the vaccine stance of Italian tweets. 

Two datasets were collected and the tweets labelled for stance.
* dataset A: tweets between November 2019 and June 2020
* dataset B: tweets from April to September 2021.

XLM-RoBERTa-large model was finetuned using dataset A and dataset B training sets. 

Here we show how to load the finetuned model and test data, then run the model and plot the resulting stance classifications.


In [None]:
!pip install transformers
!pip install sentencepiece

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import torch

from transformers import XLMRobertaTokenizer, XLMRobertaForSequenceClassification
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
from transformers import pipeline

from tqdm.auto import tqdm

# Load the model

In [None]:
model_name = "FrGes/xlm-roberta-large-finetuned-EUJAV-datasetAB"

config = AutoConfig.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, config=config)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

classifier = pipeline( task="text-classification", model=model, tokenizer=tokenizer, config=config)


# Load the Dataset
The selected dataset is loaded directly from GitHub. It is coded with labels and text in a comma-separated file. 

In [None]:
test_data_A = pd.read_csv(
    "https://raw.githubusercontent.com/FrGes/EU-JAV/main/datasetA_test_3categories.csv",
    names=["Annotator1","Annotator2","Annotator3","label", "text","index"]
)

test_data_B = pd.read_csv(
    "https://raw.githubusercontent.com/FrGes/EU-JAV/main/datasetB_test_3categories.csv",
    names=["Annotator1","Annotator2","Annotator3","label", "text","index"]
)

test_data = test_data_A.append(test_data_B)

print("Total test dataset: ",test_data.shape[0],": datasetA: ", test_data_A.shape[0], " datasetB:", test_data_B.shape[0])

In [None]:
test_data.head()

# Run and print Evaluation Report
The code below runs the finetuned models on the two test datasets. 
It prints an evaluation report using tools from sklearn.

In [None]:
X_test = list(test_data["text"])
y_test = list(test_data["label"])

In [None]:
# Extract integer values from the model prediction output

y_pred = []

for out in tqdm(classifier(X_test)):
  label = out["label"]
  int_label = label.replace("LABEL_", "")
  y_pred.append(int(int_label))

# Plot results

In [None]:
# import libraries

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report


In [None]:
df = pd.DataFrame(list(y_pred), columns = ["label"] )
counts = df['label'].value_counts()

In [None]:
counts

In [None]:
# Plot bargraph to show stance of tweets as labelled by the model

plt.figure(figsize=(10,5))
sns.barplot(x=counts.index, y=counts.values, palette = "Blues")
plt.title('Stance of Italian Tweets test datasets A and B')
plt.ylabel('counts')
plt.xticks([0, 1, 2], ['Promotional', 'Neutral', 'Discouraging'])

plt.show()

In [None]:
# Accuracy and F1-score of the model calculated using the model prediction and the annotators label 
acc = accuracy_score(y_test, y_pred)
f1  = f1_score(y_test, y_pred, average='macro')

print("Accuracy: ", acc, " F1-score: ",f1)

In [None]:
# Classification report showing performance of the finetuned model
print(classification_report(y_test, y_pred, digits=3))


In [None]:
# The annotator label and model prediction shown alongside the text
test_data.insert(loc = 4, column = "prediction", value = y_pred)
test_data.head()