# Sentiment analysis

**Author**: Andrea Cass

## 1. About this notebook

The purpose of this Google Colab notebook is to run a sentiment analysis using a cross-lingual language model called XLM-T (twitter-XLM-roBERTa-base-sentiment) on all tweets pre-processed in the Notebook titled, 02_Pre-processing_limited_merged:
> *02_Pre-processed_limited_merged.csv*

The model was developed and described by Barbieri et al. (2022). Code for employing the model can be accessed here: https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment?text=%F0%9F%A4%97

**NOTE**: The model output will be a set of 3 probability scores for each tweet. 
1. The first represents the probability that the tweet carries **negative** sentiment.
2. The second represents the probability that the tweet carries **neutral** sentiment.
3. The third represents the probaability that the tweet carries **positive** sentiment

Goals:
* Preprocess text to remove all usernames and replace them with "@user" as well as remove links
* Run model to predict probability scores

After scores are predicted, the dataset will be saved as a csv titled,
> *03_Sentiment-analysis_limited_merged.csv*

**NOTE**: This notebook was carried out on Google Colab due to issues using the transformers library on Jupyter notebook. It is my recommendation that you run this notebook on Google Colab. If you run this notebook on Jupyter notebook or another platform, several pieces of code may need to be altered.


## 2. Imports

In [None]:
!pip install transformers

In [None]:
!pip install datasets evaluate 

In [None]:
!pip install sentencepiece

In [4]:
from transformers import pipeline
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import AutoTokenizer, AutoConfig
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from scipy.special import softmax
from google.colab import files
import io
import sentencepiece

## 3. Loading the data
Running code below will prompt you to choose a file from your local drive. The desired file is the data derived from the previous Notebook:

> 02_Pre-processed_limited_merged.csv

The file should be saved in your folder called "CASS_thesis". Select and upload the file. Then, continue running the code as usual.

In [5]:
uploaded = files.upload()

Saving 02_Pre-processed_limited_merged.csv to 02_Pre-processed_limited_merged.csv


In [7]:
# importing the dataset as dataframe

df = pd.read_csv(io.BytesIO(uploaded['02_Pre-processed_limited_merged.csv']))

## 4. Pre-processing text
The code below to pre-process tweets was provided by the creators of XLM-T. For more details, please refer to the link provided in the first section of this Notebook.

In [8]:
# Defining function to preprocess text (username placeholders and link removal)

def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

In [9]:
# Converting text to string

df['text'] = df['text'].astype(str)

In [10]:
# Applying function to preprocess text and create column for the new text

df['new text'] = df['text'].apply(preprocess)

### 4.1. Viewing the dataframe

In [11]:
df

Unnamed: 0.1,Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,...,entities.hashtags,week,month,year,year-week,year-month,Language,date,inflow,new text
0,0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,...,,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste..."
1,1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,...,,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G..."
2,2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,...,,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen..."
3,3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,...,"[{'start': 23, 'end': 35, 'tag': 'Flüchtlinge'}]",16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...
4,4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,...,"[{'start': 9, 'end': 21, 'tag': 'Flüchtlinge'}...",16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66260,66260,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,...,"[{'start': 95, 'end': 108, 'tag': 'SocialRight...",26.0,6.0,2021.0,2021-26,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia..."
66261,66261,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,...,,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way..."
66262,66262,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,...,,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\..."
66263,66263,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,...,,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...


## 5. Sentiment analysis with XLMT
The code below to load and run the model was largely provided by the creators of XLM-T. For more details, please refer to the link provided in the first section of this Notebook.
### 5.1. Loading in the model

In [12]:
MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [None]:
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

### 5.2 Running the model
**NOTE**: When running the model, a waiting period of up to multiple hours may be needed. It is my recommendation to not let your computer sleep during this time.

In [15]:
# Defining a function to predict scores

def predict(new_text):
  encoded_input = tokenizer(new_text, return_tensors='pt', padding = True, truncation = True)
  output = model(**encoded_input)
  scores = output[0][0].detach().numpy()
  scores = softmax(scores, axis=-1)
  return scores

In [16]:
# Applying predict function to find scores

df['scores'] = df['new text'].apply(predict)

Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.


### 5.3 Viewing the dataframe

In [17]:
df

Unnamed: 0.1,Unnamed: 0,text,author_id,created_at,lang,geo.place_id,public_metrics.retweet_count,public_metrics.reply_count,public_metrics.like_count,public_metrics.quote_count,...,week,month,year,year-week,year-month,Language,date,inflow,new text,scores
0,0,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",4.122038e+09,2016-04-20 22:55:08+00:00,de,06d9a7c249c59bcd,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.048211697]"
1,1,"Habe schon lang nicht gehört, daß Flüchtling G...",1.179544e+09,2016-04-20 21:27:37+00:00,de,e99b714fe65be4fb,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.050594788]"
2,2,"""Es kommen kaum noch Flüchtlinge nach Griechen...",2.246076e+08,2016-04-20 21:18:58+00:00,de,3078869807f9dd36,0.0,0.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]"
3,3,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,2.480764e+09,2016-04-20 18:25:11+00:00,de,8abc99434d4f5d28,0.0,0.0,4.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,Unsere 1. Kochshow für #Flüchtlinge. Super spi...,"[0.022602528, 0.076046735, 0.9013506]"
4,4,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,6.062653e+08,2016-04-20 16:27:28+00:00,de,e11a8b8e3771f9fa,0.0,1.0,0.0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,2016-04-20 00:00:00+00:00,Syrians,500 tote #Flüchtlinge im #Mittelmeer – Tragödi...,"[0.6982864, 0.27545568, 0.026257832]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66260,66260,"For day 1 of week 2, @AnnaMariaKonsta discusse...",1.104025e+08,2021-06-28 14:43:31+00:00,en,fcbb3c6e0a7eba22,0.0,1.0,2.0,0.0,...,26.0,6.0,2021.0,2021-26,2021-06,English,2021-06-28 00:00:00+00:00,Ukrainians,"For day 1 of week 2, @user discusses the Socia...","[0.035507657, 0.82208896, 0.14240335]"
66261,66261,"@ariadneconill Europe is racist, but in a diff...",2.521809e+09,2021-06-27 13:03:29+00:00,en,e385d4d639c6a423,0.0,1.0,6.0,0.0,...,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"@user Europe is racist, but in a different way...","[0.8931986, 0.09380516, 0.012996121]"
66262,66262,"A labour of love, inspired by Middle-earth.\n\...",5.633818e+08,2021-06-27 08:37:21+00:00,en,257640324f249a73,0.0,1.0,17.0,0.0,...,25.0,6.0,2021.0,2021-26,2021-06,English,2021-06-27 00:00:00+00:00,Ukrainians,"A labour of love, inspired by Middle-earth.\n\...","[0.15655968, 0.5201203, 0.32332003]"
66263,66263,@simongerman600 I must have missed the great f...,2.591892e+09,2021-06-26 08:03:22+00:00,en,000b71538f35fe46,0.0,0.0,1.0,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-26 00:00:00+00:00,Ukrainians,@user I must have missed the great flight of t...,"[0.91320664, 0.06375945, 0.023033775]"


## 6. Downloading dataframe
The code below will download the data as a csv file with the title:
> 3_Sentiment-analysis_limited_merged.csv

Once downloaded, please ensure that you manually move this file to your folder called "CASS_thesis", as the code will not automatically do so.

In [18]:
df.to_csv('03_Sentiment-analysis_limited_merged.csv') 
files.download('03_Sentiment-analysis_limited_merged.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>