# Sentiment analysis

**Author**: Andrea Cass

## 1. About this notebook

The purpose of this Google Colab notebook is to run a sentiment analysis using a cross-lingual language model called XLM-T (twitter-XLM-roBERTa-base-sentiment) on all tweets pre-processed in the Notebook titled, 02_Pre-processing_merged:
> *02_Pre-processed_merged.csv*

The model was developed and described by Barbieri et al. (2022). Code for employing the model can be accessed here: https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base-sentiment?text=%F0%9F%A4%97

**NOTE**: The model output will be a set of 3 probability scores for each tweet. 
1. The first represents the probability that the tweet carries **negative** sentiment.
2. The second represents the probability that the tweet carries **neutral** sentiment.
3. The third represents the probaability that the tweet carries **positive** sentiment

Goals:
* Preprocess text to remove all usernames and replace them with "@user" as well as remove links
* Run model to predict probability scores

After scores are predicted, the dataset will be saved as a csv titled,
> *03_Sentiment-analysis_merged.csv*

**NOTE**: This notebook was carried out on Google Colab due to issues using the transformers library on Jupyter notebook. It my recommendation that you run this notebook on Google Colab. If you run this notebook on Jupyter notebook or another platform, several pieces of code may need to be altered.


## 2. Imports

In [None]:
!pip install transformers

In [None]:
!pip install datasets evaluate 

In [None]:
!pip install sentencepiece

In [None]:
from transformers import pipeline
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import AutoTokenizer, AutoConfig
from transformers import TFAutoModelForSequenceClassification
from transformers import AutoModelForSequenceClassification
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from collections import defaultdict
from textwrap import wrap
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
import torch.nn.functional as F
from scipy.special import softmax
from google.colab import files
import io
import sentencepiece

## 3. Loading the data
Running code below will prompt you to choose a file from your local drive. The desired file is the data derived from the previous Notebook:

> 02_Pre-processed_merged.csv

The file should be saved in your folder called "CAS_thesis". Select and upload the file. Then, continue running the code as usual.

In [None]:
uploaded = files.upload()

Saving 02_Pre-processed_merged.csv to 02_Pre-processed_merged (2).csv


In [None]:
# importing the dataset as dataframe

df = pd.read_csv(io.BytesIO(uploaded['02_Pre-processed_merged.csv']))

  exec(code_obj, self.user_global_ns, self.user_ns)


## 4. Pre-processing text
The code below to pre-process tweets was provided by the creators of XLM-T. For more details, please refer to the link provided in the first section of this Notebook.

In [None]:
# Defining function to preprocess text (username placeholders and link removal)

def preprocess(text):
    new_text = []
    for t in text.split(" "):
        t = '@user' if t.startswith('@') and len(t) > 1 else t
        t = 'http' if t.startswith('http') else t
        new_text.append(t)
    return " ".join(new_text)

In [None]:
# Converting text to string

df['text'] = df['text'].astype(str)

In [None]:
# Applying function to preprocess text and create column for the new text

df['new text'] = df['text'].apply(preprocess)

### 4.1. Viewing the dataframe

In [None]:
df

Unnamed: 0.1,Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,public_metrics.reply_count,...,new_created_at,week,month,year,year-week,year-month,Language,date,inflow,new text
0,0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,1.0,...,2016-04-20 23:04:40,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ..."
1,1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,0.0,...,2016-04-20 22:55:08,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste..."
2,2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,0.0,...,2016-04-20 21:27:37,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G..."
3,3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,0.0,...,2016-04-20 21:18:58,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen..."
4,4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,0.0,...,2016-04-20 20:56:48,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68316,68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,0.0,...,,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest..."
68317,68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,1.0,...,,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ..."
68318,68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,0.0,...,,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too."
68319,68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,0.0,...,,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...


## 5. Sentiment analysis with XLMT
The code below to load and run the model was largely provided by the creators of XLM-T. For more details, please refer to the link provided in the first section of this Notebook.
### 5.1. Loading in the model

In [None]:
MODEL = f"cardiffnlp/twitter-xlm-roberta-base-sentiment"

In [None]:
tokenizer = AutoTokenizer.from_pretrained(MODEL)

In [None]:
config = AutoConfig.from_pretrained(MODEL)
model = AutoModelForSequenceClassification.from_pretrained(MODEL)

Downloading:   0%|          | 0.00/1.11G [00:00<?, ?B/s]

### 5.2 Running the model
**NOTE**: When running the model, a waiting period of up to multiple hours may be needed. It is my recommendation to not let your computer sleep during this time.

In [None]:
# Defining a function to predict scores

def predict(new_text):
  encoded_input = tokenizer(new_text, return_tensors='pt', padding = True, truncation = True)
  output = model(**encoded_input)
  scores = output[0][0].detach().numpy()
  scores = softmax(scores, axis=-1)
  return scores

In [None]:
# Applying predict function to find scores

df['scores'] = df['new text'].apply(predict)

### 5.3 Viewing the dataframe

In [None]:
df

Unnamed: 0.1,Unnamed: 0,lang,created_at,author_id,in_reply_to_user_id,text,geo.place_id,entities.hashtags,public_metrics.retweet_count,public_metrics.reply_count,...,week,month,year,year-week,year-month,Language,date,inflow,new text,scores
0,0,de,2016-04-20 23:04:40+00:00,14526045,41482148,"@FrauWeh Film gesehen und nur gestaunt. Wir, a...",e11a8b8e3771f9fa,"[{'start': 126, 'end': 131, 'tag': 'OMFG'}]",0,1.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"@user Film gesehen und nur gestaunt. Wir, aus ...","[0.63110495, 0.18820627, 0.18068886]"
1,1,de,2016-04-20 22:55:08+00:00,4122038069,,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...",06d9a7c249c59bcd,,0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Syrisch-orthodoxer Bischof: ""Was im Nahen Oste...","[0.0679844, 0.88380396, 0.048211697]"
2,2,de,2016-04-20 21:27:37+00:00,1179543852,,"Habe schon lang nicht gehört, daß Flüchtling G...",e99b714fe65be4fb,,0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Habe schon lang nicht gehört, daß Flüchtling G...","[0.3660313, 0.5833739, 0.050594788]"
3,3,de,2016-04-20 21:18:58+00:00,224607633,,"""Es kommen kaum noch Flüchtlinge nach Griechen...",3078869807f9dd36,,0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"""Es kommen kaum noch Flüchtlinge nach Griechen...","[0.30297568, 0.4430802, 0.25394407]"
4,4,de,2016-04-20 20:56:48+00:00,3022904603,,"Verständlich, aber #Frankreich muss eigene Feh...",48504653e183c91c,"[{'start': 19, 'end': 30, 'tag': 'Frankreich'}...",0,0.0,...,16.0,4.0,2016.0,2016-16,2016-04,German,,Syrians,"Verständlich, aber #Frankreich muss eigene Feh...","[0.78610265, 0.18985648, 0.024040796]"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68316,68316,en,2021-06-25 10:34:05+00:00,232958476,,"The ministry of immigration , runs the biggest...",37439688c6302728,,0.0,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-25 00:00:00+00:00,Ukrainians,"The ministry of immigration , runs the biggest...","[0.75754243, 0.21522887, 0.027228728]"
68317,68317,en,2021-06-24 19:29:39+00:00,9474872,9474872.0,"@sudo_f @typo3 @felicity_brand Intellectually,...",8abc99434d4f5d28,,0.0,1.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user @user @user Intellectually, it would be ...","[0.053599443, 0.20743202, 0.7389685]"
68318,68318,en,2021-06-24 18:33:38+00:00,980714168,2199678761.0,@Waringphilip Agree. Immigration has done me p...,c82d9e53ae03d753,,0.0,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,"@user Agree. Immigration has done me proud, too.","[0.018092642, 0.04827376, 0.9336336]"
68319,68319,en,2021-06-24 11:16:36+00:00,185889479,10809412.0,@rakyll I would love to have automatic cross z...,5bcd72da50f0ee77,,0.0,0.0,...,25.0,6.0,2021.0,2021-25,2021-06,English,2021-06-24 00:00:00+00:00,Ukrainians,@user I would love to have automatic cross zon...,"[0.68358207, 0.23378702, 0.08263103]"


## 6. Downloading dataframe
The code below will download the data as a csv file with the title:
> 3_Sentiment-analysis.csv

Once downloaded, please ensure that you manually move this file to your folder called "CASS_thesis", as the code will not automatically do so.

In [None]:
df.to_csv('03_Sentiment-analysis_merged.csv') 
files.download('03_Sentiment-analysis_merged.csv')

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>