# War in Ukraine: Modeling Twitter Data

## Background

On February 24th, 2022, Russia invaded Ukraine after months of military preparation around the borders. Putin insists on calling it a "special military operation" and punishes anyone who calls it a "war" or defies the Russian state's completely false and fabricated narrative, which claims that Russian military is saving ethnic Russians and Ukrainians from "Nazi" officials in Ukraine.

Everyone speculated that Russian forces would bulldoze over Ukrainian forces akin to a "Blitzkrieg" operation. Contrary to many's belief, especially Putin's, Ukrainian forces, with a significant backing from the West, has put up a tough resistance, even pushing out Russian forces in some major cities after weeks of battling as of early May. 

This war is having ripple effects across the world. In trade, grain, a major export product of Ukraine and Russia, has gone up in price. Energy market has experienced a shake-up since the war began and is expecting a very uncertain future as the West slowly weans off of Russian oil and gas. In geopolitics, Russia is almost cerntainly going to be more isolated, which is pushing it to harden its alliance with China, which in turn is navigating with care as to not violate the sanctions imposed on Russia. Countries that have not been part of NATO, like Finland and Sweden, are now more eager to join the alliance aghast by the Russian aggression. As such, we are witnessing a fundamentally changing world due to Russia's invasion of Ukraine.

The war is going to have huge impacts around the world, perhaps even ending the globalized era as we know it. It is imperative that we capture the massive amounts of data being put out as a result of this war and extract insights for future generations.

## Goals

In an effort to make a record of this historical atrocity, I am initiating a project that monitors the progression of Twitterverse surrounding the war in Ukraine. On a regular basis, this project aims to conduct natural language processing tasks (e.g., wordcloud, sentiment analysis, topic modeling, etc.) to help the public keep track of what aspects of the war people are discussing on Twitter and how they feel about them, essentially tracking the Twitter users' changing views on the war. 

As a start, this project will **perform sentiment analysis on each month** since the beginning of the war separately. One this reason to this approach is to avoid exceeding the memory cap on Kaggle notebook. Thus, if necessary, a particular month's data could be divided into batches if for some reason there is an inordinate amount of data. In addition to EDA and sentiment analysis, the project will **perform topic modeling** to capture what the hottest issues are to the people on Twitter.

Once several notebooks are out, say up to April data, **the project will try to automate the entire process** so that the analyses get updated on a regular basis. The goal is to have multiple dashboards each of which visualizes the Twitterverse in a specific time window (e.g., month) for the public to understand easily. The outputs may be displayed on a web application, though this is further down the line of development.



## February Data - Exploratory Data Analysis and Data Cleaning

In this notebook we will do data-cleaning/wrangling and exploratory data analysis of the February tweets about the war in Ukraine.

EDA:
- Wordcloud
- Tweets per day
- Top users by tweet frequency
- Tweet length distribution

Data Cleaning:
- Read in February tweets
- Check the data types
- Check for missing data
- Check for duplicates



Import necessary libraries.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re
import json

import os
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

!pip install emoji --upgrade
import emoji

!pip install tweet-preprocessor
import preprocessor as p

# !pip install -U spacy
# !pip install texthero
# import texthero as hero

!pip install transformers
from transformers import AutoTokenizer, AutoModelForSequenceClassification
from transformers import BertTokenizer, TFBertForSequenceClassification
from transformers import InputExample, InputFeatures

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator

## Exploratory Data Analysis (EDA)

This project is using datasets provided by [Bwandowando on Kaggle](https://www.kaggle.com/datasets/bwandowando/ukraine-russian-crisis-twitter-dataset-1-2-m-rows).

In [None]:
all_files = []
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        full_path=os.path.join(dirname, filename)
        all_files.append(full_path)

In [None]:
# sort the files
all_files.sort()

For this notebook, we will only look at February data. Seeing the filenames above and using regex, we can grab only the February files. Subsequent notebooks will cover the remaining months.

In [None]:
# fetch all February files - filenames containing "FEB" or "202202"
feb_files = [file for file in all_files if re.search(r"FEB", file) or re.search(r"202202", file)]
feb_files

Unzip the files and concatenate them into one pandas DataFrame.

In [None]:
tmp_df_list = []
for file in feb_files:
    print(f"Reading in {file}")
    # unzip and read in the csv file as a dataframe
    tmp_df = pd.read_csv(file, compression="gzip", header=0, index_col=0)
    # append dataframe to temp list
    tmp_df_list.append(tmp_df)

print("Concatenating the DataFrames")
# concatenate the dataframes in the temp list row-wise
feb_df= pd.concat(tmp_df_list, axis=0)
print("Concatenation complete!")

Check the first 5 rows.

In [None]:
# show the first 5 rows of the february dataframe
feb_df.head()

Check the shape.

In [None]:
# get shape of the DataFrame
print(f"{feb_df.shape[0]} rows and {feb_df.shape[1]} columns")

Check data types.

In [None]:
feb_df.info()

- `userid`: unique number given to each user
- `username`: user-defined name on Twitter (e.g., @johndoe)
- `acctdesc`: user-made account description
- `location`: user-defined location information (e.g., where they are based)
- `following`: number of users the author of the tweet is following
- `followers`: number of users following the author of the tweet
- `totaltweets`: total number of tweets by the author of the tweet
- `usercreatedts`: timestamp of when the user account was created
- `tweetid` unique number given to each tweet
- `tweetcreatedts`: timestamp of when the tweet was made
- `retweetcount`: number of times the tweet was retweeted
- `text`: text/content of the tweet
- `hashtags`: hashtags in the tweet
- `language`: language code of the tweet
- `coordinates`: user-defined coordinates at the time of tweeting
- `favorite_count`: numbere of times the tweet was favorited
- `extractedts`: timestamp of when the tweet was extracted

Change the dtypes of `usercreatedts`, `tweetcreatedts`, and `extractedts` to `datetime64` for easier operation later.

In [None]:
feb_df["usercreatedts"] = pd.to_datetime(feb_df["usercreatedts"])
feb_df["tweetcreatedts"] = pd.to_datetime(feb_df["tweetcreatedts"])
feb_df["extractedts"] = pd.to_datetime(feb_df["extractedts"])

# check dtypes
feb_df.info()

When were the earliest and latest tweets in this dataset created?

In [None]:
earliest_tweet = feb_df["tweetcreatedts"].min()
latest_tweet = feb_df["tweetcreatedts"].max()

print(f"The earliest tweet was at {earliest_tweet}, and the latest was at {latest_tweet}")

Considering that the war began at early hours of the 24th of February, the earliest tweet in this dataset came only a few hours after. The latest tweet in this dataset came about half an hour before the end of February. 

Visualize tweet frequency by date.

In [None]:
# get dates in the dataframe 
dates = feb_df["tweetcreatedts"].dt.day
# group tweet timestamps by date and get tweet count for each date
tweetcount_by_date = feb_df["tweetcreatedts"].groupby(dates).size()

# plot bar graph of tweet count by date
tweetcount_by_date.plot.bar();

plt.title("February Tweet Count by Date")
plt.xlabel("Tweet Date")
plt.ylabel("Tweet Count")
plt.xticks(rotation=0)
plt.show()

In the month of February, the number of tweets about the war in Ukraine peaked on the 27th, 3 days after the start of the conflict. We can see that number of tweets jumped on the second day to over 400,000 tweets from 300,000 on the first day and stayed that way until decreasing on the 28th. At this point it is difficult to reason why the number went down on the 28th. However, it is clear that more people started to talk about the war starting the second day (the 25th).

How many languages are in this dataset?

In [None]:
print(f"There are {feb_df['language'].nunique()} unique languages in this DataFrame.")
feb_df["language"].unique()

What percentage of the tweets is in English (en)?

In [None]:
print(f"{round(feb_df.loc[feb_df['language']=='en'].shape[0]/feb_df.shape[0]*100, 2)}% of the tweets are in English.")

Plot the distribution of different languages.

In [None]:
language_counts = feb_df.groupby("language").size().sort_values(ascending=False)[0:20].plot.bar(figsize=(12,6),
                                                                                         title="Top 20 Languages by Frequency",
                                                                                         xlabel="Language Code",
                                                                                         ylabel="Number of Tweets",
                                                                                         rot=90
                                                                                         );

plt.xticks(rotation=0)
plt.show()

We can see that English (en) was by far the most prevalent language in this dataset, nearing 1.2 million tweets out of 1.96 million. The second and third most prevalent languages were French and Thai, respectively. 

Note that the sixth most prevalent language was "und", which is used to indicate that Twitter could not detect a language. Let's inspect these rows.

In [None]:
# pull the rows for which their language code is "und" or undefined
language_und = feb_df.loc[feb_df["language"]=="und"]
# show full length of the text without truncating (...)
pd.set_option('display.max_colwidth', None)
# show tweets
language_und["text"]

These tweets appear to have lots of hashtags but little text. There are a couple exceptions, however. For example, there is a tweet that is in Croatian ("Hm… sve što se današ u Ukrajini...") ending with a hashtag in English ("#Ukraine"). Another tweet has both Ukrainian and English. Perhaps, Twitter defaults to assigning "und" when a tweet contains more than one language. Regardless, we are going limit the scope of the project to English tweets, which we will implement in code soon.

The number of English tweets jumped after the first day of the war to over 400,000 and stayed that way through the third day, the 26th, compared to slightly fewer thann 300,000 tweets on the first day of the conflict, or the 24th. After the third day, however, the number slightly went down on the 27th and then rebounded on the 28th. The decrease on the 27th could be explained by many factors. For one, it could be that the tweet extraction process could have stared late or got cut early. We could get to the bottom of this by checking the earliest and latest tweet timestamps of each day in the DataFrame. Or, the number could have dropped for other reasons that are nebulous to us right now. 

Find the earliest and latest `tweetcreatedts` timestamps of each day and calculate the time difference between them.

In [None]:
earliest_tweetts = feb_df["tweetcreatedts"].groupby(feb_df["tweetcreatedts"].dt.day).min()
latest_tweetts = feb_df["tweetcreatedts"].groupby(feb_df["tweetcreatedts"].dt.day).max()

print(f"Earliest tweet timestamp of each day: {earliest_tweetts} \n")
print(f"Latest tweet timestamp of each day: {latest_tweetts} \n")

print("Timespan between first tweet of the day and the last tweet for each day shown below:")
# calculate the timespan between first tweet of the day and the last tweet
latest_tweetts - earliest_tweetts

The first day had the shortest timespan of just over 17 hours between the first tweet collected and the last one. That is because the earliest timestamp for that day was at 06:48 in the morning, whereas for other days the earliest timestamps were just after midnight (00:00). 

Upon calculating the time differences, we can see that, as expected, the first and last tweet on the 24th had the shortest span of 17 hours and 11 minutes. The second shortest day was the 28th with 23 hours and 24 minutes between its first and last tweets. The 27th was the third shortest day with 23 hours and 50 minutes, but it was only 5-6 minutes shorter than the two longest days, the 25th and 26th. Although we cannot say for certain that the times between the first and last tweets could explain the decreases on the 27th and 28th, they seem to be contributing factors.

In [None]:
min_len = feb_df["text"].str.len().min()
max_len = feb_df["text"].str.len().max()


print(f"Shortest tweet has {min_len} chars.")
print(f"Longest tweet has {max_len} chars.")

Hold on, a tweet can have 280 characters max. How could one have more than the limit?

In [None]:
# get index of the tweet that has the max length
max_len_index = feb_df["text"].str.len().idxmax()
# pull out the text of that index
feb_df.loc[max_len_index, "text"]

Upon research, mentions supposedly do not count toward the character limit when the tweet is a reply.

Let's check the distribution of tweet lengths.

In [None]:
tweet_len_series = feb_df["text"].str.len()
tweet_len_series.plot.hist();
plt.title("Distribution of Tweet Length")
plt.xlabel("Tweet Length (Characters)")
plt.ylabel("Frequency")
# draw a vertical line for the mean
plt.axvline(x=tweet_len_series.mean(), color="red")
# draw a vertical line for the median
plt.axvline(x=tweet_len_series.median(), color="yellow")
plt.show()

print(f"Mean: {tweet_len_series.mean()} chars")
print(f"Median: {tweet_len_series.median()} chars")
print(f"Standard deviation: {tweet_len_series.std()} chars")

The distribution is right-skewed. Most tweets appear to be below 300 characers in length. But because we have a few outlying tweets that have anomalously long lengths, as investigated above, the histogram has an elongated x-axis.

Find the top 20 users by tweet frequency.

In [None]:
plt.figure(figsize=(10,5))
feb_df["username"].value_counts().sort_values(ascending=True)[-20:].plot.barh();
plt.title("Top 20 Users by Tweet Frequency")
plt.xlabel("Tweet Frequency")
plt.ylabel("Username")
plt.show()

The top 20 users tweeted hundreds of times during the last 4 days of February. Such frequencies raise suspicion that they are bots or institution-operated, like media companies.

Let's calculate user account ages.

In [None]:
feb_df.columns

In [None]:
feb_df[["usercreatedts", "extractedts"]].head()

In [None]:
feb_df["account_age"] = (feb_df["extractedts"]-feb_df["usercreatedts"])
# sns.histplot(feb_df["account_age"])
# plt.xlimit()
# feb_df["account_age"].head()

In [None]:
idxmin = feb_df["usercreatedts"].idxmin()
feb_df.loc[idxmin,:]

In [None]:
feb_df["account_age"].head()

In [None]:
print(feb_df["account_age"].min())
print(feb_df["account_age"].max())

## Data Cleaning

### Check for Missing Values

Check which columns have missing values.

In [None]:
feb_df.isna().any()

`acctdesc` (account description), `location`, `coordinates` columns have missing values.

`acctdesc` column contains account descriptions that users share on their Twitter profiles. At the moment, we are not concerned with such information. For now, we will rely on the tweets to learn more about what kind of words are frequently used and the users' sentiments surrounding the war in Ukraine. Therefore, we will drop `acctdesc` column.

In [None]:
# drop acctdesc column
feb_df.drop("acctdesc", axis=1, inplace=True)
# confirm it has been dropped
feb_df.info()

Let's see how many rows in the `location` column are missing. 

In [None]:
# get the number of rows missing location info
missing_location_count = feb_df.loc[feb_df["location"].isna()].shape[0]
print(f"{missing_location_count} rows are missing location information.")
print(f"{round(missing_location_count/feb_df.shape[0]*100,2)}% of the rows are missing location information.")

Let's see how many rows in `coordinates` column are missing values.

In [None]:
missing_coordinates_count = feb_df.loc[feb_df["coordinates"].isna()].shape[0]
pct_missing_coordinates = round(missing_coordinates_count/feb_df.shape[0]*100,2)
print(f"{missing_coordinates_count} rows are missing location information.")
print(f"{pct_missing_coordinates}% of the rows are missing coordinates.")


Find the number of rows missing **both location and coordinates**.

In [None]:
missing_location_coord = feb_df.loc[(feb_df["location"].isna()) & \
                                   (feb_df["coordinates"].isna())]
print(f"{missing_location_coord.shape[0]} rows, or {missing_location_coord.shape[0]/feb_df.shape[0]*100}%, \
are missing both location and coordinates data")

Quite a significant portion of the data is missing both location and coordinates information. This makes it difficult to impute the missing data. Given the current scope of the project, we will drop `location` and `coordinates` columns.

In [None]:
feb_df.drop(["location", "coordinates"], axis=1, inplace=True)

Check if the columns have been dropped.

In [None]:
feb_df.columns

## Check for Duplicate Rows

In [None]:
# pull duplicated rows based on tweetid column because tweetid is unique to each tweet
# theoretically, there shouldn't be dupcliate tweetids; otherwise, we remove such duplicate rows
# we sort values to display the duplicate tweets next to each other
feb_df.loc[feb_df.duplicated(["tweetid"],keep=False)].sort_values("tweetid").head(6)

Remove duplicate rows.

In [None]:
# by default, keep the first instance of the duplicates and drop the rest
feb_df.drop_duplicates(["tweetid"], inplace=True)

Check if any duplicates remain.

In [None]:
feb_df.duplicated(["tweetid"]).any()

In [None]:
# double check
feb_df.duplicated().any()

All duplicates have been removed.

## Preprocessing Text Data

Tweets can contain a lot of miscellaneous information when it comes to sentiment analysis. One example would be URLs: they don't help us gauge the sentiment. Irregularities such as uppercase and lowercase letters are typically unified into all lowercase letters by convention, though exceptions are occasionally taken. Mentions and hashtags are also removed. Smileys and emojis can be useful, but we will remove them for simplicity and agility for now.

Let's look at what the text data looks like.

In [None]:
# show first 10 tweets
feb_df["text"].head(10)

As we found before, there are multiple languages. For this project, we will only look at English tweets. Future direction of the project includes doing exercises for multi-lingual natural language processing.

In [None]:
# select only the rows whose tweets are in English
feb_df = feb_df.loc[feb_df["language"]=="en"]
print(f"{feb_df.shape[0]} rows are in English")

Reset index.

In [None]:
# reset index
feb_df.reset_index(inplace=True, drop=True)
# check
feb_df.head()

View the text data one more time.

In [None]:
feb_df["text"].head(10)

The texts have URLs, emojis, mentions, hashtags, and HTML artifacts (e.g., \n). Uppercase and lowercase letters are also mixed.

1. Lowercase everything
2. Remove URLs and HTML artifacts (e.g., &amp, \n), hashtags, mentions, digits, and emojis
4. Remove punctuations

### 1. Lowercase Everything
Convert everythinng to lowercase.

*Note: we could use [Texthero](https://towardsdatascience.com/how-to-vectorize-text-in-dataframes-for-nlp-tasks-3-simple-techniques-82925a5600db) to speed up this process.

In [None]:
# Lowercase everything
feb_df["cleaned_text"] = feb_df["text"].str.lower()
# check
feb_df["cleaned_text"].head(10)

Everything has been lowercased.

### 2. Remove URLs, HTML, Hashtags, Mentions, Digits, and Emojis

In [None]:
def remove_unnecessary(text):
    # INPUT: string (tweet)
    # OUTPUT: string without URLs, mentions, hashtags, digist, and emojis (and smileys)
    p.set_options(p.OPT.URL, p.OPT.MENTION, p.OPT.HASHTAG, p.OPT.NUMBER, p.OPT.EMOJI, p.OPT.SMILEY)
    result = p.clean(text)
    return result

feb_df["cleaned_text"] = feb_df["cleaned_text"].map(remove_unnecessary)
# check
feb_df["cleaned_text"].head(10)

In [None]:
stopwords_set = set(STOPWORDS)
wordcloud = WordCloud(background_color='white',
                     stopwords = stopwords_set,
                      max_words = 300,
                      max_font_size = 40,
                      scale = 2,
                      random_state=42
                     ).generate(str(feb_df['cleaned_text']))

print(wordcloud)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()


#### Remove HTML Entities

As for removing HTML entities, the most frequent ones are "&amp" and "\n". We will replace these with empty string.

In [None]:
# replace "&amp" in tweets with empty string
feb_df["cleaned_text"] = feb_df["cleaned_text"].str.replace("&amp", "")

# replace "\n" in tweets with empty string
# may not be necessary after applying remove_unnecessary func
# feb_df["cleaned_text"] = feb_df["cleaned_text"].str.replace("\n", "")

# check
feb_df["cleaned_text"].head(10)

### 3. Remove Punctuations

In [None]:
# remove punctuations using regex
# reference: https://stackoverflow.com/questions/68641923/remove-puncts-from-pandas-dataframe
feb_df["cleaned_text"] = feb_df['cleaned_text'].str.replace(r'[^0-9a-zA-Z\s]+', '', regex=True)

# check
feb_df["cleaned_text"].head(15)

Punctuations have been removed.

Let's save this DataFrame with `cleaned_text` as a pickle file for fast loading.

In [None]:
# save dataframea as a pickle file for later loading
feb_df.to_pickle("feb_cleaned.pkl")

## Create a WordCloud

In [None]:
# concatenate all tweets in cleaned_text column into one long string for wordcloud to accept
text = " ".join(tweet for tweet in feb_df["cleaned_text"])

# reference: https://amueller.github.io/word_cloud/generated/wordcloud.WordCloud.html

wc = WordCloud(width=800, height=400, max_words=300).generate(text)
plt.figure(figsize=(12,10))
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

Unsurprisingly, we see that words like "ukraine", "war", and "russia" are the most prominent in the wordcloud, or most frequently appearing in the tweets. We can also feel a sense of shock and awe from the bigram "whole world", perhaps conveying how this war has gripped the world's attention or how it has implicated many countries around the globe in one way or another. One peculiarity could be that "russian people" is a frequent bigram, while "ukranian people" is missing from the cloud. This could be simply due to text preprocessing. However, we know that Putin has publicly claimed that he is conducting this "special military operation" to save ethnic Russians in Ukraine who are allegedly being oppressed, so that could be a reason this bigram takes up a prominent space in the cloud. In fact, he has been saying and writing outlandish theories of his own about Ukraine for many years and of couse just prior to and during this war. Wordcloud is just an exploratory tool to see what words are pervasive in the data. We shall dive deeper into analysis.

## Sentiment Analysis Using RoBERTa

For each tweet, the RoBERTa model will generate a score for each of negative, neutral, and positive sentiments.

As this is my first time conducting sentiment analysis, following work heavily relies on [S Sai Suryateja's work](https://www.kaggle.com/code/ssaisuryateja/eda-and-sentiment-analysis/notebook#Sentiment-and-Emotion-Analysis). I did some research to understand the code line by line. 

In [None]:
# reference: https://www.kaggle.com/code/ssaisuryateja/eda-and-sentiment-analysis#EDA

import torch

print(f"Number of GPUs: {torch.cuda.device_count()}")
# set device to cuda:0 if it's available
device = "cuda:0" if torch.cuda.is_available() else "cpu"

In [None]:
# get pretrained tokenizer from cardiffnlp repo
tokenizer = AutoTokenizer.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment")
# create instance of twitter-roberta-base-sentiment classification model and attach it to the cuda
model = AutoModelForSequenceClassification.from_pretrained("cardiffnlp/twitter-roberta-base-sentiment").to(device)

In [None]:
import urllib
import csv

labels=[] # will contain 'positive', 'neutral', 'negative'
task = 'sentiment' # our task is sentiment analysis
mapping_link = f"https://raw.githubusercontent.com/cardiffnlp/tweeteval/main/datasets/{task}/mapping.txt"
with urllib.request.urlopen(mapping_link) as f:
    html = f.read().decode('utf-8').split("\n")
    csvreader = csv.reader(html, delimiter='\t')
labels = [row[1] for row in csvreader if len(row) > 1]

In [None]:
labels

We use three labels: "negative", "neutral", and "positive".

In [None]:
from scipy.special import softmax
from tqdm import tqdm

BATCH_SIZE = 100 # number of tweets in a batch that will be passed into tokenizer

scores_all = np.empty((0,len(labels)))
# create list of all the tweets in the dataset
text_all = feb_df['cleaned_text'].to_list()
n = len(text_all) # same as number of tweets
with torch.no_grad():
    for start_idx in tqdm(range(0, n, BATCH_SIZE)):
        end_idx = min(start_idx + BATCH_SIZE, n) 
        # reference: https://huggingface.co/docs/transformers/preprocessing
        # tokenize the tweets in the batch, return pytorch ('pt') tensors
        # some tweets are shorter than the uniform tensor length needed; padding adds 0's to maintain uniform tensor length
        # some tweets are too long; truncation truncates input to maximum length accepted by model
        encoded_input = tokenizer(text_all[start_idx:end_idx], return_tensors='pt', padding=True, truncation=True).to(device)
        
        # references: https://stackoverflow.com/questions/11315010/what-do-and-before-a-variable-name-mean-in-a-function-signature
        # https://stackoverflow.com/questions/1419046/normal-arguments-vs-keyword-arguments/1419160#1419160
        output = model(**encoded_input)
        # convert pytorch tensor to numpy
        scores = output[0].detach().cpu().numpy()
        # 
        scores = softmax(scores, axis=1)
        scores_all = np.concatenate((scores_all, scores), axis=0)
        
        # delete encoded_input, output, scores for next batch
        del encoded_input, output, scores 
        # release all unoccupied cached mem 
        torch.cuda.empty_cache()

Output below is what `scores_all` looks like. Each row contains scores for negative, neutral, and positive sentiments. The higher the score, the more likely the tweet has that sentiment.

In [None]:
scores_all

Let's combine the scores with the existing DataFrame.

In [None]:
feb_df[labels] = pd.DataFrame(scores_all, columns=labels)
feb_df.head()

Save this DataFrame so that we don't have to run the model again, which takes a long time.

In [None]:
feb_df.to_csv("./feb_sentiment_analysis_RoBERTa_raw_values.csv", index=False)

[More interpretative work to come.]