# Coronavirus Tweets

Tweets gathered during the Covid-19 lockdown, or tweets relating to the Covid-19, otherwise known as the Coronavirus.

A Text Classification problem, we will use the FastText Libray from Meta (formerly Facebook).

## Requirements

FastText requires a compiler with C++ 11 support, which includes (gcc-4.6.3 or newer) or (clang-3.3 or newer). The compilation is carried out using a MakeFile. This can be run on the Google Colab platform onlinr, since it has the compiler, or you can have a working make file installed locally, if you want to run it offline.

In [1]:
from google.colab import drive
drive.mount('/content/gdrive/', force_remount=True)

Mounted at /content/gdrive/


In [3]:
!pip install fasttext --quiet

[?25l[K     |████▊                           | 10 kB 21.9 MB/s eta 0:00:01[K     |█████████▌                      | 20 kB 6.3 MB/s eta 0:00:01[K     |██████████████▎                 | 30 kB 9.0 MB/s eta 0:00:01[K     |███████████████████             | 40 kB 4.4 MB/s eta 0:00:01[K     |███████████████████████▉        | 51 kB 4.3 MB/s eta 0:00:01[K     |████████████████████████████▋   | 61 kB 5.1 MB/s eta 0:00:01[K     |████████████████████████████████| 68 kB 3.3 MB/s 
[?25h  Building wheel for fasttext (setup.py) ... [?25l[?25hdone


In [16]:
import fasttext
import pandas as pd
import re
import os

In [10]:
base_path = '/content/gdrive/MyDrive/AI/covid_tweets'

In [11]:
!pwd
os.chdir(base_path)
!pwd

/content
/content/gdrive/MyDrive/AI/covid_tweets


In [13]:
train = pd.read_csv('Corona_NLP_train.csv', encoding='ISO-8859-1')
test = pd.read_csv('Corona_NLP_test.csv', encoding='ISO-8859-1')

In [14]:
train.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [17]:
train['tweet_processed'] = train['OriginalTweet'].apply(lambda x : re.sub(r'[\r\n]', '', x))

In [18]:
train['Sentiment'].unique()

array(['Neutral', 'Positive', 'Extremely Negative', 'Negative',
       'Extremely Positive'], dtype=object)

In [19]:
train['sentiment_processed'] = train['Sentiment'].apply(lambda x : "_".join(x.split(" ")).lower())

In [21]:
preprocessed = []

for i in range(len(train)):
    preprocessed.append('__label__' + train['sentiment_processed'][i] + ' ' + train['tweet_processed'][i])

In [30]:
with open('covid_train.txt', 'w', encoding='ISO-8859-1') as f:
    for i in preprocessed:
        f.write(i)
        f.write('\n')
    f.close()

In [24]:
test.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,1,44953,NYC,02-03-2020,TRENDING: New Yorkers encounter empty supermar...,Extremely Negative
1,2,44954,"Seattle, WA",02-03-2020,When I couldn't find hand sanitizer at Fred Me...,Positive
2,3,44955,,02-03-2020,Find out how you can protect yourself and love...,Extremely Positive
3,4,44956,Chicagoland,02-03-2020,#Panic buying hits #NewYork City as anxious sh...,Negative
4,5,44957,"Melbourne, Victoria",03-03-2020,#toiletpaper #dunnypaper #coronavirus #coronav...,Neutral


In [27]:
test['tweet_processed'] = test['OriginalTweet'].apply(lambda x : re.sub(r'[\r\n]', '', x))
test['sentiment_processed'] = test['Sentiment'].apply(lambda x : "_".join(x.split(" ")).lower())

In [28]:
prep_datapoints = []

for i in range(len(test)):
    prep_datapoints.append('__label__' + test['sentiment_processed'][i] + ' ' + test['tweet_processed'][i])

In [31]:
with open('covid_valid.txt', 'w', encoding='ISO-8859-1') as f:
    for i in prep_datapoints:
        f.write(i)
        f.write('\n')
    f.close()

Here, we preprocessed the train and validation tweets to improve the performance of the model

In [33]:
!cat covid_train.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > train.preprocessed.txt
!cat covid_valid.txt | sed -e "s/\([.\!?,'/()]\)/ \1 /g" | tr "[:upper:]" "[:lower:]" > valid.preprocessed.txt

In [40]:
len(train), len(test)

(41157, 3798)

In [39]:
!head -n 41157 train.preprocessed.txt > covid.train
!tail -n 3798 valid.preprocessed.txt > covid.valid

In [41]:
model = fasttext.train_supervised('covid.train', lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova')

We input a tweet from the test dataset to make prediction of the sentiment using our newly trained model.

In [42]:
model.predict("when i couldn ' t find hand sanitizer at fred meyer ,  i turned to #amazon .  but $114 . 97 for a 2 pack of purell ?  ?  !  ! check out how  #coronavirus concerns are driving up prices .  https: /  / t . co / ygbipbflmy")

(('__label__positive',), array([0.97405267]))

Also, we observe the threshold for the precision and recall metrics.

In [45]:
model.test('covid.valid', k=-1)

(3798, 0.2, 1.0)