<h1>Text Classification on Coronavirus tweets</h1>

Natural language processing(NLP) in simple words can be defined as the branch of Artificial Intelligence, that gives machines the ability to read, understand and make sense of human languages. Sentiment analysis is also known as opinion mining or AI emotion is the use of natural language processing techniques for analyzing texts and extracting and classifying the sentiments of these texts as positive negative or neutral.<br>
Now an interesting question about this type of project that may arise in your mind is that why sentiment analysis on COVID-19 Tweets? What is about the coronavirus tweets that would be positive?<br>
The use of social media for communication during the time of crisis has increased remarkably over the recent years. As mentioned above, analyzing social media data is important as it helps understand public sentiment. During the coronavirus pandemic, many people took to social media to express their anger, grief, or sadness while some also spread happiness and positivity. People also used social media to ask their network for help related to vaccines or hospitals during this hard time. Many issues related to this pandemic can also be solved if experts considered this social data. That’s the reason why analyzing this type of data is important to understand the overall issues faced by people.<br>
We are going to perform Text Classification on the data. The tweets have been pulled from Twitter and manual tagging has been done then.


__Importing all the necessary libraries__

In [1]:
import pandas as pd
import numpy as np
import re 
import nltk 
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from io import StringIO

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import LinearSVC

from sklearn.metrics import confusion_matrix,accuracy_score,precision_score,recall_score,classification_report
import matplotlib.pyplot as plt

Matplotlib is building the font cache; this may take a moment.


__Importing Data__<br>
For this project, I have used the datasets from Kaggle.The dataset consists of 41157 tweets. There are 5 sentiments namely ‘Positive’, ‘Extremely Positive’, ‘Negative’, ‘Extremely Negative’, and ‘Neutral’ in the sentiment column. 

In [2]:
data=pd.read_csv("Corona_NLP_train.csv",encoding='latin1')
data = data[['OriginalTweet', 'Sentiment']]

The dataset consists of the ‘UserName’, ‘ScreenName’, ‘Location’, ‘TweetAt’, ‘OriginalTweet’, and ‘Sentiment’ columns. I have dropped the unnecessary columns that are not useful for what we are trying to predict.

In [3]:
data.head()

Unnamed: 0,OriginalTweet,Sentiment
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,advice Talk to your neighbours family to excha...,Positive
2,Coronavirus Australia: Woolworths to give elde...,Positive
3,My food stock is not the only one which is emp...,Positive
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative


__Understanding the data__

In [4]:
data.shape

(41157, 2)

In [5]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41157 entries, 0 to 41156
Data columns (total 2 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   OriginalTweet  41157 non-null  object
 1   Sentiment      41157 non-null  object
dtypes: object(2)
memory usage: 643.2+ KB


From information above we can see that there no null values and both the columns have object datatypes.

In [6]:
data['Sentiment'].value_counts()

Positive              11422
Negative               9917
Neutral                7713
Extremely Positive     6624
Extremely Negative     5481
Name: Sentiment, dtype: int64

The distribution of positive and negative sentiments seems pretty close i.e. 28%, 24% respectively while 16% of tweets are extremely positive and 13% are extremely negative.

In [7]:
#converting the categorical target variable to numerical target variable
target_variable={'Extremely Negative':0, 'Negative':0, 'Neutral':1,
                'Positive':2, 'Extremely Positive':2}
data['Sentiment_num']=data['Sentiment'].map(lambda x:target_variable[x])
data.head()

Unnamed: 0,OriginalTweet,Sentiment,Sentiment_num
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,1
1,advice Talk to your neighbours family to excha...,Positive,2
2,Coronavirus Australia: Woolworths to give elde...,Positive,2
3,My food stock is not the only one which is emp...,Positive,2
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0


There is not much difference between ‘Extreme Positive’ and ‘Positive’ and ‘Extremely Negative’ and ‘Negative’, therefore I have replaced extremely positive with positive sentiment and extremely negative as negative and mapped ‘Positive’ sentiment to 1, ‘Negative’ sentiment to 2, and ‘Neutral’ sentiment to 0. This will also help in fast processing.

<h1>Data Preprocessing<h1>

The tweets data that we get from the API is unstructured and in different languages. This is not convenient for Machine Learning or statistical analysis. Therefore, data preprocessing is an extremely important step as it affects the ability of our model to learn.

In [8]:
def remove_urls(text):
    url_remove = re.compile(r'https?://\S+|www\.\S+')
    return url_remove.sub(r'', text)
data['text']=data['OriginalTweet'].apply(lambda x:remove_urls(x))
data.head()

Unnamed: 0,OriginalTweet,Sentiment,Sentiment_num,text
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,1,@MeNyrbie @Phil_Gahan @Chrisitv and and
1,advice Talk to your neighbours family to excha...,Positive,2,advice Talk to your neighbours family to excha...
2,Coronavirus Australia: Woolworths to give elde...,Positive,2,Coronavirus Australia: Woolworths to give elde...
3,My food stock is not the only one which is emp...,Positive,2,My food stock is not the only one which is emp...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0,"Me, ready to go at supermarket during the #COV..."


In [9]:
def remove_html(text):
    html=re.compile(r'<.*?>')
    return html.sub(r'',text)
data['text']=data['text'].apply(lambda x:remove_html(x))
data.head()

Unnamed: 0,OriginalTweet,Sentiment,Sentiment_num,text
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,1,@MeNyrbie @Phil_Gahan @Chrisitv and and
1,advice Talk to your neighbours family to excha...,Positive,2,advice Talk to your neighbours family to excha...
2,Coronavirus Australia: Woolworths to give elde...,Positive,2,Coronavirus Australia: Woolworths to give elde...
3,My food stock is not the only one which is emp...,Positive,2,My food stock is not the only one which is emp...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0,"Me, ready to go at supermarket during the #COV..."


In [10]:
# Lower casing
def lower(text):
    low_text= text.lower()
    return low_text
data['text']=data['text'].apply(lambda x:lower(x))
data.head()

Unnamed: 0,OriginalTweet,Sentiment,Sentiment_num,text
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,1,@menyrbie @phil_gahan @chrisitv and and
1,advice Talk to your neighbours family to excha...,Positive,2,advice talk to your neighbours family to excha...
2,Coronavirus Australia: Woolworths to give elde...,Positive,2,coronavirus australia: woolworths to give elde...
3,My food stock is not the only one which is emp...,Positive,2,my food stock is not the only one which is emp...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0,"me, ready to go at supermarket during the #cov..."


In [11]:
# Number removal
def remove_num(text):
    remove= re.sub(r'\d+', '', text)
    return remove
data['text']=data['text'].apply(lambda x:remove_num(x))
data.head()

Unnamed: 0,OriginalTweet,Sentiment,Sentiment_num,text
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,1,@menyrbie @phil_gahan @chrisitv and and
1,advice Talk to your neighbours family to excha...,Positive,2,advice talk to your neighbours family to excha...
2,Coronavirus Australia: Woolworths to give elde...,Positive,2,coronavirus australia: woolworths to give elde...
3,My food stock is not the only one which is emp...,Positive,2,my food stock is not the only one which is emp...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0,"me, ready to go at supermarket during the #cov..."


In [12]:
#Remove mentions and hashtags
def remove_mention(x):
    text=re.sub(r'@\w+','',x)
    return text
data['text']=data['text'].apply(lambda x:remove_mention(x))

def remove_hash(x):
    text=re.sub(r'#\w+','',x)
    return text
data['text']=data['text'].apply(lambda x:remove_hash(x))

#Remove extra white space left while removing stuff
def remove_space(text):
    space_remove = re.sub(r"\s+"," ",text).strip()
    return space_remove
data['text']=data['text'].apply(lambda x:remove_space(x))
data.head()

Unnamed: 0,OriginalTweet,Sentiment,Sentiment_num,text
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,1,and and
1,advice Talk to your neighbours family to excha...,Positive,2,advice talk to your neighbours family to excha...
2,Coronavirus Australia: Woolworths to give elde...,Positive,2,coronavirus australia: woolworths to give elde...
3,My food stock is not the only one which is emp...,Positive,2,my food stock is not the only one which is emp...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0,"me, ready to go at supermarket during the outb..."


In [13]:
def punct_remove(text):
    punct = re.sub(r"[^\w\s\d]","", text)
    return punct
data['text']=data['text'].apply(lambda x:punct_remove(x))
data.head()

Unnamed: 0,OriginalTweet,Sentiment,Sentiment_num,text
0,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral,1,and and
1,advice Talk to your neighbours family to excha...,Positive,2,advice talk to your neighbours family to excha...
2,Coronavirus Australia: Woolworths to give elde...,Positive,2,coronavirus australia woolworths to give elder...
3,My food stock is not the only one which is emp...,Positive,2,my food stock is not the only one which is emp...
4,"Me, ready to go at supermarket during the #COV...",Extremely Negative,0,me ready to go at supermarket during the outbr...


In [14]:
#Remove stopwords
from nltk.corpus import stopwords
", ".join(stopwords.words('english'))
STOPWORDS = set(stopwords.words('english'))
def remove_stopwords(text):
    """custom function to remove the stopwords"""
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])
data['text']=data['text'].apply(lambda x:remove_stopwords(x))

LookupError: 
**********************************************************************
  Resource [93mstopwords[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('stopwords')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mcorpora/stopwords[0m

  Searched in:
    - 'C:\\Users\\Dr. Jayesh/nltk_data'
    - 'C:\\Users\\Dr. Jayesh\\anaconda3\\nltk_data'
    - 'C:\\Users\\Dr. Jayesh\\anaconda3\\share\\nltk_data'
    - 'C:\\Users\\Dr. Jayesh\\anaconda3\\lib\\nltk_data'
    - 'C:\\Users\\Dr. Jayesh\\AppData\\Roaming\\nltk_data'
    - 'C:\\nltk_data'
    - 'D:\\nltk_data'
    - 'E:\\nltk_data'
**********************************************************************


After all these processing, there is a possibility to end up with empty strings.Hence,we have implemented below to take care of it.

In [None]:
data[data['text']==""]

In [15]:
#removed all the text will null string in training data
data = data[data['text']!=""]

In [16]:
data["Sentiment_num"].value_counts()

2    18044
0    15397
1     7701
Name: Sentiment_num, dtype: int64

<h1>TF-IDF<h1>

One of the most popular approaches to counting word frequencies is TF-IDF. TF-IDF stands for Term Frequency — Inverse Document Frequency. The term frequency gives us how many times a word appears in a document whereas inverse document frequency decreases the weights of the words that occur frequently across different documents. It tells us if a word is common or rare across all documents.<br>

TF-IDF is important as it helps understand the importance of a word in the documents.

In [17]:
tfidf = TfidfVectorizer(sublinear_tf=True, min_df=5,stop_words='english')

features = tfidf.fit_transform(data.text)

labels = data.Sentiment_num

print("Each of the %d tweets is represented by %d features (TF-IDF score of unigrams and bigrams)" %(features.shape))

Each of the 41142 tweets is represented by 9336 features (TF-IDF score of unigrams and bigrams)


<h1>Train-test split<h1>

As the size of data is not very large, we will assign most of it to the training data. Hence using train__test_split from Scikit learn , the data has been split into 90:10

In [18]:
xtrain, xtest, ytrain, ytest = train_test_split(features,labels,test_size=0.10,random_state=0)

<h1>Modelling and Evaluation<h1>

For this problem statement , I have implemented various classification models like Logistic Regression, Naive Bayes,Support Vector Machine and Random forest.<br>
For evaluation, out of all the metrics I have used accuracy, precision and recall.<br>
Precision: Of all the predicted positive values, how many are actually positive? This metric shows the accuracy of the model when we only look at the positive cases.<br>
Recall: Of all the actually positive values, how many are predicted to be positive? This metric shows how well the model performs on detecting the actual positive values.

<h3>Logistic Regression<h3>

The main concept of logistic regression is to use linear combinations of the observed features to estimate the particular value and the corresponding label.

In [19]:
LogReg = LogisticRegression(max_iter=500,multi_class='multinomial')
LogReg.fit(xtrain, ytrain)
pred = LogReg.predict(xtrain)
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy : ", accuracy_score(ytrain, pred))
print("Precision : ", precision_score(ytrain, pred,average='macro'))
print("Recall Score : ",recall_score(ytrain, pred,average='macro'))
print(classification_report(ytrain, pred))

Confusion Matrix: 
 [[12349   422  1094]
 [  624  5680   657]
 [  887   354 14960]]
Accuracy :  0.890944445944851
Precision :  0.8886672313370418
Recall Score :  0.8766781471670534
              precision    recall  f1-score   support

           0       0.89      0.89      0.89     13865
           1       0.88      0.82      0.85      6961
           2       0.90      0.92      0.91     16201

    accuracy                           0.89     37027
   macro avg       0.89      0.88      0.88     37027
weighted avg       0.89      0.89      0.89     37027



In [20]:
#evaluation on test data
pred_test = LogReg.predict(xtest)
print("Confusion Matrix: \n", confusion_matrix(ytest, pred_test))
print("Accuracy : ", accuracy_score(ytest, pred_test))
print("Precision : ", precision_score(ytest, pred_test,average='macro'))
print("Recall Score : ",recall_score(ytest, pred_test,average='macro'))
print(classification_report(ytest, pred_test))

Confusion Matrix: 
 [[1247   86  199]
 [ 125  491  124]
 [ 172  107 1564]]
Accuracy :  0.8024301336573512
Precision :  0.7847691910618009
Recall Score :  0.775366189415822
              precision    recall  f1-score   support

           0       0.81      0.81      0.81      1532
           1       0.72      0.66      0.69       740
           2       0.83      0.85      0.84      1843

    accuracy                           0.80      4115
   macro avg       0.78      0.78      0.78      4115
weighted avg       0.80      0.80      0.80      4115



<h3>Multinomial Naive Bayes<h3>

Multinomial Naive Bayes is one of the two classic naive Bayes variants used in text classification.Naive Bayes has been considered a success many times in case of text-classification problems. Let’s see how it performs!

In [21]:
MultNB = MultinomialNB()
MultNB.fit(xtrain, ytrain)
pred = MultNB.predict(xtrain)
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: ", accuracy_score(ytrain, pred))
print("Precision : ", precision_score(ytrain, pred,average='macro'))
print("Recall Score : ",recall_score(ytrain, pred,average='macro'))
print(classification_report(ytrain, pred))

Confusion Matrix: 
 [[10941   100  2824]
 [ 1638  1729  3594]
 [ 1433    89 14679]]
Accuracy:  0.738623166878224
Precision :  0.792692233757592
Recall Score :  0.64784943420482
              precision    recall  f1-score   support

           0       0.78      0.79      0.78     13865
           1       0.90      0.25      0.39      6961
           2       0.70      0.91      0.79     16201

    accuracy                           0.74     37027
   macro avg       0.79      0.65      0.65     37027
weighted avg       0.77      0.74      0.71     37027



In [28]:
#evaluation on test data
pred_test = MultNB.predict(xtest)
print("Confusion Matrix: \n", confusion_matrix(ytest, pred_test))
print("Accuracy : ", accuracy_score(ytest, pred_test))
print("Precision : ", precision_score(ytest, pred_test,average='macro'))
print("Recall Score : ",recall_score(ytest, pred_test,average='macro'))
print(classification_report(ytest, pred_test))

Confusion Matrix: 
 [[1082   18  432]
 [ 207  101  432]
 [ 239   15 1589]]
Accuracy :  0.67363304981774
Precision :  0.7032082524225092
Recall Score :  0.5683113437619586
              precision    recall  f1-score   support

           0       0.71      0.71      0.71      1532
           1       0.75      0.14      0.23       740
           2       0.65      0.86      0.74      1843

    accuracy                           0.67      4115
   macro avg       0.70      0.57      0.56      4115
weighted avg       0.69      0.67      0.64      4115



<h3>Support Vector Machine<h3>

LinearSVC implements a “one-vs-the-rest” multi-class strategy, thus training n-class models. It is similar to SVC with parameter kernel=’ linear’.



In [23]:
svc_model = LinearSVC()
svc_model.fit(xtrain, ytrain)
pred = svc_model.predict(xtrain)
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: ", accuracy_score(ytrain, pred))
print("Precision : ", precision_score(ytrain, pred,average='macro'))
print("Recall Score : ",recall_score(ytrain, pred,average='macro'))
print(classification_report(ytrain, pred))

Confusion Matrix: 
 [[12917   297   651]
 [  381  6205   375]
 [  560   264 15377]]
Accuracy:  0.9317254976098522
Precision :  0.9288776257351702
Recall Score :  0.924053417989308
              precision    recall  f1-score   support

           0       0.93      0.93      0.93     13865
           1       0.92      0.89      0.90      6961
           2       0.94      0.95      0.94     16201

    accuracy                           0.93     37027
   macro avg       0.93      0.92      0.93     37027
weighted avg       0.93      0.93      0.93     37027



In [29]:
#evaluation on test data
pred_test = svc_model.predict(xtest)
print("Confusion Matrix: \n", confusion_matrix(ytest, pred_test))
print("Accuracy : ", accuracy_score(ytest, pred_test))
print("Precision : ", precision_score(ytest, pred_test,average='macro'))
print("Recall Score : ",recall_score(ytest, pred_test,average='macro'))
print(classification_report(ytest, pred_test))

Confusion Matrix: 
 [[1250  101  181]
 [ 102  534  104]
 [ 165  115 1563]]
Accuracy :  0.8133657351154313
Precision :  0.7939246490709905
Recall Score :  0.7952074357670863
              precision    recall  f1-score   support

           0       0.82      0.82      0.82      1532
           1       0.71      0.72      0.72       740
           2       0.85      0.85      0.85      1843

    accuracy                           0.81      4115
   macro avg       0.79      0.80      0.79      4115
weighted avg       0.81      0.81      0.81      4115



<h3>Random Forest<h3>

After trying with Support Vector Machine, let’s try modeling this data using the tree-based classifier RandomForestClassifier. Random forest fits several decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.


In [25]:
random_model = RandomForestClassifier(max_depth=5, random_state=0)
random_model.fit(xtrain, ytrain)
pred = random_model.predict(xtrain)
print("Confusion Matrix: \n", confusion_matrix(ytrain, pred))
print("Accuracy: ", accuracy_score(ytrain, pred))
print("Precision : ", precision_score(ytrain, pred,average='macro'))
print("Recall Score : ",recall_score(ytrain, pred,average='macro'))
print(classification_report(ytrain, pred))

Confusion Matrix: 
 [[ 1704     0 12161]
 [   18     0  6943]
 [   74     0 16127]]
Accuracy:  0.4815675047937991
Precision :  0.46884177747473316
Recall Score :  0.3727772558875953
              precision    recall  f1-score   support

           0       0.95      0.12      0.22     13865
           1       0.00      0.00      0.00      6961
           2       0.46      1.00      0.63     16201

    accuracy                           0.48     37027
   macro avg       0.47      0.37      0.28     37027
weighted avg       0.56      0.48      0.36     37027



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [30]:
#evaluation on test data
pred_test = random_model.predict(xtest)
print("Confusion Matrix: \n", confusion_matrix(ytest, pred_test))
print("Accuracy : ", accuracy_score(ytest, pred_test))
print("Precision : ", precision_score(ytest, pred_test,average='macro'))
print("Recall Score : ",recall_score(ytest, pred_test,average='macro'))
print(classification_report(ytest, pred_test))

Confusion Matrix: 
 [[ 151    0 1381]
 [   4    0  736]
 [  13    0 1830]]
Accuracy :  0.4814094775212637
Precision :  0.45415093239390175
Recall Score :  0.3638367506340883
              precision    recall  f1-score   support

           0       0.90      0.10      0.18      1532
           1       0.00      0.00      0.00       740
           2       0.46      0.99      0.63      1843

    accuracy                           0.48      4115
   macro avg       0.45      0.36      0.27      4115
weighted avg       0.54      0.48      0.35      4115



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


<h1>Summary<h1>

After evaluating all the models the highest accuracy among these is achieved using Support Vector Machine and Logistic Regression with an accuracy of 81% and 80% respectively.The worst performer are Naive Bayes and Random Forest with 66% and 45% respectively.
Let's dig deep into high performing models.
In SVM we can see that the precision for all sentiments are above 70% which means that out of all the sentiments that the model predicted, more than 70% was correctly predicted. Similarly recall score as well is more than 70% which means that out of all the sentiments that actually predicted, the model only predicted this outcome correctly for more than 70% of those players.
