<a href="https://colab.research.google.com/github/ankitkarmakar95/Covid19_tweets_sentiment_analysis/blob/main/Covid19_tweets_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment analysis from tweets related to Covid19 using NLP and ANN model
### Here in this project we will be building a mdoel which can read a statement and tell if that is a positive or a negative post. For this we have collected the tweeter data from Kaggle which is pre-labeled. 

## Importing Data and neccessary modules

In [1]:
import numpy as np
import pandas as pd

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/covid tweets.csv', encoding="ISO-8859-1")
df.head()

Unnamed: 0,UserName,ScreenName,Location,TweetAt,OriginalTweet,Sentiment
0,3799,48751,London,16-03-2020,@MeNyrbie @Phil_Gahan @Chrisitv https://t.co/i...,Neutral
1,3800,48752,UK,16-03-2020,advice Talk to your neighbours family to excha...,Positive
2,3801,48753,Vagabonds,16-03-2020,Coronavirus Australia: Woolworths to give elde...,Positive
3,3802,48754,,16-03-2020,My food stock is not the only one which is emp...,Positive
4,3803,48755,,16-03-2020,"Me, ready to go at supermarket during the #COV...",Extremely Negative


In [4]:
#performing train test split
from sklearn.model_selection import train_test_split
train_data,test_data = train_test_split(df, test_size=0.25,random_state=56)

In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 30867 entries, 26473 to 35300
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       30867 non-null  int64 
 1   ScreenName     30867 non-null  int64 
 2   Location       24432 non-null  object
 3   TweetAt        30867 non-null  object
 4   OriginalTweet  30867 non-null  object
 5   Sentiment      30867 non-null  object
dtypes: int64(2), object(4)
memory usage: 1.6+ MB


In [6]:
test_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10290 entries, 25885 to 4497
Data columns (total 6 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   UserName       10290 non-null  int64 
 1   ScreenName     10290 non-null  int64 
 2   Location       8135 non-null   object
 3   TweetAt        10290 non-null  object
 4   OriginalTweet  10290 non-null  object
 5   Sentiment      10290 non-null  object
dtypes: int64(2), object(4)
memory usage: 562.7+ KB


There are null values in the location column, but we are not using it.

## Preprocessing

In [7]:
#preparing the predictor and target data for feeding into model
train_input= train_data['OriginalTweet'].copy()
train_label=train_data['Sentiment'].copy()

test_input= test_data['OriginalTweet'].copy()
test_label=test_data['Sentiment'].copy()

In [8]:
#getting all the unique type of labels we are working with
train_label.unique()

array(['Neutral', 'Negative', 'Positive', 'Extremely Negative',
       'Extremely Positive'], dtype=object)

**Encoding the labels **

Here we will only use three identifier for classification


```
-1 = Negative [include Extremely Negative also]
0 = Neutral
1 = Positive  [include Extremely Positive also]
```




In [9]:
sentiment_enco= {"Extremely Positive":1,
                 "Positive":1,
                 'Neutral':0,
                 'Negative':-1,
                 'Extremely Negative':-1
                 }
train_label.replace(sentiment_enco,inplace=True)
test_label.replace(sentiment_enco,inplace=True)

In [10]:
"""removing unwanted URIs and mentiones, as names should not be a identifier for sentiment. 
However keeping the words from a hashtag because it can be used to construct a meaningful sentence."""

import re
mod = lambda x : re.sub('[^a-zA-Z]',' ', re.sub(r'@\S+',' ',re.sub(r'http\S+',' ',x))).lower()

In [11]:
"""First we will tokenize the words from the posts , will Remove the stopwords, and then lemmatizing the posts"""

import nltk
from nltk.stem.wordnet import WordNetLemmatizer
nltk.download('wordnet')
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('omw-1.4')
wordnet = WordNetLemmatizer()
rmstop = lambda x : ' '.join([ wordnet.lemmatize(w) for w in nltk.word_tokenize(x) if w not in set(stopwords.words('english'))])

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Applying the lambda methods over the dataset

In [12]:
train_input=train_input.apply(mod)
train_input=train_input.apply(rmstop)

In [13]:
test_input=test_input.apply(mod)
test_input=test_input.apply(rmstop)

## Model Building

vectorizing the words using TF-IDF method, it basically creates a vector of words and assign values based on the importance of that particular word respective to the entire dataset

In [14]:
from sklearn.feature_extraction.text import TfidfVectorizer
cv=TfidfVectorizer(max_features=5000)
X= cv.fit_transform(train_input).toarray()

Using multilayer perception classifier to build a artificial neural network model to predict if the post is a negative or a positive post

In [15]:
from sklearn.neural_network import MLPClassifier
nn_model = MLPClassifier(hidden_layer_sizes=50)
nn_model.fit(X,train_label)

## Model Evaluation

In [16]:
#confusion matrix for prediction analysis
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

predicted_label = nn_model.predict(cv.transform(test_input))
pd.DataFrame(confusion_matrix(test_label,predicted_label),
             index=['Negative','Neutral','Positive'],columns=['Negative','Neutral','Positive'])

Unnamed: 0,Negative,Neutral,Positive
Negative,2968,366,478
Neutral,316,1356,289
Positive,494,331,3692


In [19]:
print(classification_report(test_label,predicted_label))

              precision    recall  f1-score   support

          -1       0.79      0.78      0.78      3812
           0       0.66      0.69      0.68      1961
           1       0.83      0.82      0.82      4517

    accuracy                           0.78     10290
   macro avg       0.76      0.76      0.76     10290
weighted avg       0.78      0.78      0.78     10290



In [20]:
nn_model.score(cv.transform(test_input),test_label)

0.7790087463556852

## Miscellanious

In [21]:
#in this module we will test different tweets to check if they are positive or negative
sr = pd.Series(["Number of covid cases started to go down recently in India #covid19"])  #input your text in this
x=sr.apply(mod)
x=x.apply(rmstop)

for i in nn_model.predict(cv.transform(x)):
  if i>0:
    print('Positive')
  elif i<0:
    print('Negative')
  else :
    print('Neutral')

Positive
