# Twitter Sentiment analysis of Demonetization in India - Group 19

This script is developed by-
Deepak Khirey
Himanshu Gupta
Aravind mannarswamy

Version Control -
V1.0 April 23 2018

Description -
1. This script takes Twitter Corpus of Demonetization tweets in pickle format as input.
2. Performs Data Cleaning operations
3. Performs cross validation accuracy comparison of classifiers
4. Performs Sentiment Polarity annotation by Supervised Machine Learning method on Tweet corpus Dataframe.
5. Performs Vizualization in plotly

### Import Packages

This is import block. It makes sure we have imported all package before we execute the script.
It uses in-built python packages like pandas, numpy, scikitlearn.
We have also used some external packages for this project which need to be installed explicitly.

1. guess_language
 https://pypi.org/project/guess-language/
 pip install guess-language

2. tweet_preprocessor
https://pypi.org/project/tweet-preprocessor/
pip install tweet-preprocessor 

3. poltly
pip install plotly

In case if you face error while installing, please refer https://github.com/s/preprocessor/issues/16
Fixed zip tweet-preprocessor-0.4.0.zip is included alongwith submission.
It can be installed using 
pip install < path to compressed file>

In [1]:
import pandas as pd
import pickle as pkl
import os
import preprocessor as p
import re
from guess_language import guess_language
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
import plotly
import plotly.plotly as py
import plotly.graph_objs as go
import numpy as np
plotly.tools.set_credentials_file(username='iuproject', api_key='m3AGEhqrwClGaYCf7zqV')

### Reading Corpus
Reading corpus under directory structured. 
Corpus is stored in pickle files. 
After reading all pickle files, data is imported into a Dataframe.

In [2]:
# Read pickle files and append them to a data frame
tweets = [] #empty dictionary
for root, subdirs, files in os.walk("."):
    list_file_path = os.path.join(root, 'here.txt')
    with open(list_file_path, 'wb') as list_file:
        for filename in files:
            file_path = os.path.join(root, filename)
            if filename.endswith(".pkl"):
                tweets.extend(pkl.load(open(file_path, "rb"))) #appending pickle file content to this dictionary


tweet_df = pd.DataFrame.from_dict(tweets) # dictionary to dataframe

tweet_df.to_csv("demonetizationtweetsraw.csv", sep='|', encoding='utf-8') # saving dataframe as csv file
tweet_df.shape

(1605240, 11)

### Data Cleaning
Data Cleaning step. It removes all un-necessary text from tweets and provides a clean corpus. This may take several minutes. Please be patient.

In [3]:
dedup_tweet_df = tweet_df.drop_duplicates(subset=['tweet_id']) # removing duplicates based on tweet_id
dedup_tweet_df = dedup_tweet_df[dedup_tweet_df['text'].str.startswith('RT') == False] # removing Retweets

dedup_tweet_df = dedup_tweet_df.replace({'# ': '#'}, regex=True) # removing space after #
dedup_tweet_df = dedup_tweet_df.replace({'@ ': '@'}, regex=True)# removing space after @
dedup_tweet_df['clean_text'] = dedup_tweet_df['text'].apply(p.clean)

cleantext = []
for text in dedup_tweet_df['clean_text']:
    text = re.sub(r'[^a-zA-Z ]', '', text) # keeping only english letters
    cleantext.append(text)
dedup_tweet_df['clean_text'] = cleantext

dedup_tweet_df['clean_text'] = dedup_tweet_df['clean_text'].str.replace('\W+', ' ')
dedup_tweet_df['clean_text'] = dedup_tweet_df['clean_text'].str.lower() # converting all to lowercase


lng = []
for text in dedup_tweet_df['clean_text']:
    lng.append(guess_language(text)) # using guess language
dedup_tweet_df['lng'] = lng
dedup_tweet_df = dedup_tweet_df[(dedup_tweet_df['lng']=="en")==True] # keeping only english language tweets

dedup_tweet_df = dedup_tweet_df[(dedup_tweet_df['clean_text'].str.len() <= 10) == False] #removing null and short strings

dedup_tweet_df = dedup_tweet_df[['date','favorites','retweets','term','clean_text','tweet_id']] # keeping only required columns
dedup_tweet_df.to_csv("demonetizationtweetsclean.csv", sep='|', encoding='utf-8') # saving dataframe as csv file
dedup_tweet_df.shape

(389312, 6)

### Data Analysis
Reading training datset which is manualy annotated for sentiment polarity. Structure of datset is same as that of sentiment140. 

In [4]:
mynewtrainingclean = pd.read_csv("demonetizationtrainingdata.csv", names=["polarity", "tid", "date", "query", "user", "text"], encoding="latin1")
mynewtrainingclean.head()

Unnamed: 0,polarity,tid,date,query,user,text
0,polarity,tid,date,query,user,text
1,4,796082258591432000,2016-11-08 15:09:35,No_QUERY,GROUP19,"# ModiJi doesn't give a DAMN about U, Ur RTI a..."
2,4,796071624894218000,2016-11-08 14:27:20,No_QUERY,GROUP19,great Step of Closure of 500 and 1000 rupee no...
3,2,796047865646678000,2016-11-08 12:52:55,No_QUERY,GROUP19,"# Demonetisation of Rs 500, Rs 1000 notes: DO ..."
4,2,796031200129351000,2016-11-08 11:46:42,No_QUERY,GROUP19,Will this # Demonetisation of Rs 500 & 1000 no...


Executing Support Vector Machine classifier on training dataset.

In [5]:
tv = TfidfVectorizer(min_df=10,lowercase=True,stop_words='english')
X = tv.fit_transform(mynewtrainingclean['text'])
y = mynewtrainingclean['polarity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=3057)
clf = SVC(kernel="linear", verbose=3)
clf.fit(X_train, y_train)
cv_scores = cross_val_score(clf, X, y,cv=5)
print(cv_scores)

[LibSVM]


The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=5.



[LibSVM][LibSVM][LibSVM][LibSVM][LibSVM][0.82176092 0.89541547 0.91117479 0.91176471 0.89096126]


Executing Logistic Regression classifier on training dataset.

In [6]:
tv1 = TfidfVectorizer(min_df=10,lowercase=True,stop_words='english')
X1 = tv1.fit_transform(mynewtrainingclean['text'])
y1 = mynewtrainingclean['polarity']
X1_train, X1_test, y1_train, y1_test = train_test_split(X1, y1, test_size=0.25, random_state=3057)
clf1 = LogisticRegression()
clf1.fit(X1_train, y1_train)
cv_scores = cross_val_score(clf1, X1, y1,cv=5)
print(cv_scores)


The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=5.



[0.84395132 0.89326648 0.89971347 0.89885222 0.88880918]


Executing K Nearest Neighbor classifier on training dataset.

In [7]:
tv2 = TfidfVectorizer(min_df=10,lowercase=True,stop_words='english')
X2 = tv2.fit_transform(mynewtrainingclean['text'])
y2 = mynewtrainingclean['polarity']
X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y2, test_size=0.25, random_state=3057)
clf2 = KNeighborsClassifier(n_neighbors=15)
clf2.fit(X2_train, y2_train)
cv_scores = cross_val_score(clf2, X2, y2,cv=5)
print(cv_scores)


The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=5.



[0.77308518 0.79297994 0.82234957 0.8507891  0.80703013]


Since Support Vector Machine classifier performed better in terms of accuracy, using this for predicting sentiments on entire corpus. This may take several minutes. Please be patient.

In [8]:
tweet_tfidf = tv.transform(dedup_tweet_df['clean_text'])
tweet_pred = clf.predict(tweet_tfidf)
dedup_tweet_df["sent_pred"] = tweet_pred
dedup_tweet_df.to_csv("demonetizationtweetswithscore.csv", sep='|', encoding='utf-8') # saving dataframe as csv file

Storing final results of sentiment polarity in CSV file.

In [9]:
stored_results = pd.read_csv("demonetizationtweetswithscore.csv", sep='|', encoding='utf-8') # saving dataframe as csv file
dedup_tweet_df.head()

Unnamed: 0,date,favorites,retweets,term,clean_text,tweet_id,sent_pred
0,2016-11-08 18:58:13,1,0,#Demonetisation,why are opposing does nodemonetization helps f...,796139793998450688,0
1,2016-11-08 18:57:24,0,0,#Demonetisation,of rs rs notes naidu had inkling of the ban,796139591216599042,0
2,2016-11-08 18:38:58,0,0,#Demonetisation,he said he should have bought gold on dhan tir...,796134950005342211,4
3,2016-11-08 18:10:43,0,0,#Demonetisation,illiterate lots are always forced to delete th...,796127842614579200,0
4,2016-11-08 18:06:38,0,0,#Demonetisation,richbm people in next hrs rushing to hospitals...,796126814745731072,0


### Visualization
We are using plotly package for our visualizations. We have registered to plotly account which provides 25 graphs for free. We have noticed that sometimes plotly doesn't work in secured environemnt through firewalls, or graph limit is reached if executed multiple times. 
Hence we have also created these visualizations in Tableau Public. It can be found at below link -
https://public.tableau.com/views/SMMFinal/Sheet1?:embed=y&:display_count=yes

Visualization 1- Twitter Trends by Hashtag
Using plotly for getting Twitter volume over timeline.

In [14]:
stored_results['date_notime'] = pd.to_datetime(stored_results['date']).dt.date
data = stored_results.groupby(['date_notime','term']).size().reset_index(name='count')
data1 = data[data['term'] == '#DeMonetisation']
data2 = data[data['term'] == '#Demonetisation']
data3 = data[data['term'] == '#demonetisation']
data4 = data[data['term'] == '#demonetization']
data5 = data[data['term'] == '#Demonetization']
data6 = data[data['term'] == '#DeMonetization']
DeMonetisation = go.Scatter(x=data1['date_notime'], y=data1['count'], mode = 'lines', name = '#DeMonetisation')
Demonetisation = go.Scatter(x=data2['date_notime'], y=data2['count'], mode = 'lines', name = '#Demonetisation')
demonetisation = go.Scatter(x=data3['date_notime'], y=data3['count'], mode = 'lines', name = '#demonetisation')
demonetization = go.Scatter(x=data4['date_notime'], y=data4['count'], mode = 'lines', name = '#demonetization')
Demonetization = go.Scatter(x=data5['date_notime'], y=data5['count'], mode = 'lines', name = '#Demonetization')
DeMonetization = go.Scatter(x=data6['date_notime'], y=data6['count'], mode = 'lines', name = '#DeMonetization')
line = [DeMonetisation, Demonetisation, demonetisation, demonetization, Demonetization, DeMonetization]
layout = dict(title="Tweet count by date & hashtags",yaxis=dict(title='No of tweets') )
fig = dict(data=line, layout=layout)
py.iplot(fig)

Visualization 2 -
Twitter sentiment polarity over timeline

In [11]:
stored_results['date_notime'] = pd.to_datetime(stored_results['date']).dt.date
data = stored_results.groupby(['date_notime','sent_pred']).size().reset_index(name='count')
data1 = data[data['sent_pred'] == 0]
data2 = data[data['sent_pred'] == 2]
data3 = data[data['sent_pred'] == 4]
Negative = go.Scatter(x=data1['date_notime'], y=data1['count'], mode = 'lines', name = 'Negative')
Neutral = go.Scatter(x=data2['date_notime'], y=data2['count'], mode = 'lines', name = 'Neutral')
Positive = go.Scatter(x=data3['date_notime'], y=data3['count'], mode = 'lines', name = 'Positive')
line = [Neutral, Positive, Negative]
layout = dict(title="Tweet count by date & predicted sentiment",yaxis=dict(title='No of tweets') )
fig = dict(data=line, layout=layout)
py.iplot(fig)

Visualization 3 -
Favorites by Sentiment polarity over timeline

In [12]:
stored_results['date_notime'] = pd.to_datetime(stored_results['date']).dt.date
data = stored_results.groupby(['date_notime','sent_pred'])['favorites'].agg('sum').reset_index(name='sum')
data1 = data[data['sent_pred'] == 0]
data2 = data[data['sent_pred'] == 2]
data3 = data[data['sent_pred'] == 4]
Negative = go.Scatter(x=data1['date_notime'], y=data1['sum'], mode = 'lines', name = 'Negative')
Neutral = go.Scatter(x=data2['date_notime'], y=data2['sum'], mode = 'lines', name = 'Neutral')
Positive = go.Scatter(x=data3['date_notime'], y=data3['sum'], mode = 'lines', name = 'Positive')
line = [Neutral, Positive, Negative]
layout = dict(title="Favorites by date & predicted sentiment",yaxis=dict(title='No of Favorites tweets') )
fig = dict(data=line, layout=layout)
py.iplot(fig)

Visualization 4 -
Retweets by Sentiment polarity over timeline

In [13]:
data = stored_results.groupby(['date_notime','sent_pred'])['retweets'].agg('sum').reset_index(name='sum')
data1 = data[data['sent_pred'] == 0]
data2 = data[data['sent_pred'] == 2]
data3 = data[data['sent_pred'] == 4]
Negative = go.Scatter(x=data1['date_notime'], y=data1['sum'], mode = 'lines', name = 'Negative')
Neutral = go.Scatter(x=data2['date_notime'], y=data2['sum'], mode = 'lines', name = 'Neutral')
Positive = go.Scatter(x=data3['date_notime'], y=data3['sum'], mode = 'lines', name = 'Positive')
line = [Neutral, Positive, Negative]
layout = dict(title="Retweets by date & predicted sentiment",yaxis=dict(title='No of retweetes') )
fig = dict(data=line, layout=layout)
py.iplot(fig)