## Problem3

In this case study, you have been given Twitter data collected from an anonymous twitter handle. 
With the help of a Naïve Bayes model, predict if a given tweet about a real disaster is real or fake.
1 = real tweet and 0 = fake tweet

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer,TfidfTransformer

# Loading the data set
tweets = pd.read_csv("Disaster_tweets_NB.csv",encoding = "ISO-8859-1")

In [2]:
tweets

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


In [3]:
tweets.shape

(7613, 5)

In [4]:
tweets['location'].isnull().sum()

2533

In [5]:
tweets['location'].nunique()

3341

location has 70% nulls and unique values. this would not give additional information. Hence we drop the column

In [15]:
#we can drop id as it does provide any additional info
tweets.drop(['location'],axis=1,inplace=True)

In [8]:
tweets['keyword'].nunique()

221

In [7]:
#we can delete these minimal number of rows
tweets['keyword'].isnull().sum()

61

In [10]:
tweets[tweets['keyword'].isnull()].index

Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,   10,
              11,   12,   13,   14,   15,   16,   17,   18,   19,   20,   21,
              22,   23,   24,   25,   26,   27,   28,   29,   30, 7583, 7584,
            7585, 7586, 7587, 7588, 7589, 7590, 7591, 7592, 7593, 7594, 7595,
            7596, 7597, 7598, 7599, 7600, 7601, 7602, 7603, 7604, 7605, 7606,
            7607, 7608, 7609, 7610, 7611, 7612],
           dtype='int64')

In [11]:
# Get names of indexes for which column keyword has value null
indexNames = tweets[tweets['keyword'].isnull()].index

# Delete these row indexes from dataFrame
tweets.drop(indexNames , inplace=True)

In [12]:
tweets['keyword'].isnull().any()

False

In [13]:
#we can drop id as it does provide any additional info
tweets.drop(['id'],axis=1,inplace=True)

In [16]:
tweets.head()

Unnamed: 0,keyword,text,target
31,ablaze,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1
32,ablaze,We always try to bring the heavy. #metal #RT h...,0
33,ablaze,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1
34,ablaze,Crying out for more! Set me ablaze,0
35,ablaze,On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0


In [17]:
# cleaning data 
import re
stop_words = []
# Load the custom built Stopwords
with open("stopwords_en.txt","r") as sw:
    stop_words = sw.read()

stop_words = stop_words.split("\n")
   


In [18]:
stop_words

['a',
 "a's",
 'able',
 'about',
 'above',
 'according',
 'accordingly',
 'across',
 'actually',
 'after',
 'afterwards',
 'again',
 'against',
 "ain't",
 'all',
 'allow',
 'allows',
 'almost',
 'alone',
 'along',
 'already',
 'also',
 'although',
 'always',
 'am',
 'among',
 'amongst',
 'an',
 'and',
 'another',
 'any',
 'anybody',
 'anyhow',
 'anyone',
 'anything',
 'anyway',
 'anyways',
 'anywhere',
 'apart',
 'appear',
 'appreciate',
 'appropriate',
 'are',
 "aren't",
 'around',
 'as',
 'aside',
 'ask',
 'asking',
 'associated',
 'at',
 'available',
 'away',
 'awfully',
 'b',
 'be',
 'became',
 'because',
 'become',
 'becomes',
 'becoming',
 'been',
 'before',
 'beforehand',
 'behind',
 'being',
 'believe',
 'below',
 'beside',
 'besides',
 'best',
 'better',
 'between',
 'beyond',
 'both',
 'brief',
 'but',
 'by',
 'c',
 "c'mon",
 "c's",
 'came',
 'can',
 "can't",
 'cannot',
 'cant',
 'cause',
 'causes',
 'certain',
 'certainly',
 'changes',
 'clearly',
 'co',
 'com',
 'come',
 'c

In [19]:
def cleaning_text(i):
    i = re.sub("[^A-Za-z" "]+"," ",i).lower()
  #  i = re.sub("[0-9" "]+"," ",i)
    w = []
    for word in i.split(" "):
        if len(word)>3:
            w.append(word)
    return (" ".join(w))


In [20]:
# testing above function with sample text => removes punctuations, numbers
cleaning_text("Hope you are having a good week. Just checking in")

'hope having good week just checking'

In [9]:
cleaning_text("hope i can understand your feelings 123121. 123 hi how .. are you?")

'hope understand your feelings'

In [10]:
cleaning_text("Hi how are you, I am good")

'good'

In [21]:
tweets['text']

31      @bbcmtd Wholesale Markets ablaze http://t.co/l...
32      We always try to bring the heavy. #metal #RT h...
33      #AFRICANBAZE: Breaking news:Nigeria flag set a...
34                     Crying out for more! Set me ablaze
35      On plus side LOOK AT THE SKY LAST NIGHT IT WAS...
                              ...                        
7578     @jt_ruff23 @cameronhacker and I wrecked you both
7579    Three days off from work and they've pretty mu...
7580    #FX #forex #trading Cramer: Iger's 3 words tha...
7581    @engineshed Great atmosphere at the British Li...
7582    Cramer: Iger's 3 words that wrecked Disney's s...
Name: text, Length: 7552, dtype: object

In [22]:
tweets.text = tweets.text.apply(cleaning_text)

In [23]:
tweets['text']

31          bbcmtd wholesale markets ablaze http lhyxeohy
32                     always bring heavy metal http xngw
33      africanbaze breaking news nigeria flag ablaze ...
34                                     crying more ablaze
35         plus side look last night ablaze http qqsmshaj
                              ...                        
7578                      ruff cameronhacker wrecked both
7579    three days from work they pretty much been wre...
7580    forex trading cramer iger words that wrecked d...
7581    engineshed great atmosphere british lion tonig...
7582    cramer iger words that wrecked disney stock cn...
Name: text, Length: 7552, dtype: object

In [25]:
tweets.shape

(7552, 3)

In [26]:
# no empty rows
tweets.loc[tweets.text != " ",:].shape

(7552, 3)

In [40]:
X = np.array(tweets.iloc[:,:2]) # Predictors 
Y = np.array(tweets.iloc[:,2]) # Target

In [41]:
# CountVectorizer
# Convert a collection of text documents to a matrix of token counts

# splitting data into train and test data sets 
from sklearn.model_selection import train_test_split
#tweets_train, tweets_test = train_test_split(tweets, test_size = 0.2)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)

In [43]:
X_train

array([['seismic',
        'exploration takes seismic shift gabon somalia bloomberg http bekrpjnyhs somalia'],
       ['screamed', 'mogacola zamtriossu screamed after hitting tweet'],
       ['drown',
        'this weekend nathan birthday weekend want drown yourself beer reckless things potentially'],
       ...,
       ['sirens', 'know dill pickle when taste'],
       ['outbreak',
        'person dies legionnaires disease outbreak http fjdm qhyai sebee'],
       ['thunderstorm',
      dtype=object)

In [44]:
# creating a matrix of token counts for the entire text document 
def split_into_words(i):
    return [word for word in i.split(" ")]

In [46]:
# Defining the preparation of tweet texts into word count matrix format - Bag of Words
tweets_bow = CountVectorizer(analyzer = split_into_words).fit(tweets.text)

In [47]:
tweets_bow

CountVectorizer(analyzer=<function split_into_words at 0x000001927DF4D790>)

In [48]:
# Defining BOW for all messages
all_tweets_bow_matrix = tweets_bow.transform(tweets.text)

In [62]:
X_train[:,1]

array(['exploration takes seismic shift gabon somalia bloomberg http bekrpjnyhs somalia',
       'mogacola zamtriossu screamed after hitting tweet',
       'this weekend nathan birthday weekend want drown yourself beer reckless things potentially',
       ..., 'know dill pickle when taste',
       'person dies legionnaires disease outbreak http fjdm qhyai sebee',
      dtype=object)

In [63]:
# For training messages
train_tweets_matrix = tweets_bow.transform(X_train[:,1])

In [64]:
# For testing messages
test_tweets_matrix = tweets_bow.transform(X_test[:,1])

In [65]:
test_tweets_matrix.shape

(1511, 19237)

In [66]:
# Learning Term weighting and normalizing on entire emails
tfidf_transformer = TfidfTransformer().fit(all_tweets_bow_matrix)

In [67]:
# Preparing TFIDF for train tweets
train_tfidf = tfidf_transformer.transform(train_tweets_matrix)
train_tfidf.shape # (row, column)

(6041, 19237)

In [68]:
# Preparing TFIDF for test tweets
test_tfidf = tfidf_transformer.transform(test_tweets_matrix)
test_tfidf.shape #  (row, column)

(1511, 19237)

In [69]:
# Preparing a naive bayes model on training data set 
from sklearn.naive_bayes import MultinomialNB as MB

# Multinomial Naive Bayes
classifier_mb = MB()
classifier_mb.fit(train_tfidf, Y_train)

MultinomialNB()

In [70]:
# Evaluation on Test Data
test_pred_m = classifier_mb.predict(test_tfidf)
accuracy_test_m = np.mean(test_pred_m == Y_test)
accuracy_test_m

0.784910655195235

In [71]:
from sklearn.metrics import accuracy_score
accuracy_score(test_pred_m, Y_test) 

0.784910655195235

78% accuracy on test data

In [72]:
pd.crosstab(test_pred_m, Y_test)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,779,244
1,81,407


In [None]:
Mistakes made by the model on test data:
    
1. model predicted 81 fake tweets as real. (False Negative)
2. model predicted 244 real tweets as fake. (False Positive)

In [73]:
# Training Data accuracy
train_pred_m = classifier_mb.predict(train_tfidf)
accuracy_train_m = np.mean(train_pred_m == Y_train)
accuracy_train_m

0.9036583347127959

90% accuracy on training data

In [92]:
# Multinomial Naive Bayes changing default alpha for laplace smoothing
# if alpha = 0 then no smoothing is applied and the default alpha parameter is 1
# the smoothing process mainly solves the emergence of zero probability problem in the dataset.
classifier_mb_lap = MB(alpha = 2)
classifier_mb_lap.fit(train_tfidf, Y_train)


MultinomialNB(alpha=2)

In [94]:
# Training Data accuracy
train_pred_lap = classifier_mb_lap.predict(train_tfidf)
accuracy_train_lap = np.mean(train_pred_lap == Y_train)
accuracy_train_lap


0.8750206919384208

In [95]:
# Evaluation on Test Data after applying laplace
test_pred_lap = classifier_mb_lap.predict(test_tfidf)
accuracy_test_lap = np.mean(test_pred_lap == Y_test)
accuracy_test_lap

0.7902051621442753

79% accuracy on test  data with laplace smoothening alpha = 2 . We are likely to predict tweet of a real disaster 79% times using this model.

In [96]:
from sklearn.metrics import accuracy_score
accuracy_score(test_pred_lap, Y_test)

0.7902051621442753

In [97]:
#confusion matrix
pd.crosstab(test_pred_lap, Y_test)

col_0,0,1
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
0,803,260
1,57,391


Mistakes made by the model on test data:
    
1. model predicted 57 fake tweets as real. (False Negative)
2. model predicted 260 real tweets as fake. (False Positive)

Though accuracy is better with laplace smoothening alpha = 2, we find False Positive less (244 vs 260) without smoothening.