# Model the Problem

## Preprocessing the data

In [2]:
import pandas as pd
import re

In [4]:
df = pd.read_csv('tweets.csv', sep="\t")

In [5]:
df.head()

Unnamed: 0,created_at,screen_name,text
0,Fri Nov 18 23:59:58 +0000 2016,arunprasad72,RT @Praveen_1singh: First the stone pelting st...
1,Fri Nov 18 23:59:49 +0000 2016,pranavkisu,RT @NewDelhiTimesIN: Is the #demonetization of...
2,Fri Nov 18 23:59:48 +0000 2016,bablumohan,RT @scoopwhoopnews: #BREAKING Banks across Ind...
3,Fri Nov 18 23:59:37 +0000 2016,NagrathRob,RT @DrGPradhan: .@ravishndtv of @ndtv spreadin...
4,Fri Nov 18 23:59:28 +0000 2016,ManishPrasa,RT @YesIamSaffron: जब भी #Demonetization व् का...


# Topic modelling (LDA - Latent Dirichlet allocation)

In natural language processing, Latent Dirichlet allocation (LDA) is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word's creation is attributable to one of the document's topics.

Original Paper on LDA - http://jmlr.org/papers/v3/blei03a.html

*Summary - We describe latent Dirichlet allocation (LDA), a generative probabilistic model for collections of discrete data such as text corpora. LDA is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities. In the context of text modeling, the topic probabilities provide an explicit representation of a document. We present efficient approximate inference techniques based on variational methods and an EM algorithm for empirical Bayes parameter estimation. We report results in document modeling, text classification, and collaborative filtering, comparing to a mixture of unigrams model and the probabilistic LSI model.*

Here is a graphical approach to build intuition around this topic - http://www.mblondel.org/journal/2010/08/21/latent-dirichlet-allocation-in-python/

Here is a video which explains LDA - https://www.youtube.com/watch?v=ePUAZ8RG-3w

Here is toned down article for LDA - http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/

<code>
Of course, we can’t directly observe topics; in reality all we have are documents. Topic modeling is a way of extrapolating backward from a collection of documents to infer the discourses (“topics”) that could have generated them. (The notion that documents are produced by discourses rather than authors is alien to common sense, but not alien to literary theory.) Unfortunately, there is no way to infer the topics exactly: there are too many unknowns. But pretend for a moment that we had the problem mostly solved. Suppose we knew which topic produced every word in the collection, except for this one word in document D. The word happens to be “lead,” which we’ll call word type W. How are we going to decide whether this occurrence of W belongs to topic Z?

We can’t know for sure. But one way to guess is to consider two questions. A) How often does “lead” appear in topic Z elsewhere? If “lead” often occurs in discussions of Z, then this instance of “lead” might belong to Z as well. But a word can be common in more than one topic. And we don’t want to assign “lead” to a topic about leadership if this document is mostly about heavy metal contamination. So we also need to consider B) How common is topic Z in the rest of this document?

Here’s what we’ll do. For each possible topic Z, we’ll multiply the frequency of this word type W in Z by the number of other words in document D that already belong to Z. The result will represent the probability that this word came from Z. 
</code>

<img src="../../img/ldaformula.png"/>


In [11]:
import lda
import numpy as np
import lda.datasets
import sklearn.feature_extraction.text as text

### Generating the document term matrix

In [12]:
vectorizer = text.CountVectorizer(input='content', stop_words='english', min_df=1)

In [15]:
df.text.fillna('', inplace=True)

In [17]:
dtm = vectorizer.fit_transform(df.text).toarray()

In [18]:
dtm

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ..., 
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

### Loading the vocabulary

In [19]:
vocab = np.array(vectorizer.get_feature_names())

In [20]:
vocab[:20]

array(['00', '000', '00f8drqczr', '025', '039mojsvyi', '0480654',
       '04bevf5lre', '06', '07', '08', '09ip2atmbg', '09mitali',
       '09uv9sfd2y', '0bfujsgltb', '0bwfwuirqk', '0fan6b2wxv',
       '0fmh3umbvq', '0frhtzuyhv', '0funzj0fjo', '0i47zl55f2'], 
      dtype='<U29')

In [21]:
text = df.text

In [22]:
model = lda.LDA(n_topics=5, n_iter=500, random_state=1)

In [23]:
model.fit(dtm)

INFO:lda:n_documents: 11914
INFO:lda:vocab_size: 11019
INFO:lda:n_words: 141760
INFO:lda:n_topics: 5
INFO:lda:n_iter: 500
INFO:lda:<0> log likelihood: -1372721
INFO:lda:<10> log likelihood: -978602
INFO:lda:<20> log likelihood: -954704
INFO:lda:<30> log likelihood: -950702
INFO:lda:<40> log likelihood: -947126
INFO:lda:<50> log likelihood: -944809
INFO:lda:<60> log likelihood: -943443
INFO:lda:<70> log likelihood: -942278
INFO:lda:<80> log likelihood: -940852
INFO:lda:<90> log likelihood: -939839
INFO:lda:<100> log likelihood: -939844
INFO:lda:<110> log likelihood: -938514
INFO:lda:<120> log likelihood: -938177
INFO:lda:<130> log likelihood: -936859
INFO:lda:<140> log likelihood: -936467
INFO:lda:<150> log likelihood: -936332
INFO:lda:<160> log likelihood: -935451
INFO:lda:<170> log likelihood: -935602
INFO:lda:<180> log likelihood: -935251
INFO:lda:<190> log likelihood: -935235
INFO:lda:<200> log likelihood: -934997
INFO:lda:<210> log likelihood: -934803
INFO:lda:<220> log likelihood:

<lda.lda.LDA at 0x115b22748>

In [24]:
model.topic_word_

array([[  3.64296203e-07,   3.64296203e-07,   3.67939165e-05, ...,
          3.64296203e-07,   3.64296203e-07,   3.64296203e-07],
       [  3.99357992e-07,   5.59500547e-04,   3.99357992e-07, ...,
          3.99357992e-07,   3.99357992e-07,   8.02709564e-05],
       [  3.72686687e-07,   1.86716030e-04,   3.72686687e-07, ...,
          3.72686687e-07,   3.72686687e-07,   3.72686687e-07],
       [  7.48101351e-05,   4.22764252e-04,   2.48538655e-07, ...,
          2.48538655e-07,   2.48538655e-07,   2.48538655e-07],
       [  4.39498813e-07,   4.39498813e-07,   4.39498813e-07, ...,
          1.32289143e-04,   2.20188905e-04,   4.39498813e-07]])

In [25]:
topic_word = model.topic_word_ 

In [26]:
topic_word

array([[  3.64296203e-07,   3.64296203e-07,   3.67939165e-05, ...,
          3.64296203e-07,   3.64296203e-07,   3.64296203e-07],
       [  3.99357992e-07,   5.59500547e-04,   3.99357992e-07, ...,
          3.99357992e-07,   3.99357992e-07,   8.02709564e-05],
       [  3.72686687e-07,   1.86716030e-04,   3.72686687e-07, ...,
          3.72686687e-07,   3.72686687e-07,   3.72686687e-07],
       [  7.48101351e-05,   4.22764252e-04,   2.48538655e-07, ...,
          2.48538655e-07,   2.48538655e-07,   2.48538655e-07],
       [  4.39498813e-07,   4.39498813e-07,   4.39498813e-07, ...,
          1.32289143e-04,   2.20188905e-04,   4.39498813e-07]])

### Finding the key words that come together for each topic

In [27]:
n_top_words = 8

In [28]:
for i, topic_dist in enumerate(topic_word):
    topic_words = np.array(vocab)[np.argsort(topic_dist)][:-n_top_words:-1]
    print('Topic {}: {}'.format(i, ' '.join(topic_words)))

Topic 0: demonetization rt https purpose devyanidilli mouth uses
Topic 1: demonetization rt https sardesairajdeep taken congress struggling
Topic 2: demonetization rt https पर कर रह modi
Topic 3: demonetization https rt hai modi bank nahi
Topic 4: demonetization rt https money black anna govt


### Finding the Topic for each Document

In [29]:
doc_topic = model.doc_topic_

In [31]:
for n in range(10):
    topic_most_pr = doc_topic[n].argmax()
    print("topic: {} , {}".format(topic_most_pr,text[n]))

topic: 4 , RT @Praveen_1singh: First the stone pelting stopped and now this!! 
What months of politics and talks couldn't do #demonetization did in da…
topic: 0 , RT @NewDelhiTimesIN: Is the #demonetization of ₹1000 &amp; ₹500 notes good for India? 

@AmitShah @OfficeOfRG @PMOIndia @BJP4India
topic: 0 , RT @scoopwhoopnews: #BREAKING Banks across India to serve only senior citizens tomorrow: NDTV
#demonetization
topic: 1 , RT @DrGPradhan: .@ravishndtv of @ndtv spreading rumours to provoke people against #demonetization &amp; PM @narendramodi 

He need mob treatmen…
topic: 2 , RT @YesIamSaffron: जब भी #Demonetization व् काली धन का इतिहास लीखा जाएगा फ़र्ज़ीवाल का नाम सबसे ऊपर काले अछर से लिखा जाएगा @ArvindKejriwal…
topic: 3 , Agree Sir reason of worry for SC could b some or all of them were affected by #demonetization or instructions by ma… https://t.co/ifqDuDbcEm
topic: 3 , @BspUp2017 @OfficeOfRG @MamataOfficial @ArvindKejriwal  #सारे_चोर_मचाये_शोर  #demonetization  #ModiFightsCorruption

# Sentiment Analysis

Sentiment analysis aims to determine the attitude of a speaker or a writer with respect to some topic or the overall contextual polarity of a document. The attitude may be his or her judgment or evaluation (see appraisal theory), affective state (that is to say, the emotional state of the author when writing), or the intended emotional communication (that is to say, the emotional effect the author wishes to have on the reader).

A basic task in sentiment analysis is classifying the polarity of a given text at the document, sentence, or feature/aspect level — whether the expressed opinion in a document, a sentence or an entity feature/aspect is positive, negative, or neutral. We will use knowledge-based techniques classify text by affect categories based on the presence of unambiguous affect words such as happy, sad, afraid, and bored.

Here is a link to the Sentiment Analysis from nltk site - http://www.nltk.org/howto/sentiment.html

Here is an example of Sentiment Analysis on Tweets data - http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/


### What is Naive Bayes algorithm?

It is a classification technique based on Bayes’ Theorem with an assumption of independence among predictors. In simple terms, a Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. For example, a fruit may be considered to be an apple if it is red, round, and about 3 inches in diameter. Even if these features depend on each other or upon the existence of the other features, all of these properties independently contribute to the probability that this fruit is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along with simplicity, Naive Bayes is known to outperform even highly sophisticated classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x) and P(x|c). Look at the equation below:

<img src="../../img/nb.png"/>

In [32]:
from nltk.classify import NaiveBayesClassifier
import math
import collections

In [33]:
pos_features = []
neg_features = []

In [34]:
def make_full_dict(word):
    return dict([(word, True)])

In [35]:
with open('../../data/postive_words.txt','r') as posFile:
    lines = posFile.readlines()
    for line in lines:
        pos_features.append([make_full_dict(line.rstrip()),'pos'])
        

In [36]:
pos_features

[[{'a+': True}, 'pos'],
 [{'abound': True}, 'pos'],
 [{'abounds': True}, 'pos'],
 [{'abundance': True}, 'pos'],
 [{'abundant': True}, 'pos'],
 [{'accessable': True}, 'pos'],
 [{'accessible': True}, 'pos'],
 [{'acclaim': True}, 'pos'],
 [{'acclaimed': True}, 'pos'],
 [{'acclamation': True}, 'pos'],
 [{'accolade': True}, 'pos'],
 [{'accolades': True}, 'pos'],
 [{'accommodative': True}, 'pos'],
 [{'accomodative': True}, 'pos'],
 [{'accomplish': True}, 'pos'],
 [{'accomplished': True}, 'pos'],
 [{'accomplishment': True}, 'pos'],
 [{'accomplishments': True}, 'pos'],
 [{'accurate': True}, 'pos'],
 [{'accurately': True}, 'pos'],
 [{'achievable': True}, 'pos'],
 [{'achievement': True}, 'pos'],
 [{'achievements': True}, 'pos'],
 [{'achievible': True}, 'pos'],
 [{'acumen': True}, 'pos'],
 [{'adaptable': True}, 'pos'],
 [{'adaptive': True}, 'pos'],
 [{'adequate': True}, 'pos'],
 [{'adjustable': True}, 'pos'],
 [{'admirable': True}, 'pos'],
 [{'admirably': True}, 'pos'],
 [{'admiration': True}, 'p

In [37]:
with open('../../data/negative_words.txt','r',encoding='utf-8') as negFile:
    lines = negFile.readlines()
    for line in lines:
        neg_features.append([make_full_dict(line.rstrip()),'neg'])

In [38]:
neg_features

[[{'2-faced': True}, 'neg'],
 [{'2-faces': True}, 'neg'],
 [{'abnormal': True}, 'neg'],
 [{'abolish': True}, 'neg'],
 [{'abominable': True}, 'neg'],
 [{'abominably': True}, 'neg'],
 [{'abominate': True}, 'neg'],
 [{'abomination': True}, 'neg'],
 [{'abort': True}, 'neg'],
 [{'aborted': True}, 'neg'],
 [{'aborts': True}, 'neg'],
 [{'abrade': True}, 'neg'],
 [{'abrasive': True}, 'neg'],
 [{'abrupt': True}, 'neg'],
 [{'abruptly': True}, 'neg'],
 [{'abscond': True}, 'neg'],
 [{'absence': True}, 'neg'],
 [{'absent-minded': True}, 'neg'],
 [{'absentee': True}, 'neg'],
 [{'absurd': True}, 'neg'],
 [{'absurdity': True}, 'neg'],
 [{'absurdly': True}, 'neg'],
 [{'absurdness': True}, 'neg'],
 [{'abuse': True}, 'neg'],
 [{'abused': True}, 'neg'],
 [{'abuses': True}, 'neg'],
 [{'abusive': True}, 'neg'],
 [{'abysmal': True}, 'neg'],
 [{'abysmally': True}, 'neg'],
 [{'abyss': True}, 'neg'],
 [{'accidental': True}, 'neg'],
 [{'accost': True}, 'neg'],
 [{'accursed': True}, 'neg'],
 [{'accusation': True}

In [39]:
len(pos_features),len(neg_features)

(8020, 4783)

In [40]:
trainFeatures = pos_features + neg_features

In [41]:
trainFeatures

[[{'a+': True}, 'pos'],
 [{'abound': True}, 'pos'],
 [{'abounds': True}, 'pos'],
 [{'abundance': True}, 'pos'],
 [{'abundant': True}, 'pos'],
 [{'accessable': True}, 'pos'],
 [{'accessible': True}, 'pos'],
 [{'acclaim': True}, 'pos'],
 [{'acclaimed': True}, 'pos'],
 [{'acclamation': True}, 'pos'],
 [{'accolade': True}, 'pos'],
 [{'accolades': True}, 'pos'],
 [{'accommodative': True}, 'pos'],
 [{'accomodative': True}, 'pos'],
 [{'accomplish': True}, 'pos'],
 [{'accomplished': True}, 'pos'],
 [{'accomplishment': True}, 'pos'],
 [{'accomplishments': True}, 'pos'],
 [{'accurate': True}, 'pos'],
 [{'accurately': True}, 'pos'],
 [{'achievable': True}, 'pos'],
 [{'achievement': True}, 'pos'],
 [{'achievements': True}, 'pos'],
 [{'achievible': True}, 'pos'],
 [{'acumen': True}, 'pos'],
 [{'adaptable': True}, 'pos'],
 [{'adaptive': True}, 'pos'],
 [{'adequate': True}, 'pos'],
 [{'adjustable': True}, 'pos'],
 [{'admirable': True}, 'pos'],
 [{'admirably': True}, 'pos'],
 [{'admiration': True}, 'p

In [42]:
classifier = NaiveBayesClassifier.train(trainFeatures)

In [43]:
referenceSets = collections.defaultdict(set)
testSets = collections.defaultdict(set)

In [44]:
def make_full_dict_sent(words):
    return dict([(word, True) for word in words])

In [45]:
import re

In [46]:
neg_test = 'I hate data science'

In [47]:
title_words = re.findall(r"[\w']+|[.,!?;]",
                         'The Daily Mail stole My Visualization, Twice')

In [48]:
title_words

['The', 'Daily', 'Mail', 'stole', 'My', 'Visualization', ',', 'Twice']

In [49]:
test=[]

In [50]:
test.append([make_full_dict_sent(title_words),''])

In [51]:
test

[[{',': True,
   'Daily': True,
   'Mail': True,
   'My': True,
   'The': True,
   'Twice': True,
   'Visualization': True,
   'stole': True},
  '']]

In [52]:
for i, (features, label) in enumerate(test):
    predicted = classifier.classify(features)
    print(predicted)

neg


In [54]:
for doc in df.text:
    title_words = re.findall(r"[\w']+|[.,!?;]", doc.lower())
    test = []
    test.append([make_full_dict_sent(title_words),''])
    for i, (features, label) in enumerate(test):
        predicted = classifier.classify(features)
        print(predicted,doc)
    

pos RT @Praveen_1singh: First the stone pelting stopped and now this!! 
What months of politics and talks couldn't do #demonetization did in da…
pos RT @NewDelhiTimesIN: Is the #demonetization of ₹1000 &amp; ₹500 notes good for India? 

@AmitShah @OfficeOfRG @PMOIndia @BJP4India
neg RT @scoopwhoopnews: #BREAKING Banks across India to serve only senior citizens tomorrow: NDTV
#demonetization
neg RT @DrGPradhan: .@ravishndtv of @ndtv spreading rumours to provoke people against #demonetization &amp; PM @narendramodi 

He need mob treatmen…
pos RT @YesIamSaffron: जब भी #Demonetization व् काली धन का इतिहास लीखा जाएगा फ़र्ज़ीवाल का नाम सबसे ऊपर काले अछर से लिखा जाएगा @ArvindKejriwal…
neg Agree Sir reason of worry for SC could b some or all of them were affected by #demonetization or instructions by ma… https://t.co/ifqDuDbcEm
pos @BspUp2017 @OfficeOfRG @MamataOfficial @ArvindKejriwal  #सारे_चोर_मचाये_शोर  #demonetization  #ModiFightsCorruption 
https://t.co/4ElMIvgygX
pos RT @janlokpal: Reac