# Homework 3  (Due: 10/31/2018)

COEN 281, Fall 2018  
Professor Marwah

---

The objective of this HW is to implement a Naive Bayes classifier to predict whether a tweet was posted by a Republican or Democrat politician. The training data consist of about 13K tweets collected before the 2006 US presidential elections, There are about an equal number of Republican and Democrat tweets, and the tweets belong to three republican and three democrat twitter accounts. 

To represent each tweet, we will use a commonly used model in natural language processing called 'bag of words' model. A bag of words representation of a document (tweet here) consists of words and their frequencies in the document. The order of words is ignored.  

There four main tasks.
1. Tokenization: Parsing and converting the tweets to tokens. [**This is already done for you**]
2. Feature matrix construction from the training data set
3. Learning Naive Bayes parameters, priors and likelihoods, from the feature matrix.
4. Using the learned NB model to predict the labels of the test data set (about 4K tweets).

## Tokenization
This task consists of converting each tweet into a sequence of "tokens" that can be used as features. Tokens are essentially characters and character sequences obtained after using white space as a separator. A lot these are noise that we want to remove; some are words or other character sequences that are useful features. A python package called *NLTK* (natural language toolkit) contains several tokenizers, including one for tweets. We use that tokenizer; in addition we do the following:
- remove stopwords. These are words that are frequently used in a language but do not carry any semantic information, e.g., the, an , a, this, is, was, etc.
- make all tokens lower case (this is done by the tweet tokenizer)
- removing twitter handles (again, done by the tweet tokenizer)
- remove punctuations, http links

Finally, we "lemmatize" the tokens. That means we convert different forms of a word to a common basic form, so that they can be recognized as the same work. E.g., vote, votes, voted would all be converted to vote; geese would be converted to goose,e tc. (There is a less sophisticated version of lemmatizer called a stemmer which just chops words to convert to the same base work; it doesn't work as well as a lemmatizer and we dont use it here.) There is a good description of the NLTK tokenizer [here](https://berkeley-stat159-f17.github.io/stat159-f17/lectures/11-strings/11-nltk..html).

The output of this part is a cleaned up list of tokens for each tweet. 


In [1]:
import pandas as pd
import string
import numpy as np

import nltk
#
# you may need to run the following
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/vaishnavisabhahith/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/vaishnavisabhahith/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# The data set has two columns - screen_name and text (which is the raw tweet)

## load tweets
tweets = pd.read_csv("tweets_train.csv", na_filter=False)

## screen_namee (accounts)
#  democrat - hillary, time kaine, TheDemocrats
# republicans - trunp, pence, GOP


In [3]:
tweets['screen_name'].unique()

array(['GOP', 'TheDemocrats', 'HillaryClinton', 'timkaine', 'mike_pence',
       'realDonaldTrump'], dtype=object)

In [4]:
tweets.head()

Unnamed: 0,screen_name,text
0,GOP,RT @GOPconvention: #Oregon votes today. That m...
1,TheDemocrats,RT @DWStweets: The choice for 2016 is clear: W...
2,HillaryClinton,Trump's calling for trillion dollar tax cuts f...
3,HillaryClinton,.@TimKaine's guiding principle: the belief tha...
4,timkaine,Glad the Senate could pass a #THUD / MilCon / ...


In [5]:
tweets.describe()

Unnamed: 0,screen_name,text
count,13000,13000
unique,6,12982
top,realDonaldTrump,MAKE AMERICA GREAT AGAIN!
freq,2217,4


In [6]:
# add labels
# 1 for D's
# 0 for R's
tweets['label'] = tweets['screen_name'].str.contains('TheDemocrats|HillaryClinton|timkaine', regex=True)
tweets.describe()

Unnamed: 0,screen_name,text,label
count,13000,13000,13000
unique,6,12982,2
top,realDonaldTrump,MAKE AMERICA GREAT AGAIN!,False
freq,2217,4,6554


The training data has 13K tweets, and each of the two classes have about an equal number of tweets.

Now we will define our tokenizer.

In [7]:
from nltk.stem import WordNetLemmatizer
#
#  Input : dataframe with a column names 'text' which contains raw tweets (one per row)
#  Output: A list of lists of tokens corrsponding to the 'text' column
#
def tokenize_tweets2(tweets):
    """Given a df with tweets in 'text' col, this function return tokens as a list of lists"""

    # apply tokenize to the 'text' coolumn in the tweets df
    tweet_tokenizer = nltk.tokenize.TweetTokenizer(preserve_case=False, reduce_len=True, strip_handles=True)
    tokens = tweets['text'].apply(tweet_tokenizer.tokenize)
    
    # filter
    misc = ['rt', '’', '…', '—', 'u', '”', 'w', '“', '...', '️', 'http', 'https']
    to_remove = nltk.corpus.stopwords.words('English') + list(string.punctuation) + misc
    
    lemmatizer = WordNetLemmatizer()
    
    tokens = [[lemmatizer.lemmatize(token) for token in tw if token not in to_remove] for tw in tokens]      
    return(tokens)

In [8]:
all_tokens = tokenize_tweets2(tweets)
print(len(all_tokens))
all_tokens[:10]

13000


[['#oregon', 'vote', 'today', 'mean', '62', 'day', 'https://t.co/OoH9FVb7QS'],
 ['choice',
  '2016',
  'clear',
  'need',
  'another',
  'democrat',
  'white',
  'house',
  '#demdebate',
  '#wearedemocrats',
  'http://t.co/0n5g0YN46f'],
 ["trump's",
  'calling',
  'trillion',
  'dollar',
  'tax',
  'cut',
  'wall',
  'street',
  'time',
  'pay',
  'fair',
  'share',
  'https://t.co/y8vyESIOES'],
 ['guiding',
  'principle',
  'belief',
  'make',
  'difference',
  'public',
  'service',
  'https://t.co/YopSUeMqOX'],
 ['glad',
  'senate',
  'could',
  'pas',
  '#thud',
  'milcon',
  'vetaffairs',
  'approps',
  'bill',
  'solid',
  'provision',
  'virginia',
  'https://t.co/NxIgRC3hDi'],
 ['exclusive',
  'sits',
  'see',
  'sunday',
  'morning',
  '8:',
  '30a',
  'rtv',
  '6',
  'rtv',
  '6',
  'app'],
 ['chatham',
  'town',
  'council',
  'congress',
  'made',
  'strong',
  'mark',
  'community',
  'proud',
  'work',
  'together',
  'behalf',
  'va'],
 ['thank',
  'new',
  'orleans',
  

The tokenizer can still be improved, but we will go with this. 

Let's find the most common tokens, and we will use all tokens that at least occur 25 times as features.

In [9]:
from collections import Counter
counts=0
counts = Counter([token for tokens in all_tokens for token in tokens])
print(len(counts))
counts.most_common(20)

23459


[('hillary', 1159),
 ('trump', 1144),
 ('great', 749),
 ('clinton', 720),
 ('today', 709),
 ('make', 581),
 ('donald', 576),
 ('president', 564),
 ('day', 552),
 ('thank', 539),
 ('american', 512),
 ('new', 503),
 ('job', 503),
 ('u', 485),
 ('america', 480),
 ('people', 469),
 ('vote', 451),
 ('state', 442),
 ('get', 420),
 ('year', 415)]

In [10]:
top_words = [k for k in counts.keys() if counts.get(k) > 25]
len(top_words)

927

top_words are our features.
Now let's construct a feature matrix from these top words

## Feature Martix Construction

Problem 1 (15 points) Compute feature matrix

Now we will extract the features from the training data and construct a feature matrix. The bad news is this matrix can be very large. In our case it is about 13K X 1K, or about 13M x 4 bytes ~ 52M, which will easily fit in the RAM of your laptops, but the training set could have easily been 10x or 100x the current size, and the number of features 10x in which case you would be out of luck. The good news is this matrix is likely to be very sparse. In fact, each tweet is not likely to contain more than 10-20 tokens, so even if this matrix becomes very large, we would be okay if we use a sparse representation.

In a sparse representation, only the non-zero entities and their indices are saved. Scipy provides [several formats](https://docs.scipy.org/doc/scipy-0.18.1/reference/sparse.html) for sparse matrices. In this assignment, it doesn't matter which one you use (in fact, we could have even used a dense matrix). However, since we have to sum along columns (or features), the most suitable one is [csc (or compressed sparse column) format](https://docs.scipy.org/doc/scipy-0.18.1/reference/generated/scipy.sparse.csc_matrix.html).

To make it easier to estimate priors and likelihoods, we will construct two feature matrices - one for each for the two classes. For this, first we need to figure out how many data points are in each class.

While setting elements of a csc_matrix you may get a 'SparseEfficiencyWarning'; you can ignore that. 

In [11]:
num_feat = len(top_words)

# set this to the correct values
nTrainR = tweets['label'][tweets['label']==False].count()  # number of R (0) training points
nTrainD = tweets['label'][tweets['label']==True].count()  # number of D (1) training points
#nTrainR=6554
print(nTrainR)
#nTrainD=6446
print(nTrainD)
df = tweets.copy()
df_rep=tweets[tweets['label']==False]
df_dem=tweets[tweets['label']==True]
train_label=tweets['label']
#create sparse feature matrix
from scipy.sparse import csc_matrix
row_rep_count=0
col_r=np.array([])
row_r=np.array([])
data_Spsr=np.array([])

rfmat = csc_matrix((nTrainR, num_feat), dtype=int)
dfmat = csc_matrix((nTrainD, num_feat), dtype=int)

#
# populate rfmat and dfmat with the counts of the features
# Remember: all tokens are not features

# a function that might be useful is <list>.index() 
#
#rep sparse matrix
for (per_tweets_token,rowCount) in zip(all_tokens,range(0,len(all_tokens))):
    tweet_count=Counter([token_list for token_list in per_tweets_token])
    for tokens_each in tweet_count.keys():
        if tokens_each in top_words and train_label[rowCount]==False:
            col_r=np.append(col_r,top_words.index(tokens_each))
            row_r=np.append(row_r,row_rep_count)
            data_Spsr=np.append(data_Spsr, tweet_count[tokens_each])
    if train_label[rowCount]==False:row_rep_count+=1
rfmat = csc_matrix((data_Spsr,(row_r,col_r)),shape=(nTrainR,num_feat))
                
print(rfmat)    

6554
6446
  (0, 0)	1.0
  (31, 0)	1.0
  (98, 0)	1.0
  (212, 0)	1.0
  (236, 0)	1.0
  (247, 0)	1.0
  (326, 0)	1.0
  (401, 0)	1.0
  (552, 0)	1.0
  (570, 0)	1.0
  (605, 0)	1.0
  (617, 0)	1.0
  (628, 0)	1.0
  (646, 0)	1.0
  (678, 0)	1.0
  (782, 0)	1.0
  (789, 0)	1.0
  (806, 0)	1.0
  (825, 0)	2.0
  (870, 0)	1.0
  (988, 0)	1.0
  (1010, 0)	1.0
  (1037, 0)	1.0
  (1039, 0)	1.0
  (1152, 0)	1.0
  :	:
  (4454, 925)	1.0
  (4532, 925)	1.0
  (5295, 925)	1.0
  (5534, 925)	1.0
  (5721, 925)	1.0
  (5951, 925)	1.0
  (6397, 925)	1.0
  (6447, 925)	1.0
  (1385, 926)	1.0
  (1428, 926)	1.0
  (1578, 926)	1.0
  (1848, 926)	1.0
  (1908, 926)	1.0
  (2872, 926)	1.0
  (2896, 926)	1.0
  (2944, 926)	1.0
  (3470, 926)	1.0
  (3898, 926)	2.0
  (4047, 926)	1.0
  (4304, 926)	1.0
  (4438, 926)	1.0
  (5105, 926)	1.0
  (5160, 926)	1.0
  (5229, 926)	1.0
  (5686, 926)	1.0


In [12]:
#dem sparse matrix

col_dem=np.array([])
row_dem=np.array([])
data_Spsdem=np.array([])
row_dem_count=0
for (per_tweets_token,rowCount) in zip(all_tokens,range(0,len(all_tokens))):
    tweet_count=Counter([token_list for token_list in per_tweets_token])
    for tokens_each in tweet_count.keys():
        if tokens_each in top_words and train_label[rowCount]==True:
            col_dem=np.append(col_dem,top_words.index(tokens_each))
            row_dem=np.append(row_dem,row_dem_count)
            data_Spsdem=np.append(data_Spsdem, tweet_count[tokens_each])
    if train_label[rowCount]==True:row_dem_count+=1
dfmat = csc_matrix((data_Spsdem,(row_dem,col_dem)),shape=(nTrainD,num_feat))
print(dfmat)

  (11, 0)	1.0
  (65, 0)	1.0
  (74, 0)	1.0
  (96, 0)	1.0
  (133, 0)	1.0
  (166, 0)	1.0
  (182, 0)	1.0
  (225, 0)	1.0
  (236, 0)	1.0
  (243, 0)	1.0
  (250, 0)	1.0
  (302, 0)	1.0
  (315, 0)	1.0
  (319, 0)	1.0
  (337, 0)	1.0
  (400, 0)	1.0
  (411, 0)	1.0
  (418, 0)	1.0
  (515, 0)	1.0
  (532, 0)	1.0
  (535, 0)	1.0
  (552, 0)	1.0
  (599, 0)	1.0
  (602, 0)	1.0
  (610, 0)	1.0
  :	:
  (4646, 924)	2.0
  (1274, 925)	1.0
  (1382, 925)	1.0
  (2471, 925)	1.0
  (2659, 925)	1.0
  (3232, 925)	1.0
  (3729, 925)	1.0
  (4019, 925)	1.0
  (4873, 925)	1.0
  (4899, 925)	1.0
  (5573, 925)	1.0
  (5857, 925)	1.0
  (6072, 925)	1.0
  (1408, 926)	1.0
  (2290, 926)	1.0
  (2306, 926)	1.0
  (2313, 926)	1.0
  (2735, 926)	1.0
  (3899, 926)	1.0
  (4776, 926)	1.0
  (5192, 926)	1.0
  (5482, 926)	1.0
  (5523, 926)	1.0
  (5682, 926)	1.0
  (6215, 926)	1.0


Learning Naive Bayes Model Parameters
Problem 2 (5 points) compute log priors
Problem 3 (30 points) compute log likelihoods using Laplace smoothing
Now we can compute the model parameters, this is, the likelihoods and priors for the two classes. As we discussed in class, since the probabilities can be very small numbers, we will compute log likelihoods and log priors. Aslo use Laplace (aka add one) smoothing.
To sum a matrix column, you can use something like dfmat[:,i].sum()

In [13]:
# compute log priors
prior_rep = nTrainR/len(all_tokens)
prior_dem = nTrainD/len(all_tokens)
print("Prior(Rep) ", prior_rep)
print("Prior(Dem) ", prior_dem)
log_prior_rep = np.log(prior_rep)
log_prior_dem = np.log(prior_dem)
print("Log of Prior(Rep) ",log_prior_rep)
print("Log of Prior(Dem)",log_prior_dem)

Prior(Rep)  0.5041538461538462
Prior(Dem)  0.4958461538461538
Log of Prior(Rep)  -0.6848738071849139
Log of Prior(Dem) -0.7014895740682907


In [14]:
#denominator of likelihood -Sumof  count of all tokens matrix 
count_rep_total=0
count_rep_total=rfmat.sum()
print("Total count of rep_train_token",count_rep_total)
count_dem_total=0
count_dem_total=dfmat.sum()
print("Total count of dem_train_token",count_dem_total)

Total count of rep_train_token 37783.0
Total count of dem_train_token 39188.0


In [15]:
#compute log likelihoods without Laplace smoothing republic
frequency_sum_rep = np.array([])
for k in range(0,len(top_words)):
    frequency_sum_rep = np.append(frequency_sum_rep, rfmat[:,k].sum())
likelihood_given_Rep= frequency_sum_rep/count_rep_total
#print(frequency_sum_rep)
log_likelihood_given_Rep=np.log(likelihood_given_Rep)
print(log_likelihood_given_Rep)

[ -5.50266194  -4.66187876  -7.64924279  -4.71953161  -7.36156071
  -5.95464707  -7.07387864  -5.99631976  -6.2491551   -6.30550804
  -6.98426648  -6.36522728  -5.7355935          -inf  -7.76702582
  -7.90055722  -7.76702582  -6.15758791  -7.7064012   -7.01325402
  -7.64924279  -5.28734112  -7.28151801  -7.20741003  -7.36156071
  -8.34238997  -5.05067682  -7.64924279  -6.77841443  -7.90055722
  -7.01325402  -7.24377768  -8.5937044   -6.39647982  -7.20741003
  -5.82111567  -6.22212643  -7.83156434  -7.32073872  -7.76702582
  -8.0547079   -6.07370643  -6.3499598   -7.17231872  -6.90202839
  -5.96490357  -6.10879775  -6.26294843  -7.49509211  -4.44154026
  -4.71953161  -5.24129718  -6.64779425  -6.32010684  -7.07387864
  -6.98426648  -7.17231872  -5.93444436  -7.28151801  -4.74052189
  -7.97466519  -5.84826666  -5.96490357  -5.10153524  -6.49656328
  -7.36156071  -8.34238997  -5.06315099         -inf  -7.7064012
  -7.36156071  -6.33492193  -7.36156071  -8.14171927  -6.92869663
  -7.361560

  import sys


In [16]:
#compute log likelihoods without Laplace smoothing democratic
frequency_sum_dem = np.array([])
for m in range(0,len(top_words)):
    frequency_sum_dem = np.append(frequency_sum_dem, dfmat[:,m].sum())
likelihood_given_dem = frequency_sum_dem/count_dem_total
log_likelihood_given_dem=np.log(likelihood_given_dem)
print(log_likelihood_given_dem)

  


[ -4.88239372  -4.71249468  -6.8872464   -5.20548783  -6.83845624
  -6.06526635  -7.07961829  -5.12080474  -6.53307459  -5.11654034
  -6.62488214  -6.40173859  -5.90329702  -7.28028899  -5.42283426
  -7.58039358  -8.17823058  -5.85762699  -6.86255379  -6.9385397
  -7.80353713  -5.44032742  -6.43299113  -7.53160342  -6.96520794
  -7.74291251  -4.75012575  -6.56879267  -6.34201935  -7.39807203
  -5.51988005  -6.60583394  -6.86255379  -5.65614493  -5.7639415
  -5.98100601  -6.55077417  -7.6857541   -7.63168688  -7.74291251
  -5.53917325  -6.06526635  -6.68430556  -7.07961829  -6.19409922
  -5.57217955  -5.38316901  -5.81395192  -5.41707056  -6.03283107
  -5.46413807         -inf         -inf  -6.08748949  -7.04976533
  -6.68430556  -8.27354076  -5.55224534  -6.83845624  -8.49668431
  -7.44063164  -5.6202988   -6.14530906  -5.0996623   -6.64430022
  -7.53160342  -7.58039358  -5.09132892  -6.96520794  -7.58039358
  -7.63168688  -6.43299113  -8.37890128  -7.6857541   -6.25863774
  -8.9666879

In [17]:
#Problem 3 compute log likelihoods with Laplace smoothing republic
frequency_sum_rep_laplace = np.array([])
for i in range(0,len(top_words)):
    frequency_sum_rep_laplace = np.append(frequency_sum_rep_laplace, rfmat[:,i].sum()+1)
denr=count_rep_total+len(top_words)
likelihood_Republican_laplace = frequency_sum_rep_laplace/denr
log_likelihood_Republican_laplace=np.log(likelihood_Republican_laplace)
print(log_likelihood_Republican_laplace)

[ -5.52042813  -4.68332026  -7.61941426  -4.74080735  -7.34497742
  -5.96873339  -7.06734568  -6.00997635  -6.25978815  -6.315358
  -6.98033431  -6.3741985   -5.75166889 -10.56385324  -7.7306399
  -7.85580304  -7.7306399   -6.16940409  -7.67348149  -7.00850518
  -7.61941426  -5.30635787  -7.26801638  -7.19655741  -7.34497742
  -8.26126815  -5.0707918   -7.61941426  -6.77966361  -7.85580304
  -7.00850518  -7.23164873  -8.4844117   -6.40497016  -7.19655741
  -5.83646542  -6.2331199   -7.79126452  -7.30575671  -7.7306399
  -7.99890389  -6.08651643  -6.35916062  -7.16265586  -6.9002916
  -5.97888576  -6.12120199  -6.2733938   -7.47281079  -4.46353429
  -4.74080735  -5.26054834  -6.65183024  -6.32974674  -7.06734568
  -6.98033431  -7.16265586  -5.94873273  -7.26801638  -4.76173487
  -7.92479591  -5.86337288  -5.97888576  -5.12143553  -6.50341023
  -7.34497742  -8.26126815  -5.08321432 -10.56385324  -7.67348149
  -7.34497742  -6.34434554  -7.34497742  -8.07894659  -6.92626708
  -7.34497742  

In [18]:
#Problem 3 compute log likelihoods with Laplace smoothing democratic
frequency_sum_dem_laplace = np.array([])
for i in range(0,len(top_words)):
    frequency_sum_dem_laplace = np.append(frequency_sum_dem_laplace, dfmat[:,i].sum()+1)
den2=count_dem_total+len(top_words)
likelihood_dem_laplace = frequency_sum_dem_laplace/den2
log_likelihood_dem_laplace=np.log(likelihood_dem_laplace)
print(log_likelihood_dem_laplace)

[ -4.90241212  -4.73303755  -6.88593354  -5.2242272   -6.83830549
  -6.07771703  -7.07314508  -5.13992009  -6.5390626   -5.1356738
  -6.62921369  -6.40985087  -5.91737438  -7.2673011   -5.44045031
  -7.55498317  -8.11459896  -5.87211779  -6.86183599  -6.93594396
  -7.76629226  -5.45784205  -6.44062252  -7.50846315  -6.96191945
  -7.70913385  -4.77055999  -6.57415392  -6.35101037  -7.38062978
  -5.53691058  -6.61052156  -6.86183599  -5.67225192  -5.77922404
  -5.99433542  -6.55645434  -7.65506663  -7.60377333  -7.70913385
  -5.55608049  -6.07771703  -6.6874826   -7.07314508  -6.20505645
  -5.58887031  -5.40100858  -5.82882098  -5.43471963  -6.04562872
  -5.4815118  -10.59950561 -10.59950561  -6.09969594  -7.04415755
  -6.6874826   -8.20161034  -5.56906769  -6.83830549  -8.40228103
  -7.42145178  -5.63666098  -6.15685435  -5.11886668  -6.64826189
  -7.50846315  -7.55498317  -5.11056788  -6.96191945  -7.55498317
  -7.60377333  -6.44062252  -8.29692052  -7.65506663  -6.26877227
  -8.807746

## Prediction on Test Set

Now we have a trained Naive Bayes classifier. We will load the test data set and make the predictions. Note: If a token is not a feature, ignore it. 

Problem 4 (5 points) Load test data and tokenize

Problem 5 (30 points) Using the trained NB classifier predict the labels

Problem 6 (5 points) Calculate accuracy, recall, and precision of your predictions


In [19]:
#Problem 4 Load test data and tokenize
tweets_test = pd.read_csv("tweets_test.csv", na_filter=False)
tweets_test['screen_name'].unique()
tweets_test.head()

Unnamed: 0,screen_name,text
0,timkaine,My staff is hosting office hours across the Vi...
1,mike_pence,RT @GovPenceIN: Enjoyed r community convo at @...
2,realDonaldTrump,I am self-funding my campaign and am therefore...
3,timkaine,We know #safetrack will be inconvenient for co...
4,timkaine,"ICYMI: @realwizkaliaa, @GeorgeMasonU student &..."


In [20]:
tweets_test['label'] = tweets_test['screen_name'].str.contains('TheDemocrats|HillaryClinton|timkaine', regex=True)
tweets_test.describe()

Unnamed: 0,screen_name,text,label
count,4298,4298,4298
unique,6,4295,2
top,timkaine,MAKE AMERICA GREAT AGAIN!,True
freq,747,3,2206


In [21]:
all_tokens_test = tokenize_tweets2(tweets_test)
print(len(all_tokens_test))
all_tokens_test[:10]

4298


[['staff',
  'hosting',
  'office',
  'hour',
  'across',
  'virginia',
  'next',
  'week',
  'answer',
  'question',
  'find',
  'location',
  'near',
  'https://t.co/nulOEkTOKB'],
 ['enjoyed',
  'r',
  'community',
  'convo',
  'today',
  'special',
  'thx',
  'covenant',
  'christian',
  'h',
  'demotte',
  'coming',
  'https://…'],
 ['self-funding',
  'campaign',
  'therefore',
  'controlled',
  'lobbyist',
  'special',
  'interest',
  'like',
  'lightweight',
  'rubio',
  'ted',
  'cruz'],
 ['know',
  '#safetrack',
  'inconvenient',
  'commuter',
  'safety',
  'work',
  'long',
  'overdue',
  'wmata',
  'finally',
  'stepping',
  'good',
  'sign'],
 ['icymi',
  'student',
  '#rva',
  'native',
  'share',
  'story',
  'student',
  'loan',
  'debt',
  '#inthered',
  'https://t.co/luEztEZlLE'],
 ['#stophillary',
  'event',
  'va',
  'highlighting',
  'hillary',
  "clinton's",
  'hypocrisy',
  'http://t.co/LAJw0nzCNO'],
 ['heading',
  'north',
  'carolina',
  'two',
  'big',
  'rally'

In [22]:
from collections import Counter
counts_test=0
counts_test = Counter([token_test for tokens_test in all_tokens_test for token_test in tokens_test])
print(len(counts_test))
counts_test.most_common(20)

11175


[('trump', 390),
 ('hillary', 366),
 ('great', 242),
 ('today', 225),
 ('clinton', 223),
 ('american', 198),
 ('president', 196),
 ('make', 186),
 ('donald', 180),
 ('thank', 173),
 ('u', 168),
 ('day', 159),
 ('job', 156),
 ('america', 155),
 ('new', 153),
 ('one', 147),
 ('people', 147),
 ('vote', 142),
 ('state', 139),
 ('get', 135)]

In [23]:
#Problem 5 (30 points) Using the trained NB classifier predict the labels

map_dem_token=np.array([])
map_rep_token=np.array([])
map_prediction_list=np.array([])
tweet_log_pr_rep = 0
tweet_log_pr_dem = 0

for every_row in all_tokens_test:
    for every_token in  every_row:
        if every_token in top_words:
            tweet_log_pr_rep=tweet_log_pr_rep + log_likelihood_Republican_laplace[top_words.index(every_token)]
            tweet_log_pr_dem=tweet_log_pr_dem + log_likelihood_dem_laplace[top_words.index(every_token)]
    tweet_log_pr_rep=tweet_log_pr_rep+log_prior_rep
    tweet_log_pr_dem=tweet_log_pr_dem+log_prior_dem
    
    map_dem_token=np.append(map_dem_token,tweet_log_pr_dem)
    map_rep_token=np.append(map_rep_token,tweet_log_pr_rep)

    if(tweet_log_pr_rep>tweet_log_pr_dem):
        map_prediction_list=np.append(map_prediction_list,False)
    elif(tweet_log_pr_rep<tweet_log_pr_dem):
        map_prediction_list=np.append(map_prediction_list,True)
    else:
        map_prediction_list=np.append(map_prediction_list,False)

    tweet_log_pr_rep=0
    tweet_log_pr_dem=0

#print(map_prediction_list.shape)
print(Counter(map_prediction_list))
print(len(map_prediction_list))

Counter({1.0: 2179, 0.0: 2119})
4298


In [24]:
#Problem 6 Calculate accuracy, recall, and precision of your predictions
test_label=tweets_test['label']
#print(test_label)
label_test_array = np.array([])
label_test_array = np.append(label_test_array, list(test_label))
print(label_test_array)

TP=0
TN=0
FN=0
FP=0

for row_i in range(0,len(label_test_array)):
    if label_test_array[row_i]==1 and map_prediction_list[row_i]==1:
        TP+=1
    elif label_test_array[row_i]==0 and map_prediction_list[row_i]==0:
        TN+=1
    elif label_test_array[row_i]==0 and map_prediction_list[row_i]==1:
        FP+=1
    elif label_test_array[row_i]==1 and map_prediction_list[row_i]==0:
        FN+=1    

[1. 0. 0. ... 0. 1. 0.]


In [25]:
#accuracy
accuracy = float(TP+TN)/len(label_test_array)
print(TP+TN, len(label_test_array), accuracy)
print("Accuarcy is %f percenatge"%(accuracy*100))

3489 4298 0.8117729176361098
Accuarcu is 81.177292 percenatge


In [26]:
#Precision =TP /(TP + FP)
print(TP /float(TP + FP))
print("precision is %f percenatge"%((TP /float(TP+ FP))*100))

0.8205598898577329
precision is 82.055989 percenatge


In [27]:
#recall=TP/(TP + FN)
print(TP/float(TP + FN))
print("recall is %f percenatge"%((TP/float(TP + FN))*100))

0.8105167724388033
recall is 81.051677 percenatge


In [28]:
#Problem 7 List the features with top ten likelihoods for each of the two classes. What is the likelihood for 
#'hillary', that is, P(hillary|class)? Is it in the top ten? How important is it in this classification problem?

likelihood_rep = {}
likelihood_dem = {}

import operator as op

for j in range(0 , len(top_words)):
    likelihood_rep[top_words[j]] = log_likelihood_Republican_laplace[j]
    likelihood_dem[top_words[j]] = log_likelihood_dem_laplace[j]
    

likelihood_rep_sort = sorted(likelihood_rep.items(), key=op.itemgetter(1), reverse = True)
for top_10_rep in likelihood_rep_sort[:10]:
    print(top_10_rep)
print("#########")

likelihood_dem_sort = sorted(likelihood_dem.items(), key=op.itemgetter(1), reverse = True)
for top_10_dem in likelihood_dem_sort[:10]:
    print(top_10_dem)

('clinton', -4.165258309036486)
('hillary', -4.187126295673067)
('great', -4.218216882743098)
('thank', -4.46353429155163)
('today', -4.683320257170993)
('day', -4.740807348088675)
('new', -4.740807348088675)
('indiana', -4.7617348681946305)
('job', -4.804951469694413)
('state', -4.804951469694413)
#########
('trump', -3.8321624829223646)
('hillary', -4.248619891473017)
('donald', -4.3476017250218675)
('president', -4.593152448586023)
('today', -4.733037551254459)
('american', -4.744433685985329)
('make', -4.770559990577549)
('u', -4.875920506235375)
('vote', -4.902412121682352)
('one', -5.074052669055972)


Problem 7 (5 points) List the features with top ten likelihoods for each of the two classes. What is the likelihood for 'hillary', that is, P(hillary|class)? Is it in the top ten? How important is it in this classification problem? 

P(hillary|Republic)=-4.248619891473017
P(hillary|Democratic)=-4.165258309036486
Yes , it is in top 10.
Likelihood is almost the same.For MAP prior plays more important role in this . 

Solution:

Problem 8 (5 points) How important are the priors in this problem?

In [29]:
print("Prior(Rep) ", prior_rep)
print("Prior(Dem) ", prior_dem)
print("Log of Prior(Rep) ",log_prior_rep)
print("Log of Prior(Dem)",log_prior_dem)

Prior(Rep)  0.5041538461538462
Prior(Dem)  0.4958461538461538
Log of Prior(Rep)  -0.6848738071849139
Log of Prior(Dem) -0.7014895740682907


Solution:

Prior are almost the same.It is little bit higher for republic party.

Extra credit (5 points): Compute the accuracy of the test set without Laplace smoothing and compare with the above.

In [30]:
map_dem_token1=np.array([])
map_rep_token1=np.array([])
map_prediction_list1=np.array([])
tweet_log_pr_rep1 = 0
tweet_log_pr_dem1 = 0

for every_row1 in all_tokens_test:
    for every_token1 in  every_row1:
        if every_token1 in top_words:
            tweet_log_pr_rep1=tweet_log_pr_rep1 + log_likelihood_given_Rep[top_words.index(every_token1)]
            tweet_log_pr_dem1=tweet_log_pr_dem1 + log_likelihood_given_dem[top_words.index(every_token1)]
    tweet_log_pr_rep1=tweet_log_pr_rep1+log_prior_rep
    tweet_log_pr_dem1=tweet_log_pr_dem1+log_prior_dem
    
    map_dem_token1=np.append(map_dem_token1,tweet_log_pr_dem1)
    map_rep_token1=np.append(map_rep_token1,tweet_log_pr_rep1)

    if(tweet_log_pr_rep1>tweet_log_pr_dem1):
        map_prediction_list1=np.append(map_prediction_list1,False)
    elif(tweet_log_pr_rep1<tweet_log_pr_dem1):
        map_prediction_list1=np.append(map_prediction_list1,True)
    else:
        map_prediction_list1=np.append(map_prediction_list1,False)

    tweet_log_pr_rep1=0
    tweet_log_pr_dem1=0

#print(map_prediction_list.shape)
print(Counter(map_prediction_list1))
print(len(map_prediction_list1))

Counter({1.0: 2171, 0.0: 2127})
4298


In [31]:
test_label1=tweets_test['label']
#print(test_label)
label_test_array1 = np.array([])
label_test_array1 = np.append(label_test_array1, list(test_label1))
print(Counter(label_test_array1))

TP1=0
TN1=0

for v in range(0,len(map_prediction_list1)):
    if label_test_array1[v]==1 and map_prediction_list1[v]==1:
        TP1=1+TP1
    elif label_test_array1[v]==0 and map_prediction_list1[v]==0:
        TN1+=1

#accuracy
print(TP1)
accuracy1 = float(TP1+TN1)/len(label_test_array1)
print(TP1+TN1, len(label_test_array1), accuracy1)
print("Accuarcy is %f percenatge"%(accuracy1*100))

Counter({1.0: 2206, 0.0: 2092})
1781
3483 4298 0.8103769194974406
Accuarcy is 81.037692 percenatge
