<H1>INSTRUCTIONS</H1>

In this ICA, you will build various text classification models and use them to classify sentences from 2016 presidential debates according to speaker.

The ICA zip file contains three files train, dev and test. The files are one document per line, tokenized and lowercased, so you don't have to do any preprocessing. Each line has the format: 

      
   <font color="red"> <center>trump i fully understand .</center></font>

where the first token is the speaker's name, and the remaining tokens are the words of the document.


<H2>Hands on programming with Text Classification</H2> 

Naive Bayes Classifier with 51.5% accuracy is provided.
Experiment with two new kinds of features.
<ol type="a">
<li>Extend the code provided to construct a bag of words to include an additional kind of feature besides words. You can try bigrams, prefixes, parts of speech, anything you like. Describe your new features. Report your new accuracy. Briefly write down your conclusions from this experiment.<b>(5 points)</b></li>
    <li>Do the same thing for another kind of feature. Report your new accuracy. Briefly write down your conclusions from this experiment.<b>(5 points)</b></li>
</ol>

In [1]:
import pandas as pd
import numpy as np

In [2]:
##Load Training Data
train_x=[]
train_y=[]

In [3]:
with open('train.txt', 'r') as f:
   for line in f:
        temp=line.split(' ',1)
        train_x.append(temp[1])
        train_y.append(temp[0])
       

In [4]:
np.unique(train_y)

array(['bush', 'carson', 'chafee', 'christie', 'clinton', 'cruz',
       'fiorina', 'huckabee', 'kasich', "o'malley", 'paul', 'perry',
       'rubio', 'sanders', 'trump', 'walker', 'webb'], dtype='<U8')

In [5]:
len(np.unique(train_y))

17

In [6]:
train_df = pd.DataFrame({'sentence':train_x, 'speaker':train_y})
train_df.head()

Unnamed: 0,sentence,speaker
0,"no . i am very proud to be jewish , and being ...",sanders
1,"well , that just wasn't that just wasn't the f...",clinton
2,"thank you , anderson . thank you , cnn . and t...",chafee
3,. . . let us talk about issues . \n,sanders
4,"thank you , bernie . thank you . \n",clinton


In [7]:
##Load test data
test_x=[]
test_y=[]

In [8]:
with open('test.txt', 'r') as f:
   for line in f:
        temp=line.split(' ',1)
        test_x.append(temp[1])
        test_y.append(temp[0])

In [9]:
test_df = pd.DataFrame({'sentence':test_x, 'speaker':test_y})
test_df.head()

Unnamed: 0,sentence,speaker
0,"madam secretary , when he asked me to speak . ...",sanders
1,too many lives have been destroyed because peo...,sanders
2,. . . well and i . . . \n,sanders
3,"look , the secretary is right . this is a terr...",sanders
4,"well , hillary clinton , and everybody else wh...",sanders


<H3>Train Naive Bayes Classifier with the data in "train.txt" file</H3>

In [10]:
#Count of documents per class
from collections import Counter
docs_per_class = Counter(train_df['speaker'])
docs_per_class

Counter({'sanders': 501,
         'clinton': 455,
         'chafee': 18,
         "o'malley": 128,
         'webb': 28,
         'bush': 207,
         'cruz': 273,
         'trump': 637,
         'christie': 81,
         'rubio': 305,
         'kasich': 162,
         'fiorina': 90,
         'paul': 94,
         'carson': 104,
         'huckabee': 41,
         'walker': 31,
         'perry': 1})

In [56]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(stop_words="english",ngram_range=(2,2)) 
X_train_dtm = vect.fit_transform(train_df.sentence)


In [57]:
#count of words for all speakers
vect.vocabulary_

{'proud jewish': 51641,
 'jewish jewish': 33941,
 'jewish look': 33942,
 'look father': 39264,
 'father family': 23206,
 'family wiped': 23052,
 'wiped hitler': 72572,
 'hitler holocaust': 30072,
 'holocaust know': 30129,
 'know crazy': 35582,
 'crazy radical': 14782,
 'radical extremist': 52388,
 'extremist politics': 22395,
 'politics mean': 49299,
 'mean learned': 41233,
 'learned lesson': 37058,
 'lesson tiny': 37451,
 'tiny tiny': 66596,
 'tiny child': 66595,
 'child mother': 10483,
 'mother shopping': 42902,
 'shopping people': 59136,
 'people working': 47762,
 'working stores': 73389,
 'stores numbers': 61789,
 'numbers arms': 44636,
 'arms hitler': 4560,
 'hitler concentration': 30071,
 'concentration camp': 12729,
 'camp proud': 8862,
 'jewish essential': 33940,
 'essential human': 21478,
 'just wasn': 34826,
 'wasn just': 71408,
 'wasn fact': 71400,
 'fact senator': 22648,
 'senator fact': 58519,
 'fact great': 22589,
 'great effort': 27748,
 'effort obama': 20315,
 'obama ad

In [58]:
#Train Naive Bayes Classifier
#MultinomialNB has default smoothing factor of 1
from sklearn.naive_bayes import MultinomialNB
nb_clf = MultinomialNB().fit(X_train_dtm, train_df.speaker)

<H3>Test the developed naive bayes classifier with "test.txt" file</H3>

In [59]:
X_test_dtm = vect.transform(test_df.sentence)
predicted_speaker=nb_clf.predict(X_test_dtm)

In [60]:
from sklearn import metrics
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.54

In [61]:
#Evaluate model
from sklearn import metrics
print(metrics.classification_report(test_df.speaker, predicted_speaker, target_names=test_df.speaker.unique()))

              precision    recall  f1-score   support

     sanders       0.75      0.20      0.32        30
     clinton       1.00      0.19      0.32        16
    o'malley       0.00      0.00      0.00         2
      chafee       1.00      0.25      0.40        12
        webb       0.58      0.58      0.58        53
    huckabee       0.61      0.56      0.58        34
       trump       1.00      0.20      0.33        10
        cruz       0.00      0.00      0.00         7
        paul       0.88      0.45      0.60        31
    christie       1.00      0.39      0.56        18
       rubio       1.00      0.20      0.33         5
     fiorina       0.62      0.44      0.51        41
      kasich       0.55      0.67      0.61        64
        bush       0.41      0.97      0.57        71
      carson       0.00      0.00      0.00         3
      walker       0.00      0.00      0.00         3

   micro avg       0.54      0.54      0.54       400
   macro avg       0.59   

  'precision', 'predicted', average, warn_for)


#### 1: adding stop words

In [62]:
vect1 = CountVectorizer(stop_words="english") 
X_train_dtm = vect1.fit_transform(train_df.sentence)

In [63]:
#count of words for all speakers
vect1.vocabulary_

{'proud': 6233,
 'jewish': 4406,
 'look': 4809,
 'father': 3161,
 'family': 3135,
 'wiped': 8692,
 'hitler': 3868,
 'holocaust': 3881,
 'know': 4548,
 'crazy': 1995,
 'radical': 6337,
 'extremist': 3089,
 'politics': 5951,
 'mean': 5001,
 'learned': 4649,
 'lesson': 4686,
 'tiny': 8005,
 'child': 1489,
 'mother': 5236,
 'shopping': 7195,
 'people': 5779,
 'working': 8730,
 'stores': 7570,
 'numbers': 5426,
 'arms': 668,
 'concentration': 1734,
 'camp': 1285,
 'essential': 2934,
 'human': 3953,
 'just': 4474,
 'wasn': 8585,
 'fact': 3105,
 'senator': 7085,
 'great': 3629,
 'effort': 2718,
 'obama': 5436,
 'administration': 334,
 'really': 6419,
 'send': 7087,
 'clear': 1569,
 'message': 5060,
 'knew': 4543,
 'children': 1492,
 'abused': 240,
 'treated': 8125,
 'terribly': 7905,
 'tried': 8145,
 'border': 1092,
 'disagreement': 2436,
 'think': 7948,
 've': 8423,
 'called': 1274,
 'counsel': 1949,
 'face': 3097,
 'kind': 4527,
 'process': 6140,
 'speaks': 7410,
 'advocates': 375,
 'right'

In [64]:
nb1_clf = MultinomialNB().fit(X_train_dtm, train_df.speaker)

In [66]:
X_test_dtm = vect1.transform(test_df.sentence)
predicted_speaker=nb1_clf.predict(X_test_dtm)

In [67]:
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.535

#### 2: adding bigrams

In [68]:
vect2 = CountVectorizer(stop_words="english",ngram_range=(2,2)) 
X_train_dtm = vect2.fit_transform(train_df.sentence)

In [69]:
#count of words for all speakers
vect2.vocabulary_

{'proud jewish': 51641,
 'jewish jewish': 33941,
 'jewish look': 33942,
 'look father': 39264,
 'father family': 23206,
 'family wiped': 23052,
 'wiped hitler': 72572,
 'hitler holocaust': 30072,
 'holocaust know': 30129,
 'know crazy': 35582,
 'crazy radical': 14782,
 'radical extremist': 52388,
 'extremist politics': 22395,
 'politics mean': 49299,
 'mean learned': 41233,
 'learned lesson': 37058,
 'lesson tiny': 37451,
 'tiny tiny': 66596,
 'tiny child': 66595,
 'child mother': 10483,
 'mother shopping': 42902,
 'shopping people': 59136,
 'people working': 47762,
 'working stores': 73389,
 'stores numbers': 61789,
 'numbers arms': 44636,
 'arms hitler': 4560,
 'hitler concentration': 30071,
 'concentration camp': 12729,
 'camp proud': 8862,
 'jewish essential': 33940,
 'essential human': 21478,
 'just wasn': 34826,
 'wasn just': 71408,
 'wasn fact': 71400,
 'fact senator': 22648,
 'senator fact': 58519,
 'fact great': 22589,
 'great effort': 27748,
 'effort obama': 20315,
 'obama ad

In [70]:
nb2_clf = MultinomialNB().fit(X_train_dtm, train_df.speaker)

In [71]:
X_test_dtm = vect2.transform(test_df.sentence)
predicted_speaker=nb2_clf.predict(X_test_dtm)

In [72]:
metrics.accuracy_score(test_df.speaker, predicted_speaker)

0.54

I used  "stop_words" and "bi_grams" as my additional 2 features and the accuracy increased to 54%. This indicated that model is more accurate when it tries to predict based on 2 prior words. Also the accuracy decreases if I choose tri grams because of smal document size.

