# Lab 04

student: John Wu

In [1]:
import nltk, sys, csv, string, re, sklearn.preprocessing, sklearn.metrics
import numpy as np, pandas as pd
from sklearn import naive_bayes as NB, svm as SVM
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline
import IPython.display as disp

## Loading in Data

In [2]:
# input file names
trainFile = './data/train.tsv'
testFile = './data/test.tsv'
devFile = './data/dev.tsv'
varNames = ['stars','docID','text']

# read in files
train = pd.read_csv(trainFile, sep='\t', header=None, names=varNames)
dev = pd.read_csv(devFile, sep='\t', header=None, names=varNames)
test = pd.read_csv(testFile, sep='\t', header=None, names=varNames)

## (a) Study the training data

This section explores the training data set to allow a better understanding of the data. To start with, we look at whether the classes are balanced. If the mean rating of the data is 3, it would mean the data is balanced.

In [3]:
train['stars'].mean()

3.0

To get some idea of useful features for the data, we use `CountVectorizer` to count the number of term frequencies of terms appearing in each document. We set the CountVectorizer to only return binary counts (i.e. value=1 if term is in the document at least once).

In [4]:
binVec = CountVectorizer(tokenizer=nltk.word_tokenize, binary=True)
binTF = binVec.fit_transform(train['text'])

#### Relative word frequency
In the following section, we find the top terms with the biggest difference of appearences between two and four-star reviews

In [5]:
twoSt = (train['stars']==2).to_numpy() # idx for 2-star reviews
tfDiff = np.abs(binTF[twoSt].mean(axis=0) - binTF[~twoSt].mean(axis=0))
top20idx = tfDiff.A1.argsort()[-20:][::-1] # get last 20, descending
terms = binVec.get_feature_names() # get actual terms of doc matrix
for x in top20idx: # loop over all terms in top 20 difference
    a,b = binTF[~twoSt,x].mean()*100, binTF[twoSt,x].mean()*100
    print('%s: %.2f%% (pos), %.2f%% (neg)'%(terms[x],a,b))

great: 39.60% (pos), 17.80% (neg)
was: 54.90% (pos), 75.60% (neg)
not: 42.40% (pos), 62.00% (neg)
!: 47.90% (pos), 28.40% (neg)
were: 27.60% (pos), 41.60% (neg)
n't: 45.30% (pos), 59.10% (neg)
always: 22.70% (pos), 9.80% (neg)
good: 55.70% (pos), 42.90% (neg)
did: 14.40% (pos), 27.10% (neg)
be: 32.60% (pos), 44.20% (neg)
just: 26.70% (pos), 38.10% (neg)
better: 11.70% (pos), 22.80% (neg)
delicious: 14.40% (pos), 3.30% (neg)
friendly: 17.50% (pos), 6.60% (neg)
are: 45.00% (pos), 34.20% (neg)
because: 13.80% (pos), 24.10% (neg)
ordered: 13.90% (pos), 24.10% (neg)
no: 15.50% (pos), 25.50% (neg)
bad: 7.20% (pos), 17.00% (neg)
at: 38.60% (pos), 47.10% (neg)


The terms are listed in order of disparity. As expected, words like "great", "always", "good", "delicious", and "friendly" are expected to have a high presence in positive reviews, while terms like "not" and "bad" has a high presence in negative reviews. However, there are also some counter-intuitive examples. The term "better" is more frequently seen in negative reviews due to expressions like "maybe the next time I come in the food will be better". While the term "like" can connotate a favorable feeling, it is also used in simile, which are present in negative reviews such as "it tastes like a combo of cream cheese, american cheese and sour cream". 

Terms like "was", "were", and "did" also had a higher presence in negative reviews. These words are combined with others to form negative phrases like "was not" and "weren't". One unexpected result is that people are much more likely to use exclamation marks in postive reviews. The word "ordered" appear to be used more frequent in negative reviews. After looking through negative reviews, they often contain details which list the items ordered and how they are bad, such as "my friend ordered a virgin strawberry daiquiri and instead she got some weird smoothie with whip cream on top".

#### Other useful characteristics

In this section, we explore a few characteristics that are different between the two types of reviews. Each cell we break down the characteristic by star rating

In [6]:
textLen = train['text'].str.len() # length of text
textLen.groupby(train['stars']).mean()

stars
2    720.375
4    631.283
Name: text, dtype: float64

Negative reviews are 90 characters longer on average.

In [7]:
capPct = train['text'].str.count(r'[A-Z]')/textLen # % of chars upper case
capPct.groupby(train['stars']).mean()

stars
2    0.025294
4    0.027273
Name: text, dtype: float64

Positive reviews tend to have a slightly larger proportion of upper case letters.

In [8]:
nPunct = train['text'].str.count('2') # number of apperences of number 2
(nPunct/textLen).groupby(train['stars']).mean()

stars
2    0.000543
4    0.000399
Name: text, dtype: float64

2-star reviews tend to have more mentions of the number "2", likely from the reviews explicitly enumerating the score.

## (b) Train a classifier

In this section, we build a pipeline for a naive Bayes classifier. The pipeline includes two pars:
1. **TF-IDF vectorizer**, which extracts features from input text and builds a document-term matrix based on TF-IDF values. 
  * The text is tokenized via NLTK `word_tokenize` function.
    * It is based on [Treebank tokenization](ftp://ftp.cis.upenn.edu/pub/treebank/public_html/tokenization.html) developed at UPenn.
    * It splits on all whitespaces as well as contractions i.e. "can't" -> "ca", "n't"
    * It tokenizes any consecutive number of punctuations, such as “,”, “?”, “—“, or “…”
    * Punctuations inmixed with letters, such as “03/20/2018” would be tokenized as one word, as well as things like URL or hyphenated words like “open-faced”
  * Stop words are not removed, as even simple terms like "was" and "not" appear at significantly different rates in positive and negative reviews.
  * Only the top 5K terms by document frequency is retained. Terms with df less than 5 are also removed.
  * TF-IDF weights are used for document-term matrix.
1. **Multinomial naive Bayes** model is chosen
  * Multinomial NB is chosen due to the training data being based on term counts, where the frequency of term matters as much as just apperence (the basis of Bernoulli naive Bayes model). 
  * Laplace smoothing is used with $\alpha=1$, due to the large number of features and possibility of a term not appearing in the training set.

In [9]:
NB_tfidf = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, # vectorize
                             max_features=5000, min_df=5)),
    ('clf', NB.MultinomialNB(alpha=1)) # classify
])

NB_tfidf.fit(train['text'], train['stars']) # fit model on training set
pred_dev = NB_tfidf.predict(dev['text']) # pred based on dev set

To get a view of the underlying data, we pull the underlying TF-IDF document matrix and the terms. Since the TF-IDF document matrix is sparse, we only present the elements for which there is a value. The below section prints the text of the first review as well as the document matrix representation

In [10]:
docMat = NB_tfidf.named_steps['vect'].transform(train['text'])
tfidfTerms = NB_tfidf.named_steps['vect'].get_feature_names()

print(train['text'][0]) # print text of first review
nonZero = docMat[0,]!=0 # idx for terms that are in first document
names = [tfidfTerms[x] for x in np.where(nonZero.todense().A1)[0]] # actual terms
firstDoc = pd.DataFrame(docMat[0,][nonZero], columns=names) # get first Doc
for n in range(0, firstDoc.shape[1], 10): # loop to print document
    disp.display(disp.HTML(firstDoc.iloc[:,n:n+10].to_html()))

So let me set the scene first, My church social group took a trip here last saturday. We are not your mothers church. The churhc is Community Church of Hope, We are the valleys largest GLBT church so when we desended upon Organ stop Pizza, in LDS land you know we look a little out of place. We had about 50 people from our church come and boy did we have fun.  There was a baptist church a couple rows down from us who didn't see it coming. Now we aren't a bunch of flamers frolicking around or anything but we do tend to get a little loud and generally have a great time. I did recognized some of the music  so I was able to sing along with those.  This is a great place to take anyone over 50.  I do think they might be washing dirtymob money or something since the business is cash only.........which I think caught a lot of people off guard including me.  The show starts at 530  so dont be late !!!!!!


Unnamed: 0,!,",",.,...,50,a,able,about,along,and
0,0.181439,0.052572,0.124862,0.112978,0.0898,0.156293,0.07353,0.037368,0.076885,0.0328


Unnamed: 0,anyone,anything,are,around,at,be,boy,bunch,business,but
0,0.071641,0.055843,0.088998,0.051678,0.028452,0.060279,0.092999,0.091857,0.068932,0.021933


Unnamed: 0,cash,caught,church,come,coming,community,couple,did,do,dont
0,0.081739,0.104911,0.629466,0.051043,0.065795,0.104911,0.061112,0.118815,0.066965,0.084209


Unnamed: 0,down,first,from,fun,generally,get,great,group,had,have
0,0.050583,0.047145,0.073359,0.065199,0.08232,0.034248,0.069237,0.071641,0.028669,0.052001


Unnamed: 0,here,hope,i,in,including,is,it,know,land,largest
0,0.032635,0.076074,0.069533,0.022206,0.075684,0.062039,0.019581,0.046379,0.104911,0.102536


Unnamed: 0,last,late,let,little,look,lot,loud,me,might,money
0,0.052683,0.067284,0.060381,0.085246,0.063712,0.053483,0.073869,0.069344,0.062196,0.070227


Unnamed: 0,music,my,n't,not,now,of,off,only,or,our
0,0.072875,0.025224,0.050829,0.025415,0.050358,0.100891,0.050433,0.038258,0.070051,0.04104


Unnamed: 0,out,over,people,pizza,place,saturday,scene,see,set,show
0,0.034066,0.046905,0.093811,0.051199,0.055569,0.0705,0.100479,0.056056,0.07961,0.07961


Unnamed: 0,since,so,social,some,something,starts,stop,take,tend,the
0,0.054135,0.115262,0.104911,0.038056,0.04949,0.095573,0.065795,0.052341,0.090794,0.095061


Unnamed: 0,there,they,think,this,those,time,to,took,trip,upon
0,0.029094,0.025578,0.093456,0.023143,0.060815,0.034699,0.054798,0.05423,0.073869,0.082924


Unnamed: 0,us,was,we,when,which,who,with,you,your
0,0.047888,0.043959,0.253231,0.0355,0.038741,0.052683,0.025123,0.02801,0.043173


Below is the prediction of the first 10 document in the dev set.

In [11]:
for n in range(10):
    print('%s\t%d'%(dev['docID'][n],pred_dev[n]))

ZSJnW6faaNFQoqq4ALqYg	4
Rcbv11hm5AYEwZyqYwAvg	2
rkRTjhu5szaBggeFVcVJlA	4
dhmeDsQGUS1FXMLs49SWjQ	4
z9zfIMYmRRCE4ggfOIieEw	4
Xtb3pGSh39bqcozkBECw	2
DOUflAGzxLsXG6xOmR1w	2
0RxCEWURe08CTcZt95F4AQ	2
MzUg5twEcCyd0X6lBMP2Lg	2
uNlw2D5CYKk0wjNxLtYw	4


## (c) Evaluate your predictions

To evaluate the effectiveness of the classifier built in (b), the following function displays precision, recall, F1 score, and accuracy, as well as their constituent parts.

In [12]:
# Calculates various stats related to validation
def validationStats(y_Prd, y_Act, msg='', algo='naive Bayes'):
    # confusion matrix, T=true, F=false, N=negative, P=positive
    TN, FP, FN, TP = sklearn.metrics.confusion_matrix(y_Act, y_Prd).ravel()
    precision,recall = TP/(TP+FP) , TP/(TP+FN) # precision and recall
    corr,tot = TN+TP , TN+TP+FN+FP # used for accuracy calculation
    print("Using %s, %s"%(algo,msg))
    print("\tTP=%d, TN=%d, FP=%d, FN=%d"%(TP,TN,FP,FN))
    print("\tRecall: %u/%u = %.1f%%" % (TP, TP+FN, recall*100) )
    print("\tPrecision: %u/%u = %.1f%%" % (TP, TP+FP, precision*100) )
    print("\tF1 score: %.3f" % (2*precision*recall / (precision+recall)) )
    print("\tAccuracy: %u/%u = %.1f%%" % (corr,tot,corr/tot*100) )
    return (TN, FP, FN, TP)

In [13]:
validationStats(pred_dev, dev['stars'], 'TF-IDF doc vectors');

Using naive Bayes, TF-IDF doc vectors
	TP=835, TN=838, FP=162, FN=165
	Recall: 835/1000 = 83.5%
	Precision: 835/997 = 83.8%
	F1 score: 0.836
	Accuracy: 1673/2000 = 83.7%


### Analysis of classification mistakes

Below we present a few documents in the dev set that were classified incorrectly. The text of the document is first displayed, and the mean TF-IDF value of a few select terms by review type is also shown to assess why the review was misclassified. 

In [14]:
def getTermCompByClass(termList):
    ''' The function concatenate tf-idf values of all the terms in the parameter
    by review type, and display the resulting table '''
    idx = [tfidfTerms.index(x) for x in termList]
    df = pd.DataFrame(docMat[:,idx].todense()).groupby(train['stars']).mean()
    df.columns = termList
    disp.display(disp.HTML(df.to_html()))

In [15]:
print(dev['text'][57])
getTermCompByClass(['not', 'sucks', 'but', "n't", 'though', 'greasy', 'time'])

Ok, I am not sure why people put down the Stratosphere, yea...yea the casino action kind of sucks....but don't go there to gamble, go there for Lucky's and Fat Tuesday's....But let's get back to Lucky's....we were here with a couple that had never been to the Strat, so we decided to go before having dinner, my honey and I decided to play some penny slot to kill time, and LO and BEHOLD...I saw a sign advertising steak and crab legs for 9.99, well, most people (NOT ME) would be scared off by that, but hell it was a hard night, I lost some moola and was looking for some cheap (but good) grub....I am after all a foodie...haha, only if you count greasy spoons. Anyways, back to Lucky's the dinner was really good, the steak was juicy, the crab legs meaty, tender and were already cut in half for you....shoot people, what more do you want, what more do you need.....Oh, but wait there's a catch, you can only order that between 7p-10p or 6-10, I forgot, but I do know it ends at 10p. Try to catch 

Unnamed: 0_level_0,not,sucks,but,n't,though,greasy,time
stars,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2,0.041488,0.001672,0.042361,0.042388,0.009895,0.00451,0.01925
4,0.024336,0.000762,0.033935,0.027497,0.006978,0.003108,0.017394


This review was classified as a 4, but was actually a 2. The review starts out negatively, using words like "not", "sucks", "but", and "don't". However, the initial part of the review was describing the casino, not the restaurant itself. Depiste being a positive review, it's interspersed with words frequently found in negative reviews like "though", "greasy", and "time" (probably from people who waited a long time).

In [16]:
print(dev['text'][974])
getTermCompByClass(['recommend', 'great', 'good', 'love'])

It has been about a month since we last visited this place.  I recommend going to the pub and not the restaraunt side.  Service was great.  We got a couple of pints and some wings.  Wings were overdone although the sauces were good.  Love the garlic parmesan wings, even though overdone.


Unnamed: 0_level_0,recommend,great,good,love
stars,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2,0.00257,0.01096,0.025437,0.006714
4,0.008005,0.033682,0.041305,0.014412


This review was misclassified as a 4, but is actually a 2. Depiste the overall review being negative, it talks about positive aspects of the visit. The review contains words typically associated with positive reviews like "recommend", "great", "good", and "love". Therefore, it's easy to see what this was misclassified.

In [17]:
print(dev['text'][1437])
getTermCompByClass(['great', 'friendly', 'awesome', 'fast', 'unfortunately'])

If you want atmosphere, it's a great, great, great coffee shop.  If you want espresso, food, or fast service, unfortunately, look elsewhere.    Every visit here has had me run into friendly, talkative, awesome people, but I go to a coffee shop wanting coffee, honestly.


Unnamed: 0_level_0,great,friendly,awesome,fast,unfortunately
stars,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2,0.01096,0.004824,0.00211,0.005806,0.005428
4,0.033682,0.016115,0.007643,0.009526,0.000644


This review was misclassified as a 4. It is easy to see why this happened, as the reviewer used "great" three times as well as other words like "good", "friendly", and "awesome". The positive aspects of this review overwhelmed the use of negative words like "unfortunately".

## (d) Build a second classifier

For the second classifier, we build a support vector machine (SVM) pipeline is explored to test whether it can provide a better performance than naive Bayes. The pipeline contains:

1. **TF-IDF vectorizer**, using the same setup as was used in **(b)**, with 5000 features and min(df)=2
1. **Standard Scaler**, where document vector are scaled by mean and standard deviation to be between 0 and 1. This improves separability of the data by SVM
1. **Support Vector Machines** as the classifier
  * SVC is chosen to try to get a better separation of the two classes. 
  * Radial basis function, $\exp(-\gamma||x-x'||^2)$, is used as the kernel, where $\gamma$ is inversely proportional to the number of features
  * To ensure repeatability, the same random state is seeded.

In [18]:
SVM_tfidf = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, 
                             max_features=5000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev2 = SVM_tfidf.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev2, dev['stars'], '5000 features', 'SVM');

Using SVM, 5000 features
	TP=812, TN=825, FP=175, FN=188
	Recall: 812/1000 = 81.2%
	Precision: 812/987 = 82.3%
	F1 score: 0.817
	Accuracy: 1637/2000 = 81.8%


Use of SVM model results in a very small degredation of the result, but not very much so.

With SVM focused on separability, we also evaluate an alternative setup, where only 2000 most important features are used. Wtih this setup, we hope to improve the performance as separability can be achieved better.

In [19]:
SVM_tfidf_2k = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, 
                             max_features=2000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev3 = SVM_tfidf_2k.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev3, dev['stars'], '2000 features', 'SVM');

Using SVM, 2000 features
	TP=825, TN=865, FP=135, FN=175
	Recall: 825/1000 = 82.5%
	Precision: 825/960 = 85.9%
	F1 score: 0.842
	Accuracy: 1690/2000 = 84.5%


Using fewer features improves performance compared to the first SVM setup, about equivalent to the performance of the naive Bayes model.

## (e) Feature engineering

For this section, in addition to the bag-of-words model, we add the use of longer token sets. In the first instance, we try bigrams in addition to unigram using SVM. The setup is similar to those attempted in (d), but we increase the maximum feature counts to 5000, since with the additional of bigrams there would be more features

In [20]:
SVM_tfidf_bg = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, ngram_range=(1,2),
                             max_features=5000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev4 = SVM_tfidf_bg.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev4, dev['stars'], 'uni & bigram', 'SVM');

Using SVM, uni & bigram
	TP=853, TN=877, FP=123, FN=147
	Recall: 853/1000 = 85.3%
	Precision: 853/976 = 87.4%
	F1 score: 0.863
	Accuracy: 1730/2000 = 86.5%


Using bigrams in addition to unigram increased the performance above all the classifiers we have used so far, where we are getting better precision and recall at the same time.

To consider another setup, we increase the order of the model to trigrams as well.

In [21]:
SVM_tfidf_bitri = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, ngram_range=(1,3),
                             max_features=7000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev5 = SVM_tfidf_bitri.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev5, dev['stars'], 'bi & trigram', 'SVM');

Using SVM, bi & trigram
	TP=841, TN=874, FP=126, FN=159
	Recall: 841/1000 = 84.1%
	Precision: 841/967 = 87.0%
	F1 score: 0.855
	Accuracy: 1715/2000 = 85.8%


Extending the feature set to trigrams does not result in as good of a performance as unigram and bigrams, especially considering recall.

### Outputting predictions

So far, the classifier with the best performance is the SVM model based on unigram and trigrams. As a result, we output the prediction based on this model on the test set.

In [22]:
pred_test = SVM_tfidf_bg.predict(test['text'])
with open('jwu74.tsv', 'w') as fh:
    for ID,star in zip(test['docID'], pred_test):
        fh.write('%s\t%d\n'%(ID,star))