# Lab 04

student: John Wu

In [1]:
import nltk, sys, csv, string, re, sklearn.preprocessing, sklearn.metrics
import numpy as np, pandas as pd
from sklearn import naive_bayes as NB, svm as SVM
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.pipeline import Pipeline

## Data Pre-Processing

loading of data

In [121]:
# input file names
trainFile = './data/train.tsv'
testFile = './data/test.tsv'
devFile = './data/dev.tsv'
varNames = ['stars','docID','text']

# read in files
train = pd.read_csv(trainFile, sep='\t', header=None, names=varNames)
dev = pd.read_csv(devFile, sep='\t', header=None, names=varNames)
test = pd.read_csv(testFile, sep='\t', header=None, names=varNames)

Tokenization codes

In [44]:
punct = re.compile('^['+string.punctuation+']+$') # match 1+ consec. punctuation
# list of stop words in English, tokenized by word_tokenize()
engStopWords =  set(nltk.word_tokenize( \
    ' '.join(nltk.corpus.stopwords.words('english')) ) ) 

def myTokenize(txt): # tokenize, no 1+ consec punct and stop words
    return [tk for tk in nltk.word_tokenize(txt) if \
        (tk not in engStopWords and not punct.match(tk))]

## (a) Study the training data

This section explores the training data set to allow a better understanding of the data.

In [34]:
train['stars'].mean()

3.0

The averge rating of the data set is 3, which means the training sample is balanced.

To get some idea of useful features for the data, we use `CountVectorizer` to count the number of term frequencies of terms appearing in each document. We set the CountVectorizer to only return binary counts (i.e. value=1 if term is in document at least once).

In [35]:
binVec = CountVectorizer(tokenizer=nltk.word_tokenize, binary=True)
binTF = binVec.fit_transform(train['text'])

#### Relative word frequency
In the following section, we find the top terms with the biggest difference of document frequency between two and four-star reviews

In [5]:
twoSt = (train['stars']==2).to_numpy() # idx for 2-star reviews
tfDiff = np.abs(binTF[twoSt].mean(axis=0) - binTF[~twoSt].mean(axis=0))
top20idx = tfDiff.A1.argsort()[-20:][::-1]
terms = binVec.get_feature_names()
for x in top20idx:
    a,b = binTF[~twoSt,x].mean()*100, binTF[twoSt,x].mean()*100
    print('%s: %.2f%% (pos), %.2f%% (neg)'%(terms[x],a,b))

great: 39.60% (pos), 17.80% (neg)
was: 54.90% (pos), 75.60% (neg)
not: 42.40% (pos), 62.00% (neg)
!: 47.90% (pos), 28.40% (neg)
were: 27.60% (pos), 41.60% (neg)
n't: 45.30% (pos), 59.10% (neg)
always: 22.70% (pos), 9.80% (neg)
good: 55.70% (pos), 42.90% (neg)
did: 14.40% (pos), 27.10% (neg)
be: 32.60% (pos), 44.20% (neg)
just: 26.70% (pos), 38.10% (neg)
better: 11.70% (pos), 22.80% (neg)
delicious: 14.40% (pos), 3.30% (neg)
friendly: 17.50% (pos), 6.60% (neg)
are: 45.00% (pos), 34.20% (neg)
because: 13.80% (pos), 24.10% (neg)
ordered: 13.90% (pos), 24.10% (neg)
no: 15.50% (pos), 25.50% (neg)
bad: 7.20% (pos), 17.00% (neg)
at: 38.60% (pos), 47.10% (neg)


The terms are listed in order of disparity. As expected, words like "great", "always", "delicious", and "friendly" are expected to have a high presence in positive reviews, and "bad" has a high presence in negative reviews. However, there are also some counter-intuitive examples. The term "better" is more frequently seen in negative reviews due to expressions like "there are many others in Charlotte that are better" and "maybe the next time I come in the food will be better". While the term "like" can connotate a favorable feeling, it is also used in simile, which are present in negative reviews such as "it tastes like a combo of cream cheese, american cheese and sour cream". 

Another unintuitive term to have disparity is "ordered". After looking through negative reviews, they often contain details which list the items ordered and how they are bad, such as "my friend ordered a virgin strawberry daiquiri and instead she got some weird smoothie with whip cream on top".

#### Other useful characteristics

In this section, we explore a few characteristics that are different between the two types of reviews

In [36]:
textLen = train['text'].str.len() # length of text
textLen.groupby(train['stars']).mean()

stars
2    720.375
4    631.283
Name: text, dtype: float64

Negative reviews are 90 characters longer on average.

In [37]:
capPct = train['text'].str.count(r'[A-Z]')/textLen # % of chars upper case
capPct.groupby(train['stars']).mean()

stars
2    0.025294
4    0.027273
Name: text, dtype: float64

Positive reviews tend to have a slightly larger proportion of upper case letters.

In [40]:
nPunct = train['text'].str.count('2')
(nPunct/textLen).groupby(train['stars']).mean()

stars
2    0.000543
4    0.000399
Name: text, dtype: float64

2-star reviews tend to have more mentions of the number "2", likely from the reviews explicitly enumerating the score.

## (b) Train a classifier

In [41]:
################################################################################
NB_tfidf = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize,
                            max_features=5000, min_df=5)),
    ('clf', NB.MultinomialNB(alpha=1))
])

NB_tfidf.fit(train['text'], train['stars']) # fit model on training set
pred_dev = NB_tfidf.predict(dev['text']) # pred based on dev set

In [363]:
for n in range(10):
    print('%s\t%d'%(dev['docID'][n],pred_dev[n]))

ZSJnW6faaNFQoqq4ALqYg	4
Rcbv11hm5AYEwZyqYwAvg	2
rkRTjhu5szaBggeFVcVJlA	4
dhmeDsQGUS1FXMLs49SWjQ	4
z9zfIMYmRRCE4ggfOIieEw	4
Xtb3pGSh39bqcozkBECw	2
DOUflAGzxLsXG6xOmR1w	2
0RxCEWURe08CTcZt95F4AQ	2
MzUg5twEcCyd0X6lBMP2Lg	2
uNlw2D5CYKk0wjNxLtYw	4


## (c) Evaluate your predictions

In [42]:
# Calculates various stats related to validation
def validationStats(y_Prd, y_Act, msg='', algo='naive Bayes'):
    # confusion matrix, T=true, F=false, N=negative, P=positive
    TN, FP, FN, TP = sklearn.metrics.confusion_matrix(y_Act, y_Prd).ravel()
    precision,recall = TP/(TP+FP) , TP/(TP+FN) # precision and recall
    corr,tot = TN+TP , TN+TP+FN+FP # used for accuracy calculation
    print("Using %s, %s"%(algo,msg))
    print("\tTP=%d, TN=%d, FP=%d, FN=%d"%(TP,TN,FP,FN))
    print("\tRecall: %u/%u = %.1f%%" % (TP, TP+FN, recall*100) )
    print("\tPrecision: %u/%u = %.1f%%" % (TP, TP+FP, precision*100) )
    print("\tF1 score: %.3f" % (2*precision*recall / (precision+recall)) )
    print("\tAccuracy: %u/%u = %.1f%%" % (corr,tot,corr/tot*100) )
    return (TN, FP, FN, TP)

In [43]:
validationStats(pred_dev, dev['stars'], 'TF-IDF doc vectors');

Using naive Bayes, TF-IDF doc vectors
	TP=835, TN=838, FP=162, FN=165
	Recall: 835/1000 = 83.5%
	Precision: 835/997 = 83.8%
	F1 score: 0.836
	Accuracy: 1673/2000 = 83.7%


In [30]:
pd.Series(binTF[:,terms.index('fast')].todense().A1).groupby(train['stars']).mean()

stars
2    0.057
4    0.087
dtype: float64

In [451]:
dev['text'][57]

"Ok, I am not sure why people put down the Stratosphere, yea...yea the casino action kind of sucks....but don't go there to gamble, go there for Lucky's and Fat Tuesday's....But let's get back to Lucky's....we were here with a couple that had never been to the Strat, so we decided to go before having dinner, my honey and I decided to play some penny slot to kill time, and LO and BEHOLD...I saw a sign advertising steak and crab legs for 9.99, well, most people (NOT ME) would be scared off by that, but hell it was a hard night, I lost some moola and was looking for some cheap (but good) grub....I am after all a foodie...haha, only if you count greasy spoons. Anyways, back to Lucky's the dinner was really good, the steak was juicy, the crab legs meaty, tender and were already cut in half for you....shoot people, what more do you want, what more do you need.....Oh, but wait there's a catch, you can only order that between 7p-10p or 6-10, I forgot, but I do know it ends at 10p. Try to catch

This review was classified as a 4, but was actually a 2. The review starts out negatively, using words like "not", "sucks", "but", and "don't". However, the initial part of the review was describing the casino, not the restaurant itself. Depiste being a positive review, it's interspersed with words frequently found in negative reviews like "though", "greasy", and "time" (probably from people who waited a long time).

In [29]:
dev['text'][974]

'It has been about a month since we last visited this place.  I recommend going to the pub and not the restaraunt side.  Service was great.  We got a couple of pints and some wings.  Wings were overdone although the sauces were good.  Love the garlic parmesan wings, even though overdone.'

This review was misclassified as a 4, but is actually a 2. Depiste the overall review being negative, it talks about positive aspects of the visit. The review contains words typically associated with positive reviews like "recommend", "great", "good", and "love". Therefore, it's easy to see what this was misclassified.

In [23]:
dev['text'][1437]

"If you want atmosphere, it's a great, great, great coffee shop.  If you want espresso, food, or fast service, unfortunately, look elsewhere.    Every visit here has had me run into friendly, talkative, awesome people, but I go to a coffee shop wanting coffee, honestly."

This review was misclassified as a 4. It is easy to see why this happened, as the reviewer used "great" three times as well as other words like "good", "friendly", and "awesome". The positive aspects of this review overwhelmed the use of negative words like "unfortunately".

## (d) Build a second classifier

In [92]:
SVM_tfidf = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, 
                             max_features=5000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev2 = SVM_tfidf.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev2, dev['stars'], '5000 features', 'SVM');

Using SVM, 5000 features
	TP=812, TN=825, FP=175, FN=188
	Recall: 812/1000 = 81.2%
	Precision: 812/987 = 82.3%
	F1 score: 0.817
	Accuracy: 1637/2000 = 81.8%


In [94]:
################################################################################
SVM_tfidf = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, 
                             max_features=2000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev3 = SVM_tfidf.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev3, dev['stars'], '2000 features', 'SVM');

Using SVM, 2000 features
	TP=825, TN=865, FP=135, FN=175
	Recall: 825/1000 = 82.5%
	Precision: 825/960 = 85.9%
	F1 score: 0.842
	Accuracy: 1690/2000 = 84.5%


## (e) Feature engineering

In [107]:
SVM_tfidf = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, ngram_range=(1,2),
                             max_features=5000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev4 = SVM_tfidf.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev4, dev['stars'], 'uni & bigram', 'SVM');

Using SVM, uni & bigram
	TP=853, TN=877, FP=123, FN=147
	Recall: 853/1000 = 85.3%
	Precision: 853/976 = 87.4%
	F1 score: 0.863
	Accuracy: 1730/2000 = 86.5%


In [119]:
SVM_tfidf = Pipeline([ # establish pipeline
    ('vect', TfidfVectorizer(tokenizer=nltk.word_tokenize, ngram_range=(2,3),
                             max_features=5000, min_df=2) ),
    ('scl', sklearn.preprocessing.StandardScaler(copy=False, with_mean=False)),
    ('clf', SVM.SVC(gamma='auto', max_iter=-1, random_state=1, kernel='rbf'))
])

pred_dev5 = SVM_tfidf.fit(train['text'], train['stars']).predict(dev['text'])
validationStats(pred_dev5, dev['stars'], 'bi & trigram', 'SVM');

Using SVM, bi & trigram
	TP=817, TN=831, FP=169, FN=183
	Recall: 817/1000 = 81.7%
	Precision: 817/986 = 82.9%
	F1 score: 0.823
	Accuracy: 1648/2000 = 82.4%


In [125]:
pred_test = SVM_tfidf.predict(test['text'])
with open('jwu74.tsv', 'w') as fh:
    for ID,star in zip(test['docID'], pred_test):
        fh.write('%s\t%d\n'%(ID,star))