# Sentiment Analysis on Movie Reviews

Using Logistic Regression, SGD, Naive Bayes, OneVsOne Models

- 0 - negative

- 1 - somewhat negative

- 2 - neutral

- 3 - somewhat positive

- 4 - positive

## Load Libraries

In [140]:
import nltk
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier

from sklearn.metrics import classification_report, confusion_matrix

## Load & Read Datasets

In [3]:
train = pd.read_csv('train.tsv', delimiter='\t')
test = pd.read_csv('test.tsv', delimiter='\t')

In [4]:
train.shape, test.shape

((156060, 4), (66292, 3))

In [5]:
train.head()

Unnamed: 0,PhraseId,SentenceId,Phrase,Sentiment
0,1,1,A series of escapades demonstrating the adage ...,1
1,2,1,A series of escapades demonstrating the adage ...,2
2,3,1,A series,2
3,4,1,A,2
4,5,1,series,2


In [6]:
test.head()

Unnamed: 0,PhraseId,SentenceId,Phrase
0,156061,8545,An intermittently pleasing but mostly routine ...
1,156062,8545,An intermittently pleasing but mostly routine ...
2,156063,8545,An
3,156064,8545,intermittently pleasing but mostly routine effort
4,156065,8545,intermittently pleasing but mostly routine


In [7]:
# unique sentiment labels
train.Sentiment.unique()

array([1, 2, 3, 4, 0])

In [8]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 156060 entries, 0 to 156059
Data columns (total 4 columns):
PhraseId      156060 non-null int64
SentenceId    156060 non-null int64
Phrase        156060 non-null object
Sentiment     156060 non-null int64
dtypes: int64(3), object(1)
memory usage: 6.0+ MB


In [9]:
train.Sentiment.value_counts()

2    79582
3    32927
1    27273
4     9206
0     7072
Name: Sentiment, dtype: int64

In [10]:
train.Sentiment.value_counts() / train.Sentiment.count()

2    0.509945
3    0.210989
1    0.174760
4    0.058990
0    0.045316
Name: Sentiment, dtype: float64

## Extracting features

In order to perform machine learning on text documents, we first need to turn the text content into numerical feature vectors.

### Bags of words

The most intuitive way to do so is the bags of words representation:

- assign a fixed integer id to each word occurring in any document of the training set (for instance by building a dictionary from words to integer indices).

- for each document $#i$, count the number of occurrences of each word $w$ and store it in $X[i, j]$ as the value of feature $#j$ where $j$ is the index of word $w$ in the dictionary

*Reference: http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html*

The Bag of Words model learns a vocabulary from all of the documents, then models each document by counting the number of times each word appears.

We'll be using the *CountVectorizer* feature extractor module from *scikit-learn* to create *bag-of-words* features.

In [12]:
X_train = train['Phrase']
y_train = train['Sentiment']

In [55]:
# Convert a collection of text documents to a matrix of token counts
count_vect = CountVectorizer() 

# Fit followed by Transform
# Learn the vocabulary dictionary and return term-document matrix
X_train_counts = count_vect.fit_transform(X_train)

In [29]:
#X_train_count = X_train_count.toarray()

In [56]:
# 156060 rows of train data & 15240 features (one for each vocabulary word)
X_train_counts.shape

(156060, 15240)

In [20]:
# get all words in the vocabulary
vocab = count_vect.get_feature_names()
print (vocab)



In [24]:
# get index of any word
count_vect.vocabulary_.get(u'100')

2

In [57]:
# Sum up the counts of each vocabulary word
dist = np.sum(X_train_counts, axis=0)
# print (dist) # matrix

dist = np.squeeze(np.asarray(dist))
print (dist) # array

zipped = zip(vocab, dist)
zipped.sort(key = lambda t: t[1], reverse=True) # sort words by highest number of occurrence

# For each, print the vocabulary word and the number of times it 
# appears in the training set
for tag, count in zipped:
    print (count, tag)

[ 23 179  70 ...,  15   9   5]
(51633, u'the')
(32702, u'of')
(32177, u'and')
(22761, u'to')
(13997, u'in')
(13476, u'is')
(12338, u'that')
(11734, u'it')
(8651, u'as')
(7750, u'with')
(7553, u'for')
(7051, u'its')
(6733, u'film')
(6502, u'an')
(6241, u'movie')
(5677, u'this')
(5126, u'but')
(5053, u'be')
(4893, u'on')
(4855, u'you')
(3990, u'by')
(3895, u'more')
(3827, u'his')
(3784, u'one')
(3682, u'about')
(3668, u'not')
(3593, u'at')
(3511, u'or')
(3495, u'than')
(3477, u'from')
(3401, u'all')
(3190, u'like')
(3134, u'have')
(3067, u'are')
(2969, u'has')
(2866, u'so')
(2785, u'out')
(2539, u'story')
(2438, u'rrb')
(2423, u'up')
(2373, u'who')
(2261, u'good')
(2243, u'too')
(2227, u'most')
(2172, u'into')
(2098, u'lrb')
(1999, u'if')
(1926, u'what')
(1919, u'time')
(1899, u'their')
(1888, u'no')
(1882, u'characters')
(1872, u'much')
(1813, u'comedy')
(1781, u'your')
(1769, u'can')
(1728, u'just')
(1718, u'life')
(1713, u'some')
(1674, u'does')
(1624, u'even')
(1605, u'little')
(1583

(172, u'appealing')
(172, u'car')
(172, u'college')
(172, u'decent')
(172, u'urban')
(172, u'warm')
(171, u'robert')
(171, u'touch')
(171, u'welcome')
(170, u'flawed')
(170, u'free')
(170, u'pretentious')
(170, u'trip')
(169, u'female')
(169, u'somewhat')
(168, u'clear')
(168, u'country')
(168, u'relationship')
(168, u'used')
(168, u'worthy')
(167, u'fast')
(167, u'knows')
(167, u'process')
(166, u'appeal')
(166, u'couple')
(166, u'ugly')
(165, u'battle')
(165, u'name')
(165, u'nice')
(165, u'seat')
(165, u'stupid')
(164, u'behind')
(164, u'david')
(164, u'fully')
(164, u'inspired')
(164, u'taking')
(163, u'century')
(163, u'epic')
(163, u'felt')
(163, u'situation')
(163, u'soul')
(163, u'use')
(162, u'based')
(162, u'called')
(162, u'contemporary')
(162, u'depth')
(162, u'gross')
(162, u'journey')
(162, u'mother')
(162, u'several')
(162, u'start')
(161, u'british')
(161, u'co')
(161, u'crowd')
(161, u'insight')
(161, u'skin')
(161, u'word')
(160, u'adaptation')
(160, u'edge')
(160, u'

(70, u'dysfunctional')
(70, u'exhilarating')
(70, u'fair')
(70, u'fake')
(70, u'features')
(70, u'flashy')
(70, u'ghost')
(70, u'loving')
(70, u'mill')
(70, u'refreshingly')
(70, u'satirical')
(70, u'stuck')
(70, u'total')
(70, u'van')
(70, u'yes')
(69, u'achieves')
(69, u'appreciate')
(69, u'awkward')
(69, u'bag')
(69, u'bang')
(69, u'blend')
(69, u'brutally')
(69, u'buy')
(69, u'capture')
(69, u'crude')
(69, u'deft')
(69, u'delivered')
(69, u'distance')
(69, u'filling')
(69, u'frida')
(69, u'heartwarming')
(69, u'killed')
(69, u'mature')
(69, u'meets')
(69, u'moviemaking')
(69, u'notes')
(69, u'notion')
(69, u'obsession')
(69, u'parts')
(69, u'quest')
(69, u'remain')
(69, u'rental')
(69, u'shelf')
(69, u'simultaneously')
(69, u'thousands')
(69, u'tortured')
(69, u'urge')
(69, u'wannabe')
(68, u'award')
(68, u'bittersweet')
(68, u'bother')
(68, u'brand')
(68, u'build')
(68, u'cat')
(68, u'childhood')
(68, u'desperate')
(68, u'essence')
(68, u'example')
(68, u'explore')
(68, u'faithful

(53, u'passionate')
(53, u'portray')
(53, u'preposterous')
(53, u'richer')
(53, u'scattered')
(53, u'schwarzenegger')
(53, u'slip')
(53, u'speak')
(53, u'staged')
(53, u'surely')
(53, u'surrounding')
(53, u'tasty')
(53, u'test')
(53, u'thomas')
(53, u'torn')
(53, u'touches')
(53, u'wears')
(53, u'western')
(53, u'wholly')
(52, u'21st')
(52, u'22')
(52, u'boom')
(52, u'bullets')
(52, u'capacity')
(52, u'caught')
(52, u'cgi')
(52, u'changes')
(52, u'characterizations')
(52, u'cohesive')
(52, u'committed')
(52, u'competent')
(52, u'conservative')
(52, u'conspiracy')
(52, u'difference')
(52, u'downbeat')
(52, u'dramas')
(52, u'easier')
(52, u'eat')
(52, u'encounter')
(52, u'existential')
(52, u'expense')
(52, u'exploitative')
(52, u'german')
(52, u'gun')
(52, u'ham')
(52, u'legacy')
(52, u'lingering')
(52, u'literally')
(52, u'lopez')
(52, u'nevertheless')
(52, u'nicholas')
(52, u'ourselves')
(52, u'overwhelming')
(52, u'presentation')
(52, u'pulp')
(52, u'rating')
(52, u'sadly')
(52, u'sa

(41, u'convoluted')
(41, u'crudup')
(41, u'deadpan')
(41, u'demons')
(41, u'distant')
(41, u'engage')
(41, u'expression')
(41, u'farrelly')
(41, u'fiennes')
(41, u'frailty')
(41, u'greene')
(41, u'harder')
(41, u'hat')
(41, u'hatred')
(41, u'heady')
(41, u'hotel')
(41, u'impeccable')
(41, u'influence')
(41, u'insipid')
(41, u'intentioned')
(41, u'irresistible')
(41, u'joel')
(41, u'jumbo')
(41, u'korean')
(41, u'leading')
(41, u'magnificent')
(41, u'manufactured')
(41, u'morally')
(41, u'mysteries')
(41, u'numbingly')
(41, u'pleaser')
(41, u'polished')
(41, u'radar')
(41, u'sandra')
(41, u'seemingly')
(41, u'separate')
(41, u'shameless')
(41, u'sting')
(41, u'stylistic')
(41, u'survival')
(41, u'talky')
(41, u'ten')
(41, u'theories')
(41, u'therapy')
(41, u'throw')
(41, u'todd')
(41, u'trashy')
(41, u'understands')
(41, u'unfulfilling')
(41, u'vanity')
(41, u'walsh')
(41, u'weighty')
(41, u'whenever')
(41, u'witless')
(40, u'ai')
(40, u'alabama')
(40, u'ancient')
(40, u'anomie')
(40, u

(34, u'damon')
(34, u'defies')
(34, u'distinguished')
(34, u'dv')
(34, u'elevate')
(34, u'enormously')
(34, u'enthralling')
(34, u'european')
(34, u'experienced')
(34, u'filming')
(34, u'fits')
(34, u'frontal')
(34, u'frustratingly')
(34, u'giddy')
(34, u'glimpses')
(34, u'grating')
(34, u'haneke')
(34, u'huppert')
(34, u'hypnotic')
(34, u'illustrates')
(34, u'impressed')
(34, u'impressions')
(34, u'includes')
(34, u'incomprehensible')
(34, u'interview')
(34, u'introduction')
(34, u'ironies')
(34, u'isolation')
(34, u'jolie')
(34, u'landscapes')
(34, u'latin')
(34, u'liberal')
(34, u'massive')
(34, u'mechanical')
(34, u'member')
(34, u'mundane')
(34, u'nifty')
(34, u'nine')
(34, u'nobody')
(34, u'optimism')
(34, u'overblown')
(34, u'overrun')
(34, u'pastiche')
(34, u'pervasive')
(34, u'praise')
(34, u'pratfalls')
(34, u'promised')
(34, u'pulpy')
(34, u'pulse')
(34, u'rah')
(34, u'recommendation')
(34, u'recommended')
(34, u'reminded')
(34, u'remotely')
(34, u'retro')
(34, u'runner')
(3

(27, u'figured')
(27, u'flaccid')
(27, u'foolish')
(27, u'frequent')
(27, u'gasp')
(27, u'haunted')
(27, u'hearst')
(27, u'hundred')
(27, u'huston')
(27, u'iconoclastic')
(27, u'idiotic')
(27, u'ignore')
(27, u'illness')
(27, u'imaginable')
(27, u'impersonal')
(27, u'impulses')
(27, u'intentionally')
(27, u'interaction')
(27, u'involve')
(27, u'island')
(27, u'jackass')
(27, u'jazzy')
(27, u'joyless')
(27, u'kitten')
(27, u'lifts')
(27, u'log')
(27, u'marching')
(27, u'mars')
(27, u'mart')
(27, u'metal')
(27, u'metaphors')
(27, u'modestly')
(27, u'mouthed')
(27, u'mumbo')
(27, u'muted')
(27, u'myth')
(27, u'naipaul')
(27, u'narration')
(27, u'native')
(27, u'negotiate')
(27, u'nerves')
(27, u'nervy')
(27, u'nonsensical')
(27, u'normally')
(27, u'object')
(27, u'obsessed')
(27, u'overtly')
(27, u'palpable')
(27, u'pauline')
(27, u'paying')
(27, u'penetrating')
(27, u'performed')
(27, u'placed')
(27, u'plus')
(27, u'pompeo')
(27, u'practice')
(27, u'prevent')
(27, u'prism')
(27, u'probe'

(22, u'kafka')
(22, u'kicks')
(22, u'knock')
(22, u'kouyate')
(22, u'laptops')
(22, u'le')
(22, u'liking')
(22, u'limit')
(22, u'lyricism')
(22, u'macdowell')
(22, u'manifesto')
(22, u'marshall')
(22, u'masculine')
(22, u'maternal')
(22, u'mechanics')
(22, u'mission')
(22, u'modernize')
(22, u'moralizing')
(22, u'nash')
(22, u'newfoundland')
(22, u'ninety')
(22, u'nonconformist')
(22, u'obviousness')
(22, u'ol')
(22, u'oppressive')
(22, u'outbursts')
(22, u'outlandish')
(22, u'overwhelmed')
(22, u'pairing')
(22, u'palestinian')
(22, u'parental')
(22, u'passing')
(22, u'pedestal')
(22, u'perfunctory')
(22, u'pesky')
(22, u'pootie')
(22, u'possesses')
(22, u'potter')
(22, u'pressed')
(22, u'preview')
(22, u'principal')
(22, u'principals')
(22, u'profession')
(22, u'pronounce')
(22, u'psychedelic')
(22, u'ray')
(22, u'reacting')
(22, u'reducing')
(22, u'remaining')
(22, u'restraint')
(22, u'romanek')
(22, u'roughshod')
(22, u'round')
(22, u'sack')
(22, u'schools')
(22, u'screenings')
(22,

(18, u'mug')
(18, u'mummy')
(18, u'mush')
(18, u'musings')
(18, u'mystic')
(18, u'narrated')
(18, u'nationally')
(18, u'nationwide')
(18, u'navel')
(18, u'negative')
(18, u'nonstop')
(18, u'numerous')
(18, u'objects')
(18, u'offerings')
(18, u'ominous')
(18, u'operates')
(18, u'oprah')
(18, u'ops')
(18, u'opts')
(18, u'oral')
(18, u'orwell')
(18, u'outright')
(18, u'palaver')
(18, u'paradoxically')
(18, u'partnerships')
(18, u'passe')
(18, u'paulette')
(18, u'penchant')
(18, u'phillip')
(18, u'photos')
(18, u'physically')
(18, u'pipeline')
(18, u'plainly')
(18, u'pollution')
(18, u'portraits')
(18, u'portraying')
(18, u'prevents')
(18, u'prince')
(18, u'projection')
(18, u'proposal')
(18, u'provincial')
(18, u'psychic')
(18, u'punchy')
(18, u'quadrangle')
(18, u'rated')
(18, u'reflects')
(18, u'relying')
(18, u'remakes')
(18, u'restrictive')
(18, u'retooled')
(18, u'revulsion')
(18, u'rewarded')
(18, u'rotoscope')
(18, u'rubenesque')
(18, u'ruggero')
(18, u'ruinous')
(18, u'rustic')
(1

(15, u'brio')
(15, u'brittle')
(15, u'brushes')
(15, u'bryan')
(15, u'bubble')
(15, u'cagney')
(15, u'cameos')
(15, u'campfire')
(15, u'candid')
(15, u'cannibal')
(15, u'capability')
(15, u'captured')
(15, u'cards')
(15, u'chatty')
(15, u'cheated')
(15, u'cheering')
(15, u'cheesiest')
(15, u'chillingly')
(15, u'choreographed')
(15, u'chouraqui')
(15, u'churn')
(15, u'cinematically')
(15, u'closet')
(15, u'coal')
(15, u'coast')
(15, u'cockney')
(15, u'cocktail')
(15, u'coda')
(15, u'cognizant')
(15, u'coke')
(15, u'complexly')
(15, u'composure')
(15, u'conceive')
(15, u'confessions')
(15, u'confluence')
(15, u'conspiracies')
(15, u'contemplative')
(15, u'contradiction')
(15, u'contrivance')
(15, u'corniest')
(15, u'crackers')
(15, u'crooks')
(15, u'crossed')
(15, u'cuisine')
(15, u'cussing')
(15, u'daddy')
(15, u'dampened')
(15, u'dangerously')
(15, u'dani')
(15, u'dated')
(15, u'defecates')
(15, u'defiant')
(15, u'delay')
(15, u'delighted')
(15, u'dependence')
(15, u'deserving')
(15, u

(13, u'ace')
(13, u'activity')
(13, u'actorly')
(13, u'adopt')
(13, u'aesthetically')
(13, u'affords')
(13, u'aggressiveness')
(13, u'agitprop')
(13, u'airless')
(13, u'alan')
(13, u'alarming')
(13, u'alexander')
(13, u'alienate')
(13, u'alterations')
(13, u'amaro')
(13, u'amazement')
(13, u'anarchist')
(13, u'anew')
(13, u'animator')
(13, u'annoyance')
(13, u'anteing')
(13, u'antic')
(13, u'apocalypse')
(13, u'approached')
(13, u'arbitrary')
(13, u'archive')
(13, u'ardent')
(13, u'armchair')
(13, u'ashley')
(13, u'association')
(13, u'assumes')
(13, u'asylum')
(13, u'atrociously')
(13, u'attics')
(13, u'aurelie')
(13, u'austrian')
(13, u'avenues')
(13, u'backed')
(13, u'balloon')
(13, u'ballplayer')
(13, u'ballroom')
(13, u'bam')
(13, u'barreled')
(13, u'barris')
(13, u'begging')
(13, u'begrudge')
(13, u'besotted')
(13, u'bewildering')
(13, u'bid')
(13, u'biographical')
(13, u'blaring')
(13, u'blasphemous')
(13, u'blissfully')
(13, u'boils')
(13, u'boost')
(13, u'boston')
(13, u'bravu

(12, u'misdemeanor')
(12, u'moist')
(12, u'mommy')
(12, u'monopoly')
(12, u'mopping')
(12, u'morbid')
(12, u'mortarboards')
(12, u'moulin')
(12, u'mouse')
(12, u'mouser')
(12, u'moviehouse')
(12, u'mulholland')
(12, u'multilayered')
(12, u'multitude')
(12, u'musker')
(12, u'nail')
(12, u'napoli')
(12, u'nationalist')
(12, u'natives')
(12, u'naturedness')
(12, u'neglecting')
(12, u'newcomers')
(12, u'niches')
(12, u'octane')
(12, u'oddity')
(12, u'omission')
(12, u'omniscient')
(12, u'oomph')
(12, u'operational')
(12, u'opportunists')
(12, u'oppressively')
(12, u'options')
(12, u'outshined')
(12, u'overachieving')
(12, u'overmanipulative')
(12, u'overstimulated')
(12, u'overview')
(12, u'owen')
(12, u'ownership')
(12, u'pablum')
(12, u'packages')
(12, u'palestinians')
(12, u'palm')
(12, u'parachutes')
(12, u'parochial')
(12, u'parsec')
(12, u'pasolini')
(12, u'pasta')
(12, u'pasts')
(12, u'pathetically')
(12, u'patiently')
(12, u'pax')
(12, u'peaks')
(12, u'peanut')
(12, u'pellington')


(11, u'transvestite')
(11, u'treatise')
(11, u'troopers')
(11, u'truckzilla')
(11, u'trumps')
(11, u'trusting')
(11, u'tryingly')
(11, u'tube')
(11, u'tummy')
(11, u'turks')
(11, u'twitchy')
(11, u'uhf')
(11, u'unapologetically')
(11, u'uncertainties')
(11, u'uncreative')
(11, u'underplayed')
(11, u'undeserved')
(11, u'unforgettably')
(11, u'unguarded')
(11, u'unhinged')
(11, u'uninflected')
(11, u'unit')
(11, u'unity')
(11, u'univac')
(11, u'unqualified')
(11, u'unrecoverable')
(11, u'unreligious')
(11, u'unsavory')
(11, u'unveil')
(11, u'unwieldy')
(11, u'upfront')
(11, u'ushered')
(11, u'valid')
(11, u'vanessa')
(11, u'vardalos')
(11, u'vega')
(11, u'vicarious')
(11, u'victorious')
(11, u'videologue')
(11, u'viewpoint')
(11, u'violinist')
(11, u'virtuous')
(11, u'vonnegut')
(11, u'vowing')
(11, u'wallflower')
(11, u'warren')
(11, u'waster')
(11, u'watches')
(11, u'weathered')
(11, u'weber')
(11, u'weil')
(11, u'weinstein')
(11, u'welty')
(11, u'westbrook')
(11, u'whitaker')
(11, u'w

(9, u'arguably')
(9, u'articulates')
(9, u'artworks')
(9, u'ascertain')
(9, u'asphalt')
(9, u'aspiration')
(9, u'assigned')
(9, u'associations')
(9, u'attachment')
(9, u'attuned')
(9, u'auspicious')
(9, u'autocritique')
(9, u'avalanche')
(9, u'avengers')
(9, u'avid')
(9, u'baader')
(9, u'baffle')
(9, u'balding')
(9, u'ballesta')
(9, u'banged')
(9, u'barker')
(9, u'bastard')
(9, u'batman')
(9, u'bawdy')
(9, u'beasts')
(9, u'beause')
(9, u'bedfellows')
(9, u'bedtime')
(9, u'beers')
(9, u'bees')
(9, u'began')
(9, u'beginnings')
(9, u'believed')
(9, u'benchmark')
(9, u'bergmanesque')
(9, u'beside')
(9, u'besides')
(9, u'bespeaks')
(9, u'bestowing')
(9, u'bicentennial')
(9, u'blacked')
(9, u'blacken')
(9, u'blanchett')
(9, u'blender')
(9, u'blight')
(9, u'blockage')
(9, u'blueblood')
(9, u'boards')
(9, u'boatload')
(9, u'boffo')
(9, u'bona')
(9, u'bordering')
(9, u'bots')
(9, u'bowl')
(9, u'boyd')
(9, u'brady')
(9, u'branched')
(9, u'brawn')
(9, u'brazen')
(9, u'brazenly')
(9, u'broadside')

(8, u'concerning')
(8, u'concludes')
(8, u'confection')
(8, u'confession')
(8, u'confrontations')
(8, u'confuses')
(8, u'conjuring')
(8, u'conrad')
(8, u'consoled')
(8, u'consumerist')
(8, u'contentedly')
(8, u'continually')
(8, u'continuing')
(8, u'contours')
(8, u'contributions')
(8, u'convolution')
(8, u'cookie')
(8, u'corcuera')
(8, u'coriat')
(8, u'cornpone')
(8, u'cosa')
(8, u'couch')
(8, u'counterculture')
(8, u'courtesy')
(8, u'courtroom')
(8, u'crawling')
(8, u'crawls')
(8, u'cresting')
(8, u'cristo')
(8, u'crudely')
(8, u'crudities')
(8, u'cruelties')
(8, u'crusty')
(8, u'cursing')
(8, u'cynic')
(8, u'damsel')
(8, u'dangers')
(8, u'deadeningly')
(8, u'debilitating')
(8, u'decomposition')
(8, u'deer')
(8, u'deficit')
(8, u'definitions')
(8, u'degrading')
(8, u'deliberateness')
(8, u'demonstrated')
(8, u'dependable')
(8, u'derring')
(8, u'devaluation')
(8, u'deviant')
(8, u'devos')
(8, u'diamond')
(8, u'diatribes')
(8, u'dicey')
(8, u'differently')
(8, u'dimwits')
(8, u'dipped'

(7, u'bounces')
(7, u'bounds')
(7, u'bowser')
(7, u'breach')
(7, u'bullseye')
(7, u'buoy')
(7, u'burdened')
(7, u'burkinabe')
(7, u'byron')
(7, u'cadavers')
(7, u'calculus')
(7, u'cameras')
(7, u'candor')
(7, u'cantet')
(7, u'capably')
(7, u'capricious')
(7, u'captions')
(7, u'cardellini')
(7, u'careless')
(7, u'carlito')
(7, u'carousel')
(7, u'catalytic')
(7, u'celebrities')
(7, u'cheats')
(7, u'cheered')
(7, u'cheerfully')
(7, u'chimney')
(7, u'cimarron')
(7, u'clan')
(7, u'claustrophobia')
(7, u'claws')
(7, u'cleanflicks')
(7, u'cliffhanger')
(7, u'clinically')
(7, u'cloudy')
(7, u'clutch')
(7, u'clutches')
(7, u'collegiate')
(7, u'colosseum')
(7, u'coltish')
(7, u'comedically')
(7, u'comical')
(7, u'command')
(7, u'commendable')
(7, u'compassionately')
(7, u'compensated')
(7, u'concentrate')
(7, u'condone')
(7, u'conducted')
(7, u'condundrum')
(7, u'cone')
(7, u'confines')
(7, u'confining')
(7, u'conjure')
(7, u'cons')
(7, u'consistency')
(7, u'consists')
(7, u'conspiratorial')
(7,

(6, u'hanky')
(6, u'hanley')
(6, u'hardest')
(6, u'headline')
(6, u'hearing')
(6, u'heartland')
(6, u'heartstrings')
(6, u'hem')
(6, u'herrmann')
(6, u'hideousness')
(6, u'hints')
(6, u'hitchcockianism')
(6, u'hollowness')
(6, u'holmes')
(6, u'homages')
(6, u'hoods')
(6, u'hookers')
(6, u'hoopla')
(6, u'horde')
(6, u'horizons')
(6, u'horrifyingly')
(6, u'hush')
(6, u'hustling')
(6, u'hymn')
(6, u'icky')
(6, u'imagines')
(6, u'impatient')
(6, u'implicitly')
(6, u'imprint')
(6, u'improbabilities')
(6, u'inauspicious')
(6, u'inducingly')
(6, u'inexorably')
(6, u'inexpressible')
(6, u'inexpressive')
(6, u'infants')
(6, u'infectiously')
(6, u'infinite')
(6, u'inflicted')
(6, u'inform')
(6, u'initiation')
(6, u'injected')
(6, u'intention')
(6, u'intern')
(6, u'intolerance')
(6, u'intractable')
(6, u'inversion')
(6, u'invested')
(6, u'irrigates')
(6, u'irritatingly')
(6, u'ismail')
(6, u'jangle')
(6, u'jarecki')
(6, u'join')
(6, u'jostles')
(6, u'joyful')
(6, u'juxtapositions')
(6, u'kalesnik

(5, u'unspeakable')
(5, u'unspeakably')
(5, u'untrained')
(5, u'uplifter')
(5, u'upside')
(5, u'vibrantly')
(5, u'violated')
(5, u'virtual')
(5, u'visualizing')
(5, u'vote')
(5, u'voyage')
(5, u'voyeur')
(5, u'wades')
(5, u'waged')
(5, u'wail')
(5, u'wanna')
(5, u'wanton')
(5, u'weaned')
(5, u'wending')
(5, u'whereas')
(5, u'whistle')
(5, u'whit')
(5, u'whites')
(5, u'wisp')
(5, u'wistfully')
(5, u'womanhood')
(5, u'wordy')
(5, u'wretched')
(5, u'ww')
(5, u'yeah')
(5, u'yields')
(5, u'zaza')
(5, u'zip')
(5, u'zzzzzzzzz')
(4, u'70')
(4, u'78')
(4, u'98')
(4, u'aaa')
(4, u'accompanied')
(4, u'ackerman')
(4, u'affects')
(4, u'alain')
(4, u'alfonso')
(4, u'alt')
(4, u'anarchists')
(4, u'anchor')
(4, u'annoyances')
(4, u'appetizer')
(4, u'applied')
(4, u'approximation')
(4, u'arctic')
(4, u'armenian')
(4, u'artnering')
(4, u'averse')
(4, u'ayres')
(4, u'babies')
(4, u'ballsy')
(4, u'bardem')
(4, u'barred')
(4, u'beating')
(4, u'belinsky')
(4, u'bergman')
(4, u'bjarne')
(4, u'blandly')
(4, u

## Convert Occurrence to Frequency

Problem with occurrence count of words:
- longer documents will have higher average count values than shorter documents, even though they might talk about the same topics

Solution:
- divide the number of occurrences of each word in a document by the total number of words in the document
- new features formed by this method are called **tf** (***Term Frequencies***)

Refinement on *tf*:
- downscale weights for words that occur in many documents in the corpus and are therefore less informative than those that occur only in a smaller portion of the corpus
- this downscaling is called **tf-idf** (***Term Frequency times Inverse Document Frequency***)

Let's compute *tf* and *tf-idf* : 

In [58]:
tf_transformer = TfidfTransformer(use_idf=False).fit(X_train_counts)
X_train_tf = tf_transformer.transform(X_train_counts)

In [59]:
# 156060 rows of train data & 15240 features (one for each vocabulary word)
X_train_tf.shape

(156060, 15240)

In [69]:
# print some values of tf-idf transformed feature vector
print X_train_tf[1:2]

  (0, 5837)	0.258198889747
  (0, 5323)	0.258198889747
  (0, 5821)	0.258198889747
  (0, 7217)	0.258198889747
  (0, 14871)	0.258198889747
  (0, 13503)	0.258198889747
  (0, 288)	0.258198889747
  (0, 13505)	0.516397779494
  (0, 3490)	0.258198889747
  (0, 4577)	0.258198889747
  (0, 9227)	0.258198889747
  (0, 11837)	0.258198889747


In the above code, we first used the *fit()* method to fit our estimator and then the *transform()* method to transform our count-matrix to a tf-idf representation.

These two steps can be combined using *fit_transform()* method.

In [83]:
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(156060, 15240)

## Train Classifier

We train our classifier by inputing our features and expecting our classifier to output/predict the sentiment value for each phrase in test dataset.

### Naive Bayes Classifier

In [84]:
clf = MultinomialNB().fit(X_train_tfidf, y_train)

In [85]:
predicted = clf.predict(X_train_tfidf)

In [86]:
np.mean(predicted == y_train)

0.62973215429962837

### Building a Pipeline

In order to make the **vectorizer => transformer => classifier** easier to work with, scikit-learn provides a **Pipeline** class that behaves like a compound classifier.

You can compare the above accuracy result of the classifier without using Pipeline and the below accuracy result of the classifier while using Pipeline class. It's the same. Hence, Pipeline class highly simplifies our task of tokenizing and tfidf conversion.

In [97]:
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [98]:
text_clf.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...False,
         use_idf=True)), ('clf', MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True))])

In [99]:
predicted = text_clf.predict(X_train)

In [100]:
np.mean(predicted == y_train)

0.62973215429962837

Let's use stop words filter in *CountVectorizer* method and see how it affects the classifier's accuracy. We see that this increases accuracy.

In [102]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

0.63321799307958482

### Classification Report (precision, recall, f1-score)

In [128]:
target_names = y_train.unique()
#np.array(map(str, target_names))
#np.char.mod('%d', target_names)
target_names = ['0', '1', '2', '3', '4']

print (classification_report(
    y_train, \
    predicted, \
    target_names = target_names
))

             precision    recall  f1-score   support

          0       0.77      0.08      0.15      7072
          1       0.61      0.34      0.44     27273
          2       0.64      0.92      0.76     79582
          3       0.59      0.46      0.52     32927
          4       0.77      0.10      0.17      9206

avg / total       0.64      0.63      0.59    156060



### Confusion Matrix

In [129]:
print (confusion_matrix(y_train, predicted))

[[  580  3005  3346   137     4]
 [  136  9273 16986   868    10]
 [   30  2481 73022  4011    38]
 [    6   386 17270 15054   211]
 [    0    47  2877  5391   891]]


### Stochastic Gradient Descent (SGD) Classifier

In [139]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english')),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='modified_huber', shuffle=True, penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None)),
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

0.54136229655260792

### Logistic Regression Classifier

In [137]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english', max_features=5000)),
    ('tfidf', TfidfTransformer()),
    ('clf', LogisticRegression())
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

0.64022811739074714

### OneVsOne Classifier

In [142]:
text_clf = Pipeline([
    ('vect', CountVectorizer(stop_words='english', max_features=5000)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsOneClassifier(LinearSVC()))
])

text_clf.fit(X_train, y_train)
predicted = text_clf.predict(X_train)
np.mean(predicted == y_train)

0.66293092400358833

## Create Submission

In [135]:
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 66292 entries, 0 to 66291
Data columns (total 3 columns):
PhraseId      66292 non-null int64
SentenceId    66292 non-null int64
Phrase        66292 non-null object
dtypes: int64(2), object(1)
memory usage: 2.0+ MB


In [35]:
X_test = test['Phrase']
phraseIds = test['PhraseId']
predicted = text_clf.predict(X_test)
output = pd.DataFrame( data={"PhraseId":phraseIds, "Sentiment":predicted} )
#output.to_csv( "submission.csv", index=False, quoting=3 )