<div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
  <h1 style="color: white;"> *Sentiment Analyzer* </h1>.
 </div>
 
 - [Original Blog](https://marcobonzanini.com/2015/01/19/sentiment-analysis-with-python-and-scikit-learn/) 
 - [Data](http://www.cs.cornell.edu/people/pabo/movie-review-data/)
 
 Uses supervised machine learning methodology
 
 ![alt text](./misc/supervised-learning.png)


# Read the data for building the model

In [4]:
import os

In [5]:
data_dir = './review_polarity/txt_sentoken/'
classes = ['pos', 'neg']


In [6]:
train_data = []
train_labels = []
test_data = []
test_labels = []

In [7]:
for curr_class in classes:
    dirname = os.path.join(data_dir, curr_class)
    for fname in os.listdir(dirname):
        with open(os.path.join(dirname, fname), 'r') as f:
            content = f.read()
            if fname.startswith('cv9'):
                test_data.append(content)
                test_labels.append(curr_class)
            else:
                train_data.append(content)
                train_labels.append(curr_class)

In [12]:
#train_data[0]

# Convert test and train data(text) into vector space model representation 

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=5,
                                 max_df = 0.8,
                                 sublinear_tf=True,
                                 use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

In [14]:
terms = vectorizer.get_feature_names()
terms[:-100]

['00',
 '000',
 '007',
 '10',
 '100',
 '1000',
 '101',
 '102',
 '11',
 '12',
 '13',
 '137',
 '13th',
 '14',
 '15',
 '150',
 '16',
 '16x9',
 '17',
 '17th',
 '18',
 '180',
 '18th',
 '19',
 '1912',
 '1939',
 '1940',
 '1947',
 '1950',
 '1950s',
 '1958',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1963',
 '1964',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '1986',
 '1987',
 '1988',
 '1989',
 '1990',
 '1990s',
 '1991',
 '1992',
 '1993',
 '1994',
 '1995',
 '1996',
 '1997',
 '1998',
 '1999',
 '19th',
 '20',
 '200',
 '2000',
 '2001',
 '20th',
 '21',
 '21st',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '300',
 '3000',
 '31',
 '33',
 '35',
 '36',
 '37',
 '40',
 '400',
 '42',
 '45',
 '48',
 '4th',
 '50',
 '500',
 '50s',
 '54',
 '57',
 '5th',
 '60',
 '60s',
 '666',
 '70',
 '70s',
 '75',
 '80',
 '800',
 '80s',
 '85',
 '8mm',
 '90'

In [16]:
import pandas as pd
df = pd.DataFrame(train_vectors.toarray(), columns=vectorizer.get_feature_names())

In [17]:
df.head()

Unnamed: 0,00,000,007,10,100,1000,101,102,11,12,...,zingers,zoe,zombie,zombies,zone,zoom,zooms,zorro,zucker,zwick
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.14335,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [18]:
len(train_data)

1800

In [19]:
len(test_data)

200

# Train a Logistic Regressio model

In [22]:
from sklearn.linear_model import LogisticRegression

classifier_lr = LogisticRegression()

classifier_lr.fit(train_vectors, train_labels)

prediction_lr = classifier_lr.predict(test_vectors)

# Check Model Accuracy

In [25]:
print("Classification rate:", classifier_lr.score(train_vectors, train_labels))

Classification rate: 0.977777777778


In [26]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(test_labels, prediction_lr)
confusion_matrix

array([[90, 10],
       [11, 89]])

In [27]:
from sklearn.metrics import classification_report 
print(classification_report(test_labels, prediction_lr))

             precision    recall  f1-score   support

        neg       0.89      0.90      0.90       100
        pos       0.90      0.89      0.89       100

avg / total       0.90      0.90      0.89       200



# Coefficient of logistic regression model

In [29]:
coeffs = classifier_lr.coef_

In [32]:
i = 0
positive_list = []
negative_list = []

threshold_positive = 0.7
threshold_negative = -0.7
for coeff in coeffs[0]:
    if coeff > threshold_positive:
        positive_list.append(terms[i])
    if coeff < threshold_negative:
        negative_list.append(terms[i])
    i = i + 1
        

In [33]:
positive_list

['allows',
 'also',
 'although',
 'always',
 'american',
 'best',
 'both',
 'brilliant',
 'cameron',
 'change',
 'damon',
 'different',
 'especially',
 'excellent',
 'family',
 'fantastic',
 'fun',
 'gives',
 'great',
 'hilarious',
 'jackie',
 'job',
 'life',
 'many',
 'memorable',
 'most',
 'oscar',
 'others',
 'outstanding',
 'overall',
 'own',
 'perfect',
 'perfectly',
 'performance',
 'performances',
 'quite',
 'see',
 'seen',
 'sometimes',
 'terrific',
 'throughout',
 'true',
 'very',
 'war',
 'well',
 'will',
 'wonderful',
 'wonderfully',
 'works',
 'world',
 'yet']

In [34]:
negative_list

['any',
 'anyway',
 'attempt',
 'attempts',
 'awful',
 'bad',
 'better',
 'boring',
 'could',
 'dull',
 'even',
 'fails',
 'if',
 'lame',
 'least',
 'looks',
 'material',
 'maybe',
 'mess',
 'minute',
 'no',
 'none',
 'nothing',
 'only',
 'plot',
 'pointless',
 'poor',
 'poorly',
 'reason',
 'ridiculous',
 'script',
 'should',
 'stupid',
 'supposed',
 'terrible',
 'there',
 'tries',
 'tv',
 'unfortunately',
 'unfunny',
 'waste',
 'wasted',
 'why',
 'worse',
 'worst']

# Exercise

- Do the positive and negative words make sense?

- Train a support vector machine classifier based model on the data and test its accuracy

~~~~
Hint code
from sklearn import svm

classifier_rbf = svm.SVC()

classifier_rbf.fit(train_vectors, train_labels)

prediction_rbf = classifier_rbf.predict(test_vectors)
~~~~

<div style="background-color: #99CD4E; text-align:center; vertical-align: middle; padding:40px 0;"> 
  <h1 style="color: white;"> *The End* </h1>.
 </div>