### Sentiment Analysis
We will implement Naive Bayes classifier to perform sentiment analysis on movie reviews. We will take 1000 positive and 1000 negative reviews and mix them, split them into training and testing dataset. Train the classifier on training dataset and then use it to predict the sentiment on testing datasets

In [78]:
import pandas as pd
import numpy as np
import nltk
from nltk import word_tokenize

In [3]:
import os
import nltk.corpus
from nltk.corpus import movie_reviews

In [23]:
from nltk.tokenize.moses import MosesDetokenizer

In [4]:
from sklearn.feature_extraction.text import CountVectorizer

In [6]:
import sklearn
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

In [7]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

## opening nltk corpora

In [8]:
print(os.listdir(nltk.data.find("corpora")))

['abc', 'abc.zip', 'alpino', 'alpino.zip', 'biocreative_ppi', 'biocreative_ppi.zip', 'brown', 'brown.zip', 'brown_tei', 'brown_tei.zip', 'cess_cat', 'cess_cat.zip', 'cess_esp', 'cess_esp.zip', 'chat80', 'chat80.zip', 'city_database', 'city_database.zip', 'cmudict', 'cmudict.zip', 'comparative_sentences', 'comparative_sentences.zip', 'comtrans.zip', 'conll2000', 'conll2000.zip', 'conll2002', 'conll2002.zip', 'conll2007.zip', 'crubadan', 'crubadan.zip', 'dependency_treebank', 'dependency_treebank.zip', 'dolch', 'dolch.zip', 'europarl_raw', 'europarl_raw.zip', 'floresta', 'floresta.zip', 'framenet_v15', 'framenet_v15.zip', 'framenet_v17', 'framenet_v17.zip', 'gazetteers', 'gazetteers.zip', 'genesis', 'genesis.zip', 'gutenberg', 'gutenberg.zip', 'ieer', 'ieer.zip', 'inaugural', 'inaugural.zip', 'indian', 'indian.zip', 'jeita.zip', 'kimmo', 'kimmo.zip', 'knbc.zip', 'lin_thesaurus', 'lin_thesaurus.zip', 'machado.zip', 'mac_morpho', 'mac_morpho.zip', 'masc_tagged.zip', 'movie_reviews', 'movie

In [9]:
print(movie_reviews.categories())

['neg', 'pos']


In [15]:
print(len(movie_reviews.fileids('pos')))
print("  ")
#print(movie_reviews.fileids('pos'))

1000
  


In [14]:
print(len(movie_reviews.fileids('neg')))
print(" ")
#print(movie_reviews.fileids('pos'))

1000
 


In [17]:
pos_reviews = movie_reviews.fileids('pos')
print(len(pos_reviews))

1000


In [18]:
neg_reviews = movie_reviews.fileids('neg')
print(len(neg_reviews))

1000


In [19]:
rev = nltk.corpus.movie_reviews.words('pos/cv000_29590.txt')
rev

['films', 'adapted', 'from', 'comic', 'books', 'have', ...]

In [20]:
len(rev)

862

### converting tokens back to string for using count Vectorizer and tf-idf calculation
- Two ways:
    - Use Detokinizer
    - Use join method to join all token to form a string

In [24]:
detokenizer = MosesDetokenizer()

In [25]:
detokenizer.detokenize(rev, return_str = True)

'films adapted from comic books have had plenty of success, whether they \'re about superheroes ( batman, superman, spawn), or geared toward kids ( casper) or the arthouse crowd ( ghost world), but there\' s never really been a comic book like from hell before. for starters, it was created by alan moore ( and eddie campbell), who brought the medium to a whole new level in the mid \'80s with a 12 - part series called the watchmen. to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd. the book ( or "graphic novel," if you will) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes. in other words, don\' t dismiss this film because of its source. if you can get past the whole comic book thing, you might find another stumbling block in from hell \'s directors, albert and allen hughes. getting the hughes brothers to direct this seems almost as ludicrous as casti

In [26]:
rev_list = []

In [28]:
for rev in neg_reviews:
    rev_text_neg = rev = nltk.corpus.movie_reviews.words(rev)
    review_one_string = " ".join(rev_text_neg)
    review_one_string = review_one_string.replace(" ,", ",")
    review_one_string = review_one_string.replace(" .", ".")
    review_one_string = review_one_string.replace("\' ", "'")
    review_one_string = review_one_string.replace(" \'", "'")
    rev_list.append(review_one_string)

In [33]:
len(rev_list) 
print(rev_list)

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [30]:
for rev in pos_reviews:
    rev_text_neg = rev = nltk.corpus.movie_reviews.words(rev)
    review_one_string = " ".join(rev_text_neg)
    review_one_string = review_one_string.replace(" ,", ",")
    review_one_string = review_one_string.replace(" .", ".")
    review_one_string = review_one_string.replace("\' ", "'")
    review_one_string = review_one_string.replace(" \'", "'")
    rev_list.append(review_one_string)

In [31]:
len(rev_list)

2000

In [32]:
print(rev_list[1000])

films adapted from comic books have had plenty of success, whether they're about superheroes ( batman, superman, spawn ), or geared toward kids ( casper ) or the arthouse crowd ( ghost world ), but there's never really been a comic book like from hell before. for starters, it was created by alan moore ( and eddie campbell ), who brought the medium to a whole new level in the mid'80s with a 12 - part series called the watchmen. to say moore and campbell thoroughly researched the subject of jack the ripper would be like saying michael jackson is starting to look a little odd. the book ( or " graphic novel, " if you will ) is over 500 pages long and includes nearly 30 more that consist of nothing but footnotes. in other words, don't dismiss this film because of its source. if you can get past the whole comic book thing, you might find another stumbling block in from hell's directors, albert and allen hughes. getting the hughes brothers to direct this seems almost as ludicrous as casting c

### We will denote negative reviews with '0' and positive reviews with '1'
CReating targets

In [42]:
# create numpy array with 1000 zeroes as integers
neg_targets = np.zeros((1000,), dtype = np.int)
#print(neg_targets)

In [37]:
type(neg_reviews)

list

In [38]:
len(neg_reviews)

1000

In [43]:
pos_targets = np.ones((1000,), dtype = np.int) 
#print(pos_targets)

In [44]:
target_list = []

In [45]:
for neg_tar in neg_targets:
    target_list.append(neg_tar)

In [46]:
for pos_tar in pos_targets:
    target_list.append(pos_tar)

In [48]:
len(target_list)

2000

In [49]:
target_list[999]

0

In [50]:
target_list[1000]

1

In [51]:
# create pandas series
y = pd.Series(target_list)

In [52]:
type(y)

pandas.core.series.Series

In [53]:
y.shape

(2000,)

In [56]:
y.head(10)

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    0
8    0
9    0
dtype: int64

### Using Count Vectorizer (Bag of Words)

In [79]:
count_vect = CountVectorizer(min_df = 2, lowercase = True, stop_words ='english')

In [80]:
X_count_vect = count_vect.fit_transform(rev_list)

In [81]:
type(X_count_vect)

scipy.sparse.csr.csr_matrix

In [82]:
X_count_vect.shape

(2000, 23784)

In [83]:
X_names = count_vect.get_feature_names()

In [84]:
X_names

['00',
 '000',
 '007',
 '05',
 '10',
 '100',
 '1000',
 '100m',
 '101',
 '102',
 '103',
 '105',
 '106',
 '107',
 '108',
 '10th',
 '11',
 '110',
 '113',
 '115',
 '11th',
 '12',
 '126',
 '129',
 '13',
 '130',
 '132',
 '137',
 '13th',
 '14',
 '14th',
 '15',
 '150',
 '1500s',
 '155',
 '15th',
 '16',
 '160',
 '1600',
 '161',
 '16mm',
 '16th',
 '16x9',
 '17',
 '175',
 '1773',
 '17th',
 '18',
 '180',
 '1800s',
 '1839',
 '1869',
 '1871',
 '1888',
 '18th',
 '19',
 '1900',
 '1912',
 '1914',
 '1919',
 '1925',
 '1928',
 '1930',
 '1930s',
 '1932',
 '1933',
 '1935',
 '1937',
 '1938',
 '1939',
 '1940',
 '1940s',
 '1941',
 '1943',
 '1944',
 '1945',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1953',
 '1954',
 '1957',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1963',
 '1964',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '198

In [85]:
X_Count_vect = pd.DataFrame(X_count_vect.toarray(), columns = X_names)

In [86]:
X_Count_vect.head()

Unnamed: 0,00,000,007,05,10,100,1000,100m,101,102,...,zoom,zooming,zooms,zoot,zorg,zorro,zucker,zuko,zwick,zwigoff
0,0,0,0,0,10,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [87]:
type(X_Count_vect)

pandas.core.frame.DataFrame

In [88]:
X_Count_vect.shape

(2000, 23784)

### Train test split

In [91]:
X_train, X_test, y_train, y_test = train_test_split(X_Count_vect, y, test_size =0.25, random_state = 5)

In [92]:
# if error comes because of fit word as it is in bag of words too then perform the following
X_Count_vect = X_Count_vect.rename(columns = {'fit':'fit_features'})

In [93]:
X_train.shape

(1500, 23784)

In [94]:
y_train.shape

(1500,)

In [95]:
X_test.shape

(500, 23784)

In [96]:
y_test.shape

(500,)

In [97]:
y_train.value_counts()

1    758
0    742
dtype: int64

In [98]:
y_test.value_counts()

0    258
1    242
dtype: int64

### Using Multinomial Naive Bayes

In [101]:
clf_cv = MultinomialNB()

In [102]:
clf_cv.fit(X_train, y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [104]:
y_pred = clf_cv.predict(X_test)

In [106]:
type(y_pred)

numpy.ndarray

In [107]:
y_pred.shape

(500,)

In [108]:
print(metrics.accuracy_score(y_test, y_pred))

0.798


In [109]:
score_clf_cv = confusion_matrix(y_test, y_pred)

In [110]:
score_clf_cv

array([[213,  45],
       [ 56, 186]], dtype=int64)

In [111]:
TP = score_clf_cv[0][0]
FP = score_clf_cv[0][1]
FN = score_clf_cv[1][0]
TN = score_clf_cv[1][1]

In [112]:
print("True positive:", TP)
print("False positive:", FP)
print("False Negative:", FN)
print("True Negative:", TN)

True positive: 213
False positive: 45
False Negative: 56
True Negative: 186


In [113]:
print("correctly identified:", TP+TN)
print("Wrongly identified:", FN+FP)

correctly identified: 399
Wrongly identified: 101


### Let us check if our accuracy changes after applying tf-idf vectorizer

In [114]:
tf_vect = TfidfVectorizer(min_df= 2, lowercase = True, stop_words = 'english')

In [115]:
X_TFIDF = tf_vect.fit_transform(rev_list)

In [116]:
type(X_TFIDF)

scipy.sparse.csr.csr_matrix

In [117]:
X_TFIDF.shape

(2000, 23784)

In [118]:
X_TFIDF_names = tf_vect.get_feature_names()

In [119]:
X_TFIDF_names

['00',
 '000',
 '007',
 '05',
 '10',
 '100',
 '1000',
 '100m',
 '101',
 '102',
 '103',
 '105',
 '106',
 '107',
 '108',
 '10th',
 '11',
 '110',
 '113',
 '115',
 '11th',
 '12',
 '126',
 '129',
 '13',
 '130',
 '132',
 '137',
 '13th',
 '14',
 '14th',
 '15',
 '150',
 '1500s',
 '155',
 '15th',
 '16',
 '160',
 '1600',
 '161',
 '16mm',
 '16th',
 '16x9',
 '17',
 '175',
 '1773',
 '17th',
 '18',
 '180',
 '1800s',
 '1839',
 '1869',
 '1871',
 '1888',
 '18th',
 '19',
 '1900',
 '1912',
 '1914',
 '1919',
 '1925',
 '1928',
 '1930',
 '1930s',
 '1932',
 '1933',
 '1935',
 '1937',
 '1938',
 '1939',
 '1940',
 '1940s',
 '1941',
 '1943',
 '1944',
 '1945',
 '1947',
 '1948',
 '1949',
 '1950',
 '1950s',
 '1953',
 '1954',
 '1957',
 '1958',
 '1959',
 '1960',
 '1960s',
 '1961',
 '1962',
 '1963',
 '1964',
 '1965',
 '1966',
 '1967',
 '1968',
 '1969',
 '1970',
 '1970s',
 '1971',
 '1972',
 '1973',
 '1974',
 '1975',
 '1976',
 '1977',
 '1978',
 '1979',
 '1980',
 '1980s',
 '1981',
 '1982',
 '1983',
 '1984',
 '1985',
 '198

In [122]:
X_tf = pd.DataFrame(X_TFIDF.toarray(), columns = X_TFIDF_names)

In [123]:
X_tf.head()

Unnamed: 0,00,000,007,05,10,100,1000,100m,101,102,...,zoom,zooming,zooms,zoot,zorg,zorro,zucker,zuko,zwick,zwigoff
0,0.0,0.0,0.0,0.0,0.396904,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [124]:
X_TF_train, X_TF_test, y_TF_train, y_TF_test = train_test_split(X_tf, y, test_size = 0.25, random_state = 5)

In [125]:
X_TF_train.shape

(1500, 23784)

In [126]:
X_TF_test.shape

(500, 23784)

In [127]:
y_TF_train.shape

(1500,)

In [128]:
y_TF_test.shape

(500,)

In [130]:
clf_tf = MultinomialNB()

In [131]:
clf_tf.fit(X_TF_train, y_TF_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [134]:
y_TF_pred = clf_tf.predict(X_TF_test)

In [135]:
type(y_TF_pred)

numpy.ndarray

In [136]:
y_TF_pred.shape

(500,)

In [137]:
print(metrics.accuracy_score(y_TF_test, y_TF_pred))

0.82


In [138]:
score_clf_tf = confusion_matrix(y_TF_test, y_TF_pred)

In [139]:
score_clf_tf

array([[211,  47],
       [ 43, 199]], dtype=int64)

In [141]:
TP = score_clf_tf[0][0]
FP =  score_clf_tf[0][1]
FN = score_clf_tf[1][0]
TN = score_clf_tf[1][1]

In [142]:
print("True positive:", TP)
print("False positive:", FP)
print("False Negative:", FN)
print("True Negative:", TN)

True positive: 211
False positive: 47
False Negative: 43
True Negative: 199


In [143]:
print("correctly identified:", TP+TN)
print("Wrongly identified:", FN+FP)

correctly identified: 410
Wrongly identified: 90
