### Sentiment Analysis in Python - Part 2

This is the continuation from part 1. I have saved the data from part 1 into a csv file format. Below lines of code will simply read the csv file into a dataframe.

In [158]:
import numpy as np
import pandas as pd

In [159]:
train_reviews = pd.read_csv('train_reviews.csv')

In [160]:
test_reviews = pd.read_csv('test_reviews.csv')

In [161]:
train_reviews.head()

Unnamed: 0.1,Unnamed: 0,review,label,file,tidy_review,reviews_without_stopwords,normalized
0,8372,"gone in 60 sec. where do i began, it keeps you...",1,6286_8.txt,gone in 60 sec where do i began it keeps you i...,gone 60 sec began keeps movie good action cool...,gone 60 sec began keep movi good action cool c...
1,22624,THHE2 is entertaining in that you'll laugh a l...,0,7863_4.txt,thhe2 is entertaining in that youll laugh a lo...,thhe2 entertaining youll laugh lot cringe prob...,thhe2 entertain youll laugh lot cring probabl ...
2,22032,A young boy sees his mother getting killed and...,0,732_1.txt,a young boy sees his mother getting killed and...,young boy sees mother getting killed father ha...,young boy see mother get kill father hang 20 y...
3,798,Spacecamp is a movie that I plan to show my Da...,1,10719_10.txt,spacecamp is a movie that i plan to show my da...,spacecamp movie plan show daughter julia ann r...,spacecamp movi plan show daughter julia ann ru...
4,14317,I'm not going to waste my time writing an essa...,0,11636_3.txt,im not going to waste my time writing an essay...,im going waste time writing essay waste time w...,im go wast time write essay wast time would li...


In [162]:
test_reviews.head()

Unnamed: 0.1,Unnamed: 0,review,label,file,tidy_review,reviews_without_stopwords,normalized
0,24702,Basic summary: Ipswitch used to be a community...,0,9733_4.txt,basic summary ipswitch used to be a community ...,basic summary ipswitch used community witches ...,basic summari ipswitch use commun witch escap ...
1,24288,I have no respect for IMDb ratings anymore. I ...,0,9360_1.txt,i have no respect for imdb ratings anymore i t...,respect imdb ratings anymore think bunch mormo...,respect imdb rate anymor think bunch mormon fl...
2,2743,I really enjoyed this movie. Yes there was dis...,1,1246_9.txt,i really enjoyed this movie yes there was disr...,really enjoyed movie yes disrespect throughout...,realli enjoy movi ye disrespect throughout mov...
3,6856,This is a classic animated film from the carto...,1,4921_10.txt,this is a classic animated film from the carto...,classic animated film cartoon series major cha...,classic anim film cartoon seri major charact g...
4,17403,Frankly I met real Han Su Ying before and seei...,0,3163_4.txt,frankly i met real han su ying before and seei...,frankly met real han su ying seeing portrayed ...,frankli met real han su ying see portray ameri...


Since we only need the normalized review data and the label columns, we will removed the other columns.We will do the same for the test_review dataset

In [148]:
train_reviews.drop([train_reviews.columns[0],'review', 'file', 'tidy_review', 'reviews_without_stopwords'], axis=1, inplace=True)

In [149]:
train_reviews.head()

Unnamed: 0,label,normalized
0,1,gone 60 sec began keep movi good action cool c...
1,0,thhe2 entertain youll laugh lot cring probabl ...
2,0,young boy see mother get kill father hang 20 y...
3,1,spacecamp movi plan show daughter julia ann ru...
4,0,im go wast time write essay wast time would li...


In [151]:
test_reviews.drop([test_reviews.columns[0],'review', 'file', 'tidy_review', 'reviews_without_stopwords'], axis=1, inplace=True)

In [152]:
test_reviews.head()

Unnamed: 0,label,normalized
0,0,basic summari ipswitch use commun witch escap ...
1,0,respect imdb rate anymor think bunch mormon fl...
2,1,realli enjoy movi ye disrespect throughout mov...
3,1,classic anim film cartoon seri major charact g...
4,0,frankli met real han su ying see portray ameri...


#### Bag of Words

It is a common method in text analysis to extract features in a text document. It counts how many time the words appear in a document and creates a tally. The BOW feature extraction technique is implemented in scikit learn by the CountVectorizer.

In [153]:
from sklearn.feature_extraction.text import CountVectorizer

bow_vectorizer = CountVectorizer(binary=True)
bow_vectorizer.fit(train_reviews['normalized'])
train_bow = bow_vectorizer.transform(train_reviews['normalized'])
test_bow = bow_vectorizer.transform(test_reviews['normalized'])


Import the needed library from scikit learn to create the machine learnign model. We will be using Logistic regression model to predict if a given review is positive or not.

In [154]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

Using the tain_test_split method to divide the train_reviews data set into train and test data

In [155]:
X_train, X_test, y_train, y_test = train_test_split(train_bow, train_reviews['label'], random_state=4, test_size=0.3)

Loop into possible values of c to deternine what value of c the gives high accuracy.

In [156]:
for c in [0.01, 0.05, 0.25, 0.5, 1]:
    logreg = LogisticRegression(C=c)
    logreg.fit(X_train, y_train)
    print("Accuray from C=%s: %s"
         % (c, accuracy_score(y_test, logreg.predict(X_test))))

Accuray from C=0.01: 0.8686666666666667
Accuray from C=0.05: 0.8776
Accuray from C=0.25: 0.8762666666666666
Accuray from C=0.5: 0.8712
Accuray from C=1: 0.8674666666666667


Based from the above result, a c=0.5 gives a much better accuracy. Using this value, we create the final model. We will be using this time the entire training_reviews data as the train set and we will use the entire test_reviews data as the test set. The final model gives us an accuracy of 86% which is not bad.

In [157]:
final_model = LogisticRegression(C=0.5)
final_model.fit(train_bow, train_reviews['label'])
print ("Final Accuracy: %s" 
       % accuracy_score(test_reviews['label'], final_model.predict(test_bow)))
# Final Accuracy: 0.86392

Final Accuracy: 0.86392
