# **Final Project - MSCS3806: Advanced Topics in AI and Machine Learning**


**Professor: Avid Farhoodfar**

**Student Name: Siyu Yi**

**Student ID: 277727**

**Date: December 19, 2019**

# Introduction

Sentiment Analysis is a common task in Natural Language Processing (NLP). One application of sentiment analysis is to predict whether a movie review should be classified as positive or negative using machine learning classification algorithms.

This project will present a simplied version of sentiment analysis on movie reviews, compiled by Andrew Maas in 2011.http://ai.stanford.edu/~amaas/data/sentiment/. The dataset is a binary sentiment classification that includes 25,000 highly polar movie reviews for training, and 25,000 for testing.

@InProceedings{maas-EtAl:2011:ACL-HLT2011,
  author    = {Maas, Andrew L.  and  Daly, Raymond E.  and  Pham, Peter T.  and  Huang, Dan  and  Ng, Andrew Y.  and  Potts, Christopher},
  title     = {Learning Word Vectors for Sentiment Analysis},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {142--150},
  url       = {http://www.aclweb.org/anthology/P11-1015}
}


Because the dataset divides each category of movie reviews in different folder, I used a complete txt file combined by Aaron Kub in his Guthub page: https://github.com/aaronkub/machine-learning-examples/tree/master/imdb-sentiment-analysis.

Other references of this project are:
https://machinelearningmastery.com/prepare-movie-review-data-sentiment-analysis/;
https://machinelearningmastery.com/predict-sentiment-movie-reviews-using-deep-learning/.

# Read in the text and assign the sentiment value to the review.

In [None]:
import pandas as pd
pd.set_option('display.max_colwidth', 100)

#create a list where the first 12500 values are positive, and the rest of 12500 values are negative
sent = ['neg']*25000
sent[:12500]= ['pos']*12500

#read in the full test text file and assign column name of "body_text"
data_test = pd.read_csv('full_test.txt', sep='\r', header=None)
data_test.columns = ['body_text']

#add another column to match the sentiment with the review
data_test['label'] = sent
cols = ['label', 'body_text']

#we want to rearrange the order of columns so that the label is in front of the review.
data_test = data_test[cols]

data_test.head()

Unnamed: 0,label,body_text
0,pos,I went and saw this movie last night after being coaxed to by a few friends of mine. I'll admit ...
1,pos,"Actor turned director Bill Paxton follows up his promising debut, the Gothic-horror ""Frailty"", w..."
2,pos,"As a recreational golfer with some knowledge of the sport's history, I was pleased with Disney's..."
3,pos,"I saw this film in a sneak preview, and it is delightful. The cinematography is unusually creati..."
4,pos,Bill Paxton has taken the true story of the 1913 US golf open and made a film that is about much...


In [None]:
#Perform the same command with the training review data
import pandas as pd
pd.set_option('display.max_colwidth', 100)

#create a list where the first 12500 values are positive, and the rest of 12500 values are negative
sent = ['neg']*25000
sent[:12500]= ['pos']*12500

#read in the full test text file and assign column name of "body_text"
data_train = pd.read_csv('full_train.txt', sep='\r', header=None)
data_train.columns = ['body_text']

#add another column to match the sentiment with the review
data_train['label'] = sent
cols = ['label', 'body_text']

#we want to rearrange the order of columns so that the label is in front of the review.
data_train = data_train[cols]

data_train.head()

Unnamed: 0,label,body_text
0,pos,Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school l...
1,pos,Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a ...
2,pos,"Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love s..."
3,pos,"This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not giv..."
4,pos,This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and ...


# Clean the review texts

## Remove the punctuation, special character, etc.

In [None]:
import string
import re

#because the texts are obtained from the IMDb website, it contains something
#like <br> or <br/>. In this case, those character will be replaced by a space.
br = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

#First replace all <br><br/> with space. Then create a new list with all character except for punctuation.
def remove_punct(text):
    text = [br.sub(" ", line) for line in text]
    text_nopunct = "".join([char for char in text if char not in string.punctuation])
    
    return text_nopunct

#data_train_clean = remove_punct(data_train['body_text'])



data_train['body_text_clean'] = data_train['body_text'].apply(lambda x: remove_punct(x.lower()))
data_train.head()

Unnamed: 0,label,body_text,body_text_clean
0,pos,Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school l...,bromwell high is a cartoon comedy it ran at the same time as some other programs about school li...
1,pos,Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a ...,homelessness or houselessness as george carlin stated has been an issue for years but never a pl...
2,pos,"Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love s...",brilliant over acting by lesley ann warren best dramatic hobo lady i have ever seen and love sce...
3,pos,"This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not giv...",this is easily the most underrated film inn the brooks cannon sure its flawed it does not give a...
4,pos,This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and ...,this is not the typical mel brooks film it was much less slapstick than most of his movies and a...


## Further cleaning the text: tokenize, and stopwords
Comparing Stemmerize and Lemmatize

In [None]:
#download stopwords packet from nltk
import nltk
nltk.download('stopwords')

#create variables for all stopwords in nltk
stopwords = nltk.corpus.stopwords.words('english')
ps = nltk.PorterStemmer()

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [None]:
def token(text):
  tokens = re.split('\W+', text)
  text = [word for word in tokens if word not in stopwords]
  return text

data_train['body_text_tokenize'] = data_train['body_text_clean'].apply(lambda x: token(x))

data_train.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenize
0,pos,Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school l...,bromwell high is a cartoon comedy it ran at the same time as some other programs about school li...,"[bromwell, high, cartoon, comedy, ran, time, programs, school, life, teachers, 35, years, teachi..."
1,pos,Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a ...,homelessness or houselessness as george carlin stated has been an issue for years but never a pl...,"[homelessness, houselessness, george, carlin, stated, issue, years, never, plan, help, street, c..."
2,pos,"Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love s...",brilliant over acting by lesley ann warren best dramatic hobo lady i have ever seen and love sce...,"[brilliant, acting, lesley, ann, warren, best, dramatic, hobo, lady, ever, seen, love, scenes, c..."
3,pos,"This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not giv...",this is easily the most underrated film inn the brooks cannon sure its flawed it does not give a...,"[easily, underrated, film, inn, brooks, cannon, sure, flawed, give, realistic, view, homelessnes..."
4,pos,This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and ...,this is not the typical mel brooks film it was much less slapstick than most of his movies and a...,"[typical, mel, brooks, film, much, less, slapstick, movies, actually, plot, followable, leslie, ..."


In [None]:
def stemming(text):
  text = [ps.stem(word) for word in text]
  return text

data_train['body_text_stemmed'] = data_train['body_text_tokenize'].apply(lambda x:stemming(x))
data_train.head()

Unnamed: 0,label,body_text,body_text_clean,body_text_tokenize,body_text_stemmed
0,pos,Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school l...,bromwell high is a cartoon comedy it ran at the same time as some other programs about school li...,"[bromwell, high, cartoon, comedy, ran, time, programs, school, life, teachers, 35, years, teachi...","[bromwel, high, cartoon, comedi, ran, time, program, school, life, teacher, 35, year, teach, pro..."
1,pos,Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a ...,homelessness or houselessness as george carlin stated has been an issue for years but never a pl...,"[homelessness, houselessness, george, carlin, stated, issue, years, never, plan, help, street, c...","[homeless, houseless, georg, carlin, state, issu, year, never, plan, help, street, consid, human..."
2,pos,"Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love s...",brilliant over acting by lesley ann warren best dramatic hobo lady i have ever seen and love sce...,"[brilliant, acting, lesley, ann, warren, best, dramatic, hobo, lady, ever, seen, love, scenes, c...","[brilliant, act, lesley, ann, warren, best, dramat, hobo, ladi, ever, seen, love, scene, cloth, ..."
3,pos,"This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not giv...",this is easily the most underrated film inn the brooks cannon sure its flawed it does not give a...,"[easily, underrated, film, inn, brooks, cannon, sure, flawed, give, realistic, view, homelessnes...","[easili, underr, film, inn, brook, cannon, sure, flaw, give, realist, view, homeless, unlik, say..."
4,pos,This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and ...,this is not the typical mel brooks film it was much less slapstick than most of his movies and a...,"[typical, mel, brooks, film, much, less, slapstick, movies, actually, plot, followable, leslie, ...","[typic, mel, brook, film, much, less, slapstick, movi, actual, plot, follow, lesli, ann, warren,..."


In [None]:
nltk.download('wordnet')
wn = nltk.WordNetLemmatizer()

def lemm(text):
    text = [wn.lemmatize(word) for word in text]
    return text

data_train['body_text_lemm'] = data_train['body_text_tokenize'].apply(lambda x: lemm(x))

data_train.head()

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Unnamed: 0,label,body_text,body_text_clean,body_text_tokenize,body_text_stemmed,body_text_lemm
0,pos,Bromwell High is a cartoon comedy. It ran at the same time as some other programs about school l...,bromwell high is a cartoon comedy it ran at the same time as some other programs about school li...,"[bromwell, high, cartoon, comedy, ran, time, programs, school, life, teachers, 35, years, teachi...","[bromwel, high, cartoon, comedi, ran, time, program, school, life, teacher, 35, year, teach, pro...","[bromwell, high, cartoon, comedy, ran, time, program, school, life, teacher, 35, year, teaching,..."
1,pos,Homelessness (or Houselessness as George Carlin stated) has been an issue for years but never a ...,homelessness or houselessness as george carlin stated has been an issue for years but never a pl...,"[homelessness, houselessness, george, carlin, stated, issue, years, never, plan, help, street, c...","[homeless, houseless, georg, carlin, state, issu, year, never, plan, help, street, consid, human...","[homelessness, houselessness, george, carlin, stated, issue, year, never, plan, help, street, co..."
2,pos,"Brilliant over-acting by Lesley Ann Warren. Best dramatic hobo lady I have ever seen, and love s...",brilliant over acting by lesley ann warren best dramatic hobo lady i have ever seen and love sce...,"[brilliant, acting, lesley, ann, warren, best, dramatic, hobo, lady, ever, seen, love, scenes, c...","[brilliant, act, lesley, ann, warren, best, dramat, hobo, ladi, ever, seen, love, scene, cloth, ...","[brilliant, acting, lesley, ann, warren, best, dramatic, hobo, lady, ever, seen, love, scene, cl..."
3,pos,"This is easily the most underrated film inn the Brooks cannon. Sure, its flawed. It does not giv...",this is easily the most underrated film inn the brooks cannon sure its flawed it does not give a...,"[easily, underrated, film, inn, brooks, cannon, sure, flawed, give, realistic, view, homelessnes...","[easili, underr, film, inn, brook, cannon, sure, flaw, give, realist, view, homeless, unlik, say...","[easily, underrated, film, inn, brook, cannon, sure, flawed, give, realistic, view, homelessness..."
4,pos,This is not the typical Mel Brooks film. It was much less slapstick than most of his movies and ...,this is not the typical mel brooks film it was much less slapstick than most of his movies and a...,"[typical, mel, brooks, film, much, less, slapstick, movies, actually, plot, followable, leslie, ...","[typic, mel, brook, film, much, less, slapstick, movi, actual, plot, follow, lesli, ann, warren,...","[typical, mel, brook, film, much, le, slapstick, movie, actually, plot, followable, leslie, ann,..."


Based on the result, I think lemmatize looks better in normalizing the text. The following paragraph will only lemmatize the data.

# Evaluate Random forest classifier, Naive Bayes and Support vector machine

In [None]:
import re
import string
import nltk
nltk.download('stopwords')
nltk.download('wordnet')

review_train = data_train[['label','body_text']].copy()
review_test = data_test[['label','body_text']].copy()

br = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")
stopwords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()

def clean_lem(text):
    text = [br.sub(" ", line) for line in text] # replace <br><br/> with space
    text = "".join([char.lower() for char in text if char not in string.punctuation]) # remove punctuation
    token = re.split('\W+', text) #split words
    text = [wn.lemmatize(word) for word in token if word not in stopwords] # Lemmatized

    return text


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Because there are total of 50k training and testing dataset, my computer cannot calculate this level of massive dataset. Therefore, I sampled 12500 from the training and testing data, respectively.

## Testing Sample

In [None]:
review_train_sample = review_train.sample(n=12500)
review_test_sample = review_test.sample(n=12500)

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score
#from sklearn.model_selection import train_test_split

In [None]:
# Applying TfidfVectorizer
tfidf_vect_sample = TfidfVectorizer(analyzer = clean_lem, ngram_range=(1,2))
tfidf_vect_fit_sample = tfidf_vect_sample.fit(review_train_sample['body_text'])

tfidf_train_sample = tfidf_vect_fit_sample.transform(review_train_sample['body_text'])
tfidf_test_sample = tfidf_vect_fit_sample.transform(review_test_sample['body_text'])

X_train_vect_sample = pd.DataFrame(tfidf_train_sample.toarray())
X_test_vect_sample = pd.DataFrame(tfidf_test_sample.toarray())

Y_train_sample = review_train_sample['label'].copy()
Y_test_sample = review_test_sample['label'].copy()

X_train_vect_sample.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,...,65677,65678,65679,65680,65681,65682,65683,65684,65685,65686,65687,65688,65689,65690,65691,65692,65693,65694,65695,65696,65697,65698,65699,65700,65701,65702,65703,65704,65705,65706,65707,65708,65709,65710,65711,65712,65713,65714,65715,65716
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.053637,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.148323,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import precision_recall_fscore_support as score
import time

In [None]:
#Applying random forest classifier
rf_sample = RandomForestClassifier(n_estimators = 500, max_depth=None, n_jobs=-1)

start = time.time()
rf_model_sample = rf_sample.fit(X_train_vect_sample, Y_train_sample)
end = time.time()
fit_time_sample = (end - start)

start = time.time()
Y_pred_sample = rf_model_sample.predict(X_test_vect_sample)
end = time.time()
pred_time_sample = (end - start)

precision, recall, fscore, train_support = score(Y_test_sample, Y_pred_sample, pos_label='pos', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time_sample, 3), round(pred_time_sample, 3), round(precision, 3), round(recall, 3), round((Y_pred_sample==Y_test_sample).sum()/len(Y_pred_sample), 3)))

Fit time: 324.12 / Predict time: 4.553 ---- Precision: 0.861 / Recall: 0.852 / Accuracy: 0.857


Random forest has a very long running time to analyze the vectorized data: more than 5 mins. But at least the program didn't crush. It crushes when I tried to analyze the full text

In [None]:
#Applying Naive Bayes classifier
mnb_sample = MultinomialNB()

start = time.time()
mnb_model_sample = mnb_sample.fit(X_train_vect_sample, Y_train_sample)
end = time.time()
fit_time_sample = (end - start)

start = time.time()
Y_pred_sample = mnb_model_sample.predict(X_test_vect_sample)
end = time.time()
pred_time_sample = (end - start)

precision, recall, fscore, train_support = score(Y_test_sample, Y_pred_sample, pos_label='pos', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time_sample, 3), round(pred_time_sample, 3), round(precision, 3), round(recall, 3), round((Y_pred_sample==Y_test_sample).sum()/len(Y_pred_sample), 3)))

Fit time: 2.481 / Predict time: 1.346 ---- Precision: 0.862 / Recall: 0.79 / Accuracy: 0.832


Comparing to random forest, naive bayes's running time is neglectable.

In [None]:
#Test the regularization factor C

for c in [0.01, 0.05, 0.25, 0.5, 1]:
    svm = LinearSVC(C=c)
    svm.fit(X_train_vect_sample, Y_train_sample)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(Y_test_sample, svm.predict(X_test_vect_sample))))

Accuracy for C=0.01: 0.83856
Accuracy for C=0.05: 0.86752
Accuracy for C=0.25: 0.87336
Accuracy for C=0.5: 0.8692
Accuracy for C=1: 0.86464


In [None]:
#Test the regularization factor C

for c in [0.05, 0.1, 0.15, 0.2, 0.25]:
    svm = LinearSVC(C=c)
    svm.fit(X_train_vect_sample, Y_train_sample)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(Y_test_sample, svm.predict(X_test_vect_sample))))

Accuracy for C=0.05: 0.86752
Accuracy for C=0.1: 0.87184
Accuracy for C=0.15: 0.87336
Accuracy for C=0.2: 0.87304
Accuracy for C=0.25: 0.87336


In [None]:
#Test the regularization factor C

for c in [0.25, 0.30, 0.35, 0.40, 0.45, 0.5]:
    svm = LinearSVC(C=c)
    svm.fit(X_train_vect_sample, Y_train_sample)
    print ("Accuracy for C=%s: %s" 
           % (c, accuracy_score(Y_test_sample, svm.predict(X_test_vect_sample))))

Accuracy for C=0.25: 0.87336
Accuracy for C=0.3: 0.87376
Accuracy for C=0.35: 0.87264
Accuracy for C=0.4: 0.87112
Accuracy for C=0.45: 0.87016
Accuracy for C=0.5: 0.8692


When c=0.25, we have the most accuracy.

In [None]:
#Applying Support Vector Machines
svm_sample = LinearSVC(C=0.25)

start = time.time()
svm_model_sample = svm_sample.fit(X_train_vect_sample, Y_train_sample)
end = time.time()
fit_time_sample = (end - start)

start = time.time()
Y_pred_sample = svm_model_sample.predict(X_test_vect_sample)
end = time.time()
pred_time_sample = (end - start)

precision, recall, fscore, train_support = score(Y_test_sample, Y_pred_sample, pos_label='pos', average='binary')
print('Fit time: {} / Predict time: {} ---- Precision: {} / Recall: {} / Accuracy: {}'.format(
    round(fit_time_sample, 3), round(pred_time_sample, 3), round(precision, 3), round(recall, 3), round((Y_pred_sample==Y_test_sample).sum()/len(Y_pred_sample), 3)))

Fit time: 1.901 / Predict time: 0.855 ---- Precision: 0.873 / Recall: 0.874 / Accuracy: 0.873


As you can see, Random Forest classifier yield the accuracy of 0.857;

Naive Bayes has the lowest accuracy: 0.832;

Support Vector Machine has the highest accuracy of 0.873, when the C=0.25.

As to the run time, SVM is faster than Naive Bayes and Random Forest, and the Random Forest has the slowest run time.

Therefore, SVM is the best among the three algorithm to perform sentiment analysis.

# Summarize


This project is a simplified version of sentiment analysis using machine learning classifier algorithm. I used three algorithm -- Random Forest, Naive Bayes, and Support Vector Machine -- to analyze the vectorized text, but none of them achieved the accuracy of more than 90%.


In this project, firstly, I clean up the text by replacing some characters such as < br> and < br/> with space, and removing punctuation using string.punctuation.


Further, I use stopwords function in nltk library to eliminate words that are not likely to have significant impact on prediction.

The result for the three algorithm suggests that SVM is the best algorithm among the three in that it yields the most accuracy in the least amount of time.


More advanced function can be implemented by finding features that will differentiate positive/negative reviews, such as deleting most commonly words but have not significant impact on the result. This can decrease the workload of the program so that more data can be used to train/test.


Further, this program can achieve more to classify the review into more categories, such as very negative, negative, neutral, positive, very positive. These are the features that can be exploit more using sentiment analysis.