<h1>Project Computer Vision Lecture</h1>
<h2>Title: Movie Review Sentiment Analysis</h2>

1. Daniel Santoso / 2201756506
2. Boban Nathaniel Seputra / 2201762540
3. Luwis Lim / 2201761771
4. Steven Odolf Yuwono / 2201758045

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

In [2]:
df = pd.read_csv('IMDB Dataset.csv', delimiter=',')
print(df.isnull().values.any())

y = df.sentiment

print(df)

False
                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


In [3]:
df['review'] = df['review'].str.lower()
print(df)

                                                  review sentiment
0      one of the other reviewers has mentioned that ...  positive
1      a wonderful little production. <br /><br />the...  positive
2      i thought this was a wonderful way to spend ti...  positive
3      basically there's a family where a little boy ...  negative
4      petter mattei's "love in the time of money" is...  positive
...                                                  ...       ...
49995  i thought this movie did a down right good job...  positive
49996  bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  i am a catholic taught in parochial elementary...  negative
49998  i'm going to have to disagree with the previou...  negative
49999  no one expects the star trek movies to be high...  negative

[50000 rows x 2 columns]


In [4]:
# Remove numbers and punctuation

import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

df['review'] = df['review'].apply(lambda x: re.sub(r"\d+", "", x))
df['review'] = preprocess_reviews(df['review'])
print(df)

                                                  review sentiment
0      one of the other reviewers has mentioned that ...  positive
1      a wonderful little production  the filming tec...  positive
2      i thought this was a wonderful way to spend ti...  positive
3      basically theres a family where a little boy j...  negative
4      petter matteis love in the time of money is a ...  positive
...                                                  ...       ...
49995  i thought this movie did a down right good job...  positive
49996  bad plot bad dialogue bad acting idiotic direc...  negative
49997  i am a catholic taught in parochial elementary...  negative
49998  im going to have to disagree with the previous...  negative
49999  no one expects the star trek movies to be high...  negative

[50000 rows x 2 columns]


In [6]:
review = df['review']

stop_words = ['in', 'of', 'at', 'a', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words)
ngram_vectorizer.fit(review)
X = ngram_vectorizer.transform(review)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size = 0.8
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size = 0.8
)

print(X_train)
print(y_train)
print(X_val)
print(y_val)
print(X_test)
print(y_test)

  (0, 40174)	1
  (0, 42082)	1
  (0, 42086)	1
  (0, 92492)	1
  (0, 93529)	1
  (0, 93537)	1
  (0, 98030)	1
  (0, 98031)	1
  (0, 144871)	1
  (0, 145024)	1
  (0, 145025)	1
  (0, 181329)	1
  (0, 181563)	1
  (0, 181566)	1
  (0, 181730)	1
  (0, 181735)	1
  (0, 212164)	1
  (0, 212845)	1
  (0, 212925)	1
  (0, 269740)	1
  (0, 270528)	1
  (0, 270537)	1
  (0, 284317)	1
  (0, 284320)	1
  (0, 324962)	1
  :	:
  (31999, 9068260)	1
  (31999, 9070608)	1
  (31999, 9076591)	1
  (31999, 9076712)	1
  (31999, 9173112)	1
  (31999, 9178387)	1
  (31999, 9178483)	1
  (31999, 9193924)	1
  (31999, 9213145)	1
  (31999, 9213248)	1
  (31999, 9214350)	1
  (31999, 9218171)	1
  (31999, 9218174)	1
  (31999, 9472208)	1
  (31999, 9472519)	1
  (31999, 9472681)	1
  (31999, 9576979)	1
  (31999, 9584044)	1
  (31999, 9584049)	1
  (31999, 9586993)	1
  (31999, 9587645)	1
  (31999, 9588083)	1
  (31999, 9588087)	1
  (31999, 9595728)	1
  (31999, 9596185)	1
27754    negative
4975     positive
48690    negative
34358    negative
45873

In [7]:
svm = LinearSVC(C=0.1)
svm.fit(X_train, y_train)
print ("Accuracy for C=%s: %s" % (0.1, accuracy_score(y_val, svm.predict(X_val))))

Accuracy for C=0.1: 0.90725


In [9]:
# svm1 = LinearSVC(C=0.1)
# svm1.fit(X_train, y_train)
print ("Accuracy for C=%s: %s" % (0.1, accuracy_score(y_test, svm.predict(X_test))))

Accuracy for C=0.1: 0.9028
