<h1>Project Computer Vision Lecture</h1>
<h2>Title: Movie Review Sentiment Analysis</h2>

1. Daniel Santoso / 2201756506
2. Boban Nathaniel Seputra / 2201762540
3. Luwis Lim / 2201761771
4. Steven Odolf Yuwono / 2201758045

In [1]:
import pandas as pd
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.svm import LinearSVC

In [2]:
df = pd.read_csv('IMDB Dataset.csv', delimiter=',')
print(df.isnull().values.any())

y = df.sentiment

print(df)

False
                                                  review sentiment
0      One of the other reviewers has mentioned that ...  positive
1      A wonderful little production. <br /><br />The...  positive
2      I thought this was a wonderful way to spend ti...  positive
3      Basically there's a family where a little boy ...  negative
4      Petter Mattei's "Love in the Time of Money" is...  positive
...                                                  ...       ...
49995  I thought this movie did a down right good job...  positive
49996  Bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  I am a Catholic taught in parochial elementary...  negative
49998  I'm going to have to disagree with the previou...  negative
49999  No one expects the Star Trek movies to be high...  negative

[50000 rows x 2 columns]


In [3]:
df['review'] = df['review'].str.lower()
print(df)

                                                  review sentiment
0      one of the other reviewers has mentioned that ...  positive
1      a wonderful little production. <br /><br />the...  positive
2      i thought this was a wonderful way to spend ti...  positive
3      basically there's a family where a little boy ...  negative
4      petter mattei's "love in the time of money" is...  positive
...                                                  ...       ...
49995  i thought this movie did a down right good job...  positive
49996  bad plot, bad dialogue, bad acting, idiotic di...  negative
49997  i am a catholic taught in parochial elementary...  negative
49998  i'm going to have to disagree with the previou...  negative
49999  no one expects the star trek movies to be high...  negative

[50000 rows x 2 columns]


In [4]:
# Remove numbers and punctuation

import re

REPLACE_NO_SPACE = re.compile("[.;:!\'?,\"()\[\]]")
REPLACE_WITH_SPACE = re.compile("(<br\s*/><br\s*/>)|(\-)|(\/)")

def preprocess_reviews(reviews):
    reviews = [REPLACE_NO_SPACE.sub("", line.lower()) for line in reviews]
    reviews = [REPLACE_WITH_SPACE.sub(" ", line) for line in reviews]
    
    return reviews

df['review'] = df['review'].apply(lambda x: re.sub(r"\d+", "", x))
df['review'] = preprocess_reviews(df['review'])
print(df)

                                                  review sentiment
0      one of the other reviewers has mentioned that ...  positive
1      a wonderful little production  the filming tec...  positive
2      i thought this was a wonderful way to spend ti...  positive
3      basically theres a family where a little boy j...  negative
4      petter matteis love in the time of money is a ...  positive
...                                                  ...       ...
49995  i thought this movie did a down right good job...  positive
49996  bad plot bad dialogue bad acting idiotic direc...  negative
49997  i am a catholic taught in parochial elementary...  negative
49998  im going to have to disagree with the previous...  negative
49999  no one expects the star trek movies to be high...  negative

[50000 rows x 2 columns]


In [5]:
review = df['review']

stop_words = ['in', 'of', 'at', 'a', 'the']
ngram_vectorizer = CountVectorizer(binary=True, ngram_range=(1, 3), stop_words=stop_words)
ngram_vectorizer.fit(review)
X = ngram_vectorizer.transform(review)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size = 0.8
)

X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, train_size = 0.8
)

print(X_train)
print(y_train)
print(X_val)
print(y_val)
print(X_test)
print(y_test)

  (0, 35721)	1
  (0, 36300)	1
  (0, 36302)	1
  (0, 70717)	1
  (0, 71157)	1
  (0, 71326)	1
  (0, 87877)	1
  (0, 90556)	1
  (0, 90560)	1
  (0, 92056)	1
  (0, 92141)	1
  (0, 92492)	1
  (0, 92826)	1
  (0, 92907)	1
  (0, 99873)	1
  (0, 99926)	1
  (0, 164849)	1
  (0, 170063)	1
  (0, 170065)	1
  (0, 212164)	1
  (0, 224782)	1
  (0, 224784)	1
  (0, 240554)	1
  (0, 240578)	1
  (0, 259570)	1
  :	:
  (31999, 9218628)	1
  (31999, 9218675)	1
  (31999, 9353070)	1
  (31999, 9364347)	1
  (31999, 9364350)	1
  (31999, 9386020)	1
  (31999, 9386176)	1
  (31999, 9390058)	1
  (31999, 9390063)	1
  (31999, 9402763)	1
  (31999, 9402770)	1
  (31999, 9472208)	1
  (31999, 9477455)	1
  (31999, 9477456)	1
  (31999, 9484401)	1
  (31999, 9485119)	1
  (31999, 9485120)	1
  (31999, 9485409)	1
  (31999, 9485410)	1
  (31999, 9574955)	1
  (31999, 9575212)	1
  (31999, 9575213)	1
  (31999, 9611906)	1
  (31999, 9612477)	1
  (31999, 9612478)	1
45513    positive
39582    positive
23308    negative
32027    positive
9405     nega

In [6]:
svm = LinearSVC(C=0.1)
svm.fit(X_train, y_train)
print ("Accuracy for C=%s: %s" % (0.1, accuracy_score(y_val, svm.predict(X_val))))

Accuracy for C=0.1: 0.90225


In [7]:
svm1 = LinearSVC(C=0.1)
svm1.fit(X_train, y_train)
print ("Accuracy for C=%s: %s" % (0.1, accuracy_score(y_test, svm1.predict(X_test))))

Accuracy for C=0.1: 0.901
