# Digital Talent Scholarship 2020 
# Challenge 2 - IMDB Movie Reviews 
# Data Scientist D - Kelompok 2 : 

  1. Adinda Maharani - Teknik Elektro - 015192512011
  2. Garias Devara - Teknik Material - 015192512160
  3. Johannes Silaban - Ilmu Geologi - 015192512155
  4. Yunita - Teknik Informatika - 01519251201

**Import Dataset dari drive**

In [2]:
from google.colab import drive
drive.mount ('/content/drive/')

Mounted at /content/drive/


In [3]:
import pandas as pd
path = '/content/drive/My Drive/Colab Notebooks/datasets/IMDB_Dataset.csv'
df = pd.read_csv(path)
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


**Melihat jumlah data dalam dataset**

In [4]:
df.shape

(50000, 2)

In [5]:
df['sentiment'].value_counts()

negative    25000
positive    25000
Name: sentiment, dtype: int64

**Identifikasi missing data dalam dataset**

In [6]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

Karena jumlah data dalam dataset cukup banyak (**50.000** data), untuk mempercepat proses permodelan maka kami akan mengambil sampel 10% dari dataset yakni sejumlah **5.000** data.

In [7]:
import random

n = 50000 # Jumlah keseluruhan data
s = 5000 # Jumlah sampel yang diinginkan
skip = sorted(random.sample(range(1,n+1),n-s))
df2 = pd.read_csv(path, skiprows=skip)
df2.head()

Unnamed: 0,review,sentiment
0,"Probably my all-time favorite movie, a story o...",positive
1,"I remember this film,it was the first film i h...",positive
2,"What an absolutely stunning movie, if you have...",positive
3,"Taut and organically gripping, Edward Dmytryk'...",positive
4,"I bought this film at Blockbuster for $3.00, b...",negative


**Jumlah data dari 50.000 jadi 5.000**

In [8]:
df2.shape

(5000, 2)

**Distribusi sentimen positif dan negatif**

In [9]:
df2['sentiment'].value_counts()

positive    2521
negative    2479
Name: sentiment, dtype: int64

# Preprocessing Data

**Install library nltk (Natural Language Toolkit) untuk preprocessing data yang berupa text**

In [13]:
import nltk
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/cmudict.zip.
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gazetteers.zip.
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/genesis.zip.
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/gutenberg.zip.
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/inaugural.zip.
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Unzipping corpora/movie_reviews.zip.
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Unzipping corpora/names.zip.
[nltk_data]    | Downloading package shakespeare to /root/nltk_data...
[nlt

True

In [14]:
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.stem import LancasterStemmer,WordNetLemmatizer
from bs4 import BeautifulSoup
import spacy
import re,string,unicodedata

#Tokenization text
tokenizer=ToktokTokenizer()
#Setting English stopwords
stopword_list=nltk.corpus.stopwords.words('english')

**Menghilangkan strip html dan noise pada text**

In [15]:
# Menghilangkan html strips 
def strip_html(text):
  soup = BeautifulSoup(text, 'html.parser')
  return soup.get_text()

# Menghilangkan square brackets
def remove_between_square_brackets(text):
  return re.sub('\[[^]]*\]', '', text)

# Menghilangkan noise text
def denoise_text(text):
  text = strip_html(text)
  text = remove_between_square_brackets(text)
  return text

# Mengaplikasikan fungsi pada kolom review
df2['review'] = df2['review'].apply(denoise_text)


**Menghilangkan karakter spesial (simbol-simbol) pada text**

In [16]:
# Define fungsi untuk menghilangkan karakter spesial
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text

# Mengaplikasikan fungsi pada kolom review
df2['review']=df2['review'].apply(remove_special_characters)

**Text stemming**

In [17]:
# Define fungsi untuk text stemming
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
# Mengaplikasikan fungsi pada kolom review
df2['review']=df2['review'].apply(simple_stemmer)

**Menghilangkan Stopwords**

In [18]:
# Set stopwords ke Bahasa Inggris
stop=set(stopwords.words('english'))
print(stop)

# Definisi fungsi menghilangkan stopwords
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

# Mengaplikasikan fungsi pada kolom review
df2['review']=df2['review'].apply(remove_stopwords)

{'above', 'by', 'into', 'under', 'a', 'both', "aren't", 'didn', "that'll", 'hasn', 'is', 'does', "didn't", "mustn't", 'at', 'then', 'why', 'over', 'when', 'doing', 'which', 'here', "she's", 'whom', 'through', "doesn't", 'wasn', 'again', 'her', "haven't", "shouldn't", 'has', 'your', "it's", 'my', 'those', 'you', "you'll", 'hers', 'the', 'nor', 'on', 'no', 'o', 'down', 'm', 'as', 'ain', "you'd", 'yours', 'other', 's', 'herself', "shan't", 'did', 'aren', 'now', 've', 'are', "wasn't", 'ma', "mightn't", 'will', 'what', 'we', 'during', 'out', 'each', 'ourselves', 'do', 'he', 'their', 'own', 'being', "don't", 'don', 'between', 'our', 'below', "won't", 'how', 'himself', 'who', 'if', 'him', 'themselves', 'where', 'too', 't', 'haven', 'wouldn', 'hadn', "weren't", 'further', 'theirs', 'until', 'itself', 'y', 'yourselves', "wouldn't", 'because', 'an', 'it', 'its', 'for', "isn't", 'i', 'been', 'or', "hasn't", 'once', 'there', 'weren', 'shan', 'can', 'am', 'shouldn', 'to', 'couldn', 'needn', 'only',

**Splitting dataset**

In [19]:
# Train dataset
train_reviews=df2.review[:4000]
train_sentiments=df2.sentiment[:4000]

# Test dataset
test_reviews=df2.review[4000:]
test_sentiments=df2.sentiment[4000:]

print(train_reviews.shape, train_sentiments.shape)
print(test_reviews.shape, test_reviews.shape)

(4000,) (4000,)
(1000,) (1000,)


In [20]:
norm_train_reviews=df2.review[:4000]
norm_train_reviews[0]

'probabl alltim favorit movi stori selfless sacrific dedic nobl caus preachi bore never get old despit seen 15 time last 25 year paul luka perform bring tear eye bett davi one veri truli sympathet role delight kid grandma say like dressedup midget children onli make fun watch mother slow awaken happen world roof believ startl dozen thumb theyd thi movi'

In [21]:
norm_test_reviews=df2.review[4000:]
norm_test_reviews[4500]

'ill keep thi fast sweet five girl way home footbal game decid take short cut lead desert forestridden road cours noth good thing happen safe arriv destinationalright dont soon theyr hunt derang chick ha sever mental issu ensu 90 minut sheer boredomi hope never see ani actor ani movi ever scream screech voic gave headach script wa poorli written includ lot repeat phrase nonsens hyster scream one worst cheap horror flick ive ever seenand ive seen lot'

# Data Modelling

Dalam text analysis, terdapat 2 model yang biasanya digunakan yaitu: \
**Bag of Words** & **TFIDF**

**Bag of Words** digunakan untuk konversi teks menjadi vektor numerik. \

Sedangkan, **TFIDF** (Term Frequency-Inverse Document Frequency Model) digunakan untuk konversi teks menjadi matriks.

Untuk lebih jelasnya, bisa berkunjung ke halaman berikut [ini](https://www.analyticsvidhya.com/blog/2020/02/quick-introduction-bag-of-words-bow-tf-idf/)

**Import Libraries untuk modelling:**

In [22]:
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

**Bag of Words Model**

In [24]:
# Menghitung vectorizer untuk model Bag of Words
cv = CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))

# Transformasi train reviews
cv_train_reviews = cv.fit_transform(norm_train_reviews)

# Transformasi test reviews
cv_test_reviews = cv.transform(norm_test_reviews)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)

BOW_cv_train: (4000, 789861)
BOW_cv_test: (1000, 789861)


**TFIDF Model**

In [25]:
# TFIDF vectorizer
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))

# Transformasi train reviews
tv_train_reviews=tv.fit_transform(norm_train_reviews)

# Transformasi test reviews
tv_test_reviews=tv.transform(norm_test_reviews)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (4000, 789861)
Tfidf_test: (1000, 789861)


**Labelling Sentiment Text**

In [26]:
# Labeling sentiment data
lb=LabelBinarizer()

# Transformasi sentiment data
sentiment_data=lb.fit_transform(df2['sentiment'])
print(sentiment_data.shape)

(5000, 1)


**Split Sentiment Data**

In [27]:
# Spliting sentiment data
train_sentiments = sentiment_data[:4000]
test_sentiments = sentiment_data[4000:]
print(train_sentiments)
print(test_sentiments)

[[1]
 [1]
 [1]
 ...
 [0]
 [0]
 [1]]
[[1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]

**Logistic Regression Model** untuk **Bag of Words** dan **TFIDF**

In [28]:
# Training model
lr=LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)

# Fitting model untuk Bag of words
lr_bow=lr.fit(cv_train_reviews,train_sentiments)
print(lr_bow)

# Fitting model untuk TFIDF features
lr_tfidf=lr.fit(tv_train_reviews,train_sentiments)
print(lr_tfidf)

  y = column_or_1d(y, warn=True)


LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


  y = column_or_1d(y, warn=True)


LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=500,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=42, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)


In [29]:
# Prediksi model untuk bag of words
lr_bow_predict=lr.predict(cv_test_reviews)
print(lr_bow_predict)
# Prediksi model untuk TFIDF 
lr_tfidf_predict=lr.predict(tv_test_reviews)
print(lr_tfidf_predict)

[1 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 1
 0 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 1
 1 1 1 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 1 0 1 1 1 0 0 1 1 1 0 0 0 1 1
 0 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 1 0 1 1 0 1 1 0 1 1 0
 1 1 1 1 1 1 0 0 1 1 1 0 0 1 0 0 0 1 0 0 0 1 1 1 0 1 0 0 1 0 0 0 0 0 1 0 1
 1 0 0 1 1 0 0 1 1 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 1 1 0 1 0 0 1 0 1 1 1 1 0
 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 1 1 0
 1 0 1 0 1 1 1 0 0 1 0 0 1 1 0 0 0 1 1 1 0 0 0 1 0 0 0 1 1 1 1 0 0 0 1 1 1
 1 1 0 0 0 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 1 1 0 0 0 0 1 1 0
 1 1 1 0 1 1 1 0 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 1 0 0 0 1 0 1 1 0 0 0 0 0
 1 0 0 1 1 0 1 1 0 0 0 0 1 1 1 1 1 1 0 1 1 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 1
 0 0 1 1 1 1 0 1 1 0 0 0 1 0 1 1 1 0 0 1 1 1 0 0 0 1 0 0 1 1 0 0 1 1 1 1 1
 1 1 0 1 1 0 1 0 0 0 0 0 1 1 1 1 0 1 0 1 1 0 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0
 1 1 1 0 0 0 1 1 1 0 0 1 

**Perhitungan akurasi dari model Logistic Regression**

In [30]:
# Accuracy score untuk bag of words
lr_bow_score=accuracy_score(test_sentiments,lr_bow_predict)
print("Akurasi Bag of Words dengan Logistic Regression :",lr_bow_score)

# Accuracy score untuk TFIDF 
lr_tfidf_score=accuracy_score(test_sentiments,lr_tfidf_predict)
print("Akurasi TFIDF dengan Logistic Regression :",lr_tfidf_score)

Akurasi Bag of Words dengan Logistic Regression : 0.706
Akurasi TFIDF dengan Logistic Regression : 0.677


**Classification Report model Logistic Regression untuk Bag of Words**

In [31]:
lr_bow_report=classification_report(test_sentiments,lr_bow_predict,target_names=['Positive','Negative'])
print(lr_bow_report)

              precision    recall  f1-score   support

    Positive       0.74      0.64      0.69       507
    Negative       0.68      0.77      0.72       493

    accuracy                           0.71      1000
   macro avg       0.71      0.71      0.71      1000
weighted avg       0.71      0.71      0.70      1000



**Classification Report model Logistic Regression untuk TFIDF**

In [32]:
lr_tfidf_report=classification_report(test_sentiments,lr_tfidf_predict,target_names=['Positive','Negative'])
print(lr_tfidf_report)

              precision    recall  f1-score   support

    Positive       0.85      0.44      0.58       507
    Negative       0.62      0.92      0.74       493

    accuracy                           0.68      1000
   macro avg       0.73      0.68      0.66      1000
weighted avg       0.73      0.68      0.66      1000



**Multinomial Naive Bayes Model** untuk **Bag of Words** dan **TFIDF**

In [33]:
# Training model
mnb=MultinomialNB()

# Fitting model untuk bag of words
mnb_bow=mnb.fit(cv_train_reviews,train_sentiments)
print(mnb_bow)

# Fitting model untuk TFIDF
mnb_tfidf=mnb.fit(tv_train_reviews,train_sentiments)
print(mnb_tfidf)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
