# Sentiment Analysis using XXXX and Spacy

## Dataset: [IMDB Dataset of 50K Movie Reviews (Kaggle)](https://www.kaggle.com/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)
## By: Amir Nejad, PhD.


In [14]:
# general
import pandas as pd
import numpy as np
import random

# visualization
import matplotlib.pyplot as plt
import seaborn as sns

# NLP & text processing
import re,string
from bs4 import BeautifulSoup
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en import English


In [15]:
# Settings'
# pandas
pd.set_option("display.max_columns", 250)
pd.set_option("display.max_rows", 250)
# matplotlib
plt.rcParams['figure.figsize'] = (11, 8)
plt.style.use('fivethirtyeight')
# random
SEED = 42
random.seed(SEED)
# spacy
nlp = English()

In [3]:
# Inputs

# Path of the data
PATH_DATA = '../data/IMDB/imdb-dataset-of-50k-movie-reviews.zip'
# Number of test samples
test_samples = 10000

In [4]:
# read data
df_raw = pd.read_csv(PATH_DATA, compression='zip')


In [5]:
df_raw.head()


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [6]:
print(
    f"Number of Positive Reviews: {len(df_raw[df_raw['sentiment'] =='positive'])}")
print(
    f"Number of Negative Reviews: {len(df_raw[df_raw['sentiment'] =='negative'])}")


Number of Positive Reviews: 25000
Number of Negative Reviews: 25000


# Problem Solving Steps
* Review cleaning
* Train/test split
* Model Build
* Model Train
* QC & Evaluate
* Prediction

### Cleaning the reviews data

In [53]:
def cleaner(text):
    # cleaning reviews from the html tags
    text_cleaned = BeautifulSoup(text, "html.parser").get_text()
    # removing any urls
    text_cleaned = re.sub(r'http\S+', ' ', text_cleaned)
    # removing brackets & parantheiss etc
    text_cleaned = re.sub(r"[\([{})\]]",' ', text_cleaned)
    # removing some special characters
    text_cleaned = re.sub(r"[^0-9a-zA-Z:!?'.,]+",' ', text_cleaned)
    return text_cleaned


In [54]:
# df is a processed version of df_raw
df = df_raw.copy()
df['review'] = df['review'].apply(lambda review: cleaner(review))


In [55]:
index = 17
# looking at some cleaned reviews (before and after)
print("Before:  "+df_raw['review'].iloc[index] + " - " * 20+' After:  ' + df['review'].iloc[index])


Before:  This movie made it into one of my top 10 most awful movies. Horrible. <br /><br />There wasn't a continuous minute where there wasn't a fight with one monster or another. There was no chance for any character development, they were too busy running from one sword fight to another. I had no emotional attachment (except to the big bad machine that wanted to destroy them) <br /><br />Scenes were blatantly stolen from other movies, LOTR, Star Wars and Matrix. <br /><br />Examples<br /><br />>The ghost scene at the end was stolen from the final scene of the old Star Wars with Yoda, Obee One and Vader. <br /><br />>The spider machine in the beginning was exactly like Frodo being attacked by the spider in Return of the Kings. (Elijah Wood is the victim in both films) and wait......it hypnotizes (stings) its victim and wraps them up.....uh hello????<br /><br />>And the whole machine vs. humans theme WAS the Matrix..or Terminator.....<br /><br />There are more examples but why waste th

In [None]:
from nltk.probability import FreqDist
from nltk.tokenize import word_tokenize
tokenized_word = [word_tokenize(l) for l in df['review'].values]
tokenized_word = np.concatenate(tokenized_word)


In [None]:
fdist = FreqDist(tokenized_word)


In [None]:
def train_test_split(df, test_samples):

    # this fucntion splits df to train/test while maintains same number of
    # categories in the test data
    pos_index = list(df[df['sentiment'] == 'positive'].index)
    neg_index = list(df[df['sentiment'] == 'negative'].index)
    n = test_samples//2
    i1 = random.sample(pos_index, n)
    i2 = random.sample(neg_index, n)
    test_index = np.concatenate((i1, i2))
    train_index = [x for x in df.index if x not in test_index]
    # training df
    train_df = df.loc[train_index, :]
    # testing df
    test_df = df.loc[test_index, :]
    return train_df, test_df


In [None]:
# split the dataframes
train_df, test_df = train_test_split(df, test_samples)

In [None]:
# check the shape of the dataframes
(train_df.shape, test_df.shape)