### Notebook part 1
# Data Cleaning
_By **Avi Patel**_

## Overview
This project seeks to create a model that classifies Sentiment of a given movie review is Positive or Negative . This information will help Movie Production House to analyze reviews from viewers around various social networking platform.



## Business Problem
Movie Production House can get Average user rating of their movies from websites like IMDb, Google Reviews and Yelp. But to get the Overall Sentiment and Word Of Mouth from social networking sites can be quite painful as reviews are available in the form of text and not ratings (numbers). To solve this problem Machine Learning models that analyze the sentiment of a given text review can be helpful.

In [1]:
import pandas as pd
import numpy as np
from pandas import option_context
import contractions
import unicodedata
import re

from textblob import TextBlob

#BeautifulSoup
from bs4 import BeautifulSoup

#NLTK
import nltk  
#nltk.download()
from nltk.tokenize import RegexpTokenizer


#Sapcy
import spacy
from spacy.lang.en.stop_words import STOP_WORDS
spacy_nlp = spacy.load("en_core_web_lg")

# warnings
import warnings
warnings.simplefilter(action='ignore', category=Warning)

In [2]:
df = pd.read_csv('datasets/movie_reviews.csv')

In [3]:
df.drop(columns=['Unnamed: 0'], axis=1, inplace=True)

In [4]:
with option_context('display.max_colwidth',250):
    display(df.head())

Unnamed: 0,review,sentiment
0,"Hakuna Matata. What a wonderful phrase. Hakuna Matata. Ain't no passing phrase. The Lion King (2019) is the #1 Movie in the World. It brings a whole joy of magic of Disney and wonders, all over again. The Lion King is one of the best instant clas...",positive
1,"Obviously a second attempt at the Hugh Glass story as portrayed by Richard Harris in Man in the Wilderness (1971). But technology has come far in the last 44 years and the cinematography is breathtaking. The technology to re-create the ""big scene...",positive
2,"If some studio head, or person who was ivolved with this film ever reads this, I must apologize for my harsh comments which will ensue. I know that it is very difficult and time consuming to make a movie, whether it's good or bad. It demands time...",negative
3,"Rocknrolla rocked me big time. I fall for everything that has to do with mafias and organized crime. This time, it goes further and displays the underworld of well, everything. Mafias are no longer exclusive for narcotics, it also deals with busi...",positive
4,"Finally, harry potter has a better movie than the book",positive


In [5]:
## functions

def count_word(text):
    return len(list(text.split(" ")))

def fix_accented_char(x):
    x = unicodedata.normalize('NFKD', x).encode('ascii', 'ignore').decode('utf-8','ignore')
    return x

def lemmat(text):
    doc = spacy_nlp(text)
    text = " ".join([token.lemma_ for token in doc])
    return text

def removeWords(text, listOfWords):
    for w in text.split():
        if w in listOfWords:
            text = text.replace(w,'')
    return text

### Count Total Words In Review
___

In [6]:
df['total_words'] = df['review'].apply(lambda x: count_word(x))

In [7]:
with option_context('display.max_colwidth',250):
    display(df.head())

Unnamed: 0,review,sentiment,total_words
0,"Hakuna Matata. What a wonderful phrase. Hakuna Matata. Ain't no passing phrase. The Lion King (2019) is the #1 Movie in the World. It brings a whole joy of magic of Disney and wonders, all over again. The Lion King is one of the best instant clas...",positive,114
1,"Obviously a second attempt at the Hugh Glass story as portrayed by Richard Harris in Man in the Wilderness (1971). But technology has come far in the last 44 years and the cinematography is breathtaking. The technology to re-create the ""big scene...",positive,70
2,"If some studio head, or person who was ivolved with this film ever reads this, I must apologize for my harsh comments which will ensue. I know that it is very difficult and time consuming to make a movie, whether it's good or bad. It demands time...",negative,290
3,"Rocknrolla rocked me big time. I fall for everything that has to do with mafias and organized crime. This time, it goes further and displays the underworld of well, everything. Mafias are no longer exclusive for narcotics, it also deals with busi...",positive,131
4,"Finally, harry potter has a better movie than the book",positive,10


In [8]:
df.describe()

Unnamed: 0,total_words
count,253039.0
mean,217.98882
std,189.752911
min,1.0
25%,107.0
50%,159.0
75%,270.0
max,2654.0


#### Observation
___
- Minimun words a review has is just 1 word and max is 2654 which incease the standard deviation and most of the reviews lies around 100 to 300 words.

In [9]:
max_words_index = df[(df.total_words > 820)].index
#df.drop(index=max_words_index, inplace=True)

In [10]:
min_words_index = df[(df.total_words < 20)].index
#df.drop(index=min_words_index, inplace=True)

### Removing Extra White Spaces
___

In [11]:
df.review = df.review.apply(lambda x: x.strip())
df.review = df.review.apply(lambda x: re.sub(' +', ' ', x))

### Removing Digits
___

In [12]:
df.review = df.review.apply(lambda x: ''.join([i for i in x if not i.isdigit()]))

### Removing HTML Tags
___

In [13]:
df.review = df.review.apply(lambda x: BeautifulSoup(x, 'lxml').get_text())

### Contraction to Expansion
___

In [14]:
df.review = df.review.apply(lambda x: contractions.fix(x))

### Removing Punctuation and Special Characters
___

In [15]:
onlyWorkdstokenizer = RegexpTokenizer(r'\w+')

In [16]:
df.review = df.review.apply(lambda x: " ".join(onlyWorkdstokenizer.tokenize(x)))

### Removing Accented Characters
___

In [17]:
df.review = df.review.apply(lambda x: fix_accented_char(x))

### Normalizing Case
___

In [18]:
df.review = df.review.apply(lambda x: x.lower())

### Removing Stop Words

In [19]:
df.review = df.review.apply(lambda x: ' '.join([w for w in x.split() if w not in STOP_WORDS]))

In [20]:
df['total_words'] = df['review'].apply(lambda x: count_word(x))

In [21]:
with option_context('display.max_colwidth',250):
    display(df.head())

Unnamed: 0,review,sentiment,total_words
0,hakuna matata wonderful phrase hakuna matata passing phrase lion king movie world brings joy magic disney wonders lion king best instant classic disney remakes generation remember king pride rock beautiful live action film like original exact lin...,positive,51
1,obviously second attempt hugh glass story portrayed richard harris man wilderness technology come far years cinematography breathtaking technology create big scene awe inspiring able watch twice let percolate watch years hold dancing wolves,positive,33
2,studio head person ivolved film reads apologize harsh comments ensue know difficult time consuming movie good bad demands time money care think combined efforts hundreds individuals despite effort goes production film like films fall aforemention...,negative,110
3,rocknrolla rocked big time fall mafias organized crime time goes displays underworld mafias longer exclusive narcotics deals business men bets clothing art etc events fast paced told eyes local robbers peculiar group plenty action think best feat...,positive,63
4,finally harry potter better movie book,positive,6


### Lemmatizing
___

In [22]:
df.review = df.review.apply(lambda x: lemmat(x))

In [23]:
with option_context('display.max_colwidth',250):
    display(df.head())

Unnamed: 0,review,sentiment,total_words
0,hakuna matata wonderful phrase hakuna matata pass phrase lion king movie world bring joy magic disney wonder lion king well instant classic disney remake generation remember king pride rock beautiful live action film like original exact line scen...,positive,51
1,obviously second attempt hugh glass story portray richard harris man wilderness technology come far year cinematography breathtake technology create big scene awe inspire able watch twice let percolate watch year hold dance wolf,positive,33
2,studio head person ivolve film read apologize harsh comment ensue know difficult time consume movie good bad demand time money care think combine effort hundred individual despite effort go production film like film fall aforementioned bad catego...,negative,110
3,rocknrolla rock big time fall mafia organize crime time go display underworld mafia long exclusive narcotic deal business man bet clothing art etc event fast pace tell eye local robber peculiar group plenty action think good feature guy ritchie s...,positive,63
4,finally harry potter well movie book,positive,6


### Removing Empty Reviews
___

In [24]:
empty = df[df.review == ''].index

In [25]:
df.drop(index=empty, axis=0, inplace=True)

### Removing Reviews with Less than 5 words. 

In [26]:
min_words_index = df[(df.total_words < 5)].index
df.drop(index=min_words_index, inplace=True)

In [27]:
df.drop(columns='total_words',axis=1,inplace=True)

### Rare and Common Words

In [28]:
text = " ".join(df.review)
text = text.split()
freq_word = pd.Series(text).value_counts()

In [29]:
top_freq = list(freq_word.sort_values(ascending=False)[:100].to_dict().keys())

In [30]:
bottom_freq = list(freq_word.sort_values()[:160000].to_dict().keys())

In [31]:
df['total_words'] = df['review'].apply(lambda x: count_word(x))

In [32]:
with option_context('display.max_colwidth',250):
    display(df.head())

Unnamed: 0,review,sentiment,total_words
0,hakuna matata wonderful phrase hakuna matata pass phrase lion king movie world bring joy magic disney wonder lion king well instant classic disney remake generation remember king pride rock beautiful live action film like original exact line scen...,positive,51
1,obviously second attempt hugh glass story portray richard harris man wilderness technology come far year cinematography breathtake technology create big scene awe inspire able watch twice let percolate watch year hold dance wolf,positive,33
2,studio head person ivolve film read apologize harsh comment ensue know difficult time consume movie good bad demand time money care think combine effort hundred individual despite effort go production film like film fall aforementioned bad catego...,negative,110
3,rocknrolla rock big time fall mafia organize crime time go display underworld mafia long exclusive narcotic deal business man bet clothing art etc event fast pace tell eye local robber peculiar group plenty action think good feature guy ritchie s...,positive,63
4,finally harry potter well movie book,positive,6


### Renaming Labels
___

In [33]:
df.sentiment = [1 if sentiment =='positive' else 0 for sentiment in df.sentiment]

In [34]:
df.shape
df.dropna(subset=['review'],axis=0,inplace=True)

In [35]:
df.to_csv('datasets/filtered.csv',index=False)

## Next Step

In the next step we will visualize the text reviews

The above step can be found in `data-viz.ipynb` file.