# Document Review Sentiment Analysis On Movie Review Data

## Preprocessing module: Performs the following preprocessing techniques

#### 1. Emojis replcement
####  2. Removal of html tags
####  3. Removal of Url's
####  4. Removal of contraction
#### 5. Special characters removal
#### 6.Removal of numbers
#### 7. Removal of duplicate words
#### 8. Removal of stopwords
#### 9.Stemming

In [7]:
import pandas as pd
import re
import os
from nltk.corpus import stopwords, words
from nltk.stem import WordNetLemmatizer
import string
from urllib.parse import urlparse

# Load Dataset

In [9]:
movie_dataset = pd.read_csv(r"C:\Users\karth\OneDrive\Desktop\Sentiment_Analysis\Dataset.csv")


# Replace emojis and converting the text into lower case

Emojis in the review are replaced by the word

In [10]:
emojis = {':)': 'smile', ':-)': 'smile', ';d': 'wink', ':-E': 'vampire', ':(': 'sad', ':-(': 'sad', ':-<': 'sad',
          ':p': 'raspberry', ':o': 'surprised', ':-@': 'shocked', ':@': 'shocked', ':-$': 'confused', ':\\': 'annoyed',
          ':#': 'mute', ':X': 'mute', ':X': 'mute', ':X': 'mute', ':^)': 'smile', ':-&': 'confused', '$_$': 'greedy',
          '@@': 'eyeroll', ':-!': 'confused', ':D': 'smile', ':-0': 'yell', '0.o': 'confused', '<(-_-)>': 'robot',
          'd[-_-]b': 'dj', ":'-)": 'sadsmile', ';)': 'wink', ';-)': 'wink', '0:-)': 'angel', '0*-)': 'angel',
          '(:-D': 'gossip', '=^.^=': 'cat'}


def clean_data(data):
    data = str(data).lower()
    data = re.sub(r"@\S+ ", r'', data)

    for emoji in emojis.keys():
        data = data.replace(emoji, emojis[emoji])

    return data

In [11]:
movie_dataset['review'] = movie_dataset['review'].apply(lambda x: clean_data(x))


In [12]:
print(movie_dataset['review'][0])

one of the other reviewers has mentioned that after watching just 1 oz episode you'll be hooked. they are right, as this is exactly what happened with me.<br /><br />the first thing that struck me about oz was its brutality and unflinching scenes of violence, which set in right from the word go. trust me, this is not a show for the faint hearted or timid. this show pulls no punches with regards to drugs, sex or violence. its is hardcore, in the classic use of the word.<br /><br />it is called oz as that is the nickname given to the oswald maximum security state penitentary. it focuses mainly on emerald city, an experimental section of the prison where all the cells have glass fronts and face inwards, so privacy is not high on the agenda. em city is home to many..aryans, muslims, gangstas, latinos, christians, italians, irish and more....so scuffles, death stares, dodgy dealings and shady agreements are never far away.<br /><br />i would say the main appeal of the show is due to the fac

# Removal Of HTML Tags

In [13]:

def remove_html(review):
    regex = re.compile(r'<[^>]+>')
    return regex.sub('', review)


In [14]:
movie_dataset['review'] = movie_dataset['review'].apply(remove_html)
print(movie_dataset['review'])

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object


# Removal Of URL's

In [15]:
def remove_url(review):
    lst = [l for l in review.split() if not urlparse(l).scheme]
    s = ' '.join(lst)
    return s

In [16]:
movie_dataset['review'] = movie_dataset['review'].apply(remove_url)
print(movie_dataset['review'])

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there's a family where a little boy ...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i'm going to have to disagree with the previou...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object


### Removal of contractions

In [17]:
import contractions

def remove_contractions(text):
    return contractions.fix(text)

   
movie_dataset['review'] = movie_dataset['review'].apply(remove_contractions)


In [18]:
movie_dataset['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production. the filming tec...
2        i thought this was a wonderful way to spend ti...
3        basically there is a family where a little boy...
4        petter mattei's "love in the time of money" is...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot, bad dialogue, bad acting, idiotic di...
49997    i am a catholic taught in parochial elementary...
49998    i am going to have to disagree with the previo...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

# Removal Of Special Characters

In [19]:
def remove_special_characters(review):
    
    return review.translate(str.maketrans('', '', string.punctuation))


In [20]:
movie_dataset['review'] = movie_dataset['review'].apply(remove_special_characters)

In [21]:
movie_dataset['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically there is a family where a little boy...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    i am going to have to disagree with the previo...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

### Removal of numbers

In [24]:
def remove_digits(text):
    return re.sub('\d+', '', text)



In [25]:
movie_dataset['review'] = movie_dataset['review'].apply(remove_digits)
movie_dataset['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically there is a family where a little boy...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot bad dialogue bad acting idiotic direc...
49997    i am a catholic taught in parochial elementary...
49998    i am going to have to disagree with the previo...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

### Removal Of Duplicate Words

In [26]:
def remove_duplicates(review):
    wordList = review.split(" ")
    UniqW = Counter(wordList)
    s = " ".join(UniqW.keys())
    return s
    
movie_dataset['review'] = movie_dataset['review'].apply(remove_duplicates)

In [27]:
movie_dataset['review']

0        one of the other reviewers has mentioned that ...
1        a wonderful little production the filming tech...
2        i thought this was a wonderful way to spend ti...
3        basically there is a family where little boy j...
4        petter matteis love in the time of money is a ...
                               ...                        
49995    i thought this movie did a down right good job...
49996    bad plot dialogue acting idiotic directing the...
49997    i am a catholic taught in parochial elementary...
49998    i am going to have disagree with the previous ...
49999    no one expects the star trek movies to be high...
Name: review, Length: 50000, dtype: object

# Removing stop words

In [28]:
stop_words = stopwords.words("english")

def stops_words(words):
    filter_words = []

    for w in words:
        if w not in stop_words:
            filter_words.append(w)

    return filter_words

In [29]:
movie_dataset['review'] = movie_dataset['review'].apply(lambda x: x.split(" "))
movie_dataset['review'] = movie_dataset['review'].apply(lambda x: stops_words(x))

In [30]:
movie_dataset['review']

0        [one, reviewers, mentioned, watching, , oz, ep...
1        [wonderful, little, production, filming, techn...
2        [thought, wonderful, way, spend, time, hot, su...
3        [basically, family, little, boy, jake, thinks,...
4        [petter, matteis, love, time, money, visually,...
                               ...                        
49995    [thought, movie, right, good, job, creative, o...
49996    [bad, plot, dialogue, acting, idiotic, directi...
49997    [catholic, taught, parochial, elementary, scho...
49998    [going, disagree, previous, comment, side, mal...
49999    [one, expects, star, trek, movies, high, art, ...
Name: review, Length: 50000, dtype: object

# Stemming

In [31]:
lemmatizer = WordNetLemmatizer()
movie_dataset['review'] = movie_dataset['review'].apply(lambda x: [lemmatizer.lemmatize(word) for word in x])
movie_dataset['review'] = movie_dataset['review'].apply(lambda x: ' '.join(x))

print("Processed Data: \n", movie_dataset.head(20))

Processed Data: 
                                                review sentiment
0   one reviewer mentioned watching  oz episode ho...  positive
1   wonderful little production filming technique ...  positive
2   thought wonderful way spend time hot summer we...  positive
3   basically family little boy jake think zombie ...  negative
4   petter matteis love time money visually stunni...  positive
5   probably alltime favorite movie story selfless...  positive
6   sure would like see resurrection dated seahunt...  positive
7   show amazing fresh  innovative idea first aire...  negative
8   encouraged positive comment film looking forwa...  negative
9   like original gut wrenching laughter movie you...  positive
10  phil alien one quirky film humour based around...  negative
11  saw movie  came recall scariest scene big bird...  negative
12  big fan boll work many enjoyed movie postal ma...  negative
13  cast played shakespeareshakespeare losti appre...  negative
14  fantastic movie th

### Saving preprocessed data

In [33]:
movie_dataset.to_csv(r"C:\Users\karth\OneDrive\Desktop\Sentiment_Analysis\IMDB_Dataset.csv", index = False)