# Data Preprocessing

Before we are able to build a Recurrent Neural Network (RNN), we need to preprocess the data so that it is compatible with the model. This includes:

- Tokenization
- Lowercasing all the words
- Removing punctuation and special

Note: The original dataset was taken from Kaggle 

In [1]:
import numpy as np
import pandas as pd

#ML Libraries for Preprocessing
from sklearn.preprocessing import LabelEncoder

#NLP Libraries for Data Preprocessing
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [4]:
reviews_df = pd.read_csv('data/movie_reviews.csv')
reviews_df

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
...,...,...
49995,I thought this movie did a down right good job...,positive
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative
49997,I am a Catholic taught in parochial elementary...,negative
49998,I'm going to have to disagree with the previou...,negative


Now that we have loaded the dataframe, in the code below we will preprocess the text to tokenize, lowercase all the words, and remove punctuation and special characters

In [5]:
def preprocess_text(text):
    # Lowercasing
    text = text.lower()
    # Remove non-alphabetic characters
    text = re.sub(r'[^a-zA-Z\s]', '', text, re.I|re.A)
    # Tokenize text
    tokens = word_tokenize(text)
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # Return the preprocessed tokens
    return filtered_tokens

# Apply preprocessing to each review
reviews_df['processed_reviews'] = reviews_df['review'].apply(preprocess_text)
reviews_df

Unnamed: 0,review,sentiment,processed_reviews
0,One of the other reviewers has mentioned that ...,positive,"[one, reviewers, mentioned, watching, oz, epis..."
1,A wonderful little production. <br /><br />The...,positive,"[wonderful, little, production, br, br, filmin..."
2,I thought this was a wonderful way to spend ti...,positive,"[thought, wonderful, way, spend, time, hot, su..."
3,Basically there's a family where a little boy ...,negative,"[basically, theres, family, little, boy, jake,..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, matteis, love, time, money, visually,..."
...,...,...,...
49995,I thought this movie did a down right good job...,positive,"[thought, movie, right, good, job, wasnt, crea..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"[bad, plot, bad, dialogue, bad, acting, idioti..."
49997,I am a Catholic taught in parochial elementary...,negative,"[catholic, taught, parochial, elementary, scho..."
49998,I'm going to have to disagree with the previou...,negative,"[im, going, disagree, previous, comment, side,..."


Next, we need to convert the words within the reviews to word embeddings so that the data can be understood by the model. We can do this by simplying using pretrained word embeddings from GloVe

In [9]:
def load_glove_model(glove_file_path):
    embeddings_index = {}
    with open(glove_file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
    return embeddings_index

glove_path = '../glove.6B/glove.6B.50d.txt'
glove_model = load_glove_model(glove_path)

In [10]:
def tokens_to_vector(tokens, model):
    vector = np.mean([model[token] for token in tokens if token in model], axis=0)
    return vector

reviews_df['embeddings'] = reviews_df['processed_reviews'].apply(lambda tokens: tokens_to_vector(tokens, glove_model))
reviews_df

Unnamed: 0,review,sentiment,processed_reviews,embeddings
0,One of the other reviewers has mentioned that ...,positive,"[one, reviewers, mentioned, watching, oz, epis...","[0.10862178, 0.044607483, -0.0782462, -0.26707..."
1,A wonderful little production. <br /><br />The...,positive,"[wonderful, little, production, br, br, filmin...","[0.08574718, 0.2102892, -0.3049311, -0.1847553..."
2,I thought this was a wonderful way to spend ti...,positive,"[thought, wonderful, way, spend, time, hot, su...","[0.15635169, 0.12138394, -0.120738186, -0.2375..."
3,Basically there's a family where a little boy ...,negative,"[basically, theres, family, little, boy, jake,...","[0.16219215, 0.020814184, -0.1491204, -0.25186..."
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,"[petter, matteis, love, time, money, visually,...","[0.28672162, 0.26971683, -0.096419714, -0.0721..."
...,...,...,...,...
49995,I thought this movie did a down right good job...,positive,"[thought, movie, right, good, job, wasnt, crea...","[0.12475395, 0.09159596, -0.009992485, -0.2981..."
49996,"Bad plot, bad dialogue, bad acting, idiotic di...",negative,"[bad, plot, bad, dialogue, bad, acting, idioti...","[0.00033322666, -0.08188592, -0.06321972, -0.3..."
49997,I am a Catholic taught in parochial elementary...,negative,"[catholic, taught, parochial, elementary, scho...","[0.13239665, 0.2664471, -0.29731384, -0.560151..."
49998,I'm going to have to disagree with the previou...,negative,"[im, going, disagree, previous, comment, side,...","[0.05808708, -0.05252409, -0.15382867, -0.2030..."


Finally, now that we have processed the reviews into word embeddings, we need to also process the label by converting from negative/positive -> 0/1 so that it can be understood by the model.

In [None]:
encoder = LabelEncoder()
encoded_categories = encoder.fit_transform(reviews_df['sentiment'])
reviews_df['sentiment_label'] = encoded_categories

In [None]:
reviews_df.drop(['processed_reviews'], axis=1, inplace=True)
reviews_df.to_csv('preprocessed_movie_reviews.csv', index=False)