<h1> Data Preprocessing </h1>

The data - preprocessing is done in a separate notbook becuase google colab is slow with this.
This notebook contains various data preprocessing like removing stopwords, removing html tags, removing punctuations and stemming

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import nltk
nltk.download_shell()

## Dataset

The dataset used in for the sentiment analysis is 'IMDB Dataset' which is available from [Kaggle](https://www.kaggle.com/columbine/imdb-dataset-sentiment-analysis-in-csv-format).

In [None]:
# importing the dataset
df = pd.read_csv('IMDB Dataset.csv')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


<h2> Beautiful Soup </h2>

Beautiful soup is an web scrapping library which can be used to remove html tags from the dataset

In [None]:
#remove html tags
from bs4 import BeautifulSoup
for i in range(df.shape[0]):
    df['review'][i] = BeautifulSoup(df['review'][i], "lxml").text

<h2>Punctuations</h2>

punctuations are removed with a function. string.punctuations contains all the punctuations.

In [None]:
#remove punctuations
import string
def punctuation(sentence):
    review = [letters.lower() for letters in sentence if letters not in string.punctuation ]
    review = ''.join(review)
    return review

df['review'] = df['review'].apply(punctuation)

## Encoding Text, Removing Common Words, Implicit Ratings

- Replacing positive , negative with 1,0. Most of the recommender systems that produce binary outputs are capable of predicting well whether a user will click the particular item or not. Hence, we do not care about predicting the rating that a user would like that movie

- **Purpose of a Sentiment Analysis Model:** For Deep Learning, It's commonly known that we may require tons of data to make accurate predictions. 

    - Usually, *more data -> better model -> better predictions*

- There are a lot of review pages and websites where there are tons of data available without a target variable. By Using sentiment analysis, we predict the accurate target variable for a review so that it could be later used for building better predictive models for recommendation systems or any other use case.

- Stopwords are common words that can be removed from the dataset. generally, these words don't carry much significance

In [None]:
df['sentiment'].replace(['positive','negative'],[1,0],inplace=True)

In [None]:
#removing stopwords
def stopwords(sentence):
    review = [words for words in sentence.split() if words not in nltk.corpus.stopwords.words('english')]
    review = ' '.join(review)
    return review

df['review'] = df['review'].apply(stopwords)

<h2> Stemming </h2>

converting to tf-idf is a computationally expensive task. we would like to reduce the number of distinct words as much as possible without losing the information from the data.<br>

our dataset contains words like <b> run, runner , running , ran </b>. our model won't lose any information even if all these words are generalised to <b> run </b>. this is what stemming does.

In [None]:
#stemming
from nltk import PorterStemmer
ps = PorterStemmer()
df['review'] = df['review'].apply(ps.stem)
df.head()

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,1
1,wonderful little production filming technique ...,1
2,thought wonderful way spend time hot summer we...,1
3,basically theres family little boy jake thinks...,0
4,petter matteis love time money visually stunni...,1


<h2> Exporting the preprocessed dataframe to a csv file </h2>

The main sentiment analysis part will be done with google colab. so, saving the modified dataframe

In [None]:
#exporting csv file
df.to_csv("processed_data.csv",index=False)