#Data Preprocessing
**Dataset: IMDb 50k reviews dataset**
(https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)

status: till tokenization done

In [None]:
# importing libraries
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
import spacy

In [None]:
# dataset loading
df = pd.read_csv("/content/drive/MyDrive/EZ Projects/IMDb Data/IMDB Dataset.csv")

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


### 1. Handling missing values
You can either remove the rows with missing values or impute them with mean, median, or mode values.

In [None]:
df.isnull().sum()

review       0
sentiment    0
dtype: int64

###2. Removing Duplicates

In [None]:
# df[df.duplicated(subset=['review'])]  #for specific column
df[df.duplicated()]

Unnamed: 0,review,sentiment
3537,Quite what the producers of this appalling ada...,negative
3769,My favourite police series of all time turns t...,positive
4391,"Beautiful film, pure Cassavetes style. Gena Ro...",positive
6352,If you liked the Grinch movie... go watch that...,negative
6479,I want very much to believe that the above quo...,negative
...,...,...
49912,This is an incredible piece of drama and power...,positive
49950,This was a very brief episode that appeared in...,negative
49984,Hello it is I Derrick Cannon and I welcome you...,negative
49986,This movie is a disgrace to the Major League F...,negative


In [None]:
df.duplicated().sum()

418

In [None]:
# removing duplicate values and reset index
df = df.drop_duplicates()
df = df.reset_index(drop=True)
df.tail(2)

Unnamed: 0,review,sentiment
49580,I'm going to have to disagree with the previou...,negative
49581,No one expects the Star Trek movies to be high...,negative


###3. Text Cleaning

1. Lowercasing: Convert all text to lowercase to ensure consistency.
2. Removing special characters and punctuation.
3. Removing stop words: Common words (like "the," "and," etc.) that don't contribute much to the meaning of the text.
4. Stemming or Lemmatization: Reducing words to their base or root form.



In [None]:
# apply on reviews
df['review'][243]

"Yes i'll say before i start commenting, this movie is incredibly underrated.<br /><br />Sharon Stone is great in her role of Catherine Trammell as is Morrissey as Dr glass. He is an analyst sent in to evaluate her after the death of a sports star. Glass is drawn into a seductive game that Trammel uses to manipulate his mind.<br /><br />The acting was good (apart from Thewlis)<br /><br />Stone really has a talent with this role. She's slick, naughty and seductive and doesn't look a day older than she did in the first.She really impressed me(like in Casino). Morrisey was also good. He showed much vunerablitity in a role that needed it. Thewlis however was lame. He ruined his character and was over-the-top the whole way. He really sucked.<br /><br />Overall, this movie not as good the first but Stone is a hoot to watch. Just ignore Thewlis."

In [None]:
# to lower case
df_lower = df['review'].str.lower()
df['review'] = df_lower

In [None]:
df['review'][243]

"yes i'll say before i start commenting, this movie is incredibly underrated.<br /><br />sharon stone is great in her role of catherine trammell as is morrissey as dr glass. he is an analyst sent in to evaluate her after the death of a sports star. glass is drawn into a seductive game that trammel uses to manipulate his mind.<br /><br />the acting was good (apart from thewlis)<br /><br />stone really has a talent with this role. she's slick, naughty and seductive and doesn't look a day older than she did in the first.she really impressed me(like in casino). morrisey was also good. he showed much vunerablitity in a role that needed it. thewlis however was lame. he ruined his character and was over-the-top the whole way. he really sucked.<br /><br />overall, this movie not as good the first but stone is a hoot to watch. just ignore thewlis."

In [None]:
# remove html tags <br />
df['review'] = df['review'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())
df['review'][243]

  df['review'] = df['review'].apply(lambda x: BeautifulSoup(x, 'html.parser').get_text())


"yes i'll say before i start commenting, this movie is incredibly underrated.sharon stone is great in her role of catherine trammell as is morrissey as dr glass. he is an analyst sent in to evaluate her after the death of a sports star. glass is drawn into a seductive game that trammel uses to manipulate his mind.the acting was good (apart from thewlis)stone really has a talent with this role. she's slick, naughty and seductive and doesn't look a day older than she did in the first.she really impressed me(like in casino). morrisey was also good. he showed much vunerablitity in a role that needed it. thewlis however was lame. he ruined his character and was over-the-top the whole way. he really sucked.overall, this movie not as good the first but stone is a hoot to watch. just ignore thewlis."

In [None]:
# remove special characters and punctuations
df['review'] = df['review'].str.replace('[^a-zA-Z0-9\s]', ' ', regex=True)
df['review'][243]

'yes i ll say before i start commenting  this movie is incredibly underrated sharon stone is great in her role of catherine trammell as is morrissey as dr glass  he is an analyst sent in to evaluate her after the death of a sports star  glass is drawn into a seductive game that trammel uses to manipulate his mind the acting was good  apart from thewlis stone really has a talent with this role  she s slick  naughty and seductive and doesn t look a day older than she did in the first she really impressed me like in casino   morrisey was also good  he showed much vunerablitity in a role that needed it  thewlis however was lame  he ruined his character and was over the top the whole way  he really sucked overall  this movie not as good the first but stone is a hoot to watch  just ignore thewlis '

In [None]:
# removing stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
df['review'] = df['review'].apply(lambda x: ' '.join([word for word in x.split() if word.lower() not in stop_words]))
df['review'][243]

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


'yes say start commenting movie incredibly underrated sharon stone great role catherine trammell morrissey dr glass analyst sent evaluate death sports star glass drawn seductive game trammel uses manipulate mind acting good apart thewlis stone really talent role slick naughty seductive look day older first really impressed like casino morrisey also good showed much vunerablitity role needed thewlis however lame ruined character top whole way really sucked overall movie good first stone hoot watch ignore thewlis'

In [None]:
# applying lemmatization using spacy python library
# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

df['review'] = df['review'].apply(lambda x: ' '.join([token.lemma_ for token in nlp(x)]))

In [None]:
df['review'][243]

'yes say start comment movie incredibly underrated sharon stone great role catherine trammell morrissey dr glass analyst send evaluate death sport star glass draw seductive game trammel use manipulate mind act good apart thewlis stone really talent role slick naughty seductive look day old first really impressed like casino morrisey also good show much vunerablitity role need thewli however lame ruin character top whole way really suck overall movie good first stone hoot watch ignore thewli'

### 4. Tokenization

In [None]:
# Load the spaCy English model
nlp = spacy.load('en_core_web_sm')

df['tokens'] = df['review'].apply(lambda x: [token.text for token in nlp(x)])



In [None]:
df.columns

Index(['review', 'sentiment', 'tokens'], dtype='object')

In [None]:
df['tokens'][243]

['yes',
 'say',
 'start',
 'comment',
 'movie',
 'incredibly',
 'underrated',
 'sharon',
 'stone',
 'great',
 'role',
 'catherine',
 'trammell',
 'morrissey',
 'dr',
 'glass',
 'analyst',
 'send',
 'evaluate',
 'death',
 'sport',
 'star',
 'glass',
 'draw',
 'seductive',
 'game',
 'trammel',
 'use',
 'manipulate',
 'mind',
 'act',
 'good',
 'apart',
 'thewlis',
 'stone',
 'really',
 'talent',
 'role',
 'slick',
 'naughty',
 'seductive',
 'look',
 'day',
 'old',
 'first',
 'really',
 'impressed',
 'like',
 'casino',
 'morrisey',
 'also',
 'good',
 'show',
 'much',
 'vunerablitity',
 'role',
 'need',
 'thewli',
 'however',
 'lame',
 'ruin',
 'character',
 'top',
 'whole',
 'way',
 'really',
 'suck',
 'overall',
 'movie',
 'good',
 'first',
 'stone',
 'hoot',
 'watch',
 'ignore',
 'thewli']

### 5. Numerical Encoding

Convert categorical variables (if any) into numerical format using techniques like one-hot encoding or label encoding.