<a href="https://www.kaggle.com/code/chwasiq0569/sentimen-analysis-on-imdb-review-dataset?scriptVersionId=123550932" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv


In [2]:
df = pd.read_csv('/kaggle/input/imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')

In [3]:
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


# Steps

1 - Text Cleaning

2 - Remove HTML Tags

3 - Remove Special Characters

4 - Converting everything to lower case

5 - Remove Stop Words

6 - Stemming

In [4]:
df = df.sample(1000)

In [5]:
df

Unnamed: 0,review,sentiment
26747,I love and admire the Farrelly brothers! How c...,positive
21071,Once again John Madden has given us a magnific...,positive
22866,Tony Scott destroys anything that may have bee...,negative
6747,With movies like this you know you are going t...,negative
3947,I haven't seen the first two - only this one w...,negative
...,...,...
21435,A warning to potential viewers of this experim...,positive
15530,It's a shame that Asterix and his buddy Obelix...,positive
40815,"I enjoy watching people doing breakdance, espe...",positive
17682,This movie feels like a film project. As thoug...,positive


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 26747 to 37580
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     1000 non-null   object
 1   sentiment  1000 non-null   object
dtypes: object(2)
memory usage: 23.4+ KB


In [7]:
df.shape

(1000, 2)

In [8]:
df['sentiment'].replace({ 'positive': 1, 'negative': 0 }, inplace=True)

In [9]:
df['sentiment']

26747    1
21071    1
22866    0
6747     0
3947     0
        ..
21435    1
15530    1
40815    1
17682    1
37580    0
Name: sentiment, Length: 1000, dtype: int64

# 1 - Text Cleaning

In [10]:
import re

def clean_html(text):
    clean = re.compile('<.*?>')
    return re.sub(clean, "", text)


In [11]:
df['review'] = df['review'].apply(clean_html)

In [12]:
def convert_lower(text):
    return text.lower()

In [13]:
df['review'] = df['review'].apply(convert_lower)

In [14]:
df['review']

26747    i love and admire the farrelly brothers! how c...
21071    once again john madden has given us a magnific...
22866    tony scott destroys anything that may have bee...
6747     with movies like this you know you are going t...
3947     i haven't seen the first two - only this one w...
                               ...                        
15530    it's a shame that asterix and his buddy obelix...
40815    i enjoy watching people doing breakdance, espe...
17682    this movie feels like a film project. as thoug...
37580    first of all when i saw the teaser trailer for...
Name: review, Length: 1000, dtype: object

# Removed Special Characters from review 

In [15]:
def remove_special(text):
    x=''
    
    for i in text:
        if i.isalnum():
            x=x+i
        else:
            x = x + ' '
    
    return x
            

In [16]:
df['review'] = df['review'].apply(remove_special)

In [17]:
import nltk

# Removing Stopwords like 'i', 'me', 'my', 'myself', 'we' etc

In [18]:
from nltk.corpus import stopwords

In [19]:
def remove_stopwords(text):
    x=[]
    for i in text.split():
        if i not in stopwords.words('english'):
            x.append(i)
    
    y = x[:]
    x.clear()
    return y

In [20]:
df['review'] = df['review'].apply(remove_stopwords)

In [21]:
df

Unnamed: 0,review,sentiment
26747,"[love, admire, farrelly, brothers, come, got, ...",1
21071,"[john, madden, given, us, magnificent, film, s...",1
22866,"[tony, scott, destroys, anything, may, interes...",0
6747,"[movies, like, know, going, get, usual, jokes,...",0
3947,"[seen, first, two, one, called, primal, specie...",0
...,...,...
21435,"[warning, potential, viewers, experimental, fi...",1
15530,"[shame, asterix, buddy, obelix, get, world, wi...",1
40815,"[enjoy, watching, people, breakdance, especial...",1
17682,"[movie, feels, like, film, project, though, fi...",1


# Performing Stemming (converting text to 1st form of verb)

In [22]:
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

In [23]:
y = []
def stem_words(text):
    for i in text:
        y.append(ps.stem(i))
    z=y[:]
    y.clear()
    return z

In [24]:
df['review'] = df['review'].apply(stem_words)

In [25]:
df['review']

26747    [love, admir, farrelli, brother, come, got, se...
21071    [john, madden, given, us, magnific, film, simp...
22866    [toni, scott, destroy, anyth, may, interest, r...
6747     [movi, like, know, go, get, usual, joke, conce...
3947     [seen, first, two, one, call, primal, speci, e...
                               ...                        
21435    [warn, potenti, viewer, experiment, film, natu...
15530    [shame, asterix, buddi, obelix, get, world, wi...
40815    [enjoy, watch, peopl, breakdanc, especi, well,...
17682    [movi, feel, like, film, project, though, film...
37580    [first, saw, teaser, trailer, wendi, wu, defin...
Name: review, Length: 1000, dtype: object