### Load Movie Reviews Data

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [3]:
# read file into pandas using a relative path. Please change the path as needed
import pandas as pd
movies_df = pd.read_table('/gdrive/My Drive/Statistical NLP AIML/labeledTrainData.tsv.zip')

In [4]:
#Number of reviews
movies_df.shape

(25000, 3)

In [5]:
movies_df.sample(n=5)

Unnamed: 0,id,sentiment,review
4698,232_1,0,This had to be one of the worst films ever. Wh...
7594,866_2,0,"\Plan B\"" is strictly by-the-numbers fare exce..."
14244,11159_3,0,I saw this movie at a 'sneak preview' and i mu...
10001,11458_7,1,Greetings again from the darkness. Insight int...
1122,9239_8,1,As I said in my comment about the first part: ...


#### Install and import NLTK

In [6]:
!pip install nltk --quiet

In [7]:
import nltk

In [8]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

### Stemmization

In [9]:
from nltk.stem import PorterStemmer

In [10]:
#Function to Stem words
def get_stemmed_text(corpus):
    stemmer = PorterStemmer()
    return [' '.join([stemmer.stem(word) for word in review.split()]) for review in corpus]

Let's apply stemming to the first review

In [11]:
#first review without stemming
movies_df.loc[0, 'review']

"With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought was really cool in the eighties just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it was originally released. Some of it has subtle messages about MJ's feeling towards the press and also the obvious message of drugs are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fans would say that he made it for the fans which if true is really nice of him.<br /><br />The actual feature film bit when it finally sta

In [12]:
#Stemming for first review
get_stemmed_text([movies_df.loc[0, 'review']])

["with all thi stuff go down at the moment with MJ i'v start listen to hi music, watch the odd documentari here and there, watch the wiz and watch moonwalk again. mayb i just want to get a certain insight into thi guy who i thought wa realli cool in the eighti just to mayb make up my mind whether he is guilti or innocent. moonwalk is part biography, part featur film which i rememb go to see at the cinema when it wa origin released. some of it ha subtl messag about mj' feel toward the press and also the obviou messag of drug are bad m'kay.<br /><br />visual impress but of cours thi is all about michael jackson so unless you remot like MJ in anyway then you are go to hate thi and find it boring. some may call MJ an egotist for consent to the make of thi movi but MJ and most of hi fan would say that he made it for the fan which if true is realli nice of him.<br /><br />the actual featur film bit when it final start is onli on for 20 minut or so exclud the smooth crimin sequenc and joe pes

Stemming all reviews

In [13]:
#Create a new column to hold stemmed reviews
movies_df['stemmed_review'] = get_stemmed_text(movies_df['review'].tolist())

In [14]:
movies_df.head()

Unnamed: 0,id,sentiment,review,stemmed_review
0,5814_8,1,With all this stuff going down at the moment w...,with all thi stuff go down at the moment with ...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...","\the classic war of the worlds\"" by timothi hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film start with a manag (nichola bell) giv...
3,3630_4,0,It must be assumed that those who praised this...,It must be assum that those who prais thi film...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbl trashi and wondrous unpretenti 80' exp...


### Lemmatization

In [15]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [16]:
from nltk.stem import WordNetLemmatizer

In [17]:
def get_lemmatized_text(corpus):
    
    lemmatizer = WordNetLemmatizer()
    return [' '.join([lemmatizer.lemmatize(word) for word in review.split()]) for review in corpus]

In [18]:
#Lemmatization for first review
get_lemmatized_text([movies_df.loc[0, 'review']])

["With all this stuff going down at the moment with MJ i've started listening to his music, watching the odd documentary here and there, watched The Wiz and watched Moonwalker again. Maybe i just want to get a certain insight into this guy who i thought wa really cool in the eighty just to maybe make up my mind whether he is guilty or innocent. Moonwalker is part biography, part feature film which i remember going to see at the cinema when it wa originally released. Some of it ha subtle message about MJ's feeling towards the press and also the obvious message of drug are bad m'kay.<br /><br />Visually impressive but of course this is all about Michael Jackson so unless you remotely like MJ in anyway then you are going to hate this and find it boring. Some may call MJ an egotist for consenting to the making of this movie BUT MJ and most of his fan would say that he made it for the fan which if true is really nice of him.<br /><br />The actual feature film bit when it finally start is on

In [19]:
#Create a new column to hold lemmatized reviews
movies_df['lemmatized_review'] = get_lemmatized_text(movies_df['review'].tolist())

In [20]:
movies_df.head()

Unnamed: 0,id,sentiment,review,stemmed_review,lemmatized_review
0,5814_8,1,With all this stuff going down at the moment w...,with all thi stuff go down at the moment with ...,With all this stuff going down at the moment w...
1,2381_9,1,"\The Classic War of the Worlds\"" by Timothy Hi...","\the classic war of the worlds\"" by timothi hi...","\The Classic War of the Worlds\"" by Timothy Hi..."
2,7759_3,0,The film starts with a manager (Nicholas Bell)...,the film start with a manag (nichola bell) giv...,The film start with a manager (Nicholas Bell) ...
3,3630_4,0,It must be assumed that those who praised this...,It must be assum that those who prais thi film...,It must be assumed that those who praised this...
4,9495_8,1,Superbly trashy and wondrously unpretentious 8...,superbl trashi and wondrous unpretenti 80' exp...,Superbly trashy and wondrously unpretentious 8...


We can use either Lemmatized or Stemmed text for Vectorization (instead of original text)

### Vectorization

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [None]:
#Using it with regular text
vect = TfidfVectorizer()
vect.fit(movies_df['review'].tolist())
len(vect.get_feature_names())

NameError: ignored

In [None]:
#Using it with Lemmatized text
vect = TfidfVectorizer()
vect.fit(movies_df['lemmatized_review'].tolist())
len(vect.get_feature_names())

In [None]:
#Using it with Stemmed text
vect = TfidfVectorizer()
vect.fit(movies_df['stemmed_review'].tolist())
len(vect.get_feature_names())