**Objective**

---


(i) Apply Naive bayes classifier for a dataset

(ii) Apply and compare Linear Discriminating Analysis with Naive bayes classifiers w.r.t. error, specificity and sensitivity

This notebook presents the classification on IMDB Dataset. It compares the classification results for Linear Discriminating Analysis and Naive Bayes Classifier. As the dataset is text corpus, various preprocessing operations such as cleaning, tokenization and lemmatization are performed. The performance of classifier is evaluated before and after dimensionality reduction.

In [1]:
!unzip "/content/drive/MyDrive/Colab Notebooks/DS/Dataset/IMDB Dataset.csv.zip"

Archive:  /content/drive/MyDrive/Colab Notebooks/DS/Dataset/IMDB Dataset.csv.zip
  inflating: IMDB Dataset.csv        


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelBinarizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud,STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from bs4 import BeautifulSoup
import spacy
import re,string,unicodedata
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from nltk.tokenize.toktok import ToktokTokenizer

In [3]:
#Reading the dataset 
imdb_data = pd.read_csv("/content/IMDB Dataset.csv")
imdb_data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [4]:
imdb_data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [5]:
imdb_data = imdb_data.iloc[:1000,:]

In [6]:
imdb_data.shape

(1000, 2)

In [7]:
#sentiment count
imdb_data['sentiment'].value_counts()

positive    501
negative    499
Name: sentiment, dtype: int64

In [8]:
#split the dataset  
#train dataset
train_reviews=imdb_data.review[:800]
train_sentiments=imdb_data.sentiment[:800]
#test dataset
test_reviews=imdb_data.review[800:]
test_sentiments=imdb_data.sentiment[800:]
print(train_reviews.shape,train_sentiments.shape)
print(test_reviews.shape,test_sentiments.shape)

(800,) (800,)
(200,) (200,)


In [9]:
#Tokenization of text
tokenizer=ToktokTokenizer()
#Setting English stopwords
import nltk
nltk.download('stopwords')
stopword_list=nltk.corpus.stopwords.words('english')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [10]:
#Removing the html strips
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_between_square_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing the noisy text
def denoise_text(text):
    text = strip_html(text)
    text = remove_between_square_brackets(text)
    return text
#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(denoise_text)

In [11]:
#Define function for removing special characters
def remove_special_characters(text, remove_digits=True):
    pattern=r'[^a-zA-z0-9\s]'
    text=re.sub(pattern,'',text)
    return text
#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(remove_special_characters)

In [12]:
#Stemming the text
def simple_stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text
#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(simple_stemmer)

In [13]:
imdb_data.head()

Unnamed: 0,review,sentiment
0,one of the other review ha mention that after ...,positive
1,A wonder littl product the film techniqu is ve...,positive
2,I thought thi wa a wonder way to spend time on...,positive
3,basic there a famili where a littl boy jake th...,negative
4,petter mattei love in the time of money is a v...,positive


In [14]:
imdb_data['review'] = imdb_data['review'].str.lower()
imdb_data.head()

Unnamed: 0,review,sentiment
0,one of the other review ha mention that after ...,positive
1,a wonder littl product the film techniqu is ve...,positive
2,i thought thi wa a wonder way to spend time on...,positive
3,basic there a famili where a littl boy jake th...,negative
4,petter mattei love in the time of money is a v...,positive


In [15]:
#set stopwords to english
stop=set(stopwords.words('english'))
print(stop)

#removing the stopwords
def remove_stopwords(text, is_lower_case=True):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text
#Apply function on review column
imdb_data['review']=imdb_data['review'].apply(remove_stopwords)

{'of', 'myself', 'above', 'should', 'only', 'been', 'what', 'so', 'our', 'his', 'aren', 'and', 'each', 'with', 'y', 'him', "you're", 'as', 'it', "won't", 'until', 'by', 'all', 'to', 'will', 'wasn', 'itself', 'did', 'there', 'now', 'under', 'your', 'before', 'having', "you've", 'off', 'when', 'can', 'some', 'not', 'again', 'theirs', 'are', 'both', 'isn', 'doesn', 'shan', 'doing', 'my', 'the', 'mightn', "isn't", "it's", 'didn', 'ma', 'here', 'such', 'out', 'they', "you'll", 'themselves', 'were', 'for', 'hasn', 'has', 'does', 'them', 'a', 'be', 'o', "needn't", "shan't", 'hers', 'ain', 'their', 'at', 'own', 'while', 'weren', 'against', 'herself', 'most', 'nor', 'haven', 'after', 'we', 'her', 'because', "hadn't", 'mustn', 'those', "didn't", 'had', 'this', 't', 'same', 'who', "wasn't", 'll', 'he', 'd', 'whom', 'or', 'needn', "mustn't", 'once', 'why', 'from', 'then', "doesn't", 'i', "mightn't", 'that', 'up', 'too', 'no', "haven't", 'more', 'than', 's', 'shouldn', 'she', 'hadn', 'do', 'other',

In [16]:
#normalized train reviews
norm_train_reviews=imdb_data.review[:800]
norm_train_reviews[0]

'one review ha mention watch 1 oz episod youll hook right thi exactli happen meth first thing struck oz wa brutal unflinch scene violenc set right word go trust thi show faint heart timid thi show pull punch regard drug sex violenc hardcor classic use wordit call oz nicknam given oswald maximum secur state penitentari focus mainli emerald citi experiment section prison cell glass front face inward privaci high agenda em citi home manyaryan muslim gangsta latino christian italian irish moreso scuffl death stare dodgi deal shadi agreement never far awayi would say main appeal show due fact goe show wouldnt dare forget pretti pictur paint mainstream audienc forget charm forget romanceoz doesnt mess around first episod ever saw struck nasti wa surreal couldnt say wa readi watch develop tast oz got accustom high level graphic violenc violenc injustic crook guard wholl sold nickel inmat wholl kill order get away well manner middl class inmat turn prison bitch due lack street skill prison exp

In [17]:
#Normalized test reviews
norm_test_reviews=imdb_data.review[800:]
norm_test_reviews[890]

'salva hi pal bigardo margin law dure live see panhandl car underground pitch get donat lame one give anyth salva harden petti crimin doesnt even ani redeem qualiti discov realiti show televis give idea next religion show busi prove win combin someth salva capit onh bigardo jail accident death priest wa critic duo salva show natur realiti show transform christlik figur instant success program esp nononsens woman show produc cant escap way salva pay unusu attent ultim salva victim hi success endjordi molla whose first directori job thi movi success way film satir role televis spain wa vulner type program ha seen share bizarr director felt assault view public want set hi stori point view peopl make fortun naiv audienceth ensembl cast ha good moment film mr molla like ani actor decid direct hi first featur would effect concentr pictur front camera candela pena good actress one best reason watch movi juan carlo villedo david gimenez cacho franco francescoantonio florinda chico rest respond

Bags of words model

It is used to convert text documents to numerical vectors or bag of words.

In [18]:
#Count vectorizer for bag of words
cv=CountVectorizer(min_df=0,max_df=1,binary=False,ngram_range=(1,3))
#transformed train reviews
cv_train_reviews=cv.fit_transform(norm_train_reviews)
#transformed test reviews
cv_test_reviews=cv.transform(norm_test_reviews)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)

BOW_cv_train: (800, 186459)
BOW_cv_test: (200, 186459)


Term Frequency-Inverse Document Frequency model (TFIDF)

It is used to convert text documents to matrix of tfidf features.

In [19]:
#Tfidf vectorizer
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,3))
#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train_reviews)
#transformed test reviews
tv_test_reviews=tv.transform(norm_test_reviews)
print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (800, 186459)
Tfidf_test: (200, 186459)


In [20]:
#labeling the sentiment data
lb=LabelBinarizer()
#transformed sentiment data
sentiment_data=lb.fit_transform(imdb_data['sentiment'])
print(sentiment_data.shape)

(1000, 1)


In [21]:
#Spliting the sentiment data
train_sentiments=sentiment_data[:800]
test_sentiments=sentiment_data[800:]
print(train_sentiments)
print(test_sentiments)

[[1]
 [1]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [1]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [1]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [0]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]
 [1]
 [0]
 [1]
 [1]
 [1]
 [1]
 [0]
 [0]
 [0]
 [0]
 [1]
 [0]


In [39]:
#training the model
mnb=GaussianNB()
#fitting the svm for bag of words
mnb_bow=mnb.fit(cv_train_reviews.toarray(),train_sentiments)
print(mnb_bow)
#fitting the svm for tfidf features
mnb_tfidf=mnb.fit(tv_train_reviews.toarray(),train_sentiments)
print(mnb_tfidf)

  y = column_or_1d(y, warn=True)


GaussianNB()


  y = column_or_1d(y, warn=True)


GaussianNB()


In [40]:
#Predicting the model for bag of words
mnb_bow_predict=mnb.predict(cv_test_reviews.toarray())
print(mnb_bow_predict)
#Predicting the model for tfidf features
mnb_tfidf_predict=mnb.predict(tv_test_reviews.toarray())
print(mnb_tfidf_predict)

[1 1 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0
 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1
 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0
 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0
 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0
 0 1 1 1 0 1 0 0 1 1 0 1 0 0 0]
[1 1 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 0 1 1 0 1 0 0
 0 1 0 1 1 1 0 1 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1
 1 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 1 1 0 1 1 0 1 0 1 0 1 0 1 0
 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 0 0 0 0 0 0 0 1 0
 1 1 1 1 1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 1 0 0
 0 1 1 1 0 1 0 0 1 1 0 1 0 0 0]


In [41]:
#Accuracy score for bag of words
mnb_bow_score=accuracy_score(test_sentiments,mnb_bow_predict)
print("mnb_bow_score :",mnb_bow_score)
#Accuracy score for tfidf features
mnb_tfidf_score=accuracy_score(test_sentiments,mnb_tfidf_predict)
print("mnb_tfidf_score :",mnb_tfidf_score)

mnb_bow_score : 0.605
mnb_tfidf_score : 0.605


In [42]:
#Classification report for bag of words 
mnb_bow_report=classification_report(test_sentiments,mnb_bow_predict,target_names=['Positive','Negative'])
print(mnb_bow_report)
#Classification report for tfidf features
mnb_tfidf_report=classification_report(test_sentiments,mnb_tfidf_predict,target_names=['Positive','Negative'])
print(mnb_tfidf_report)

              precision    recall  f1-score   support

    Positive       0.55      0.61      0.58        89
    Negative       0.66      0.60      0.63       111

    accuracy                           0.60       200
   macro avg       0.60      0.61      0.60       200
weighted avg       0.61      0.60      0.61       200

              precision    recall  f1-score   support

    Positive       0.55      0.61      0.58        89
    Negative       0.66      0.60      0.63       111

    accuracy                           0.60       200
   macro avg       0.60      0.61      0.60       200
weighted avg       0.61      0.60      0.61       200



In [31]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=1)
X_train = lda.fit_transform(cv_train_reviews.toarray(),train_sentiments)
X_test = lda.transform(cv_test_reviews.toarray())


  y = column_or_1d(y, warn=True)


In [37]:
from sklearn.naive_bayes import GaussianNB
mnb_lda = GaussianNB()

mnb_lda.fit(X_train, train_sentiments)

print("Accuracy:",mnb_lda.score(X_test, test_sentiments))

Accuracy: 0.6849999999999999


  y = column_or_1d(y, warn=True)


In [None]:
lda.fit(cv_train_reviews.todense(), train_sentiments)
print(f"Training Score: {lda.score(cv_train_reviews,train_sentiments)}")
print(f"Test Score: {lda.score(cv_test_reviews,test_sentiments)}")

  y = column_or_1d(y, warn=True)


Training Score: 0.885
Test Score: 0.58


**Conclusion**


IMDB Corpus dataset used in this experiment needs preprocessing operations such as cleaning, tokenization, lemmatization and conversion to document term matrix using vectorizer for training classification model on it.


The Naive Bayes model underperformed LDA with accuracy of 60.5 % on this dataset before dimensionality reduction. The accuracy of LDA came out to be 68.49 %.


Dimensionality Reduction reduces the time and storage space required. It helps remove multi-collinearity which improves the interpretation of the parameters of the machine learning model. It becomes easier to visualize the data when reduced to very low dimensions such as 2D or 3D.