For my project defence I plan to analyse three different datasets with different algorithms - my plan now is to use the same algorithms and then classify with just the datasets at the same time.

### Multinomial Naive Bayes for IMDB Movie review dataset

IMDB dataset having 50K movie reviews for natural language processing or Text analytics. This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms. For more dataset information, please go through the following link, http://ai.stanford.edu/~amaas/data/sentiment/

In [1]:
import numpy as np   # a useful datastructure
import pandas as pd  # for data preprocessing
import pickle
import os


In [2]:
data = pd.read_csv("IMDB_Dataset.csv")
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


In [3]:
y = data.sentiment

In [4]:
y.head()

0    positive
1    positive
2    positive
3    negative
4    positive
Name: sentiment, dtype: object

In [5]:
label = {'positive':1, 'negative' :0}

def preprocess_y(sentiment):
    return label[sentiment]



In [6]:
y = y.apply(preprocess_y)
y.head()

0    1
1    1
2    1
3    0
4    1
Name: sentiment, dtype: int64

In [7]:
X = data.review
X.head()

0    One of the other reviewers has mentioned that ...
1    A wonderful little production. <br /><br />The...
2    I thought this was a wonderful way to spend ti...
3    Basically there's a family where a little boy ...
4    Petter Mattei's "Love in the Time of Money" is...
Name: review, dtype: object

Now for our Data Preprocessing

In [8]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words("english"))

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Ukachi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Ukachi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
import re
def preprocess(review):
    #convert the tweet to lower case
    review.lower()
    #convert all urls to sting "URL"
    review = re.sub('((www\.[^\s]+)|(https?://[^\s]+))','URL',review)
    #convert all @username to "AT_USER"
    review = re.sub('@[^\s]+','AT_USER', review)
    #correct all multiple white spaces to a single white space
    review = re.sub('[\s]+', ' ', review)
    #convert "#topic" to just "topic"
    review = re.sub(r'#([^\s]+)', r'\1', review)
    tokens = word_tokenize(review)
    tokens = [w for w in tokens if not w in stop_words]
    return " ".join(tokens)

X = X.apply(preprocess)
X.head()

0    One reviewers mentioned watching 1 Oz episode ...
1    A wonderful little production . < br / > < br ...
2    I thought wonderful way spend time hot summer ...
3    Basically 's family little boy ( Jake ) thinks...
4    Petter Mattei 's `` Love Time Money '' visuall...
Name: review, dtype: object

In [10]:
import pickle
from sklearn.feature_extraction. text import TfidfVectorizer

def feature_extraction(data):
    tfv=TfidfVectorizer(sublinear_tf=True, stop_words="english")
    
    features = tfv.fit_transform(data)
    pickle.dump(tfv.vocabulary_,open("nb_feature.pkl","wb"))
    return features

data = np.array(X)
label = np.array(y)

features = feature_extraction(data)

print(features)

  (0, 75175)	0.07683718114148
  (0, 57308)	0.06624810814175679
  (0, 97762)	0.06602760412271698
  (0, 65154)	0.2509078981889345
  (0, 29979)	0.09795247296693152
  (0, 52987)	0.09506738058419874
  (0, 42657)	0.0840621839588345
  (0, 75579)	0.07345376450404376
  (0, 30834)	0.0584006628471939
  (0, 40269)	0.0581639744133865
  (0, 11951)	0.058076375863825386
  (0, 89993)	0.03916872881162975
  (0, 86227)	0.1423296497616121
  (0, 12884)	0.09343899459074295
  (0, 94168)	0.11301273835429296
  (0, 78409)	0.03850810122811244
  (0, 96683)	0.14130598712509415
  (0, 79965)	0.04717833901676
  (0, 99600)	0.10049007287848823
  (0, 92537)	0.07374873091476265
  (0, 31776)	0.0999572415413897
  (0, 40989)	0.07831441586249126
  (0, 90563)	0.10724872055245468
  (0, 71260)	0.08008369873381428
  (0, 71307)	0.09436521863985418
  :	:
  (49999, 94206)	0.11490080515017852
  (49999, 51790)	0.10313084807864072
  (49999, 45842)	0.20268533892717316
  (49999, 31177)	0.11845532639332741
  (49999, 7096)	0.12974722810563

In [11]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features, label, test_size = 0.20)

In [12]:
# Fitting Multinomial Naive Bayes classifier to the Training set
from sklearn.naive_bayes import MultinomialNB
classifier = MultinomialNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
cm = confusion_matrix(y_test, y_pred)
precision, recall, fscore, support = score(y_test, y_pred,average='macro')
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('Fscore: {}'.format(fscore))
print('Support: {}'.format(support))

Precision: 0.8652818462053554
Recall: 0.8651850607402429
Fscore: 0.8651885895622204
Support: None


In [13]:
# Fitting Bernoulli Naive Bayes classifier to the Training set
from sklearn.naive_bayes import BernoulliNB
classifier = BernoulliNB()
classifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = classifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
cm = confusion_matrix(y_test, y_pred)
precision, recall, fscore, support = score(y_test, y_pred,average='macro')
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('Fscore: {}'.format(fscore))
print('Support: {}'.format(support))

Precision: 0.850153619775179
Recall: 0.8482271929087717
Fscore: 0.8480794706403763
Support: None


In [15]:
from sklearn.feature_extraction.text import TfidfVectorizer
def feature_extraction(data):
    tfv=TfidfVectorizer(sublinear_tf=True, stop_words = "english")
    features=tfv.fit_transform(data)
    pickle.dump(tfv.vocabulary_, open("svm_feature.pkl", "wb"))
    return features

data = np.array(X)
label = np.array(y)
features = feature_extraction(data)

print(features)

  (0, 75175)	0.07683718114148
  (0, 57308)	0.06624810814175679
  (0, 97762)	0.06602760412271698
  (0, 65154)	0.2509078981889345
  (0, 29979)	0.09795247296693152
  (0, 52987)	0.09506738058419874
  (0, 42657)	0.0840621839588345
  (0, 75579)	0.07345376450404376
  (0, 30834)	0.0584006628471939
  (0, 40269)	0.0581639744133865
  (0, 11951)	0.058076375863825386
  (0, 89993)	0.03916872881162975
  (0, 86227)	0.1423296497616121
  (0, 12884)	0.09343899459074295
  (0, 94168)	0.11301273835429296
  (0, 78409)	0.03850810122811244
  (0, 96683)	0.14130598712509415
  (0, 79965)	0.04717833901676
  (0, 99600)	0.10049007287848823
  (0, 92537)	0.07374873091476265
  (0, 31776)	0.0999572415413897
  (0, 40989)	0.07831441586249126
  (0, 90563)	0.10724872055245468
  (0, 71260)	0.08008369873381428
  (0, 71307)	0.09436521863985418
  :	:
  (49999, 94206)	0.11490080515017852
  (49999, 51790)	0.10313084807864072
  (49999, 45842)	0.20268533892717316
  (49999, 31177)	0.11845532639332741
  (49999, 7096)	0.12974722810563

In [16]:
# Fitting Support Vector Machine Classifier to the Training set
from sklearn.svm import SVC
svclassifier = SVC(kernel='linear')
svclassifier.fit(X_train, y_train)

# Predicting the Test set results
y_pred = svclassifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
cm = confusion_matrix(y_test, y_pred)
precision, recall, fscore, support = score(y_test, y_pred,average='macro')
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('Fscore: {}'.format(fscore))
print('Support: {}'.format(support))

Precision: 0.896580022023722
Recall: 0.8965145860583442
Fscore: 0.8964968690302881
Support: None


In [17]:
# Fitting Logistic Regression Classifier to the Training set
from sklearn import linear_model
logReg = linear_model.LogisticRegression(solver='lbfgs', C=1000)
logClassifier = logReg.fit(X_train, y_train)

# Predicting the Test set results
y_pred = logClassifier.predict(X_test)

# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_recall_fscore_support as score
cm = confusion_matrix(y_test, y_pred)
precision, recall, fscore, support = score(y_test, y_pred,average='macro')
print('Precision: {}'.format(precision))
print('Recall: {}'.format(recall))
print('Fscore: {}'.format(fscore))
print('Support: {}'.format(support))

Precision: 0.8781164258749785
Recall: 0.8781069124276497
Fscore: 0.8780996477079819
Support: None




In [1]:
from subprocess import check_call
check_call(['dot','-Tpng','tree.dot','-o','tree.png'])

FileNotFoundError: [WinError 2] The system cannot find the file specified