<a href="https://colab.research.google.com/github/dukenel/Data_Analytics/blob/master/NLP_Machine_Learning_Model_IMDB_Customer_Review_Sentiment_Analysis_Duke.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Business Question

Task is to build a model to predict the customer sentiment. 

    Write a Function to remove punctuation and stopwords
    Process the Data
    Vectorize
    TF-IDF
    Model (Multinomial Naive Bayes)

Dataset link:
https://github.com/niteen11/data301_predictive_analytics_machine_learning/blob/main/data/imdb_labelled.txt

In [73]:
import pandas as pd

In [74]:
import numpy as np

In [75]:
email_data_df = pd.read_csv('https://raw.githubusercontent.com/niteen11/data301_predictive_analytics_machine_learning/main/data/imdb_labelled.txt', sep='\t',names=['comment','sentiment'])

In [76]:
email_data_df.head()

Unnamed: 0,comment,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [77]:
email_data_df.shape

(748, 2)

In [78]:
email_data_df['comment'][0]

'A very, very, very slow-moving, aimless movie about a distressed, drifting young man.  '

**Lenght of longest comment**

In [79]:
max(email_data_df['comment'].apply(len))

7944

In [80]:
msg_7944 = email_data_df[email_data_df['comment'].apply(len)==7944]

**Row with longest comment**

In [81]:
msg_7944

Unnamed: 0,comment,sentiment
136,"In fact, it's hard to remember that the part ...",0


**Reviewing the longest comment on row 135**

In [82]:
msg_7944.comment

136     In fact, it's hard to remember that the part ...
Name: comment, dtype: object

In [83]:
msg_7944.comment.iloc[0]

' In fact, it\'s hard to remember that the part of Ray Charles is being acted, and not played by the man himself.  \t1\nRay Charles is legendary.  \t1\nRay Charles\' life provided excellent biographical material for the film, which goes well beyond being just another movie about a musician.  \t1\nHitchcock is a great director.  \t1\nIronically I mostly find his films a total waste of time to watch.  \t0\nSecondly, Hitchcock pretty much perfected the thriller and chase movie.  \t1\nIt\'s this pandering to the audience that sabotages most of his films.  \t0\nHence the whole story lacks a certain energy.  \t0\nThe plot simply rumbles on like a machine, desperately depending on the addition of new scenes.  \t0\nThere are the usual Hitchcock logic flaws.  \t0\nMishima is extremely uninteresting.  \t0\nThis is a chilly, unremarkable movie about an author living/working in a chilly abstruse culture.  \t0\nThe flat reenactments don\'t hold your attention because they are emotionally adrift and

# Function to Remove Puctuations and Stopwords
### To Automate the text data pre processing part

In [84]:
import string

In [85]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [86]:
from nltk.corpus import stopwords

In [87]:
def message_text_pre_process(text_message):
  remove_punct = [char for char in text_message if char not in string.punctuation]
  remove_punct = ''.join(remove_punct)
  remove_stopwords = [word for word in remove_punct.split() if word.lower() not in stopwords.words('english')]
  return remove_stopwords

**Call the above function to remove punctuation and stopwords from the dataset**

In [88]:
email_data_df['comment'].apply(message_text_pre_process)

0      [slowmoving, aimless, movie, distressed, drift...
1      [sure, lost, flat, characters, audience, nearl...
2      [Attempting, artiness, black, white, clever, c...
3                       [little, music, anything, speak]
4      [best, scene, movie, Gerardo, trying, find, so...
                             ...                        
743    [got, bored, watching, Jessice, Lange, take, c...
744    [Unfortunately, virtue, films, production, wor...
745                                 [word, embarrassing]
746                                 [Exceptionally, bad]
747     [insult, ones, intelligence, huge, waste, money]
Name: comment, Length: 748, dtype: object

In [89]:
email_data_df.head()

Unnamed: 0,comment,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


# Vectorization
This process assigns values to the words in our processed dataset. This process is helpful in identifying word counts and frequency.

In [90]:
from sklearn.feature_extraction.text import CountVectorizer

In [91]:
bag_of_words = CountVectorizer(analyzer=message_text_pre_process).fit(email_data_df['comment'])

In [92]:
bag_of_words_trf = bag_of_words.transform(email_data_df['comment'])

# TF-IDF (Transformer)
This process creates the bag of words containg the Text data converted into numerical feature vectors with fixed size. From this we can determine the corpus of document.

In [93]:
from sklearn.feature_extraction.text import TfidfTransformer

In [94]:
tfidf_fit = TfidfTransformer().fit(bag_of_words_trf)

In [95]:
tfidf_trf = tfidf_fit.transform(bag_of_words_trf)

# Model Building

In [96]:
from sklearn.naive_bayes import MultinomialNB

In [97]:
spam_detector_model = MultinomialNB().fit(tfidf_trf,email_data_df['sentiment'])

In [98]:
test_message = email_data_df['comment'][4]

In [99]:
# test_message = 'Winner !!! You won 10 million'

In [100]:
bag_of_words_test_message = bag_of_words.transform([test_message])

In [101]:
tfidf_test_messsge = tfidf_fit.transform(bag_of_words_test_message)

In [102]:
spam_detector_model.predict(tfidf_test_messsge)[0]

1

In [103]:
email_data_df['sentiment'][4]

1

In [104]:
email_data_df.head()

Unnamed: 0,comment,sentiment
0,"A very, very, very slow-moving, aimless movie ...",0
1,Not sure who was more lost - the flat characte...,0
2,Attempting artiness with black & white and cle...,0
3,Very little music or anything to speak of.,0
4,The best scene in the movie was when Gerardo i...,1


In [105]:
prediction_for_all_messages = spam_detector_model.predict(tfidf_trf)

In [106]:
from sklearn.metrics import classification_report

In [107]:
print(classification_report(email_data_df['sentiment'],prediction_for_all_messages))

              precision    recall  f1-score   support

           0       0.98      0.98      0.98       362
           1       0.98      0.98      0.98       386

    accuracy                           0.98       748
   macro avg       0.98      0.98      0.98       748
weighted avg       0.98      0.98      0.98       748



# Conclusion

Our model is able to determine sentiments on a given comment to a movie with a 98% accuracy.