<a href="https://colab.research.google.com/github/VanessaEzeoke/Data_refresh/blob/main/Sentiment_Analytic_Model_Demo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1. Collect the Data

In [1]:
!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [2]:
import opendatasets as od
import pandas as pd
import numpy as np
import json

### 1a - Extract the data

In [3]:
# kaggle_key= json.load(open('kaggle.json'))
# kaggle_key["username"]
# kaggle_key["key"]
od.download("https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews")

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: sunniev
Your Kaggle Key: ··········
Downloading imdb-dataset-of-50k-movie-reviews.zip to ./imdb-dataset-of-50k-movie-reviews


100%|██████████| 25.7M/25.7M [00:00<00:00, 124MB/s] 





In [4]:
data= pd.read_csv('imdb-dataset-of-50k-movie-reviews/IMDB Dataset.csv')
print("Data shape:",data.shape)
data.head(10)

Data shape: (50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive
5,"Probably my all-time favorite movie, a story o...",positive
6,I sure would like to see a resurrection of a u...,positive
7,"This show was an amazing, fresh & innovative i...",negative
8,Encouraged by the positive comments about this...,negative
9,If you like original gut wrenching laughter yo...,positive


### 1b. Exploratory Data Analysis

For numerical data, this will entail outlier analysis, checking distribution of data among features, checking the relationship between features eg correlation,etc

In [5]:
#Summary of the dataset
data.describe()

Unnamed: 0,review,sentiment
count,50000,50000
unique,49582,2
top,Loved today's show!!! It was a variety and not...,positive
freq,5,25000


In [6]:
data['sentiment'].value_counts()

positive    25000
negative    25000
Name: sentiment, dtype: int64

### 1c. Train Test Split

Splitting the dataset into train and test. Picking the first 4k reviews for training

In [7]:
#Train data
train_reviews=data.review[:40000]
train_sentiments=data.sentiment[:40000]

#test dataset
test_reviews=data.review[40000:]
test_sentiments=data.sentiment[40000:]
#validate split

print("train data shape:",train_reviews.shape,train_sentiments.shape)
print("test data shape:", test_reviews.shape,test_sentiments.shape)

train data shape: (40000,) (40000,)
test data shape: (10000,) (10000,)


## 2. Prepare the data

In [8]:
from bs4 import BeautifulSoup as bs
import regex as re

### 2a. Clean the data

This entails removing html strips, remove non-alphabets, remove punctuations and put all text in lower case. For numerical data this will involve, filling/replacing missing data, scaling and normalising the data, dropping specific features, etc.


In [9]:
#Removing the html strips
def strip_html(text):
    soup = bs(text, "html.parser")
    return soup.get_text()

#Removing the square brackets
def remove_brackets(text):
    return re.sub('\[[^]]*\]', '', text)

#Removing punctuation marks and numbers
def remove_punct(text):
    return re.sub(r'[^\w\s]','',text)

#Removing the noisy text
def clean_text(text):
    text = strip_html(text)
    text = remove_brackets(text)
    text = remove_punct(text)
    return text.lower()

In [10]:
#Apply function on review column
data['review']=data['review'].apply(clean_text)

  soup = bs(text, "html.parser")


In [11]:
data.head()

Unnamed: 0,review,sentiment
0,one of the other reviewers has mentioned that ...,positive
1,a wonderful little production the filming tech...,positive
2,i thought this was a wonderful way to spend ti...,positive
3,basically theres a family where a little boy j...,negative
4,petter matteis love in the time of money is a ...,positive


### 2b. Feature engineering

This entails creating new features out of old features. For natural language processing, this involves removing stopwords, stemming, counting the most frequently occuring words and converting the words into vectors (tf-idf). In numerical data, this will involve creating new featuures eg creating days by subtracting two dates, or onehot encoding, etc


In [12]:
import nltk
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize.toktok import ToktokTokenizer
from sklearn.preprocessing import LabelBinarizer

In [13]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

Stop words

This involves creating and removing stop words. Stop words are words that occure frequently in sentences but might not have valuable meaning in a model. eg "she, he, they, are". These are important in structuring sentences but not so valuable in a model

In [14]:
sw = stopwords.words("english")
print("the corpus has",len(sw),"stop words","\n")
print("the first 15 stop words are",sw[0:14])
stop=set(sw)

tokenizer=ToktokTokenizer()
def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in sw]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in sw]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

the corpus has 179 stop words 

the first 15 stop words are ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your']


In [15]:
data['review']=data['review'].apply(remove_stopwords)
data.head(10)

Unnamed: 0,review,sentiment
0,one reviewers mentioned watching 1 oz episode ...,positive
1,wonderful little production filming technique ...,positive
2,thought wonderful way spend time hot summer we...,positive
3,basically theres family little boy jake thinks...,negative
4,petter matteis love time money visually stunni...,positive
5,probably alltime favorite movie story selfless...,positive
6,sure would like see resurrection dated seahunt...,positive
7,show amazing fresh innovative idea 70s first a...,negative
8,encouraged positive comments film looking forw...,negative
9,like original gut wrenching laughter like movi...,positive


Stemming the text:


This involves taking a word to its stem, thus removing all the prefixes or suffixes while trying to retain the core meaning eg: schooling becomes "school" while "education", "educating" and "educate" become "educat". This is also known as lemmatisation

In [16]:
# Text Stemming
def stemmer(text):
    ps=nltk.porter.PorterStemmer()
    text= ' '.join([ps.stem(word) for word in text.split()])
    return text

In [17]:
data['review']=data['review'].apply(stemmer)

Splitting the data

In [None]:
#Train
norm_train_reviews=data.review[:40000]
norm_train_reviews[0]


In [None]:
#Test reviews
norm_test_reviews=data.review[40000:]
norm_test_reviews[45005]

**Term Frequency-Inverse Document Frequency model** [(TFIDF)](https://medium.com/@cmukesh8688/tf-idf-vectorizer-scikit-learn-dbc0244a911a)

It is used to convert text documents to matrix of tfidf features. This basically converts each word into a number with each value dependent on how rare or infrequent the word appears in each review. Words that dont appear often are considered more valuable than words that appear every where. eg, the word "movie" should have a lower score in this dataset because it is a word that is expected to appear very often considering the context of this corpus.

In [20]:
tv=TfidfVectorizer(min_df=0,max_df=1,use_idf=True,ngram_range=(1,1))

#transformed train reviews
tv_train_reviews=tv.fit_transform(norm_train_reviews)
#6142405
#transformed test reviews
tv_test_reviews=tv.transform(norm_test_reviews)

print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Tfidf_train: (40000, 106381)
Tfidf_test: (10000, 106381)


Encoding the sentiment

In [21]:
lb=LabelBinarizer()
#transformed sentiment data
sentiment_data=lb.fit_transform(data['sentiment'])
print(sentiment_data.shape)

(50000, 1)


Here we can see that positive sentiments are converted to 1 and negative sentiments into 0

In [22]:
sentiment_data[0:5]

array([[1],
       [1],
       [1],
       [0],
       [1]])

In [23]:
#Spliting the sentiment data
X_train=tv_train_reviews
X_test= tv_test_reviews
y_train=sentiment_data[:40000]
y_test=sentiment_data[40000:]

## 3. Train the Model

In [24]:
!pip install catboost

Collecting catboost
  Downloading catboost-1.2.2-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m6.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.2


In [25]:
from xgboost import XGBClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, f1_score,confusion_matrix
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from catboost import CatBoostClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier

In [26]:

classifiers = [RandomForestClassifier(random_state=345),
               GradientBoostingClassifier(random_state=345),
               CatBoostClassifier(verbose=False, random_state=345),
               XGBClassifier(),
               LogisticRegression(),
               MultinomialNB(),
               SGDClassifier()
]

In [27]:
for classifier in classifiers:
    classifier.fit(X_train, y_train.ravel())
    y_pred = classifier.predict(X_test)
    print('\n Classfier:')
    print(classifier)
    print("\n model score: %.3f" % classifier.score(X_test, y_test))
    print('confusion matrix')
    print(confusion_matrix(y_test, y_pred))
    print('classification report')
    print(classification_report(y_test, y_pred))
    print('Accuracy : %f' % (accuracy_score(y_test, y_pred)))
    print('f1 score : %f' % (f1_score(y_test, y_pred, average='weighted')))


 Classfier:
RandomForestClassifier(random_state=345)

 model score: 0.544
confusion matrix
[[1241 3752]
 [ 806 4201]]
classification report
              precision    recall  f1-score   support

           0       0.61      0.25      0.35      4993
           1       0.53      0.84      0.65      5007

    accuracy                           0.54     10000
   macro avg       0.57      0.54      0.50     10000
weighted avg       0.57      0.54      0.50     10000

Accuracy : 0.544200
f1 score : 0.500637

 Classfier:
GradientBoostingClassifier(random_state=345)

 model score: 0.500
confusion matrix
[[4984    9]
 [4992   15]]
classification report
              precision    recall  f1-score   support

           0       0.50      1.00      0.67      4993
           1       0.62      0.00      0.01      5007

    accuracy                           0.50     10000
   macro avg       0.56      0.50      0.34     10000
weighted avg       0.56      0.50      0.34     10000

Accuracy : 0.499900


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))



 Classfier:
LogisticRegression()

 model score: 0.547
confusion matrix
[[1286 3707]
 [ 822 4185]]
classification report
              precision    recall  f1-score   support

           0       0.61      0.26      0.36      4993
           1       0.53      0.84      0.65      5007

    accuracy                           0.55     10000
   macro avg       0.57      0.55      0.51     10000
weighted avg       0.57      0.55      0.51     10000

Accuracy : 0.547100
f1 score : 0.505746

 Classfier:
MultinomialNB()

 model score: 0.544
confusion matrix
[[4152  841]
 [3722 1285]]
classification report
              precision    recall  f1-score   support

           0       0.53      0.83      0.65      4993
           1       0.60      0.26      0.36      5007

    accuracy                           0.54     10000
   macro avg       0.57      0.54      0.50     10000
weighted avg       0.57      0.54      0.50     10000

Accuracy : 0.543700
f1 score : 0.502635

 Classfier:
SGDClassifier()


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The performance is not so great. Trying a simple count vectorizer and check if the performance is better instead.


`Count Vectorizer`: This is another method of converting text into numbers (aka Tokenization) which can go into the algorithms. It converts each document (review) into a fixed lenght vector where each work is replaced by its frequency of occurence in the total documents. This uses the simple [Bag of Words model  ](https://en.wikipedia.org/wiki/Bag-of-words_model#:~:text=The%20bag%2Dof%2Dwords%20model%20is%20commonly%20used%20in%20methods,1954%20article%20on%20Distributional%20Structure.)

In [28]:
vectorizer = CountVectorizer()
X_train2 = vectorizer.fit_transform(norm_train_reviews)
X_test2 = vectorizer.transform(norm_test_reviews)
print('X_train2:',X_train2.shape)
print('X_test2:',X_test2.shape)

X_train2: (40000, 156428)
X_test2: (10000, 156428)


In [29]:
for classifier in classifiers:
    classifier.fit(X_train2, y_train.ravel())
    y_pred = classifier.predict(X_test2)
    print('\n Classfier:')
    print(classifier)
    print("\n model score: %.3f" % classifier.score(X_test2, y_test))
    print('confusion matrix')
    print(confusion_matrix(y_test, y_pred))
    print('classification report')
    print(classification_report(y_test, y_pred))
    print('Accuracy : %f' % (accuracy_score(y_test, y_pred)))
    print('f1 score : %f' % (f1_score(y_test, y_pred, average='weighted')))


 Classfier:
RandomForestClassifier(random_state=345)

 model score: 0.854
confusion matrix
[[4279  714]
 [ 749 4258]]
classification report
              precision    recall  f1-score   support

           0       0.85      0.86      0.85      4993
           1       0.86      0.85      0.85      5007

    accuracy                           0.85     10000
   macro avg       0.85      0.85      0.85     10000
weighted avg       0.85      0.85      0.85     10000

Accuracy : 0.853700
f1 score : 0.853699

 Classfier:
GradientBoostingClassifier(random_state=345)

 model score: 0.810
confusion matrix
[[3771 1222]
 [ 681 4326]]
classification report
              precision    recall  f1-score   support

           0       0.85      0.76      0.80      4993
           1       0.78      0.86      0.82      5007

    accuracy                           0.81     10000
   macro avg       0.81      0.81      0.81     10000
weighted avg       0.81      0.81      0.81     10000

Accuracy : 0.809700


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 Classfier:
LogisticRegression()

 model score: 0.885
confusion matrix
[[4415  578]
 [ 576 4431]]
classification report
              precision    recall  f1-score   support

           0       0.88      0.88      0.88      4993
           1       0.88      0.88      0.88      5007

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

Accuracy : 0.884600
f1 score : 0.884600

 Classfier:
MultinomialNB()

 model score: 0.856
confusion matrix
[[4373  620]
 [ 824 4183]]
classification report
              precision    recall  f1-score   support

           0       0.84      0.88      0.86      4993
           1       0.87      0.84      0.85      5007

    accuracy                           0.86     10000
   macro avg       0.86      0.86      0.86     10000
weighted avg       0.86      0.86      0.86     10000

Accuracy : 0.855600
f1 score : 0.855544

 Classfier:
SGDClassifier()


selecting the best model.

The top two models are the Logistic Model and the SGD classifier. Hyper parameter tuning should happen here after getting an overview of the best fit model. However we will just go on to save the top 2 best fit model.


## 4. Output

In [30]:
from joblib import Parallel, delayed
import joblib

In [33]:
lr = LogisticRegression()
lr.fit(X_train2, y_train.ravel())
y_pred = lr.predict(X_test2)
print("\n model score: %.3f" % lr.score(X_test2, y_test))
print('confusion matrix')
print(confusion_matrix(y_test, y_pred))
print('classification report')
print(classification_report(y_test, y_pred))
print('Accuracy : %f' % (accuracy_score(y_test, y_pred)))
print('f1 score : %f' % (f1_score(y_test, y_pred, average='weighted')))



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(



 model score: 0.885
confusion matrix
[[4415  578]
 [ 576 4431]]
classification report
              precision    recall  f1-score   support

           0       0.88      0.88      0.88      4993
           1       0.88      0.88      0.88      5007

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

Accuracy : 0.884600
f1 score : 0.884600


In [34]:
joblib.dump(lr, 'lr_sentiment.pkl')

['lr_sentiment.pkl']

In [35]:
sgd =SGDClassifier()
sgd.fit(X_train2, y_train.ravel())
y_pred = sgd.predict(X_test2)
print("\n model score: %.3f" % classifier.score(X_test2, y_test))
print('confusion matrix')
print(confusion_matrix(y_test, y_pred))
print('classification report')
print(classification_report(y_test, y_pred))
print('Accuracy : %f' % (accuracy_score(y_test, y_pred)))
print('f1 score : %f' % (f1_score(y_test, y_pred, average='weighted')))


 model score: 0.881
confusion matrix
[[4366  627]
 [ 579 4428]]
classification report
              precision    recall  f1-score   support

           0       0.88      0.87      0.88      4993
           1       0.88      0.88      0.88      5007

    accuracy                           0.88     10000
   macro avg       0.88      0.88      0.88     10000
weighted avg       0.88      0.88      0.88     10000

Accuracy : 0.879400
f1 score : 0.879396


In [36]:
joblib.dump(sgd, 'sgd_sentiment.pkl')

['sgd_sentiment.pkl']

Test output of the model