# Model Building

In the previous analysis, data preprocessing (feature engineering) and data exploration has been done. The next step will be to built the model. 

## Contents
1. [Loading the clean data into Pandas DataFrame](#loading)
2. [Stemming](#stem)
3. [Text Embedding](#embedding)
4. [Label Encoding](#encode)
5. [Imbalanced Dataset](#imbalance)
6. [Train Test Split](#split)
7. [Model Training](#train)
8. [Saving the models](#saving)

<a id=loading></a>
## 1. Loading the clean data into Pandas DataFrame

As in the previous it has been seen that the dataset contains more than 5,50,000 rows. So it will be very difficult to train such an enormous data. So we have selected a subset of the data (20% of the original data)

In [1]:
import joblib
df = joblib.load('./data_file/data.joblib')
    
# df = df.sample(frac=0.10,replace=False,ignore_index=True,random_state=1)
df = df.sample(frac=0.20, random_state=1).reset_index(drop=True)
df.head(5)

Unnamed: 0,Score,Review,Helpfulness,year,Month,Date,Day,sentiment,Review character length,Review word length
0,5,cherry pie larabar love cherry pie lara bar be...,0.0,2012,6,24,Sun,positive,111,20
1,5,melitta coffee melitta cafe collection blanc e...,0.0,2010,10,24,Sun,positive,404,58
2,5,great treat girls absolutely loved tuna heaven...,0.0,2012,3,15,Thu,positive,133,22
3,5,daily calming vendor fast dependable tea simpl...,0.0,2012,3,27,Tue,positive,93,14
4,5,best canned artichokes update lot happen coupl...,0.666667,2010,4,16,Fri,positive,768,119


<a id=stem></a>
## 2. Stemming

Stemming removes the inflection of words. This highly reduces the features required to train the model. 

In [3]:
from nltk.stem import PorterStemmer
 
ps = PorterStemmer()
text_data = df['Review'].apply(lambda x: ps.stem(x))
text_data

0         cherry pie larabar love cherry pie lara bar be...
1         melitta coffee melitta cafe collection blanc e...
2         great treat girls absolutely loved tuna heaven...
3         daily calming vendor fast dependable tea simpl...
4         best canned artichokes update lot happen coupl...
                                ...                        
113686    fruit nut delite bars think wonderful love com...
113687    cats say weruvya older cats one hyperthyroidis...
113688    greatest jerky jack links sweet hot jerky gera...
113689    incredible addition baking replacing half flou...
113690    good price initially little disappointed bonsa...
Name: Review, Length: 113691, dtype: object

<a id=embedding></a>
## 3. Text Embedding

The process of converting the textual data into numbers which is possible for computers to work with is called text embedding. There are many methods to embed the text. These are word2vec, TFIDF and word vectorization. Here we have selected TFIDF as it is not so computationally expensive as word vectorization and also help to preserve the semantic meaning of the text.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer
model_tf = TfidfVectorizer(max_features=6000, ngram_range=(1,2))
X = model_tf.fit_transform(text_data)
X.shape

(113691, 6000)

<a id=encode></a>
## 4. Label Encoding

The sentiment column has three classes. To convert this class into it's corresponding numbers is called text encoding.

In [5]:
from sklearn.preprocessing import LabelEncoder
model_le = LabelEncoder()
model_le.fit(['neutral','positive','negative'])
y = model_le.transform(df['sentiment'])

<a id=imbalance></a>
## 5. Checking if the dataset is imbalanced

In [6]:
df.groupby('sentiment')['sentiment'].value_counts()

sentiment
negative    16466
neutral      8448
positive    88777
Name: count, dtype: int64

It can be seen that the dataset is highly imbalanced. This may result in a biased model. So, we need to apply SMOTE technique and equalize the category of data. But before doing so we need to remove the inflection of the words using lemmatization and embed the text.

In [7]:
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
X_res.shape

(266331, 6000)

<a id=split></a>
## 6. Train Test Split

In [8]:
from sklearn.model_selection import train_test_split
X_train,X_test, y_train, y_test = train_test_split(X_res,y_res,random_state=0,test_size=0.2)

<a id=train></a>
## 7. Model Training

Here we have selected 5 machine learning models. The accuracy of each model is printed after training.

In [9]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(n_jobs=-1,multi_class='multinomial', max_iter=1000)

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(min_samples_leaf=2,n_jobs=-1)

from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()

from sklearn.naive_bayes import BernoulliNB
br = BernoulliNB()

from sklearn.ensemble import AdaBoostClassifier
ab = AdaBoostClassifier()

models = [lr, rf, dt, br, ab]
from sklearn.metrics import accuracy_score

for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    
    print(f'Model = {model}, accuracy = {acc}')
    print('-'*100)


Model = LogisticRegression(max_iter=1000, multi_class='multinomial', n_jobs=-1), accuracy = 0.8737679989486924
----------------------------------------------------------------------------------------------------
Model = RandomForestClassifier(min_samples_leaf=2, n_jobs=-1), accuracy = 0.9535171870013329
----------------------------------------------------------------------------------------------------
Model = DecisionTreeClassifier(), accuracy = 0.8655077252332589
----------------------------------------------------------------------------------------------------
Model = BernoulliNB(), accuracy = 0.6786565791202809
----------------------------------------------------------------------------------------------------
Model = AdaBoostClassifier(), accuracy = 0.7092759119154448
----------------------------------------------------------------------------------------------------


From the above we conclude the best model is <b> RandomForest</b>

<a id=saving></a>
## 8 Saving the models

We are saving the objects of PorterStammer, TFIDF, LalebEncoder and RandomForest.

In [15]:
joblib.dump(ps, './models/model_stem.joblib', compress = 3) 
joblib.dump(model_tf,'./models/model_tfidf.joblib', compress=3) 
joblib.dump(model_le, './models/model_label.joblib', compress=3) 
joblib.dump(rf, './models/model_rand_fo.joblib', compress=3) 

['./models/model_rand_fo.joblib']

All the task of model building is done. Now what left is deployement.