# Modelling

This notebook objective is to select the best model at predicting a subreddit post belonging to either 'r/GoogleHome' or 'r/AmazonEcho'.

A dataset that had been further engineered from the EDA notebook will be used and would also undego further preprocessing using stemming and lemmatization. This processed data would then undergo a TFIDvectorizer transformer before training on Naive Bayes and Random Forest Models.

Additionally, the best algorithm, based on accuracy, would be selected and would also be tested against data that does not include key words and phrases such as 'Google Home' and 'Amazon Echo' just to name a few. 

## Problem Statement

Google's smart speaker system, Google Home, was designed to compete with the popular Amazon Echo. Both product serve as a vehicle to their respective voice-activated virtual helper that connects to the internet. 

Reddit users have used the platform as a forum to discuss their experience with the products including troubleshooting. I had been tasked by Google's Research team to analyze customer sentiment towards Google Home from subreddit posts on 'r/GoogleHome'.

Additionally the Research Team would also like to compare customer pain points between their product and Amazon Echo, to improve the product itself. Therefore subreddit posts from 'r/AmazonEcho' would also be included in the dataset.

A model (insert type of model) would also be built to predict if a given set of words do in fact refer to the discussion of either the Amazon Echo or the Google Home based on selected features. 

Each subreddit post are represented as 'documents' in the dataset and therefore both terms will be used interchangebly.

**Contents**
- [Import libraries](#Import-libraries)
- [Load data](#Load-data)
- [Preprocessing](#Preprocessing)
- [Create multiple dataframes from different preprocessing methods](#Create-multiple-dataframes-fromdifferent-preprocessing-methods)
- [Decision to only use Tfidvectorizer](#Decision-to-only-use-Tfidvectorizer)
- [Modeling using lemmatized text](#Modeling-using-lemmatized-text)
    - [Naive Bayes Model](#Naive-Bayes-Model)
    - [Random Forest Model](#Random-Forest-Model)
- [Modeling using stemmed text](#Modeling-using-lemmatized-text)
    - [Naive Bayes Model](#Naive-Bayes-Model)
    - [Random Forest Model](#Random-Forest-Model)
- [Include additional features to see if model improves](#Include-additional-features-to-see-if-model-improves)
- [Conclusion](#Conclusion)
- [Recommendations](#Shortcomings-and-Recommendations)

<b>Preprocessing and Modeling</b>


Does the student properly split and/or sample the data for validation/training purposes?

Does the student defend their choice of production model relevant to the data at hand and the problem?

Does the student explain how the model works and evaluate its performance successes/downfalls?

<b>Evaluation and Conceptual Understanding</b>

Does the student accurately identify and explain the baseline score?

Does the student select and use metrics relevant to the problem objective?

Accuracy will be used best metric to use here. Improperly classification in terms of false negative or false positive does not have dire consequences as opposed to fields such as healthcare. 

Does the student interpret the results of their model for purposes of inference?

Is domain knowledge demonstrated when interpreting results?



Does the student provide appropriate interpretation with regards to descriptive and inferential statistics?



## Import Libraries

In [1]:
# imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from nltk.stem import WordNetLemmatizer, PorterStemmer
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 
from sklearn.ensemble import RandomForestClassifier

## Load Data

In [2]:
subreddit_data = pd.read_csv(r'../datasets/modelling_data.csv')

In [3]:
subreddit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5733 entries, 0 to 5732
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sentiment_score  5733 non-null   float64
 1   word_count       5733 non-null   int64  
 2   linktype         5733 non-null   object 
 3   selftext_        5733 non-null   object 
 4   num_comments     5733 non-null   int64  
 5   subreddit_       5733 non-null   object 
dtypes: float64(1), int64(2), object(3)
memory usage: 268.9+ KB


In [4]:
subreddit_data.subreddit_.value_counts()

amazonecho    2968
googlehome    2765
Name: subreddit_, dtype: int64

# Preprocessing

In [5]:
#change y_target to binary

subreddit_data['subreddit_'] = subreddit_data['subreddit_'].apply(lambda x: 1 if x == 'googlehome' else 0)

In [6]:
#check if lambda function works 

subreddit_data.subreddit_.value_counts()

0    2968
1    2765
Name: subreddit_, dtype: int64

In [7]:
#dummify categorical columns
linktype_dummy = pd.get_dummies(data=subreddit_data['linktype'],prefix='_',drop_first=True)

#drop linktype data
subreddit_data.drop(columns='linktype',inplace=True)

In [8]:
#concat new df
subreddit_data = pd.concat([subreddit_data,linktype_dummy],axis=1)

In [9]:
subreddit_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5733 entries, 0 to 5732
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sentiment_score  5733 non-null   float64
 1   word_count       5733 non-null   int64  
 2   selftext_        5733 non-null   object 
 3   num_comments     5733 non-null   int64  
 4   subreddit_       5733 non-null   int64  
 5   __link           5733 non-null   uint8  
 6   __none           5733 non-null   uint8  
dtypes: float64(1), int64(3), object(1), uint8(2)
memory usage: 235.3+ KB


In [10]:
#separate text for comparison
text_to_convert = subreddit_data['selftext_']

#create new df without selftext as additional features to be trained on moving forward
subreddit_data_notext = subreddit_data.drop(columns='selftext_',axis=1)

In [11]:
subreddit_data_notext.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5733 entries, 0 to 5732
Data columns (total 6 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   sentiment_score  5733 non-null   float64
 1   word_count       5733 non-null   int64  
 2   num_comments     5733 non-null   int64  
 3   subreddit_       5733 non-null   int64  
 4   __link           5733 non-null   uint8  
 5   __none           5733 non-null   uint8  
dtypes: float64(1), int64(3), uint8(2)
memory usage: 190.5 KB


In [12]:
#set y target
y = subreddit_data['subreddit_']

## Create multiple dataframes from different preprocessing methods

In [13]:
#instantiate lemmatizer
lemmatizer = WordNetLemmatizer()

#create lemmatized text df
lem_text = pd.DataFrame(subreddit_data['selftext_'])

for row in range(0,len(subreddit_data)):
    word_list = []
    for word in subreddit_data.iat[row,2].split():
        word_list.append(lemmatizer.lemmatize(word))
    lem_text.iat[row,0] = (' ').join(word_list)

#instantiate stemmer
p_stem = PorterStemmer()

#create stemmed text df
stem_text = pd.DataFrame(subreddit_data['selftext_'])

for row in range(0,len(subreddit_data)):
    word_list = []
    for word in subreddit_data.iat[row,2].split():
        word_list.append(p_stem.stem(word))
    stem_text.iat[row,0] = (' ').join(word_list)
        


In [14]:
#check if text had transformed accordingly

print('Text before transformation:')
print(f'{text_to_convert[2]}')
print('\n')
print('Text after lemmatizer:')
print(f'{lem_text.iat[2,0]}')
print('\n')
print('Text after stemmer:')
print(f'{stem_text.iat[2,0]}')

Text before transformation:
Issue with all speakers and displays. When I open in home app it just spins and doesn't load. After I tap a gear on top right for settings it tells me that I am not connected to same wifi but I absolutely am. Has anyone experienced this? Galaxy S9 and latest Home app


Text after lemmatizer:
Issue with all speaker and displays. When I open in home app it just spin and doesn't load. After I tap a gear on top right for setting it tell me that I am not connected to same wifi but I absolutely am. Has anyone experienced this? Galaxy S9 and latest Home app


Text after stemmer:
issu with all speaker and displays. when i open in home app it just spin and doesn't load. after i tap a gear on top right for set it tell me that i am not connect to same wifi but i absolut am. ha anyon experienc this? galaxi s9 and latest home app


All text data has gone through the apporiate preprocessng methods to test if they would yield different performance on the models. Moving forward, pipelines would be made for the respective models.

## Baseline Model

In [156]:
y.value_counts(normalize=True)

0    0.517705
1    0.482295
Name: subreddit_, dtype: float64

Should our model were to predict every post to be a Google Home post, the model's accuracy score would be 0.4823. Other models will be compared against this score. 

## Decision to only use Tfidvectorizer

TfidVectorizer was chosen as the only preprocessing method to prepare our text data for out models. This decision is because CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words. this ends up in ignoring rare words which could have helped in processing our data more efficiently.

In TfidfVectorizer, it considers overall document weightage of a word. It helps  in dealing with most frequent words. Using it will penalize them and therefore have an effect on keywords such as 'Google' and 'Amazon'. 

## Modeling using lemmatized text

#### Naive Bayes Model 

In [15]:
#turn lemtext into 
X = lem_text.squeeze()

#perform train test split for X_lem and y_target
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state = 42)

pipe_tvec = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

pipe_tvec_params = {
    'tvec__max_features': [2_000, 3_000, 4_000, 5_000], #number of max feature
    'tvec__stop_words': [None, 'english'],  #with or without stopword
    'tvec__ngram_range': [(1,1), (1,2)]   #word or bigram
}

# Instantiate GridSearchCV.
gs_tvec = GridSearchCV(pipe_tvec, 
                        param_grid = pipe_tvec_params, 
                        cv=5) # 5-fold cross-validation.

In [18]:
# Fit GridSearch to training data.
gs_tvec.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('nb', MultinomialNB())]),
             param_grid={'tvec__max_features': [2000, 3000, 4000, 5000],
                         'tvec__ngram_range': [(1, 1), (1, 2)],
                         'tvec__stop_words': [None, 'english']})

In [19]:
gs_tvec.best_params_

{'tvec__max_features': 5000,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

In [20]:
gs_tvec.score(X_train,y_train)

0.9530123284484764

In [21]:
gs_tvec.score(X_test,y_test)

0.8940027894002789

|             	| Lemmetized  	|
|-------------	|-------------	|
|             	| Naive Bayes 	|
| train score 	| 0.9530      	|
| test score  	| 0.8940      	|

The Naive Base model returned relatively good scores and does not appear to have overfitting.

#### Random Forest Model

In [35]:
#instsantiate Tfid
tvec = TfidfVectorizer(max_features=5_000,stop_words='english')

#transform lemm text
X_lemm_tfid = pd.DataFrame(tvec.fit_transform(lem_text.squeeze()).todense(), 
                          columns=tvec.get_feature_names())

In [39]:
X_train, X_test, y_train, y_test = train_test_split(X_lemm_tfid,y, stratify=y, random_state = 42)

In [40]:
rf = RandomForestClassifier(n_estimators=100)

rf_params = {
    'n_estimators': [100, 150],
    'max_depth': [None, 1, 2, 3],
}

# Instantiate GridSearchCV.
rf_gs = GridSearchCV(rf,
                     param_grid=rf_params,
                     cv=5)


In [45]:
# Fit GridSearch to training data.
rf_gs.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 1, 2, 3],
                         'n_estimators': [100, 150]})

In [46]:
rf_gs.best_params_

{'max_depth': None, 'n_estimators': 100}

In [47]:
rf_gs.score(X_train,y_train)

0.9967434287043498

In [48]:
rf_gs.score(X_test,y_test)

0.900278940027894

|             	| Lemmetized  	| Lemmetized    	|
|-------------	|-------------	|---------------	|
|             	| Naive Bayes 	| Random Forest 	|
| train score 	| 0.9530      	| 0.9967        	|
| test score  	| 0.8940      	| 0.9002        	|

Random Forest had beat both the train accuracy score and train test score. Random Forest model shows overfitting from the train accuracy score even with a grid search, having 100 trees and potentially fitting to noise in the data. Irregardless we would still move forward and see how the Random Forrest fare against stemmed text.

Random Forest appears to be better predictive algorithm at the moment. 

Next, we will try the two algorithms on stemmed text data instead.

## Modeling using stemmed text

#### Naive Bayes Model 

In [50]:
#turn lemtext into series
X = stem_text.squeeze()

#perform train test split for X_stem and y_target
X_train, X_test, y_train, y_test = train_test_split(X,y, stratify=y, random_state = 42)

pipe_stem_tvec = Pipeline([
    ('tvec', TfidfVectorizer()),
    ('nb', MultinomialNB())
])

pipe_stem_tvec_params = {
    'tvec__max_features': [2_000, 3_000, 4_000, 5_000],
    'tvec__stop_words': [None, 'english'],
    'tvec__ngram_range': [(1,1), (1,2)]
}

# Instantiate GridSearchCV.
gs_stem_tvec = GridSearchCV(pipe_tvec, 
                        param_grid = pipe_tvec_params, 
                        cv=5) # 5-fold cross-validation.

In [51]:
# Fit GridSearch to training data.
gs_stem_tvec.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('nb', MultinomialNB())]),
             param_grid={'tvec__max_features': [2000, 3000, 4000, 5000],
                         'tvec__ngram_range': [(1, 1), (1, 2)],
                         'tvec__stop_words': [None, 'english']})

In [52]:
gs_stem_tvec.best_params_

{'tvec__max_features': 5000,
 'tvec__ngram_range': (1, 2),
 'tvec__stop_words': 'english'}

In [53]:
gs_stem_tvec.score(X_train,y_train)

0.9544080018608979

In [54]:
gs_stem_tvec.score(X_test,y_test)

0.897489539748954

|             	| Lemmetized  	| Lemmetized    	| Stemmed     	|
|-------------	|-------------	|---------------	|-------------	|
|             	| Naive Bayes 	| Random Forest 	| Naive Bayes 	|
| train score 	| 0.9530      	| 0.9967        	| 0.9544      	|
| test score  	| 0.8940      	| 0.9002        	| 0.8975      	|

Naive Bayes model returned better results with stemmed text data compared to lemmetized text yet still underperformed compared to the previous Random Forest model. We will see how the Random Forest algorithm performs with stemmed text data.

#### Random Forest Model

In [157]:
#instsantiate Tfid
tvec = TfidfVectorizer(max_features=5_000,stop_words='english')

#transform lemm text
X_stem_tfid = pd.DataFrame(tvec.fit_transform(stem_text.squeeze()).todense(), 
                          columns=tvec.get_feature_names())

In [158]:
X_train, X_test, y_train, y_test = train_test_split(X_stem_tfid,y, stratify=y, random_state = 42)

In [159]:
rf = RandomForestClassifier(n_estimators=100)

rf_params = {
    'n_estimators': [100, 150],
    'max_depth': [None, 1, 2, 3],
}

# Instantiate GridSearchCV.
rf_gs = GridSearchCV(rf,
                     param_grid=rf_params,
                     cv=5)


In [160]:
# Fit GridSearch to training data.
rf_gs.fit(X_train, y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 1, 2, 3],
                         'n_estimators': [100, 150]})

In [161]:
rf_gs.best_params_

{'max_depth': None, 'n_estimators': 150}

In [162]:
rf_gs.score(X_train,y_train)

0.9965108164689462

In [163]:
rf_gs.score(X_test,y_test)

0.9037656903765691

|             	| Lemmetized  	| Lemmetized    	| Stemmed     	| Stemmed       	|
|-------------	|-------------	|---------------	|-------------	|---------------	|
|             	| Naive Bayes 	| Random Forest 	| Naive Bayes 	| Random Forest 	|
| train score 	| 0.9530      	| 0.9967        	| 0.9544      	| 0.9965        	|
| test score  	| 0.8940      	| 0.9002        	| 0.8975      	| 0.9038        	|

The Random Forest model on stemmed data outperformed all models in accuracy test scores. Additionally, Random Forest does seem to be overfitting again, even with more trees, 150.

However upon closer inspection, te Random Forest model had reduced fitting by 0.0002 in terms of score, with 50 more trees, when in theory the tendency to overfitting should decrease. However it would not make much difference.

Ultimately, it is safe to say that the Random Forest algorithm performs the best, along with stemming as the text data preprocessing method of choice. 

Next we shall use Random Forest against text data that does not have obvious keywords like 'Google' and 'Amazon' etc.

## Test Random Forest algorithm on data without keywords

In [167]:
#create list of additional words to be including in TFID stop words
additional_words = ['google','amazon','mini','dot','hub','max','echo','alexa','assistant','nest','amp','x200b']

stop_words = set(text.ENGLISH_STOP_WORDS.union(additional_words))

# Instantiate the transformer for text data with no key words(nkw)
tvec_nkw = TfidfVectorizer(stop_words=stop_words)

#put converted data into respective dataframe
tvec_nkw_df = pd.DataFrame(tvec_nkw.fit_transform(stem_text.squeeze()).todense(), 
                          columns=tvec_nkw.get_feature_names())

In [168]:
X_train, X_test, y_train, y_test = train_test_split(tvec_nkw_df,y, stratify=y, random_state = 42)

In [169]:
rf_nkw = RandomForestClassifier(n_estimators=100)


rf_params = {
    'n_estimators': [100, 150],
    'max_depth': [None, 1, 2, 3],
}

# Instantiate GridSearchCV.
rf_gs = GridSearchCV(rf,
                     param_grid=rf_params,
                     cv=5)

rf_gs_nkw.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 1, 2, 3],
                         'n_estimators': [100, 150]})

In [170]:
rf_gs_nkw.best_params_

{'max_depth': None, 'n_estimators': 150}

In [171]:
print(f'Random Forest Model accuracy score with training data: {rf_gs_nkw.score(X_train,y_train)}')

Random Forest Model accuracy score with training data: 0.9988369388229821


In [172]:
print(f'Random Forest Model accuracy score with test data: {rf_gs_nkw.score(X_test,y_test)}')

Random Forest Model accuracy score with test data: 0.8068340306834031


|             	| with keywords 	| without keywords 	|
|-------------	|---------------	|------------------	|
|             	| Random Forest 	| Random Forest    	|
| train score 	| 0.9965        	| 0.9988           	|
| test score  	| 0.9024        	| 0.8068           	|

As expected, the algorithm had performed worse with text data without key words, with 150 trees. A workaround to improve this model is to include additional features that were previous explored in our EDA, such as number of comments, link types and word counts. 

## Include additional features to see if model improves

#### Randon Forest Model

In [175]:
# Instantiate the transformer for text data with no key words(nkw)
tvec_nkw = TfidfVectorizer(stop_words=stop_words)

#put converted data into respective dataframe
tvec_nkw_df = pd.DataFrame(tvec_nkw.fit_transform(stem_text.squeeze()).todense(), 
                          columns=tvec_nkw.get_feature_names())

#cocat additional data with tfid df with no keyword
tvec_nkw_df = pd.concat([tvec_nkw_df,subreddit_data_notext],axis=1)

In [176]:
X_train, X_test, y_train, y_test = train_test_split(tvec_nkw_df,y, stratify=y, random_state = 42)

In [148]:
#random forrest with no keywords and added features
rf_nkf = RandomForestClassifier(n_estimators=100)


rf_params = {
    'n_estimators': [100, 150],
    'max_depth': [None, 1, 2, 3],
}

# Instantiate GridSearchCV.
gs_rf_nkf = GridSearchCV(rf_nkf, 
                        param_grid = rf_params,
                            cv = 5)

In [149]:
gs_rf_nkf.fit(X_train,y_train)

GridSearchCV(cv=5, estimator=RandomForestClassifier(),
             param_grid={'max_depth': [None, 1, 2, 3],
                         'n_estimators': [100, 150]})

In [150]:
gs_rf_nkf.best_params_

{'max_depth': None, 'n_estimators': 150}

In [151]:
gs_rf_nkf.score(X_train,y_train)

1.0

In [152]:
gs_rf_nkf.score(X_test,y_test)

0.99860529986053

|             	| with keywords 	| without keywords 	| without keywords and additional features 	|
|-------------	|---------------	|------------------	|------------------------------------------	|
|             	| Random Forest 	| Random Forest    	| Random Forest                            	|
| train score 	| 0.9965        	| 0.9988           	| 1.0|
| test score  	| 0.9024        	| 0.8068           	| 0.9986                                   	|

Our Random Forest Model, with the same parameters ie. 150 trees, performed significantly better with almost a perfect score.

What can be concluded is that the features added into our model was extremely useful in the classification of the post to the right subreddits.

## Conclusion

Stemming our text data had resulted in better accuracy score across our Naive Bayes and Random Forest model. This is likely due to the nature of our problem, where we are simply classifying which post belonged to which subreddit. The application of lemmatization is better suited for more complex task such as building a language application where sesitivity to language inflections is important, which also may have been the reason for lower accuracy score. 

Our Random Forest model had also performed better than Naive Bayes overall. A possible reason would be the conditions the Naive Bayes model requires for it to work well. Naive Bayes works best when you have small training data set and relatively small features(dimensions). Since we have very high dimensionaly, the model may not give you accuracy, because the likelihood would be distributed and may not follow the Gaussian or other distribution. Another condition for Naive Bayes to work is that features should be dependent of each other and with text data, there is a likelihood that certain words are dependent of each other, for example bigrams or trigrams. 

Lastly, our Random Forest performed better with additional features, to the point wher even in our test score, we were close to perfect predictions. This is representative of how with more data, lowers estimation variance.

## Recommendations

The model had been trained on data that have been cleaned and preprocessed. Post that were regarded as 'spam' post were removed, in order to ensure accurate predictions. Which had been 7% of the initial data scraped from the Google Home subreddit. The models tested had not been trained on these types of posts, therefore the model may do poorly when prediciting the right subreddit when fed posts like these. 

Training the model with spam post may yield different results and would be interesting to test. 