# Time to model!

In [1]:
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier

In [2]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

**Read in the combined Nike and Adidas csv to begin modeling**

In [3]:
df = pd.read_csv('../data/Nike_Adidas_combined.csv')

In [4]:
# Check dataframe
df.head()

Unnamed: 0.1,Unnamed: 0,author,comments,created,score,subscribers,text,title,url,subreddit
0,0,jeremec,1,2407,18,11043,Please note that any product from a previous s...,tip for identifi nike product,https://www.reddit.com/r/Nike/comments/47fex4/...,1
1,1,lavienstyle,2,2554,8,11043,0,what thi called,https://i.redd.it/hjnoz5gqo7521.jpg,1
2,2,azndkflush,0,2554,1,11043,0,lc OW nike presto d,https://imgur.com/gallery/Aykz7YX,1
3,3,hiding_in_NJ,2,2554,49,11043,0,3D print jordan 1s by me,https://i.redd.it/j4oaab0za2521.jpg,1
4,4,nitr0h,0,2554,2,11043,http://www.jimmyjazz.com/boys/clothing/nike-bl...,need help find thi tracksuit in men,https://www.reddit.com/r/Nike/comments/a7k2nd/...,1


In [5]:
# Drop Unnamed: 0 column
df.drop('Unnamed: 0', axis=1, inplace=True)

**Check to summary statistics to see if df is 100% ready to model**

In [6]:
# There are 7 null values in the 'title' column
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1991 entries, 0 to 1990
Data columns (total 9 columns):
author         1991 non-null object
comments       1991 non-null int64
created        1991 non-null int64
score          1991 non-null int64
subscribers    1991 non-null int64
text           1991 non-null object
title          1984 non-null object
url            1991 non-null object
subreddit      1991 non-null int64
dtypes: int64(5), object(4)
memory usage: 140.1+ KB


In [7]:
# Pull out rows in the dataframe with null values 
# Since there are no titles or post text for these rows, we will drop them
df[df['title'].isna()]

Unnamed: 0,author,comments,created,score,subscribers,text,title,url,subreddit
252,Tooup,3,2550,96,11043,0,,https://imgur.com/Qs7QvGK,1
308,Tooup,1,2549,2,11043,0,,https://imgur.com/yI1zLAv,1
802,heartlessCigarrette,6,2539,6,11043,0,,https://i.redd.it/9m8i4am4rbk11.jpg,1
1198,Tooup,5,2551,5,9666,0,,https://i.redd.it/rc7enslytyz11.jpg,0
1306,bredk87,8,2548,6,9666,0,,https://i.redd.it/wfhtheuud7w11.jpg,0
1963,Jdmo99,2,2532,2,9666,0,,https://i.redd.it/mbyygmph4sa11.jpg,0
1969,ghostbuster55,0,2532,1,9666,0,,https://i.redd.it/w7mldlr15sa11.jpg,0


In [8]:
# Drop all rows with null values
df.dropna(inplace=True)

In [9]:
# Clean dataframe, let's model
df.isnull().sum()

author         0
comments       0
created        0
score          0
subscribers    0
text           0
title          0
url            0
subreddit      0
dtype: int64

## Set X and y

In [10]:
X = df['title']
y = df['subreddit']

## Baseline Score

**The baseline score is the score of the majority class. Since the majority class is Nike, the baseline score is about `.502`**

In [11]:
y.value_counts(normalize=True)

1    0.502016
0    0.497984
Name: subreddit, dtype: float64

## Train Test Split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, stratify=y)

## CountVectorize our data

In [13]:
cv = CountVectorizer(stop_words='english',      # Instantiate english stop words
                    lowercase=True,             # All words to lowercase if not already
                    min_df=2,                   # Ignore words that do not occur at least 2 times
                    ngram_range=(1,5))          # Set ngram_range

In [14]:
cv_train = cv.fit_transform(X_train)            # Fit the CountVectorizer to the training data
cv_test = cv.transform(X_test)                  # Fit the CountVectorizer to the testing data

In [15]:
# Create cv_train dataframe
cv_train_df = pd.DataFrame(cv_train.toarray(), columns = cv.get_feature_names())
cv_train_df.head()

Unnamed: 0,10,10 year,100,11,11 concord,12,12 bought,120,13,15,...,zebra,zipper,zne,zne hoodi,zoom,zoom kobe,zx,zx 500,zx 500 rm,zx flux
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [16]:
# Create cv_test dataframe
cv_test_df = pd.DataFrame(cv_test.toarray(), columns = cv.get_feature_names())
cv_test_df.head()

Unnamed: 0,10,10 year,100,11,11 concord,12,12 bought,120,13,15,...,zebra,zipper,zne,zne hoodi,zoom,zoom kobe,zx,zx 500,zx 500 rm,zx flux
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
print(cv_train.shape)
print(cv_test.shape)

(1488, 1602)
(496, 1602)


### Let's try different models and then optimize the one that works best

### Logistic Regression Model

In [18]:
lr = LogisticRegression(penalty='l1', tol=.0001, C=50, random_state=42)

lr.fit(cv_train, y_train)

print(lr.score(cv_train, y_train))
print('')
print(lr.score(cv_test, y_test))

0.9704301075268817

0.7681451612903226


In [19]:
lr = LogisticRegression(penalty='l2', tol=.0001, C=50, random_state=42)

lr.fit(cv_train, y_train)

print(lr.score(cv_train, y_train))
print('')
print(lr.score(cv_test, y_test))

0.9717741935483871

0.7620967741935484


|Logistic Regression Model|Performance (Lasso)|
|---|---|
|Training score|0.97|
|Testing score|0.768|

|Logistic Regression Model|Performance (Ridge)|
|---|---|
|Training score|0.972|
|Testing score|0.762|

The Logistic Regression model has a higher training score than the testing score under both Lasso and Ridge penalties meaning that the model is very overfit. In both cases, the model performs well on the training data, but there is a significan dropoff in performance in the testing data.

In [20]:
# Created a dataframe of all coefficients and sorted by descending values
# The top features and coefficients that will determine whether a post is a nike post
coef_df = pd.DataFrame(lr.coef_, columns=cv_train_df.columns).T.sort_values(by=0, ascending=False)

In [21]:
# Top 10 features that determine a Nike post
coef_df.head(15)

Unnamed: 0,0
nike,9.632286
thi tee,7.947265
swoosh,6.320781
shorts,5.749983
nikes,5.19059
vapormax,4.996777
need help thi,4.868915
jordan,4.850543
af1,4.805395
lebron,4.671448


In [22]:
# Top 20 features that determine an Adidas post
coef_df.tail(15)

Unnamed: 0,0
love,-4.292383
pants,-4.336523
id thi jacket,-4.348685
twice,-4.351185
authentic,-4.394604
adidas,-4.654403
return,-4.902218
boost,-4.96979
primeknit,-5.074217
nmd,-5.320318


### Multinomial Naive Bayes Model

In [23]:
mnb = MultinomialNB()

mnb.fit(cv_train, y_train)

print(mnb.score(cv_train, y_train))
print('')
print(mnb.score(cv_test, y_test))

0.8958333333333334

0.7661290322580645


|Multinomial Naive Bayes Model|Performance|
|---|---|
|Training score|0.896|
|Testing score|0.766|

Similar to the Logistic Regression models, the Multinomial Naive Bayes Models are also overfit. Also, this model performs worse on the training data than the Logistic Regression models and similarly on the testing data. 

### Decision Tree Model

In [24]:
dt = DecisionTreeClassifier(criterion='gini', random_state=42, max_depth=1000)

dt.fit(cv_train, y_train)

print(dt.score(cv_train, y_train))
print('')
print(dt.score(cv_test, y_test))

0.9744623655913979

0.6915322580645161


|Decision Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.692|

The Decision Tree Model performs better than the Multinomial Naive Bayes model on the training set. The testing score however, is lower than the previous models and also suffers from overfitting.

### Bagged Tree Model

In [25]:
bag = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini', random_state=42, max_depth=1000), 
                        n_estimators=100)

bag.fit(cv_train, y_train)

print(bag.score(cv_train, y_train))
print('')
print(bag.score(cv_test, y_test))

0.9744623655913979

0.7399193548387096


|Bagged Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.744|

The Bagged Tree Model is also overfit. 

### Extra Trees Model

In [26]:
et = ExtraTreesClassifier(max_depth=1000,
                          random_state=42, criterion='gini', n_estimators=200)

et.fit(cv_train, y_train)

print(et.score(cv_train, y_train))
print('')
print(et.score(cv_test, y_test))

0.9744623655913979

0.7600806451612904


|Extra Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.76|

The Extra Trees Model is also overfit.

### Let's see how these models compare to a Random Forest Model

### Random Forest Model

In [27]:
rf = RandomForestClassifier(n_estimators=200, criterion='gini', max_depth=1000, random_state=42)

rf.fit(cv_train, y_train)

print(rf.score(cv_train, y_train))
print('')
print(rf.score(cv_test, y_test))

0.9744623655913979

0.7540322580645161


|Random Forest Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.754|

The Random Forest Model is also overfit and performs very similarly to the Extra Trees Model on both training and testing sets.

### The best model under CountVectorizer is: Logistic Regression Model (Lasso)
**(Based on test scores)**
### Let's optimize using GridSearch

In [28]:
lr = LogisticRegression(random_state=42)
lr_params = {'penalty': ['l1', 'l2'],
         'tol': [.0001, .001, .00001],
         'C': [1.0, 10.0, 50.0]}

gs = GridSearchCV(lr, param_grid=lr_params, cv=5)

gs.fit(cv_train, y_train)
print(gs.best_score_)
print(gs.best_params_)
print('')

print(gs.score(cv_train, y_train))
print('')
print(gs.score(cv_test, y_test))

0.7869623655913979
{'C': 1.0, 'penalty': 'l2', 'tol': 0.0001}

0.9348118279569892

0.7439516129032258


After running our GridSearch model we see that our best score is a cross validation score of about .79. However, when we score our Gridsearch model to the training and testing sets, the model doesn't perform better compared to the original model that we ran. We expect the Gridsearch model to optimize the model based on the parameters we feed into it, but in this case, our model did not generate a better score.

We know that the data we gathered is not a representation of the entire subreddit of posts so we can say that the cross validated score is a good generalizer on data that it has not seen. When running a cross validation score, our training set is being split into another training set and a smaller testing set on which our model will score its performance on. 

## TfidfVectorize our data

In [29]:
tf = TfidfVectorizer(stop_words='english',      # Instantiate english stop words
                    lowercase=True,             # All words to lowercase if not already
                    min_df=2,                   # Ignore words that do not occur at least 5 times
                    ngram_range=(1,5))          # Set ngram_range

In [30]:
tf_train = tf.fit_transform(X_train)            # Fit the TfidfVectorizer to the training data
tf_test = tf.transform(X_test)                  # Fit the TfidfVectorizer to the testing data

In [31]:
# Create tf_train dataframe
tf_train_df = pd.DataFrame(tf_train.toarray(), columns = tf.get_feature_names())
tf_train_df.head()

Unnamed: 0,10,10 year,100,11,11 concord,12,12 bought,120,13,15,...,zebra,zipper,zne,zne hoodi,zoom,zoom kobe,zx,zx 500,zx 500 rm,zx flux
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [32]:
# Create tf_test dataframe
tf_test_df = pd.DataFrame(tf_test.todense(), columns = tf.get_feature_names())
tf_test_df.head()

Unnamed: 0,10,10 year,100,11,11 concord,12,12 bought,120,13,15,...,zebra,zipper,zne,zne hoodi,zoom,zoom kobe,zx,zx 500,zx 500 rm,zx flux
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [33]:
print(tf_train.shape)
print(tf_test.shape)

(1488, 1602)
(496, 1602)


### Let's try different models and then optimize the one that works best

### Logistic Regression Model

In [34]:
log = LogisticRegression(penalty='l2', tol=.0001, C=50, random_state=42)

log.fit(tf_train, y_train)

print(log.score(tf_train, y_train))
print('')
print(log.score(tf_test, y_test))

0.9684139784946236

0.7620967741935484


In [35]:
log = LogisticRegression(penalty='l1', tol=.0001, C=50, random_state=42)

log.fit(tf_train, y_train)

print(log.score(tf_train, y_train))
print('')
print(log.score(tf_test, y_test))

0.9711021505376344

0.7540322580645161


|TfidfVectorizer - Logistic Regression Model|Performance (Lasso)|Performance (Ridge)|
|---|---|---|
|Training score|0.971|0.968|
|Testing score|0.754|0.762|


|CountVectorizer - Logistic Regression Model|Performance (Lasso)|Performance (Ridge)|
|---|---|---|
|Training score|0.97|0.972|
|Testing score|0.768|0.762|

The Logistic Regression model has a higher training score than the testing score under both Lasso and Ridge penalties meaning that the model is very overfit. In both cases, the model performs well on the training data, but there is a significant dropoff in performance in the testing data.

In comparison with the CountVectorizer, the Ridge training score and the Lasso testing score was lower under the Tfidf transformer.

In [36]:
# Created a dataframe of all coefficients and sorted by descending values
# The top features and coefficients that will determine whether a post is a nike post
coef_df2 = pd.DataFrame(log.coef_, columns=cv_train_df.columns).T.sort_values(by=0, ascending=False)

In [37]:
coef_df2.head(10)

Unnamed: 0,0
nike,67.448966
thi tee,29.83354
air,24.093357
thi adida,23.591298
women jacket,22.991701
jordan,22.171483
af1,21.438488
swoosh,18.586644
id thi hat,17.899171
56,15.930431


In [38]:
coef_df2.tail(10)

Unnamed: 0,0
love,-17.411038
adidas,-17.465362
pants,-18.736918
tell thi,-21.802988
appreciated,-21.860661
boost,-22.116376
uncomfort,-22.482716
ultraboost,-22.610392
nmd,-31.202116
adida,-63.892771


### Multinomial Naive Bayes Model

In [39]:
mnnb = MultinomialNB()

mnnb.fit(tf_train, y_train)

print(mnnb.score(tf_train, y_train))
print('')
print(mnnb.score(tf_test, y_test))

0.9166666666666666

0.7681451612903226


|TfidfVectorizer - Multinomial Naive Bayes|Model Performance|
|---|---|
|Training score|0.917|
|Testing score|0.768|


|CountVectorizer - Multinomial Naive Bayes|Model Performance|
|---|---|
|Training score|0.896|
|Testing score|0.766|

Although the model is still overfit under the Tfidf transformer, the training and testing score performed better than the model under the CountVectorizer transformer.

### Decision Tree Model

In [40]:
dT = DecisionTreeClassifier(criterion='gini', random_state=42, max_depth=1000)

dT.fit(tf_train, y_train)

print(dT.score(tf_train, y_train))
print('')
print(dT.score(tf_test, y_test))

0.9744623655913979

0.7379032258064516


|TfidfVectorizer - Decision Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.737|


|CountVectorizer - Decision Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.692|

Although the model is still overfit under the Tfidf transformer, the testing score performed better than the model under the CountVectorizer transformer.

### Bagged Tree Model

In [41]:
bagged = BaggingClassifier(base_estimator=DecisionTreeClassifier(criterion='gini', random_state=42, 
                                                                 max_depth=1000), n_estimators=100)

bagged.fit(tf_train, y_train)

print(bagged.score(tf_train, y_train))
print('')
print(bagged.score(tf_test, y_test))

0.9737903225806451

0.7540322580645161


|TfidfVectorizer - Bagged Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.744|


|CountVectorizer - Bagged Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.744|

The Bagged Tree Model seems to have performed equally well under both vectorizers.

### Extra Trees Model

In [42]:
eT = ExtraTreesClassifier(max_depth=1000, random_state=42, criterion='gini', n_estimators=200)

eT.fit(tf_train, y_train)

print(eT.score(tf_train, y_train))
print('')
print(eT.score(tf_test, y_test))

0.9744623655913979

0.7560483870967742


|TfidfVectorizer - Extra Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.76|


|CountVectorizer - Extra Tree Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.76|

The Extra Trees Model seems to have performed equally well under both vectorizers.

### Let's see how these models compare to a Random Forest Model

### Random Forest Model

In [43]:
rF = RandomForestClassifier(n_estimators=200, criterion='gini', max_depth=1000,
                           random_state=42)
rF.fit(tf_train, y_train)

print(rF.score(tf_train, y_train))
print('')
print(rF.score(tf_test, y_test))

0.9744623655913979

0.7661290322580645


|TfidfVectorizer - Random Forest Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.766|


|CountVectorizer - Random Forest Model|Performance|
|---|---|
|Training score|0.974|
|Testing score|0.754|

Although the model is still overfit under the Tfidf transformer, the testing score performed better than the model under the CountVectorizer transformer.

### The best models under TfidfVectorizer are: Random Forest Model & Multinomial Naive Bayes Model
**(Based on test and train scores)**
### Let's optimize both using GridSearch

In [44]:
# Random Forest Model GridSearch

rF = RandomForestClassifier(random_state=42)
rF_params = {
    'n_estimators': [10, 50, 100],
    'max_depth': [50, 100, 150]
}

gS = GridSearchCV(rF , param_grid=rF_params, cv=5)

gS.fit(tf_train, y_train)
print(gS.best_score_)
print(gS.best_params_)
print('')
print(gS.score(tf_train, y_train))
print('')
print(gS.score(tf_test, y_test))

0.7735215053763441
{'max_depth': 100, 'n_estimators': 50}

0.948252688172043

0.7641129032258065


After running our GridSearch model we see that our best score is a cross validation score of about .79. However, when we score our Gridsearch model to the training and testing sets, the model doesn't perform better compared to the original model that we ran. We expect the Gridsearch model to optimize the model based on the parameters we feed into it, but in this case, our model did not generate a better score.

We know that the data we gathered is not a representation of the entire subreddit of posts so we can say that the cross validated score is a good generalizer on data that it has not seen. When running a cross validation score, our training set is being split into another training set and a smaller testing set on which our model will score its performance on. 