# Project 3 Data Cleaning & EDA

## Contents:

- **[Import Libraries](#Import-Libraries)**.  
    
- **[Data Cleaning & EDA](#Data-Cleaning-Exploratory-Data-Analysis)**.  

- **[Model Preparation](#Model-Preparation)**.  

   - **[Baseline Model](#Baseline-Model)**. 
   - **[Pipeline Models](#Pipeline-Models)**. 
   - **[Logistic Regression Models](#Logistic-Regression-Models)**.  
   - **[Naive Bayes Models](#Naive-Bayes-Models)**.  
   - **[Random Forest](#Random-Forest)**.  
   - **[Extra Trees](#Extra-Trees)**.  
   
- **[Model Evaluation](#Model-Evaluation)**.  

- **[Conclusions and Recommendations](#Conclusions-and-Recommendations)**.  

- **[References](#References)**.

### Import Libraries

In [195]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import re
import seaborn as sns

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier
from sklearn.dummy import DummyClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text


from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
import time
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt
%matplotlib inline

#### Reading csv files

In [114]:
baby_data = pd.read_csv('../data/result1.csv')

In [115]:
baby_data.shape

(1724, 32)

In [116]:
Pets_data = pd.read_csv('../data/result2.csv')

In [117]:
Pets_data.shape

(2959, 32)

Frequency of classes are similar in both data. There is no need to stratify or correct for imbalanced data

In [118]:
comb_data = pd.read_csv('../data/combined_db.csv') 

### Data cleaning & EDA

In [119]:
comb_data.head()

Unnamed: 0,subreddit,subreddit_type,author,created_utc,link_flair_text_color,link_flair_type,retrieved_on,score,subreddit_subscribers,subreddit_type.1,...,is_robot_indexable,is_self,is_video,locked,media_only,over_18,pinned,spoiler,stickied,timestamp
0,baby,public,purehoopla,1584915670,dark,text,1584915676,1,6852,public,...,False,True,False,False,False,False,False,False,False,2020-03-22
1,baby,public,Renegade626,1584933321,dark,text,1584933322,1,6855,public,...,True,True,False,False,False,False,False,False,False,2020-03-22
2,baby,public,mytrendybabystore,1584983015,dark,text,1584983020,1,6856,public,...,False,True,False,False,False,False,False,False,False,2020-03-23
3,baby,public,mytrendybabystore,1584988382,dark,text,1584988392,1,6857,public,...,False,True,False,False,False,False,False,False,False,2020-03-23
4,baby,public,kellimoxie,1585016982,dark,text,1585016988,1,6860,public,...,True,True,False,False,False,False,False,False,False,2020-03-23


checking null value

In [121]:
comb_data.isnull().sum().sum()

0

There is no null values.

In [123]:
comb_data.dtypes

subreddit                 object
subreddit_type            object
author                    object
created_utc                int64
link_flair_text_color     object
link_flair_type           object
retrieved_on               int64
score                      int64
subreddit_subscribers      int64
subreddit_type.1          object
title                     object
domain                    object
full_link                 object
url                       object
is_reddit_media_domain      bool
no_follow                   bool
send_replies                bool
can_mod_post                bool
contest_mode                bool
is_crosspostable            bool
is_meta                     bool
is_original_content         bool
is_robot_indexable          bool
is_self                     bool
is_video                    bool
locked                      bool
media_only                  bool
over_18                     bool
pinned                      bool
spoiler                     bool
stickied  

In [124]:
comb_data['subreddit'].value_counts()

Pets    2959
baby    1724
Name: subreddit, dtype: int64

Create Binary y column based on subreddit name

In [125]:
comb_data['y'] = comb_data['subreddit'].map(lambda x: 1 if x == 'baby' else 0)

comb_data.head(2)

Unnamed: 0,subreddit,subreddit_type,author,created_utc,link_flair_text_color,link_flair_type,retrieved_on,score,subreddit_subscribers,subreddit_type.1,...,is_self,is_video,locked,media_only,over_18,pinned,spoiler,stickied,timestamp,y
0,baby,public,purehoopla,1584915670,dark,text,1584915676,1,6852,public,...,True,False,False,False,False,False,False,False,2020-03-22,1
1,baby,public,Renegade626,1584933321,dark,text,1584933322,1,6855,public,...,True,False,False,False,False,False,False,False,2020-03-22,1


In [126]:
comb_data['y'].value_counts(normalize=True)

0    0.63186
1    0.36814
Name: y, dtype: float64

In [127]:
comb_data['title']

0          What It’s Like to Be (Very) Pregnant Right Now
1                       Tummy sleeper wakes several times
2        Online Resources For Kids During Home Quarantine
3               Chicken Pot Pie The Whole Family Can Make
4       I could use advice regarding a cancelled baby ...
                              ...                        
4678                            Cat begs for food nonstop
4679     Found a price sticker in Royal Canin Small Adult
4680    6 month old kitty sometimes plays with mouth o...
4681                     Remember to keep pets safe today
4682                       Getting pee smell out of couch
Name: title, Length: 4683, dtype: object

Let create a function to tokenize,Lemmatizing & Stemming data

In [128]:
def clean_text(text_to_clean):
    # subs charact in the brackets
    text_to_clean = re.sub( '[^a-zA-Z0-9]', ' ', text_to_clean)
    # subs tabs,newlines and "whitespace-like"
    text_to_clean = re.sub( '\s+', ' ', text_to_clean).strip()
    # convert to lowercase split indv words
    words = text_to_clean.lower().split() 
    #converting stop words to set
    stops = set(stopwords.words('english')) 
    # removing stop words
    meaningful_words = [w for w in words if not w in stops] 
    #Join the words back into one string separated by space, 
    # and return the result.
    return(" ".join(meaningful_words))

In [129]:
comb_data['clean_title'] = comb_data.apply(lambda x: clean_text(x['title']), axis=1)

In [130]:
comb_data['clean_url']=comb_data.apply(lambda x: clean_text(x['url']), axis=1)

In [131]:
comb_data['clean_full_link']=comb_data.apply(lambda x: clean_text(x['full_link']), axis=1)

In [177]:
# comb_data.drop_dublicates(subset = ['subreddit', 'subreddit_type'
#                     'author', 'created_utc', 'link_flair_text_color','link_flair_type', 'retrieved_on', 'score'
# 'subreddit_subscribers','subreddit_type.1','title', 'domain', 'full_link', 'url',
# 'is_reddit_media_domain','no_follow', 'send_replies', 'can_mod_post'
# 'contest_mode', 'is_crosspostable', 'is_meta', 'is_original_content'
# 'is_robot_indexable', 'is_self', 'is_video', 'locked', 'media_only',
# 'over_18', 'pinned', 'spoiler', 'stickied', 'timestamp'], keep = 'first')

In [133]:
comb_data.head(2)

Unnamed: 0,subreddit,subreddit_type,author,created_utc,link_flair_text_color,link_flair_type,retrieved_on,score,subreddit_subscribers,subreddit_type.1,...,media_only,over_18,pinned,spoiler,stickied,timestamp,y,clean_title,clean_url,clean_full_link
0,baby,public,purehoopla,1584915670,dark,text,1584915676,1,6852,public,...,False,False,False,False,False,2020-03-22,1,like pregnant right,https www reddit com r baby comments fn8hsd li...,https www reddit com r baby comments fn8hsd li...
1,baby,public,Renegade626,1584933321,dark,text,1584933322,1,6855,public,...,False,False,False,False,False,2020-03-22,1,tummy sleeper wakes several times,https www reddit com r baby comments fncyve tu...,https www reddit com r baby comments fncyve tu...


### Model Preparation 

establish X and Y

In [137]:
X = comb_data['clean_title']
y = comb_data['y']

Split data into training and testing sets

In [139]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [None]:
Stopwords

Let's look at sklearn's stopwords.

In [141]:
print(CountVectorizer(stop_words = 'english').get_stop_words())

frozenset({'towards', 'another', 'several', 'whether', 'moreover', 'ten', 'cant', 'serious', 'though', 'nothing', 'forty', 'least', 'even', 'eg', 'up', 'thick', 'thereafter', 'them', 'eleven', 'across', 'often', 'un', 'being', 'any', 'am', 'due', 'should', 'whatever', 'there', 'no', 'something', 'nine', 'himself', 'bottom', 'without', 'per', 'which', 'our', 'de', 'onto', 'next', 'ourselves', 'had', 'those', 'about', 'some', 'via', 'each', 'mine', 'neither', 'yourselves', 'out', 'last', 'most', 'not', 'part', 'hundred', 'here', 'made', 'only', 'fill', 'find', 'but', 'once', 'between', 'itself', 'amoungst', 'when', 'be', 're', 'found', 'too', 'together', 'ever', 'mill', 'as', 'behind', 'somewhere', 'than', 'sixty', 'else', 'yours', 'still', 'also', 'beside', 'former', 'in', 'whence', 'thereby', 'whither', 'anyhow', 'of', 'its', 'then', 'everywhere', 'except', 'over', 'beyond', 'under', 'by', 'could', 'his', 'toward', 'fifteen', 'whenever', 'may', 'my', 'thru', 'for', 'will', 'whereby', '

In [142]:
cv = CountVectorizer(stop_words=['eg', 'un', 'am', 'ltd', 'etc', 'may', 'my', 'thru', 'for',
                                'from', 'inc', 'con', 'more', 'ie', ])

### Modeling


#### Baseline Model

Since this is a classification problem (and we'll be using accuracy as our metric), the baseline model is to predicted the most frequently occuring target class.The baseline accuracy is the percentage of the majority class, regardless of whether it is 1 or 0. It serves as the benchmark for our model to beat.

In [140]:
y.value_counts(normalize=True)

0    0.63186
1    0.36814
Name: y, dtype: float64

Here 1 as 'baby' and 0 as 'Pets' expect an accuracy of ~36% and ~63%. Any model performing well above this will be a significant improvement to the baseline model.

In [174]:
# instantiate DummyClassifier
dummy = DummyClassifier(strategy="most_frequent")
dummy.fit(X_train, y_train)

# score on test
print('Test Score:', dummy.score(X_test, y_test))

# score on train
print('Train Score:', dummy.score(X_train, y_train))

# score on cross val
print('Cross Val Score:', cross_val_score(dummy, X, y, cv =5).mean())

Test Score: 0.6319385140905209
Train Score: 0.6318337129840547
Cross Val Score: 0.6318599549389303


Our baseline model is at ~63% accuracy.
Therefore, we are looking to build a model that does better than 63%, otherwise our best bet is to blindly guess the same outcome every time.

#### Pipeline Models

In [178]:
cv = CountVectorizer(stop_words=['eg', 'un', 'am', 'ltd', 'etc', 'may', 'my', 'thru', 'for',
                      'from', 'inc', 'con', 'more', 'ie', ])
   
tvect = TfidfVectorizer()
lr = LogisticRegression()
mnb = MultinomialNB()
bnb = BernoulliNB()

Creates 2 Logistic Regression pipelines using count and tfidf vectorizers
with two stages:

In [None]:
cv_lr_pipe = Pipeline([
    ('cv', cv), #transformer
    ('lr', lr)  # estimator
])
tfidf_lr_pipe = Pipeline([
    ('tvect', tvect), #transformer
    ('lr', lr)        # estimator
])

# Creates two Naive Bayes pipelines with Multinomial and binomial NB
# with two stages:

m_nb_pipe = Pipeline([
    ('cv', cv),    #transformer
    ('mnb', mnb)   # estimator
])
b_nb_pipe = Pipeline([
    ('tvect', tvect), #transformer  
    ('bnb', bnb)      # estimator
])

 Search over the following values of hyperparameters:

In [184]:
params_1 = {
    'cv__max_features':[None,5_000,10_000],
    #'cv__min_df': [2, 3],
    #'cv__max_df': [.9, .95],
    'cv__ngram_range':[(1,1),(1,2)]
}
params_2 = {
    'tvect__max_features':[None,4_000,5_000,10_000],
    #'tvect__min_df': [2, 3],
    #'tvect__max_df': [.9, .95],
    'tvect__ngram_range':[(1,1),(1,2)]
}

 Instantiate GridSearchCV & Fit GridSearch to training data.

#### Logistic Regression Models

In [187]:
gs_cv = GridSearchCV(cv_lr_pipe, 
                  params_1,
                  cv = 5)
gs_cv.fit(X_train, y_train)
print('Best parameters:', gs_cv.best_params_)
print('Best Score:', gs_cv.best_score_)
print('Test Score:', gs_cv.score(X_test, y_test))
print('Train Score:', gs_cv.score(X_train, y_train))

Best parameters: {'cv__max_features': None, 'cv__ngram_range': (1, 2)}
Best Score: 0.9447593342330185
Test Score: 0.9538855678906917
Train Score: 0.994874715261959


In [188]:
gs_tfidf = GridSearchCV(tfidf_lr_pipe, 
                  params_2,
                  cv = 5)
gs_tfidf.fit(X_train, y_train)
print('Best parameters:', gs_tfidf.best_params_)
print('Best Score:', gs_tfidf.best_score_)
print('Test Score:', gs_tfidf.score(X_test, y_test))
print('Train Score:', gs_tfidf.score(X_train, y_train))

Best parameters: {'tvect__max_features': None, 'tvect__ngram_range': (1, 1)}
Best Score: 0.9342212658002133
Test Score: 0.9453458582408198
Train Score: 0.9806378132118451


#### Naive Bayes Models

Multinomial NB

In [169]:
gs_m_nb = GridSearchCV(m_nb_pipe, 
                  params_1,
                  cv = 5)
gs_m_nb.fit(X_train, y_train)
print('Best parameters:', gs_m_nb.best_params_)
print('Best Score:', gs_m_nb.best_score_)
print('Test Score:', gs_m_nb.score(X_test, y_test))
print('Train Score:', gs_m_nb.score(X_train, y_train))

Best parameters: {'cv__max_df': 0.9, 'cv__max_features': None, 'cv__min_df': 2, 'cv__ngram_range': (1, 1)}
Best Score: 0.9245370876949824
Test Score: 0.9291204099060631
Train Score: 0.9607061503416856


Binomial NB

In [171]:
gs_b_nb = GridSearchCV(b_nb_pipe, 
                  params_2,
                  cv = 5)
gs_b_nb.fit(X_train, y_train)
print('Best parameters:', gs_b_nb.best_params_)
print('Best Score:', gs_b_nb.best_score_)
print('Test Score:', gs_b_nb.score(X_test, y_test))
print('Train Score:', gs_b_nb.score(X_train, y_train))

Best parameters: {'tvect__max_df': 0.9, 'tvect__max_features': None, 'tvect__min_df': 2, 'tvect__ngram_range': (1, 1)}
Best Score: 0.9208370313633474
Test Score: 0.9222886421861657
Train Score: 0.9570045558086561


#### Random Forest

In [158]:
cv = CountVectorizer()

# Transform the corpus
X_train_vec = cv.fit_transform(X_train)
X_test_vec = cv.transform(X_test)

# an instance of RandomForestClassifier 
rf = RandomForestClassifier(random_state=42)

# Fiting
rf.fit(X_train_vec,y_train)

#cross val score is the best_score of model
print("Cross Val Score: ", cross_val_score(rf,X_train_vec,y_train,cv=5).mean())

#rounded down features in each split for classification problem.
print("Train Score: ", round(rf.score(X_train_vec,y_train),4))
print("Train Score: ", round(rf.score(X_test_vec,y_test),4))


Cross Val Score:  0.929947761526709
Train Score:  0.9983
Train Score:  0.9342


#### Extra Trees

In [159]:
cv = CountVectorizer()
# Transform the corpus
X_train_vec = cv.fit_transform(X_train)
X_test_vec = cv.transform(X_test)

# an instance of ExtraTreesClassifier
et = ExtraTreesClassifier(random_state=42)

# Fiting
et.fit(X_train_vec,y_train)
#cross val score is the best_score of model
print("Cross Val Score: ",cross_val_score(rf,X_train_vec,y_train,cv=5).mean())

#rounded down features in each split for classification problem.
print("Train Score: ", round(et.score(X_train_vec,y_train),4))
print("Test Score: ", round(et.score(X_test_vec,y_test),4))

Cross Val Score:  0.929947761526709
Train Score:  0.9983
Test Score:  0.9223


In [None]:
Baggin Classifier

In [None]:
Top Feature barh

In [166]:
# # instantiate StandardScalar
# sc = StandardScaler()

# # fit on X_train AND transform it
# sc_X_train = sc.fit_transform(X_train)

# # ONLY transform the X_test
# sc_X_test = sc.transform(X_test)

# # instantiate model
# knn = KNeighborsClassifier()

# # fit model
# knn.fit(sc_X_train, y_train)
# # train score
# print("Train Score: ", knn.score(sc_X_train, y_train))
# # test score
# print("Test Score: ", knn.score(sc_X_test, y_test))
# # cross val
# print("Cross Val Score: ", cross_val_score(knn, sc.transform(X), y).mean())

Using accuracy as metric, evaluate all models on the training, testing sets and cross val score/best score

| Model                           | Train Score        | Test Score         | Cross Val Score/Best Score |
|---------------------------------|--------------------|--------------------|----------------------------|
| Baseline                        | 0.6318337129840547 | 0.6319385140905209 | 0.6318599549389303         |
| CountVectorizer(estimator = lr) | 0.9812072892938497 | 0.9393680614859095 | 0.9382086539981277         |
| TfidfVectorizer(estimator = lr) | 0.9783599088838268 | 0.9436379163108455 | 0.9302334723387355         |
| MultinomialNB                   | 0.9607061503416856 | 0.9291204099060631 | 0.9245370876949824         |
| BernoulliNB                     | 0.9570045558086561 | 0.9222886421861657 | 0.9208370313633474         |
| Random Forest                   | 0.9342             | 0.9983             | 0.929947761526709          |
| Extra Trees                     | 0.9223             | 0.9983             | 0.929947761526709          |

### Model Evaluation

In [189]:
# Use best scoring model to evaluate
predictions = ...predict(X_test)
# check out confusion matrix
cm = confusion_matrix(y_test, predictions)

In [None]:
# Convert confusion matrix to dataframe
cm_df = pd.DataFrame(cm,
                    columns = ['predicted neg', 'predicted pos'],
                    index = ['actual neg', 'actual pos'])
cm_df

In [None]:
# Calculate model accuracy
accuracy =(1096+1157)/(1096+153+87+1157)
accuracy*100

### Conclusions and Recommendations

### References

- [Pushshift API](https://github.com/pushshift/api)

- https://api.pushshift.io/reddit/search/submission?subreddit=baby

- https://api.pushshift.io/reddit/search/submission?subreddit=Pets

- https://www.epochconverter.com

- https://pushshift.io/api-parameters/