# Data Compilation

In [12]:
import os
import pandas as pd
from sklearn.feature_extraction import stop_words
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix

os is a library that allows us to use "operating system dependant functionalities." Here, we can use the .listdir() method to list the contents of a directory:

In [3]:
list_of_dfs = []

try: 
    print('Lets try and pull data from the files in this folder')
    for file in os.listdir('../data')[1:]:
        d = pd.read_csv('../data/' + file)
        list_of_dfs.append(d)
    print('Huzzah! Mission Complete')
except:
    print("Welp, that didn't work")

df = pd.concat(list_of_dfs, ignore_index=True).drop_duplicates(subset = 'title')
df.shape

Lets try and pull data from the files in this folder
Huzzah! Mission Complete


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  if sys.path[0] == '':


(2993, 104)

If we, perhaps, have hundreds of csvs in the data subfolder, we don't want to manually create hundreds of dataframes just to concatenate them. Instead, we can programatically access a list of what's contained inside the data subfolder, import, store, and then concatenate:

*** 
# Data Exploration

In [4]:
df.columns

Index(['all_awardings', 'allow_live_comments', 'approved_at_utc',
       'approved_by', 'archived', 'author', 'author_cakeday',
       'author_flair_background_color', 'author_flair_css_class',
       'author_flair_richtext',
       ...
       'thumbnail_width', 'title', 'total_awards_received', 'ups', 'url',
       'user_reports', 'view_count', 'visited', 'whitelist_status', 'wls'],
      dtype='object', length=104)

In [5]:
df.head(3)

Unnamed: 0,all_awardings,allow_live_comments,approved_at_utc,approved_by,archived,author,author_cakeday,author_flair_background_color,author_flair_css_class,author_flair_richtext,...,thumbnail_width,title,total_awards_received,ups,url,user_reports,view_count,visited,whitelist_status,wls
0,[],False,,,False,pittman66,,,MAL,[],...,,"Wiki Overhaul Month, Week 2: Watch Order Wiki",0,67,https://www.reddit.com/r/anime/comments/c822ra...,[],,False,all_ads,6
1,[],False,,,False,AnimeMod,,,,[],...,,Recommendation Tuesdays Megathread - Week of J...,0,61,https://www.reddit.com/r/anime/comments/c823rj...,[],,False,all_ads,6
2,[],True,,,False,MinecrafterPH,,#2e51a2,MAL,[],...,,My Hero Academia Season 4 is reportedly listed...,0,5778,https://www.reddit.com/r/anime/comments/c8o432...,[],,False,all_ads,6


In [6]:
df['subreddit'].value_counts()

anime     2155
KDRAMA     838
Name: subreddit, dtype: int64

>The unbalanced classes may pose a problem for my model because it may struggle with assigning new data to the kdrama class. 

In [7]:
df['title'].head()

0        Wiki Overhaul Month, Week 2: Watch Order Wiki
1    Recommendation Tuesdays Megathread - Week of J...
2    My Hero Academia Season 4 is reportedly listed...
3     Dumbbell Nan Kilo Moteru? - Episode 1 discussion
4    Best Girl 6: Starting Salt in Another Contest!...
Name: title, dtype: object

In [8]:
df['title'].isnull().sum()

0

I'm going to focus on the subreddit post titles. I'll split my data using Train, Test, Split to see if my model can handle new data well.

In [9]:
X = df['title']
y = df['subreddit'].map(lambda cell: 1 if cell == 'anime' else 0)

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

*** 
# Model Comparison using NLP, Pipelines, and GridSearch
***

## Baseline Model

This is the baseline for my other models. If they do not surpass this score, then they are not worth using. 

In [11]:
y.value_counts(normalize = True).max()

0.7200133645172068

## Logisitic Regression

This is a classification model that excels on data with a binarized target variable and is used for its readability. 

### *Count Vectorizer*

This is an NLP method that converts my title strings into column names and counts how many time those words appear in all of the titles I'm looking at. It's basically like `.value_counts`.

In [14]:
#Creates pipeline for CountVectorizer and Logistic Regression
pipe_cv_lr = Pipeline([('cv', CountVectorizer(stop_words = 'english', max_df = .9)),  
                ('lr', LogisticRegression(random_state = 42))])

> * stop_words are words like 'the', 'an', 'to'. I don't want to include those as columns
* max_df means the min percentage of documents something needs to be in order to be excluded

In [15]:
#Hyperparameters for GridSearch to use to find best possible CountVectorizer, Log Regression hyperparameters

pipe_cv_lr_params = {
    'cv__max_features': [1500, 2000, 2500],   #max number of columns
    
    'cv__min_df': [3, 5],  #minimum number of docments something needs to be in in order to be included
    
    'cv__ngram_range': [(1,1), (1,2)],   #Accounts for context of up to two words i.e  'not good' vs 'not' or 'good'
    
    'lr__C': [.5, 1]  #penalty on coefficients increases as C decreases
    
}

In [16]:
#GridSearching to find best combination of hyperparameters for CountVectorizer
gs_cv_lr = GridSearchCV(pipe_cv_lr, param_grid=pipe_cv_lr_params, cv = 3)
gs_cv_lr.fit(X_train, y_train)
print(f'The CountVectorizer, Logistic Regression train score is {gs_cv_lr.best_score_}')
print(f'The CountVectorizer, Logistic Regression test score is {gs_cv_lr.score(X_test, y_test)}')
gs_cv_lr.best_params_



The CountVectorizer, Logistic Regression train score is 0.9295900178253119
The CountVectorizer, Logistic Regression test score is 0.9252336448598131


{'cv__max_features': 1500,
 'cv__min_df': 3,
 'cv__ngram_range': (1, 2),
 'lr__C': 1}

### Confusion Matrix

In order to get a good understanding of how our model will do when given new data, I'll generate a confusion matrix on my testing set. 

In [18]:
conf_matrix = pd.DataFrame(confusion_matrix(y_true=y_test,   #Actuals
                                            y_pred=gs_cv_lr.predict(X_test)),   #Generate Predictions
             columns=['Actual Negative', 'Actual Positive'],
             index=['Predicted Negative','Predicted Positive'])

conf_matrix

Unnamed: 0,Actual Negative,Actual Positive
Predicted Negative,156,54
Predicted Positive,2,537


A confusion matrix lets us know how well our model is adapting to new data. The positive or majority class is Anime and the negative or minority class is KDrama. Because I stratified my data before I split it, the inequality in the number of true positive (*bottom right*) and true negative (*top left*) values lets you know that there is an imbalance in the amount of Anime and KDrama posts in my data. 
> * I have signigicantly more true positives and negatives than false positives (*bottom left*) and false negatives (*top right*), meaning that this model does a great job of correctly assigning new posts to the correct subreddit.
* I have a good number of false negatives though, which lets me know that we are often assigning something to the KDrama class when it actually belongs to the Anime class.
* The low number of false positives means that the model rarely assigns something to the KDrama class when it is really an anime. 

The last two cases are probably due to the fact that our data is imbalanced, causing our model to be less prepared to handle new KDrama data because it needs to learn a bit more about it before it can accurately be sorted.

### *TFIDF Vectorizer*

Similar to Count Vectorizer, this NLP method converts my titles into strings. Unlike CountVectorizer, TFIDFVectorizer assigns a float score to each of the words in the title based on how often they appear in all of my documents.
> * words that appear more often in one document but rarely in the rest of them will score higher (i.e names)
* words that appear often in one document and show up in every document will score lower(i.e the)

In [12]:
#Creates pipeline for TFIDFVectorizer and Logistic Regression
pipe_tf_lr = Pipeline([('tf', TfidfVectorizer(stop_words = 'english', max_df = .9)),
                ('lr', LogisticRegression(random_state = 42))])

#Hyperparameters for GridSearch to use to find best possible tfVectorizer, Log Regression hyperparameters

pipe_tf_lr_params = {
    'tf__max_features': [3000, 3500, 4000],
    'tf__min_df': [3, 5],
    'tf__ngram_range': [(1,1), (1,2)],  
    'lr__C': [.5, 1]
}

In [13]:
#GridSearching to find best combination of hyperparameters for tfVectorizer
gs_tf_lr = GridSearchCV(pipe_tf_lr, param_grid=pipe_tf_lr_params, cv = 3)
gs_tf_lr.fit(X_train, y_train)
print(f'The TFIDFVectorizer, Logistic Regression train score is {gs_tf_lr.best_score_}')
print(f'The TFIDFVectorizer, Logistic Regression test score is {gs_tf_lr.score(X_test, y_test)}')
gs_tf_lr.best_params_



The TFIDFVectorizer, Logistic Regression train score is 0.9459617635583301
The TFIDFVectorizer, Logistic Regression test score is 0.9566998244587478


{'lr__C': 1,
 'tf__max_features': 3000,
 'tf__min_df': 3,
 'tf__ngram_range': (1, 2)}

**It turns out that both Logistic Regression Models perform really strongly and adapt to new data really well.** The Count Vectorizer model is pulling ahead of the TFIDF model, but this could be changed by modifying the gridsearch hyperparameters. 
> Due to it's performance, I will be comparing the **Logistic Regression Count Vectorizer model (LRCV)** to the remaining models. 

***
## Multinomial Naive Bayes

This is a modeling technique that relies on Bayes Theorem to make classification. It also models with the assumption that all of our features are independent of one another, which is rarely met. Although that assumption is naive, this model performs amazingly well regardless of that fact. We will use the Multinomial Version because our column values are positive integers after CountVectorizering our data.

> This model should give us a much better train and test score than the logistic regression model.

In [14]:
#Creates pipeline for TFIDFVectorizer and Gaussian Naive Bayes
pipe_mnb = Pipeline([('cv', CountVectorizer(stop_words = 'english', max_df = .9)), 
                ('mnb', MultinomialNB())])

#Hyperparameters for GridSearch to use to find best possible tfVec, Gaussian Naive Bayes hyperparameters

pipe_mnb_params = {
    'cv__max_features': [1500, 2000, 2500],
    'cv__min_df': [3, 5],
    'cv__ngram_range': [(1,1), (1,2)]
    
}

In [15]:
#GridSearching to find best combination of hyperparameters for tfVectorizer
gs_mnb = GridSearchCV(pipe_mnb, param_grid=pipe_mnb_params, cv = 3)
gs_mnb.fit(X_train, y_train)
print(f'The CountVectorizer, Multnomial Naive Bayes train score is {gs_mnb.best_score_}')
print(f'The CountVectorizer, Multinomial Naive Bayes test score is {gs_mnb.score(X_test, y_test)}')
gs_mnb.best_params_

The CountVectorizer, Multnomial Naive Bayes train score is 0.9541552867733125
The CountVectorizer, Multinomial Naive Bayes test score is 0.9561146869514335


{'cv__max_features': 2500, 'cv__min_df': 3, 'cv__ngram_range': (1, 2)}

The Count Vectorized Multinomial Bayes model performed very strongly with an extremely high train score. It also sports high adaptability to new data with the test score being higher than the train score. However, our model is less accurate and a weaker performer when adapting to new data when compared to the LRCV model. This may be because the naive assumption that's trademark to the Naive Bayes model is causing some error. It may also be because of the hyperparameters currently set. 

*As we continue, one major factor in the performance of our models will be the hyperparameters that each model is tested on. The nature of grid searching is a guess and check, so a good estimate can lead to amazing results for one model, and poor results for another.*

***
## Gaussian Naive Bayes

This is a another modeling technique that relies on Bayes Theorem to make classification. It also models with the assumption that all of our features are independent of one another, which is rarely met. Although that assumption is naive, this model performs amazingly well regardless of that fact. We will use the Gaussian Version because this version is the only one compatible with TFIDF's float scoring.

> This model should give us a much better train and test score than the logistic regression model.

In [16]:
from sklearn.base import TransformerMixin

This is a class that we will add to our pipeline to allow our Gaussian Naive Bayes(GNB) model to work. Without it, our data will not transform from a sparse matrix to a dense matrix, which is required by GNB.

**Code adapted from StackOverflow**
> https://stackoverflow.com/questions/28384680/scikit-learns-pipeline-a-sparse-matrix-was-passed-but-dense-data-is-required

In [17]:
class DenseTransformer(TransformerMixin):

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

In [18]:
#Creates pipeline for TFIDFVectorizer and Gaussian Naive Bayes
pipe_gnb = Pipeline([('tf', TfidfVectorizer(stop_words = 'english', max_df = .9)),
                ('to_dense', DenseTransformer()), 
                ('gnb', GaussianNB())])

#Hyperparameters for GridSearch to use to find best possible tfVec, Gaussian Naive Bayes hyperparameters

pipe_gnb_params = {
    'tf__max_features': [3500, 4000, 4500],
    'tf__min_df': [3, 5],
    'tf__ngram_range': [(1,1), (1,2)]
    
}

In [19]:
#GridSearching to find best combination of hyperparameters for tfVectorizer
gs_gnb = GridSearchCV(pipe_gnb, param_grid=pipe_gnb_params, cv = 3)
gs_gnb.fit(X_train, y_train)
print(f'The TFIDFVectorizer, Gaussian Naive Bayes train score is {gs_gnb.best_score_}')
print(f'The TFIDFVectorizer, Gaussian Naive Bayes test score is {gs_gnb.score(X_test, y_test)}')
gs_gnb.best_params_

The TFIDFVectorizer, Gaussian Naive Bayes train score is 0.908115489660554
The TFIDFVectorizer, Gaussian Naive Bayes test score is 0.9093036863662961


{'tf__max_features': 3500, 'tf__min_df': 3, 'tf__ngram_range': (1, 2)}

As we can see, the Gaussian Naive Bayes Model didn't come close to the efficiency of the LRCV model. This is not a huge surprise because the TFIDF model performed worse than the CountVectorized model, so a model based on it should be less accurate. This is still a very strong model that adapts to new data well and is not overfit. 

***
## Decision Tree

Decision trees work by attempting to reduce a Gini score to 0. A gini score is basically how pure a group of sample data is. A model with a gini score of 0 has final samples that consist of the same thing (i.e. all red birds). Decision trees want to get 1 sample per final sample group, which is what leads to overfitting. With this knowledge in mind, I'll use gridsearching to find the optimal hyperparamters to alleviate that. 

### *Count Vectorizer*

In [20]:
pipe_dt_cv = Pipeline([('cv', CountVectorizer(stop_words = 'english', max_df = .9)),
                ('dt', DecisionTreeClassifier(max_depth = 5, random_state = 42))])

#max_depth is the number of levels/questions asked by the decision tree

pipe_dt_cv_params = {
    'cv__max_features': [1500, 2000, 2500],
    'cv__min_df': [3, 5],
    'cv__ngram_range': [(1,1), (1,2)],
    'dt__min_samples_leaf': [3, 5],   #Minimum number of samples required before splitting a group/node
    'dt__min_samples_split': [7, 10]   #Minimum number of samples required to be in a group/node
}

In [21]:
#GridSearching to find best combination of hyperparameters for cvVectorizer
gs_dt_cv = GridSearchCV(pipe_dt_cv, param_grid=pipe_dt_cv_params, cv = 3)
gs_dt_cv.fit(X_train, y_train)
print(f'The CountVectorizer, Decision Tree train score is {gs_dt_cv.best_score_}')
print(f'The CountVectorizer, Decision Tree test score is {gs_dt_cv.score(X_test, y_test)}')
gs_dt_cv.best_params_

The CountVectorizer, Decision Tree train score is 0.8312524385485759
The CountVectorizer, Decision Tree test score is 0.8203627852545348


{'cv__max_features': 1500,
 'cv__min_df': 3,
 'cv__ngram_range': (1, 1),
 'dt__min_samples_leaf': 3,
 'dt__min_samples_split': 7}

### *TFIDF Vectorizer*

In [22]:
pipe_dt_tf = Pipeline([('tf', TfidfVectorizer(stop_words = 'english', max_df = .9)),
                ('dt', DecisionTreeClassifier(max_depth = 5, random_state = 42))])

#max_depth is the number of levels/questions asked by the decision tree

pipe_dt_tf_params = {
    'tf__max_features': [3500, 4000, 4500],
    'tf__min_df': [3, 5],
    'tf__ngram_range': [(1,1), (1,2)],
    'dt__min_samples_leaf': [3, 5],   #Minimum number of samples required before splitting a group/node
    'dt__min_samples_split': [7, 10]   #Minimum number of samples required to be in a group/node
}

In [23]:
#GridSearching to find best combination of hyperparameters for tfVectorizer
gs_dt_tf = GridSearchCV(pipe_dt_tf, param_grid=pipe_dt_tf_params, cv = 3)
gs_dt_tf.fit(X_train, y_train)
print(f'The TFIDFVectorizer, Decision Tree train score is {gs_dt_tf.best_score_}')
print(f'The TFIDFVectorizer, Decision Tree test score is {gs_dt_tf.score(X_test, y_test)}')
gs_dt_tf.best_params_

The TFIDFVectorizer, Decision Tree train score is 0.831642606320718
The TFIDFVectorizer, Decision Tree test score is 0.8227033352837917


{'dt__min_samples_leaf': 3,
 'dt__min_samples_split': 7,
 'tf__max_features': 3500,
 'tf__min_df': 3,
 'tf__ngram_range': (1, 1)}

Here we can see that although we added base parameters to lower our error due to variance significantly, the decision tree model was a weak performer. **Due to the innately high risk of overfitting with this model and because our data is already unbalanced, this is not the ideal model for our dataset.**

*** 
## Random Forest 

Random Forests are aggregated decision trees that use a concept called bagging to make them very accurate. Bagging is basically the central limit theorem; we split a population (our data) into multiple samples and run decision tree models over each sample. Then we take the average of each model. This is an ensemble model, and these models *typically* have the highest accuracy scores for three reasons. 
> * First, ensemble models take the average of multiple models, which usually results in canceling out the error in each of the individuals models. 
* Second, taking the average scores of multiple models tends to result in reaching scores one model may not have been able to reach alone. This means we may be able to hone in on a global best. 
* Finally, one model will most likely be not perfect since they all have their shortcomings. Aggregating the results of multiple models could create a model that exceeds the limitations of all the components combined. 

### *Count Vectorizer*

In [25]:
pipe_rf_cv = Pipeline([('cv', CountVectorizer(stop_words = 'english', max_df = .9)),
                ('rf', RandomForestClassifier(max_depth = 5, random_state = 42))])

pipe_rf_cv_params = {
    'cv__max_features': [2500, 3500, 4000, 4500],
    'cv__min_df': [3, 5],
    'cv__ngram_range': [(1,1), (1,2)],
    'rf__n_estimators': [50 ,75],
    'rf__max_depth': [5, 6]
}

In [26]:
#GridSearching to find best combination of hyperparameters for tfVectorizer
gs_rf_cv = GridSearchCV(pipe_rf_cv, param_grid=pipe_rf_cv_params, cv = 3)
gs_rf_cv.fit(X_train, y_train)
print(f'The CountVectorizer, Random Forest train score is {gs_rf_cv.best_score_}')
print(f'The CountVectorizer, Random Forest test score is {gs_rf_cv.score(X_test, y_test)}')
gs_rf_cv.best_params_

The CountVectorizer, Random Forest train score is 0.7631681623097932
The CountVectorizer, Random Forest test score is 0.7495611468695144


{'cv__max_features': 2500,
 'cv__min_df': 5,
 'cv__ngram_range': (1, 1),
 'rf__max_depth': 6,
 'rf__n_estimators': 75}

### *TFIDF Vectorizer*

In [28]:
pipe_rf_tf = Pipeline([('tf', TfidfVectorizer(stop_words = 'english', max_df = .9)),
                ('rf', RandomForestClassifier(max_depth = 5, random_state = 42))])

pipe_rf_tf_params = {
    'tf__max_features': [2500, 3500, 4000, 4500],
    'tf__min_df': [3, 5],
    'tf__ngram_range': [(1,1), (1,2)],
    'rf__n_estimators': [100, 125],
    'rf__max_depth': [5, 6]
}

In [29]:
#GridSearching to find best combination of hyperparameters for tfVectorizer
gs_rf_tf = GridSearchCV(pipe_rf_tf, param_grid=pipe_rf_tf_params, cv = 3)
gs_rf_tf.fit(X_train, y_train)
print(f'The TFIDFVectorizer, Random Forest train score is {gs_rf_tf.best_score_}')
print(f'The TFIDFVectorizer, Random Forest test score is {gs_rf_tf.score(X_test, y_test)}')
gs_rf_tf.best_params_

The TFIDFVectorizer, Random Forest train score is 0.7641435817401483
The TFIDFVectorizer, Random Forest test score is 0.7466354593329433


{'rf__max_depth': 6,
 'rf__n_estimators': 75,
 'tf__max_features': 2500,
 'tf__min_df': 5,
 'tf__ngram_range': (1, 1)}

Shockingly, the ensemble model yielded the worst results of all. This is without a doubt due to the hyperparameters we have set. In all scenarios, the random forest classifier should outperform the decision trees model. Because we are using a gridsearch and random forests have 2 more important hyperparameters in addition to those present for decision trees, I have to decide how much computer power and time I want to give to this model. This is definitely a learning experience, and shows how the flaw with gridsearching can lead to very interesting results. 

*** 
# Model Comparison

In [34]:
#Accuracy score
print(f'Baseline model accuracy score is {y.value_counts(normalize = True).max()}.')
print(f'The CountVectorizer, Logistic Regression train score is {gs_cv_lr.best_score_}')
print(f'The TFIDFVectorizer, Logistic Regression train score is {gs_tf_lr.best_score_}')
print(f'The CountVectorizer, Multnomial Naive Bayes train score is {gs_mnb.best_score_}')
print(f'The TFIDFVectorizer, Gaussian Naive Bayes train score is {gs_gnb.best_score_}')
print(f'The CountVectorizer, Decision Tree train score is {gs_dt_cv.best_score_}')
print(f'The TFIDFVectorizer, Decision Tree train score is {gs_dt_tf.best_score_}')
print(f'The CountVectorizer, Random Forest train score is {gs_rf_cv.best_score_}')
print(f'The TFIDFVectorizer, Random Forest train score is {gs_rf_tf.best_score_}')


Baseline model accuracy score is 0.7252377468910022.
The CountVectorizer, Logistic Regression train score is 0.9576667967225907
The TFIDFVectorizer, Logistic Regression train score is 0.9459617635583301
The CountVectorizer, Multnomial Naive Bayes train score is 0.9541552867733125
The TFIDFVectorizer, Gaussian Naive Bayes train score is 0.908115489660554
The CountVectorizer, Decision Tree train score is 0.8312524385485759
The TFIDFVectorizer, Decision Tree train score is 0.831642606320718
The CountVectorizer, Random Forest train score is 0.7631681623097932
The TFIDFVectorizer, Random Forest train score is 0.7641435817401483


# Summary

Although all of our models performed better than the baseline, I would go with the CountVectorized Logisitic Regression model to sort between the Anime and KDrama subreddit. It sported the best train and test scores while not being overfit. In the future, I could definitely increase the accuracy of my models by tweaking the hyperparameters each model gridsearches over. Our worst performing models were decision trees and random forests, which is probably due to our dataset imbalance. Our strongest performing models were the Logisitic Regression models, which is without a doubt due to the fact that we have an easily binarized target variable. The Naive Bayes models also performed really strongly and would be solid alternatives to the Logistic Regression models. 