# Part 2: Preprocessing & Modeling

## Imports

In [None]:
import nltk
import pandas                        as pd
import numpy                         as np
import seaborn                       as sns
import matplotlib.pyplot             as plt
import scikitplot                   as skplt
from sklearn.ensemble                import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model            import LogisticRegression
from sklearn.metrics                 import confusion_matrix, roc_auc_score, accuracy_score
from sklearn.metrics                 import precision_score, recall_score, f1_score
from sklearn.metrics                 import balanced_accuracy_score, roc_curve
from sklearn.model_selection         import GridSearchCV, train_test_split, cross_val_score
from sklearn.pipeline                import Pipeline
from sklearn.tree                    import DecisionTreeClassifier
from sklearn.svm                     import SVC
from nltk.corpus                     import stopwords
from nltk.stem                       import WordNetLemmatizer
from nltk.tokenize                   import RegexpTokenizer 
from xgboost                         import XGBClassifier
from IPython.display                 import display_html
from IPython.core.display            import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))
sns.set(style = "white", palette = "deep")
%matplotlib inline

## Table Of Contents



- [Reading In The Data](#Reading-In-The-Data)
    - [Overview](#Overview)
    - [Visuals](#Visuals)
    
    
- [Lemmatizing](#Lemmatizing)


- [Establishing The Baseline](#Establishing-The-Baseline)


- [Modeling](#Modeling)
    - [Setting The X & y variables](#Setting-The-X-and-y-variables)
    - [Running A Train-Test Split](#Running-A-Train-Test-Split)
    - [Evaluation Formulae](#Evaluation-Formulae)
    - [Logistic Regression](#Logistic-Regression)
    - [Random Forest Classifier](#Random-Forest-Classifier)
    - [Support Vector Classifier](#Support-Vector-Classifier)
    - [XGBoost Classifier](#XGBoost-Classifier)
    
    
- [Evaluation](#Evaluation)
    - [Best Model Selection](#Best-Model-Selection)
    - [Evaluation Functions](#Evaluation-Functions)
    - [Dataframes](#Dataframes)
    - [Plots](#Plots)
        - [Bar Plot](#Bar-Plot)
        - [ROC Curve](#ROC-Curve)

## Reading In The Data

### Overview

In [None]:
model_data = pd.read_csv("../Data/model_data.csv")

In [None]:
# Checking the data's head

model_data.head()

In [None]:
# Checking for null values

model_data.isnull().sum()

In [None]:
# Checking data types

model_data.info()

### Visuals

#### Functions

In [None]:
def plot_text_length_dist(text_list):
    
    # Setting the figure size
    plt.figure(figsize = (18,6))
    
    # Plotting the histogram
    sns.distplot(text_list, kde = False, color = "black",
                 bins = 60)
    
    # Setting graph parameters
    plt.title(f"Distribution Of Text Lengths", size = 18)
    plt.xlabel("Length", size = 16)
    plt.ylabel("Frequency", size = 16)
    plt.xticks(np.arange(0,23500,1500), size = 14)
    plt.yticks(size = 14)
    plt.tight_layout()
    plt.show();

In [None]:
def plot_most_frequent_authors(df, col):
    
    # Setting the figure size
    plt.figure(figsize = (20,6))
    
    # Creating the bar chart
    sns.barplot(x = df.index,
                y = col,
                data = df)
    
    # Setting graph parameters
    plt.title("Most Common Posters", size = 18)
    plt.xlabel("Reddit User", size = 16)
    plt.ylabel("Number Of Posts", size = 16)
    plt.xticks(size = 13)
    plt.yticks(size = 14);

[Top](#Table-Of-Contents)

#### Text Length

In [None]:
# Generating a list of text lengths

length_list = [len(text) for text in model_data["text"]]

plot_text_length_dist(length_list)

Most of the posts are relatively short (<2000 words), but there are a few that are extremely long (>20,000 words.)  We expected that most posts would be less than a few thousand words, which is true for the majority.

####  Most Frequent Authors

In [None]:
author_count = pd.DataFrame(model_data["author"].value_counts().head(10))

plot_most_frequent_authors(df  = author_count, 
                           col = "author")

We did not really know what to expect when we plotted this graph, because it is generally the case that a few users post most frequently and most barely post at all.  We would have like to look at the number of comments by each user in both subreddits as a measure of activity, but that is beyond the scope of this project.

#### Subreddit Of Origin

In [None]:
tick_labels = ["r/Cooking", "r/AskCulinary"]

# Setting the figure size
plt.figure(figsize = (10,5))

# Plotting the graph
sns.countplot(model_data["source"])

# Setting graph parameters
plt.title("Post Origin", size = 18)
plt.xlabel("Source", size = 16)
plt.ylabel("Number Of Posts", size = 16)

# Making sure the only two ticks are 0 and 1
plt.xticks(np.arange(0,2,1), 
           labels = tick_labels, 
           size = 14)
plt.yticks(size = 14);

We were a little surprised that there are more r/AskCulinary posts because we had roughly equal numbers of pulls from each subreddit.

[Top](#Table-Of-Contents)

#### Visualizing Most Common Words

Before we start modeling, we need to know what the most frequent words are in each subreddit are because it might be harder for our model to predict with those words in the dataframe.

We will subset the data frame into posts from r/Cooking and r/AskCulinary and use count vectorizer to determine the most frequent words.  We will also remove stop words from the outset.

In [None]:
def plot_most_frequent_words(dataframes, titles):
    
    # The count inidcates where in the subplot to go
    count = 0
    fig   = plt.figure(figsize   = (24,20),
                       facecolor = "white")
    
    # Enumerating allows for the list of titles to be referenced
    for d, dataframe in enumerate(dataframes):
        
        # Updating the location
        count += 1
        ax    = fig.add_subplot(2, 2, count)
        
        # Creating the graph
        sns.barplot(x       = 0,
                    y       = dataframe.index,
                    data    = dataframe,
                    palette = "deep")
        
        # Setting the graph parameters
        plt.title(f"Most Common Words From {titles[d]}", size = 20)
        plt.xlabel("Number Of Occurences", size = 18)
        plt.ylabel("Word", size = 18)
        plt.xticks(size = 16)
        plt.yticks(size = 17)

In [None]:
# Instantiating the count vectorizer

vectorizer = CountVectorizer()

# Masking the vectorizer with English stop words

cvec_cooking     = CountVectorizer(stop_words = "english")
cvec_askculinary = CountVectorizer(stop_words = "english")

# Subsetting the dataframe

cooking     = model_data[model_data["target"] == 1]
askculinary = model_data[model_data["target"] == 0]

# Fit-transforming the vectorizer

vec_cooking     = cvec_cooking.fit_transform(cooking["text"])
vec_askculinary = cvec_askculinary.fit_transform(askculinary["text"])

In [None]:
# Saving the vectorized dfs to a new dataframe

cooking_vectorized     = pd.DataFrame(vec_cooking.toarray(), 
                                      columns = cvec_cooking.get_feature_names())

askculinary_vectorized = pd.DataFrame(vec_askculinary.toarray(), 
                                      columns = cvec_askculinary.get_feature_names())

# Getting the 15 most frequent words from each

vectorized_cooking     = pd.DataFrame(cooking_vectorized.sum().sort_values(ascending = False).head(15))
vectorized_askculinary = pd.DataFrame(askculinary_vectorized.sum().sort_values(ascending = False).head(15))

# Plotting the most common words

plot_most_frequent_words(dataframes = [vectorized_cooking, vectorized_askculinary],
                         titles     = ["r/Cooking", "r/AskCulinary"])

We can see that there are a lot of words that occur in both subreddits.  We decided that because of that, we should create a list of customized stop words.  Furthermore, we noticed that we have to lemmatize or stem the text columns because of there are multiple forms of words in the most frequent words such as 'make' & 'making' or 'recipe' and recipes.

In [None]:
# Downloading the default stopwords

nltk.download("stopwords");

# Adding our stopwords to the English set

new_stopwords = ["like", "just", "make", "cook",
                 "use", "chicken", "recipe", "sauce"]

stopwords     = stopwords.words('english')

stopwords.extend(new_stopwords)

[Top](#Table-Of-Contents)

## Lemmatizing

We felt that lemmatizing is a better option than stemming because the lemma form of a word is more likely to result in an actual word of English than trying to find a word's stem: there are so many irregularities in English that it is not always easy to find the stem.

In [None]:
# Instantiating the lemmatizier and tokenizer
# The tokenizer will only keep text

lemmatizer = WordNetLemmatizer()
tokenizer  = RegexpTokenizer(r'\w+')

# Setting up the lemmatizer

lemmatized_posts = []

for post in model_data["text"]:
    tokens = tokenizer.tokenize(post)
    post   = [lemmatizer.lemmatize(post) for post in tokens]
    lemmatized_posts.append(" ".join(post))
    
# Appending the lemmatized posts to the dataframe

model_data["lemmatized_text"] = lemmatized_posts

# Checking the head of the dataframe

model_data.head()

While checking the results from the cell above, we noticed that in `lemmatized_text` there are some URLs which need to be removed.  We used a regular expression to remove all URLs.

In [None]:
model_data["lemmatized_text"] = model_data["lemmatized_text"].str.replace("http\S+", "")

## Establishing The Baseline

A baseline in classification gives us an idea of how exactly the model is performing.  The baseline is simply the percentage of occurrences of our target in the data as a whole.  In this case it will be what percentage of posts are from r/Cooking.

If our model has an accuracy of >41.44% we know that it is better than simply guessing the class of a post.

In [None]:
round(model_data["target"].value_counts(normalize = True)*100, 2)

## Modeling

Now that our text is in the format we want, we can begin the process of modeling.

There are a few steps we have to do before we start running models: we have to define the X and y variables and run a train-test split on the data.

### Setting The X & y variables

In [None]:
X = model_data["lemmatized_text"]
y = model_data["target"]

### Running A Train-Test Split

A train-test split is important because it allows us to reserve a portion of our data for test so that the model does not see all data before predicting.  In this case we want to preserve the class split, so we will stratify the data to match the distribution of the classes.

In [None]:
# The random state ensures reproducability
# The stratify argument preserves the distribution of classes

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    random_state = 42,
                                                    stratify     = y)

Each of the three models we will use will be gridsearched so that we can experiment with different combinations of hyperparameters (parameters we have to define).  Additionally, each model will be fit with a count vectorizer and a TFIDF (Term Frequency-Inverse Document Frequency) vectorizer.

[Top](#Table-Of-Contents)

### Evaluation Formulae

A confusion matrix allows us to look at how our model actually classified our data.  It plots the true y values and the predicted y values so that we can have an idea of how the model performs with each class.

In [None]:
# We converted the confusion matrix to a dataframe to make it easier to read

def create_confusion_matrix(y, y_preds):
    cm     = confusion_matrix(y, y_preds)
    matrix = pd.DataFrame(cm, 
                          columns = ["Predicted r/Cooking", "Predicted r/AskCulinary"], 
                          index   = ["Actual r/Cooking", "Actual r/AskCulinary"])
    return matrix

One of the evaluation metrics we want to use is specificty which is not a function we can import from sklearn.  In order to calculate this score, we will create a function based off of the confusion matrices.

In [None]:
# The function creates a confusion matrix
# and then calculates the specifity from
# specific cells

def calc_specificity(y, y_hat):
    cm          = confusion_matrix(y, y_hat)
    specificity = cm[1,1] / (cm[0,1] + cm[1, 0])
    return specificity

For each model we will calculate an ROC-AUC score.  The ROC (receiver operating characteristic) shows us a binary classification model's ability to distinguish between two classes.  The curve, which will be plotted for our best model, shows us the distribution of the two classes.  The AUC (area under the curve) is how we actually measure the distribution of the classes: 0.5 is the lowest possible and 1.0 is the highest.

This image from [GreyAtom](https://medium.com/greyatom/lets-learn-about-auc-roc-curve-4a94b4d88152) illustrates the AUC-ROC well:

<img src = "../Images/ROC_AUC 0.8 0.9.png" alt = "high auc_roc scores" height = "350" width = "350">

<img src = "../Images/ROC_AUC 0.5 0.7.png" alt = "low auc_roc scores" height = "350" width = "350">

Accuracy is not the most informative metric because we want to know how well the model is actually performing on both classes.  In order to do that we decided to look at five metrics in addition to general accuracy.

 
| Metric                | Definition                                                     | Scale    |
|:----------------------|:---------------------------------------------------------------|:---------|
| **Accuracy**          | How well the model performed                                   | 0 to 1   | 
| **Balanced Accuracy** | The average of the sensitivity on each class                   | 0 to 1   | 
| **Specificity**       | How many positive predictions are correct                      | 0 to 1   | 
| **Sensitivity**       | How many negatives are actually correct (also known as recall) | 0 to 1   | 
| **F1 Score**          | Accuracy that takes into account the precision & recall        | 0 to 1   | 
| **ROC-AUC Score**     | A measure of the model's ability to distinguish classes        | 0.5 to 1 |

While we are taking a holistic approach to evaluation, we believe our most import metric is the ROC-AUC score because it shows us how much our classes overlap; ideally our ROC-AUC scores will be as close to 1.0 as possible.

In [None]:
# Generating the "classification" report

def generate_model_eval(y, y_hat):
    print(f"The accuracy score is         : {round(accuracy_score(y, y_hat), 5)}")
    print(f"The balanced accuracy score is: {round(balanced_accuracy_score(y, y_hat), 5)}")
    print(f"The specificity score is      : {round(calc_specificity(y, y_hat), 5)}")
    print(f"The sensitivity score is      : {round(recall_score(y, y_hat), 5)}")
    print(f"The F1 score is               : {round(f1_score(y, y_hat), 5)}")
    print(f"The ROC-AUC score is          : {round(roc_auc_score(y, y_hat), 5)}")

[Top](#Table-Of-Contents)

### Logistic Regression

The logistic regression is very similar to the linear regression, but it uses a logit function to bend the line so that it can predict either 0 or 1.


The gridsearch will be searching hyperparameters for the count vectorizer and the TFIDF, not the logistic regression.

#### Count Vectorizer

In [None]:
# Setting up the pipeline
# The model's best parameters are shown

cvec_lr_pipe = Pipeline([("cvec", CountVectorizer()), 
                         ("log_reg", LogisticRegression())])

# Setting the pipeline hyperparameters

cvec_pipe_params = {"cvec__max_features": [125], 
                    "cvec__ngram_range" : [(1,2)], 
                    "cvec__stop_words"  : [None]}

# Instantiating the grid search

cvec_lr_gs = GridSearchCV(cvec_lr_pipe, 
                          param_grid = cvec_pipe_params, 
                          cv         = 5)

# Fitting the training data to the pipeline model

cvec_lr_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

cvec_lr_train_preds = cvec_lr_gs.predict(X_train)

# Generating testing predictions

cvec_lr_preds       = cvec_lr_gs.predict(X_test)
cvec_lr_proba       = cvec_lr_gs.predict_proba

In [None]:
# Training metrics

generate_model_eval(y_train, cvec_lr_train_preds)

In [None]:
# Test metrics

generate_model_eval(y_test, cvec_lr_preds)

Overall the scores are poor, although we found it surprising at how consistent they are.  We had expected a degree of overfitting due to the type of model: linear models are simple and tend to high bias.

Our biggest concern is the ROC-AUC score: the scores are close to the lowest possible score and thus the model is not doing a very good job at distinguishing between the posts from r/Cooking and r/AskCulinary.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, cvec_lr_preds)

The confusion matrix contains the following:


|                     | Predicted Positive | Predicted Negative |
|:--------------------|:------------------:|:------------------:|
| **Actual Positive** | True Positive      | False Negative     |
| **Actual Negative** | False Positive     | True Negative      |


Based on the matrix, it is easy to see that the model is better at predicting posts from r/Cooking than from r/AskCulinary.  We found that surprising because there are more posts from r/AskCulinary so we expected the opposite of these results.

[Top](#Table-Of-Contents)

#### TFIDF Vectorization

In [None]:
# Setting up the pipeline
# The model's best parameters are shown

tvec_lr_pipe = Pipeline([("tvec", TfidfVectorizer()), 
                         ("log_reg", LogisticRegression())])

# Setting TFIDF pipe parameters

tvec_pipe_params = {"tvec__max_features": [650], 
                    "tvec__ngram_range" : [(1,1)], 
                    "tvec__stop_words"  : [None]}
                    
# Instantiating the grid search

tvec_lr_gs = GridSearchCV(tvec_lr_pipe, 
                          param_grid = tvec_pipe_params, 
                          cv         = 5)

# Fitting the training data to the pipeline model

tvec_lr_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

tvec_lr_train_preds = tvec_lr_gs.predict(X_train)

# Generating testing predictions

tvec_lr_preds       = tvec_lr_gs.predict(X_test) 

In [None]:
# Training metrics

generate_model_eval(y_train, tvec_lr_train_preds)

In [None]:
# Test metrics

generate_model_eval(y_test, tvec_lr_preds)

This model with the TFIDF vectorizer performed a bit better than the same model with count vectorization.  That being said, this model is more overfit.  Additionally, the specificity of >1 does not make any sense.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, tvec_lr_preds)

Compared to the confusion matrix for the CVEC predictions, the number of true positives and true negatives have increased while the numbers of false positives and false negatives have decreased as well meaning that the model's predictive power has increased; this reflects what we saw with the model's metrics.

[Top](#Table-Of-Contents)

### Support Vector Classifier

A support vector machine (in this case a classifier) is at its core a linear model.  However, instead of running like a logistic regression, it seeks to linearly separate the data.  To do that, it uses a kernel to raise the data into _n_-dimensional space.  It then uses a line, plane (3 dimensional line), or hyperplane (>3 dimensional line) to delineate the data.

#### Count Vectorizer

In [None]:
# Setting up the pipeline
# The model's best parameters are shown

cvec_svc_pipe = Pipeline([("cvec", CountVectorizer()), 
                         ("svc", SVC())])

# Setting TFIDF pipe parameters

cvec_pipe_params = {"cvec__max_features": [319], 
                    "cvec__ngram_range" : [(1,2)], 
                    "cvec__stop_words"  : [None],
                    "svc__C"            : [1.0],
                    "svc__kernel"       : ["rbf"],
                    "svc__gamma"        : ["auto"]}
                    
# Instantiating the grid search

cvec_svc_gs = GridSearchCV(cvec_svc_pipe, 
                           param_grid = cvec_pipe_params, 
                           cv         = 5)

# Fitting the training data to the pipeline model

cvec_svc_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

cvec_svc_train_preds = cvec_svc_gs.predict(X_train)

# Generating testing predictions

cvec_svc_preds       = cvec_svc_gs.predict(X_test) 

In [None]:
# Training metrics

generate_model_eval(y_train, cvec_svc_train_preds)

In [None]:
# Test metrics

generate_model_eval(y_test, cvec_svc_preds)

We were suprised that a support vector machine performed worse than the logistic regression: we thought that because the model seeks to make the data linearly separable as possile that it would do a better job of distinguishing the classes.  And while it out-performed the CVEC logistic regression, it was 0.03 worse than the TVEC logist regression.  Furthermore, this model is much more overfit: the specificity for example dropped by 0.344.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, cvec_svc_preds)

This model predicts the r/Cooking more than the r/AskCulinary which again we found surprising.  While it did a better job with the true positives, the model overall is worse because the false negatives increased while the true negatives decreased.

[Top](#Table-Of-Contents)

#### TFIDF Vectorizer

In [None]:
# Setting up the pipeline
# The model's best parameters are shown

tvec_svc_pipe = Pipeline([("tvec", TfidfVectorizer()), 
                         ("svc", SVC())])

# Setting TFIDF pipe parameters

tvec_pipe_params = {"tvec__max_features": [1], 
                    "tvec__ngram_range" : [(1,1)], 
                    "tvec__stop_words"  : [None],
                    "svc__C"            : [1.0],
                    "svc__kernel"       : ["rbf"],
                    "svc__gamma"        : ["auto"]}
                    
# Instantiating the grid search

tvec_svc_gs = GridSearchCV(tvec_svc_pipe, 
                           param_grid = tvec_pipe_params, 
                           cv         = 5)

# Fitting the training data to the pipeline model

tvec_svc_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

tvec_svc_train_preds = tvec_svc_gs.predict(X_train)

# Generating testing predictions

tvec_svc_preds       = tvec_svc_gs.predict(X_test) 

In [None]:
# Training metrics

generate_model_eval(y_train, tvec_svc_train_preds)

In [None]:
# Test metrics

generate_model_eval(y_test, tvec_svc_preds)

Considering the performance of the SVC with count vectorization, we were surprised by how much more poorly this model performed: scores in the range of 0.2 to 0.4 are simply unacceptable.  We expected these scores to be slightly higher than the count vectorized scores because in the case of the logistic regression the scores improved with TFIDF vectorization.  Nevertheless, the model is is not overfit at all and in a few cases performed slightly better on the test data.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, tvec_svc_preds)

This matrix is dramatically different than the previous matrix.  While the true positives have increased and the false negatives decreased, the number of true negatives decrease and the false positives increased as well.  We found it interesting that the number of true positives have continually increased while the others have been so variable.

[Top](#Table-Of-Contents)

### Random Forest Classifier

A random forest classifier is a decision tree based classification method.  However, it has advantages over other tree based models.  Firstly, it bootstraps the dataframe to have a random subset of the data, but it also takes a random subset of the features.  Having two levels of randomness in the model reduce the likelihood of the model being overfit on training data but it also allows the model to be less prone to variance caused by a large number of features.

#### Count Vectorizer

In [None]:
# Creating the pipeline
# The model's best parameters are shown

cvec_rf_pipe = Pipeline([("cvec", CountVectorizer()), 
                         ("rf", RandomForestClassifier(random_state = 42))])

# Setting the pipeline hyperparameters

cvec_pipe_params = {"cvec__max_features"   : [1000], 
                    "cvec__ngram_range"    : [(1,1)], 
                    "cvec__stop_words"     : [None],
                    "rf__n_estimators"     : [72],
                    "rf__min_samples_split": [6],
                    "rf__min_samples_leaf" : [2],
                    "rf__max_depth"        : [20]}

# Instantiating the grid search

cvec_rf_gs = GridSearchCV(cvec_rf_pipe, 
                          param_grid = cvec_pipe_params, 
                          cv         = 5,
                          n_jobs     = 6)

# Fitting the model to the testing data

cvec_rf_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

cvec_rf_train_preds = cvec_rf_gs.predict(X_train)

# Generating testing predictions

cvec_rf_preds       = cvec_rf_gs.predict(X_test) 

In [None]:
# Training metrics

generate_model_eval(y_train, cvec_rf_train_preds)

In [None]:
# Test metrics

generate_model_eval(y_test, cvec_rf_preds)

This random forest model is significantly overfit, which we did not expect because the random forest's two-level randomness is designed to help prevent overfitting; overall this model performed poorly.  Additionally, we are not sure how the specificity is >1 here again.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, cvec_rf_preds)

Here we can see that the true positives and negatives increased while the false negatives and positives decreased.  This is an improvement over the previous model, but it is still not great by any means.

[Top](#Table-Of-Contents)

#### TFIDF Vectorizer

In [None]:
# Creating the pipeline
# The model's best parameters are shown

tvec_rf_pipe = Pipeline([("tvec", TfidfVectorizer()), 
                         ("rf", RandomForestClassifier(random_state = 42))])

# Setting the pipeline hyperparameters

tvec_pipe_params = {"tvec__max_features"   : [250], 
                    "tvec__ngram_range"    : [(1,2)], 
                    "tvec__stop_words"     : [None],
                    "rf__n_estimators"     : [30],
                    "rf__min_samples_split": [6],
                    "rf__min_samples_leaf" : [2],
                    "rf__max_depth"        : [12]}

# Instantiating the grid search

tvec_rf_gs = GridSearchCV(tvec_rf_pipe, 
                          param_grid = tvec_pipe_params, 
                          cv         = 5,
                          n_jobs     = 6)

# Fitting the model to the testing data

tvec_rf_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

tvec_rf_train_preds = tvec_rf_gs.predict(X_train)

# Generating testing predictions

tvec_rf_preds       = tvec_rf_gs.predict(X_test) 

In [None]:
# Training metrics

generate_model_eval(y_train, tvec_rf_train_preds)

In [None]:
# Test metrics

generate_model_eval(y_test, tvec_rf_preds)

Again, this model is very overfit and performed poorly.  We also do not understand how the specificity is again >1.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, tvec_rf_preds)

Interestingly here the false positives and true negatives were the same while the true positives decreased and false negatives increased: the models have continually done better at predicting the positive class (posts from r/Cooking).

[Top](#Table-Of-Contents)

### XGBoost Classifier

XGBoost is a tree-based boosting model that iteratively fits tree models on the errors of the previous model and uses gradient descent to help minimize the loss function.  Furthermore, the XGBoost is much more computationally effecient and can be parallelized unlike orther boosting models.

#### Count Vectorizer

In [None]:
# Creating the pipeline
# The model's best parameters are shown

cvec_xgbc_pipe = Pipeline([("cvec", CountVectorizer()), 
                           ("xgbc", XGBClassifier(n_jobs                = 6,
                                                  early_stopping_rounds = 10))])

# Setting the pipeline hyperparameters

cvec_pipe_params = {"cvec__max_features"   : [200], 
                    "cvec__ngram_range"    : [(1,3)], 
                    "cvec__stop_words"     : [None],
                    "xgbc__max_depth"      : [3],
                    "xgbc__learning_rate"  : [0.04],
                    "xgbc__n_estimators"   : [175],
                    "xgbc__gamma"          : [3.0]}

# Instantiating the grid search

cvec_xgbc_gs = GridSearchCV(cvec_xgbc_pipe, 
                            param_grid = cvec_pipe_params, 
                            cv         = 5,
                            n_jobs     = 6)

# Fitting the model to the testing data

cvec_xgbc_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

cvec_xgbc_train_preds = cvec_xgbc_gs.predict(X_train)

# Generating testing predictions

cvec_xgbc_preds       = cvec_xgbc_gs.predict(X_test) 

In [None]:
# Training metrics

generate_model_eval(y_train, cvec_xgbc_train_preds)

In [None]:
# Test metrics

generate_model_eval(y_test, cvec_xgbc_preds)

We had expected the XGBoost classifier to be the best model because it is a boosting model: it fits models on the errors from each model it fits.  This model is bad not only because the test scores very low, but because of how overfit the models are.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, cvec_xgbc_preds)

The matrices have been variable over previous models and this one is no different.  The true and false positives decreased while the false negatives and true negatives increased.  The trend of predicting more r/Cooking posts is present here as well.

[Top](#Table-Of-Contents)

#### TFIDF Vectorizer

In [None]:
# Creating the pipeline
# The model's best parameters are shown

tvec_xgbc_pipe = Pipeline([("tvec", TfidfVectorizer()), 
                           ("xgbc", XGBClassifier(n_jobs                = 6,
                                                  seed                  = 42,
                                                  early_stopping_rounds = 10))])

# Setting the pipeline hyperparameters

tvec_pipe_params = {"tvec__max_features"   : [525], 
                    "tvec__ngram_range"    : [(1,3)], 
                    "tvec__stop_words"     : [stopwords],
                    "xgbc__max_depth"      : [3],
                    "xgbc__learning_rate"  : [0.25],
                    "xgbc__n_estimators"   : [139],
                    "xgbc__gamma"          : [1.0]}

# Instantiating the grid search

tvec_xgbc_gs = GridSearchCV(tvec_xgbc_pipe, 
                            param_grid = tvec_pipe_params, 
                            cv         = 5,
                            n_jobs     = 6)

# Fitting the model to the testing data

tvec_xgbc_gs.fit(X_train, y_train);

In [None]:
# Generating training predictions

tvec_xgbc_train_preds = tvec_xgbc_gs.predict(X_train)

# Generating testing predictions

tvec_xgbc_preds       = tvec_xgbc_gs.predict(X_test) 

In [None]:
# Training metrics

generate_model_eval(y_train, tvec_xgbc_train_preds)

In [None]:
# Training metrics

generate_model_eval(y_test, tvec_xgbc_preds)

Again we were surprised by how poorly the XGBoost model did for the same reasons: it is extremely overfit and its metric scores are poor.  Again the specificity is >1 which idoes not make sense to us.

In [None]:
# Generating a confusion matrix on the test results

create_confusion_matrix(y_test, tvec_xgbc_preds)

This matrix is dramatically worse than the previous: both cases of negatives rose a significant amount, while the positives dropped.  Still, the model predicted more postives than negatives.

[Top](#Table-Of-Contents)

## Evaluation

### Best Model Selection

Overall, the best model given the metric scores, especially the ROC-AUC score, is the logistc regression.  In particular, the logistic regression model with TFIDF is the best overall model.

### Evaluation Functions

In [None]:
# Plotting a bar plot of a model's scores

def plot_scores(df, column, label):
    
    # Setting the figure size
    plt.figure(figsize   = (15,5),
               facecolor = "white")
    
    # Plotting the bar plot
    sns.barplot(x    = df.index,
                y    = column,
                data = df)
    
    # Setting the baseline line
    plt.axhline(41.8898, 
                color = "black")
    
    # Setting graph parameters
    plt.title(label, size = 20)
    plt.xlabel("Model", size = 18)
    plt.ylabel("Score", size = 18)
    plt.xticks(size  = 14)
    plt.yticks(ticks = np.arange(0,110,10), 
               size  = 14)

In [None]:
def roc(model_prob, X_test, y_test, y_pred, title):
    
    # Calculating probabilities
    model_prob    = [i[1] for i in model_prob.predict_proba(X_test)]
    
    # Creating a dataframeout of the true values & probas
    model_pred_df = pd.DataFrame({"true_values": y_test,
                                  "pred_probs" : model_prob})

    # Setting threshold values    
    thresholds = np.linspace(0, 1, 500) 
    
    # Calculating the sensitivity
    def true_positive_rate(df, true_col, pred_prob_col, threshold):
        true_positive  = df[(df[true_col] == 1) & (df[pred_prob_col] >= threshold)].shape[0]
        false_negative = df[(df[true_col] == 1) & (df[pred_prob_col] < threshold)].shape[0]
        return true_positive / (true_positive + false_negative)
    
    # Calculating the false positives
    def false_positive_rate(df, true_col, pred_prob_col, threshold):
        true_negative  = df[(df[true_col] == 0) & (df[pred_prob_col] <= threshold)].shape[0]
        false_positive = df[(df[true_col] == 0) & (df[pred_prob_col] > threshold)].shape[0]
        return 1 - (true_negative / (true_negative + false_positive))
    
    # Calculating the sensitivity and false positives for each point in the threhold
    tpr_values = [true_positive_rate(model_pred_df, "true_values", "pred_probs", prob) for prob in thresholds]
    fpr_values = [false_positive_rate(model_pred_df, 'true_values', "pred_probs", prob) for prob in thresholds]

    # Setting up the graph
    plt.figure(figsize   = (13,7),
               facecolor = "white")
    
    # Plotting the predicted
    plt.plot(fpr_values, 
             tpr_values,
             color = "darkorange",
             label = "ROC Curve")
    
    # Setting the baseline
    plt.plot(np.linspace(0, 1, 500),
             np.linspace(0, 1, 500),
             color     = "darkblue",
             label     = "Baseline"),
    
    # Setting model parameters
    plt.title(title, fontsize = 18)
    plt.ylabel("Sensitivity", size = 16)
    plt.xlabel("1 - Specificity", size = 16)
    plt.xticks(size = 14)
    plt.yticks(size = 14)
    plt.legend(bbox_to_anchor = (1.04,1), 
               loc            = "upper left",
               fontsize       = 16)
    plt.tight_layout()
    
# The code was modified from code written by Matt Brems 
# during our lesson on classification metrics.

### Dataframes

In [None]:
# Baseline accuracy

baseline = [0.418898]

# Count vectorizer metrics

cvec_accuracy          = [0.65683, 0.66544, 0.66421, 0.64945]
cvec_balanced_accuracy = [0.63445, 0.63617, 0.61924, 0.61548]
cvec_specificity       = [0.60573, 0.56985, 0.42491, 0.48421]
cvec_sensitivity       = [0.4956,  0.45455, 0.34018, 0.40469]
cvec_f1_score          = [0.54781, 0.53265, 0.45941, 0.49198]
cvec_rocauc_score      = [0.63445, 0.63617, 0.61924, 0.61548]

# TFIDF vectorizer metrics

tvec_accuracy          = [0.68758, 0.60271, 0.64084, 0.65314]
tvec_balanced_accuracy = [0.66419, 0.55244, 0.59911, 0.63209]
tvec_specificity       = [0.69685, 0.25387, 0.39726, 0.60638]
tvec_sensitivity       = [0.51906, 0.24047, 0.34018, 0.50147]
tvec_f1_score          = [0.58224, 0.33676, 0.44275, 0.54808]
tvec_rocauc_score      = [0.66419, 0.55244, 0.59911, 0.63209]

In [None]:
cvec_scores = pd.DataFrame(data    = [cvec_accuracy, cvec_balanced_accuracy, 
                                      cvec_specificity,cvec_sensitivity, cvec_f1_score],
                           columns = ["Log. Reg.", "SVC", "Random Forest", "XGBoost"],
                           index   = ["Accuracy", "Balanced Accuracy", 
                                      "Specificity", "Sensitivity", "F1 Score"])


tvec_scores = pd.DataFrame(data    = [tvec_accuracy, tvec_balanced_accuracy, 
                                      tvec_specificity, tvec_sensitivity, tvec_f1_score,],
                           columns = ["Log. Reg.", "SVC", "Random Forest", "XGBoost"],
                           index   = ["Accuracy", "Balanced Accuracy", 
                                      "Specificity", "Sensitivity", "F1 Score"])

[Top](#Table-Of-Contents)

### Plots

#### Bar Plots

In [None]:
plot_scores(df       = tvec_scores*100,
            column   = "Log. Reg.",
            label    = "Logistic Regression - TVEC")

The horizontal black line represents the baseline accuracy which is the percentage of the data from the positive class, i.e. r/Cooking.  If the two accuracy metrics were less than the baseline accuracy it would indicate the model performs worse than randomly guessing 41.44% of the data is from r/Cooking.  However, our two accuracy metrics are greater than that percentage so they out performed the baseline.  That being said, the scores are not good and we cannot say that this model is accurate even though the it performed better than the other seven we ran.  Despite that, it is a good sign that the accuracy and balanced accuracy are close to each other.


As mentioned above, `specificity` is the how many positive predictions were correct and `Sensitivity` is how many negative predictions are correct.  As we said with the confusion matrices, all of the models were better at predicting the positive class which is easily seen here.

`F1 Score` is an accuracy measurement that takes into account the specificity and sensitivity.  The score is low, ~58%, which makes it a little difficult to interpret because it can be difficult if the low score is caused by specificity or sensitivity.  However, we know from the confusion matrix that the score is low because of the negative class.

#### ROC Curve

In [None]:
rocauc_score = round(roc_auc_score(y_test, tvec_lr_preds), 5)

roc(model_prob = tvec_lr_gs,
    X_test     = X_test,
    y_test     = y_test,
    y_pred     = tvec_lr_preds,
    title      = f"ROC For TVEC Logistc Regression With A Score Of {rocauc_score}")

The ROC curve is a representation of the relationship between the false positives (1 - specificity) and the true negatives.  We can see that the as the number of false positives increases the grows but not in a linear fashion.

The low score indicates that the model is not very good at predicting the two classes, which we also saw above.  Furthermore, the curve is close to the baseline; the baseline being the point at which guess which post came from which subreddit.  Thus we hoped to have our curve be as far from the baseline as possible, ideally a 90º angle.

[Top](#Table-Of-Contents)