# Project 3: Reddit Post Sorting

- **ExplainLikeImFive (ELI5)** - Explain Like I'm Five is the best forum and archive on the internet for layperson-friendly explanations. Don't Panic!
- **AskScience** - Ask a science question, get a science answer.


---

We will be analyzing a random collection of posts from two subReddits, **ExplainLikeImFive** and **AskScience**, in order to build a model to predict if an individual posts belong to ELI5 or AskScience; we will be analyzing the Title and Body of the Post.

**What am I hoping to achieve with this?**
> If ELI5 is distinguishable from AskScience.

**Why?**
> To see if a subreddit focused on explaining things in a simple manner is that much different than a subreddit that wants to explain it any way they can.

# Modeling and Conclusion

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier

import warnings
warnings.filterwarnings('ignore')

## Read in Data

We also will remove the null values again, as we found was best during our EDA.

In [2]:
ex_df = pd.read_csv('../data/ex_df.csv')
ex_df.dropna(inplace=True)

In [3]:
ask_df = pd.read_csv('../data/ask_df.csv')
ask_df.dropna(inplace=True)

In [4]:
ex_df['combo'] = ex_df['title'] + ' ' + ex_df['selftext']
ask_df['combo'] = ask_df['title'] + ' ' + ask_df['selftext']

Now lets combine our dataframes into a single dataframe for our model.

In [5]:
culmination = pd.concat([ex_df, ask_df], ignore_index=True)

## Model Setup

First let's set our stopwords list obtained from our EDA.

In [6]:
X = culmination['combo']
y = culmination['subreddit']

### Null Model
It's important to know what our baseline success rate would be without a model.

In [7]:
y.value_counts(normalize=True)

askscience           0.571501
explainlikeimfive    0.428499
Name: subreddit, dtype: float64

**Null Model Analysis**
> Our null model, flatly predicting the more common class, would be 56.15% accurate.

### Train / Test Split

In [8]:
# Define training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                    test_size=0.33,
                                                    stratify=y,
                                                    random_state=42)

### Stop Word Adjustments

From our EDA, we know a list of words we want removed in additional to the default list.

In [9]:
#We want to use the standard stopwords
en_stopwords = stopwords.words('english')

#We also want to remove the ELI5, otherwise it would be a (nearly) 100% indicator
en_stopwords.append('eli5')

#From our analysis we also found some other words that would be beneficial to remove
en_stopwords.append('https')
en_stopwords.append('www')
en_stopwords.append('like')
en_stopwords.append('would')
en_stopwords.append('imgur')
en_stopwords.append('com')
en_stopwords.append('en')
en_stopwords.append('wikipedia')
en_stopwords.append('org')
en_stopwords.append('wiki')
en_stopwords.append('x200b')

## Model Selection

We want to try a few models to gauge a reasonable success rate in relation to our null model as well as a way to reference each other.

# Model 1: Count Vectorizer and Random Forest

**Let's set a pipeline**

In [10]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words = en_stopwords)),
    ('forest', RandomForestClassifier())    
])

**Now we pass in the parameters we want to gridsearch over.**

In [11]:
# Let's add our hyperparameters
pipe_params = {
    'cvec__max_features': [150_000],
    'cvec__min_df': [3],
    'cvec__max_df': [.55],
    'cvec__ngram_range': [(1,1), (1,2), (2,2)],
    'forest__n_estimators': [50],
    'forest__max_depth': [100],
}

**We initialize our GridSearch for our model selection.** 

In [12]:
# Instantiate GridSearchCV.
gs = GridSearchCV(
    # what object are we optimizing?
    estimator = pipe,
    
    # what parameters values are we searching?
    param_grid = pipe_params,
    
    # 5-fold cross-validation.
    cv = 5,
    
    verbose=1,
    n_jobs=-1
)

**Finally, we fit our model to our training data**

In [13]:
gs.fit(X_train, y_train)

Fitting 5 folds for each of 3 candidates, totalling 15 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        CountVectorizer(stop_words=['i', 'me',
                                                                    'my',
                                                                    'myself',
                                                                    'we', 'our',
                                                                    'ours',
                                                                    'ourselves',
                                                                    'you',
                                                                    "you're",
                                                                    "you've",
                                                                    "you'll",
                                                                    "you'd",
                                                                    'your',
  

Then analyze what the GridSearch chose.

In [14]:
# What are the best hyperparameters?
gs.best_params_

{'cvec__max_df': 0.55,
 'cvec__max_features': 150000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 1),
 'forest__max_depth': 100,
 'forest__n_estimators': 50}

In [15]:
# Score model on training set.
gs.score(X_train, y_train)

0.9764233759719088

In [16]:
# Score model on testing set.
gs.score(X_test, y_test)

0.7525458248472505

# Model 2: TfidfVectorizer and Random Forest

**Let's set a pipeline**

In [17]:
pipe2 = Pipeline([
    ('cvec', TfidfVectorizer(stop_words = en_stopwords)),
    ('forest', RandomForestClassifier())    
])

In [18]:
# Let's add our hyperparameters
pipe_params2 = {
    'cvec__max_features': [150_000],
    'cvec__min_df': [3],
    'cvec__max_df': [.55],
    'cvec__ngram_range': [(1,2)],
    'forest__n_estimators': [50],
    'forest__max_depth': [100],
}

In [19]:
# Instantiate GridSearchCV.
gs2 = GridSearchCV(
    # what object are we optimizing?
    estimator = pipe2,
    
    # what parameters values are we searching?
    param_grid = pipe_params2,
    
    # 5-fold cross-validation.
    cv = 5,
    
    verbose=1,
    n_jobs=-1
)

In [20]:
gs2.fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        TfidfVectorizer(stop_words=['i', 'me',
                                                                    'my',
                                                                    'myself',
                                                                    'we', 'our',
                                                                    'ours',
                                                                    'ourselves',
                                                                    'you',
                                                                    "you're",
                                                                    "you've",
                                                                    "you'll",
                                                                    "you'd",
                                                                    'your',
  

In [21]:
# What are the best hyperparameters?
gs2.best_params_

{'cvec__max_df': 0.55,
 'cvec__max_features': 150000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2),
 'forest__max_depth': 100,
 'forest__n_estimators': 50}

In [22]:
# Score model on training set.
gs2.score(X_train, y_train)

0.9937296212691247

In [23]:
# Score model on testing set.
gs2.score(X_test, y_test)

0.7563645621181263

# Model 3: TfidfVectorizer and KNeighborsClassifier

**Let's set a pipeline**

In [24]:
pipe3 = Pipeline([
    ('cvec', TfidfVectorizer(stop_words = en_stopwords)),
    ('knn', KNeighborsClassifier())    
])

In [25]:
# Let's add our hyperparameters
pipe_params3 = {
    'cvec__max_features': [100_000],
    'cvec__min_df': [3],
    'cvec__max_df': [.9],
    'cvec__ngram_range': [(1,2)],
    'knn__n_neighbors': [4],
    'knn__p':  [2]
}

In [26]:
# Instantiate GridSearchCV.
gs3 = GridSearchCV(
    # what object are we optimizing?
    estimator = pipe3,
    
    # what parameters values are we searching?
    param_grid = pipe_params3,
    
    # 5-fold cross-validation.
    cv = 5,
    
    verbose=1,
    n_jobs=-1
)

In [27]:
gs3.fit(X_train, y_train)

Fitting 5 folds for each of 1 candidates, totalling 5 fits


GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec',
                                        TfidfVectorizer(stop_words=['i', 'me',
                                                                    'my',
                                                                    'myself',
                                                                    'we', 'our',
                                                                    'ours',
                                                                    'ourselves',
                                                                    'you',
                                                                    "you're",
                                                                    "you've",
                                                                    "you'll",
                                                                    "you'd",
                                                                    'your',
  

In [28]:
# What are the best hyperparameters?
gs3.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 100000,
 'cvec__min_df': 3,
 'cvec__ngram_range': (1, 2),
 'knn__n_neighbors': 4,
 'knn__p': 2}

In [29]:
# Score model on training set.
gs3.score(X_train, y_train)

0.800351141208929

In [30]:
# Score model on testing set.
gs3.score(X_test, y_test)

0.6975560081466395

## Null Model

In [31]:
y.value_counts(normalize=True)

askscience           0.571501
explainlikeimfive    0.428499
Name: subreddit, dtype: float64

---
# Overall Modeling Analysis
Our 3 models have varying, though similar, levels of accuracy on train and test. 

### Model 1 Analysis:
> Model 1 used CountVector and RandomForestClassifier

- Training set accuracy of 97.6%
- Testing set accuracy of 75.2%

Our data is likely very overfit with such a high train and test accuracy difference. However, 75% is high enough for us to use. This data tells us there is possibly a discernable difference between the subreddits.

> - **Best Hyperparameters**:
> - 'cvec__max_df': 0.55,
> - 'cvec__max_features': 150000,
> - 'cvec__min_df': 3,
> - 'cvec__ngram_range': (1, 1),
> - 'forest__max_depth': 100,
> - 'forest__n_estimators': 50

### Model 2 Analysis:
> Model 2 used TfidVectorizer and RandomForestClassifier

- Training set accuracy of 99.4%
- Testing set accuracy of 75.6%

Our data is likely very overfit with such a high train and test accuracy difference. However, 75% is high enough for us to use. This data tells us there is possibly a discernable difference between the subreddits.

> - **Best Hyperparameters**:
> - 'cvec__max_df': 0.55,
> - 'cvec__max_features': 150000,
> - 'cvec__min_df': 3,
> - 'cvec__ngram_range': (1, 2),
> - 'forest__max_depth': 100,
> - 'forest__n_estimators': 50

### Model 3 Analysis:
> Model 3 used TfidVectorizer and KNeighborsClassifier

- Training set accuracy of 80.0%
- Testing set accuracy of 69.8%

Our data is likely very overfit with such a high train and test accuracy difference, although it is seemingly less overfit than our other 2 models. If we were to only be able to use a single model, this would likely be the choice. Again, 70% is high enough for us to use. This data tells us there is possibly a discernable difference between the subreddits.

> - **Best Hyperparameters**:
> - 'cvec__max_df': 0.9,
> - 'cvec__max_features': 100000,
> - 'cvec__min_df': 3,
> - 'cvec__ngram_range': (1, 2),
> - 'knn__n_neighbors': 4,
> - 'knn__p': 2

# Conclusion

All 3 models beating our null model by such a degree signifies that there is a discernable difference between the two subreddits, AskScience and ExplainLikeImFive. These models individually gave us a good inclination towards this conclusion and since they all point to the same conclusion, we are even more sure.