# First Model
### RANDOM FORESTS 
- I am using a Random Forest model because I want to reduce correlation in my decision trees. 

In [32]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score

In [34]:
both_subs = pd.read_csv('../data/preprocessed.csv')
both_subs.shape

(1980, 9)

In [35]:
both_subs.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980 entries, 0 to 1979
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             1980 non-null   object 
 1   text              1978 non-null   object 
 2   subreddit         1980 non-null   float64
 3   name              1980 non-null   object 
 4   text_length       1980 non-null   int64  
 5   text_word_length  1980 non-null   int64  
 6   title_text        1979 non-null   object 
 7   reg_text          1979 non-null   object 
 8   spacy_text        1979 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 139.3+ KB


- Because we are seeing only three rows with missing values out of 1,980 we will drop those three rows.
- Three rows in our dataset represents 0.15% of our data which is minimal enough to justify dropping them. 

In [12]:
both_subs.dropna(axis=0, inplace=True)

In [13]:
both_subs.isnull().sum()

title               0
text                0
subreddit           0
name                0
text_length         0
text_word_length    0
title_text          0
reg_text            0
spacy_text          0
dtype: int64

In [14]:
both_subs.shape

(1977, 9)

In [15]:
X = both_subs['spacy_text']
y = both_subs['subreddit']

## Checking the distribution of our target variable. 
- Because our distribution is fairly even we will not need to startify
- Our Null baseline accuracy is 52.1%. 
- Our baseline model accuracy is (Train: 0.969, Test: 0.925)
- Our Random Forest model will be successful if it surpasses our null baseline and baseline model

In [16]:
# Null Baseline 
y.value_counts(normalize=True)

subreddit
1.0    0.52605
0.0    0.47395
Name: proportion, dtype: float64

In [17]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [18]:
pipe = Pipeline([
    ('tri', TfidfVectorizer()),
    ('rf', RandomForestClassifier())
])

In [19]:
pipe_params = {
    'tri__stop_words': [None, 'english'],
    'tri__ngram_range': [(1, 1), (2, 2)], 
    'tri__max_features': [None, 15000, 20000, 100000],
    'rf__n_estimators': [100, 150, 200],
    'rf__max_depth': [2, 3, 4, 5]  
}

In [20]:
pipe_gs = GridSearchCV(pipe, param_grid=pipe_params, cv = 5, verbose=2)

In [21]:
pipe_gs.fit(X_train, y_train)

Fitting 5 folds for each of 192 candidates, totalling 960 fits
[CV] END rf__max_depth=2, rf__n_estimators=100, tri__max_features=None, tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END rf__max_depth=2, rf__n_estimators=100, tri__max_features=None, tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END rf__max_depth=2, rf__n_estimators=100, tri__max_features=None, tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END rf__max_depth=2, rf__n_estimators=100, tri__max_features=None, tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END rf__max_depth=2, rf__n_estimators=100, tri__max_features=None, tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END rf__max_depth=2, rf__n_estimators=100, tri__max_features=None, tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.1s
[CV] END rf__max_depth=2, rf__n_estimators=100, tri__max_features=None, tri__ngram_range=(1, 1), tri__stop

In [37]:
pipe_gs.best_estimator_

In [38]:
pipe_gs.best_params_

{'rf__max_depth': 5,
 'rf__n_estimators': 100,
 'tri__max_features': 100000,
 'tri__ngram_range': (1, 1),
 'tri__stop_words': 'english'}

In [22]:
pipe_gs.best_score_

0.8556056056056056

- We can see our cross-val-score for our Random Forest model is underperforming compared to our baseline model of 0.922

In [36]:
pipe_gs.score(X_train, y_train), pipe_gs.score(X_test, y_test)

(0.9014844804318488, 0.8747474747474747)

- Similarly our accuracy scores for the Random Forest is under performing from our baseline model. We will proceed with other models to potentially increase our accuracy. 

In [43]:
pipe_gs.predict_proba(['It starts with a child meeting a lady with red lips and long fingers. She keeps asking him whether he knows what she can do with these red lips and these long fingers. The child is scared but cannot escape this lady with her red lips and long fingers. This repeats itself a couple of times, getting scarier and scarier. At last, he will ask her what she can do with those red lips and long fingers. And she will move her long fingers between her lips, making a funny sound.'])

array([[0.56087009, 0.43912991]])

In [42]:
pipe_gs.predict_proba(['  J. Dawson had two goals in life: to find a rich vein of gold and to find a bride. So far, he hadn’t had any luck either with the gold or the ladies. His smooth, eastern manners seemed rather sissy and irritating among the rough miners and rowdy residents of a wild western town. He’d courted the schoolteacher, the local farmers’ daughters, and even took to visiting a few of the other entertainers at the saloon. All to no avail.Then one day, J. Dawson’s lifeless body was found at the bottom of a cliff. He had fallen several hundred feet off the mountain, where he was prospecting for gold. He was buried in Buckskin cemetery with a small service and everyone forgot about him.  Until two days after the funeral, the sheriff found the remains of J. Dawson in the local saloon, lying in the bed of a lady of the evening that he had courted a few months back. She had been sleeping off another busy night when she awoke to find J. Dawson’s remains beside her. The sheriff calmed the hysterical woman and then took J. Dawson back to the graveyard to bury him again.Naturally, no one knew anything. The miners avowed their innocence, and the shopkeepers and businessmen claimed their ignorance. The town treated the matter as a joke, speculating privately on who had dug up poor old J. Dawson.  Three days later, J. Dawson was found at the schoolhouse. He was propped against the doorpost, a love note addressed to the teacher in his hand. After being dead a week, he was not a pretty sight. The sheriff removed the corpse a second time, and had the body buried as deeply as possible. He piled heavy stones atop the grave, and J. Dawson remained in his grave for several weeks.'])

array([[0.53483923, 0.46516077]])