# Second Model
### Naive Bayes 
- I choose this model because it is fast and quick at generating predictions. The downside to this model will be that the probabilities of predictions will be difficult to interpret.  


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

In [2]:
text = pd.read_csv('../data/preprocessed.csv')
text.shape

(1980, 9)

In [3]:
text.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1980 entries, 0 to 1979
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             1980 non-null   object 
 1   text              1978 non-null   object 
 2   subreddit         1980 non-null   float64
 3   name              1980 non-null   object 
 4   text_length       1980 non-null   int64  
 5   text_word_length  1980 non-null   int64  
 6   title_text        1979 non-null   object 
 7   reg_text          1979 non-null   object 
 8   spacy_text        1979 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 139.3+ KB


- Because we are seeing only three rows with missing values out of 1,980 we will drop those three rows.
- Three rows in our dataset represents 0.15% of our data which is minimal enough to justify dropping them. 

In [4]:
text.dropna(axis=0, inplace=True)

In [5]:
text.isnull().sum()

title               0
text                0
subreddit           0
name                0
text_length         0
text_word_length    0
title_text          0
reg_text            0
spacy_text          0
dtype: int64

In [6]:
text.shape

(1977, 9)

In [7]:
X = text['spacy_text']
y = text['subreddit']

## Checking the distribution of our target variable. 
- Because our distribution is fairly even we will not need to startify
- Our Null baseline accuracy is 52.1%. 
- Our baseline model accuracy is (Train: 0.969, Test: 0.925)
- Our Naive Bayes model will be successful if it surpasses our null baseline and baseline model

In [8]:
# Null Baseline 
y.value_counts(normalize=True)

subreddit
1.0    0.52605
0.0    0.47395
Name: proportion, dtype: float64

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

In [10]:
pipe = Pipeline([
    ('cvec', CountVectorizer()),
    ('nb', MultinomialNB())
])

In [11]:
pipe_params = {
    'cvec__max_features': [None, 15000, 20000, 100000], 
    'cvec__min_df': [2, 3], 
    'cvec__max_df': [0.9, 0.95],
    'cvec__ngram_range': [(1, 1), (1, 2)] 
}

In [12]:
gs = GridSearchCV(pipe, 
                  param_grid=pipe_params, 
                  cv = 5)

In [13]:
gs.fit(X_train, y_train)

In [14]:
gs.best_params_

{'cvec__max_df': 0.9,
 'cvec__max_features': 15000,
 'cvec__min_df': 2,
 'cvec__ngram_range': (1, 2)}

In [15]:
gs.best_score_

0.9338793338793339

- We can see our cross-val-score for our Naive Bayes model is performing better than our baseline model by 1%

In [16]:
gs.score(X_train, y_train), gs.score(X_test, y_test)

(0.9757085020242915, 0.9373737373737374)

- Similarly our Naive Bayes model is performing better than our baseline model with an increase in accuracy of our Train set of 0.7% and our Test set of 1.2%

# Testing Model on Stories with Elements of Both Subreddits

In [32]:
gs.predict(['If you travel to Bear Lake in Utah on a quiet day, you just might catch a glimpse of the Bear Lake Monster. The monster looks like a huge brown snake and is nearly 90 feet long. It has ears that stick out from the side of its skinny head and a mouth big enough to eat a man. According to some, it has small legs and it kind of scurries when it ventures out on land. But in the water – watch out! It can swim faster than a horse can gallop – makes a mile a minute on a good day. Sometimes the monster likes to sneak up on unwary swimmers and blow water at them. The ones it doesn’t carry off to eat, that is. A feller I heard about spotted the monster early one evening as he was walking along the lake. He tried to shoot it with his rifle. The man was a crack shot, but not one of his bullets touched that monster. It scared the heck out of him and he high tailed it home faster than you can say Jack Robinson. Left his rifle behind him and claimed the monster ate it. Sometimes, when the monster has been quiet for a while, people start saying it is gone for good. Some folks even dredge up that old tale that says how Pecos Bill heard about the Bear Lake monster and bet some cowpokes that he could wrestle that monster until it said uncle. According to them folks, the fight lasted for days and created a hurricane around Bear Lake. Finally, Bill flung that there monster over his shoulder and it flew so far it went plumb around the world and landed in Loch Ness, where it lives to this day. Course, we know better than that. The Bear Lake Monster is just hibernating-like. Keep your eyes open at dusk and maybe you’ll see it come out to feed. Just be careful swimming in the lake, or you might be its next meal!'])

array([0.])

In [33]:
gs.predict(['  J. Dawson had two goals in life: to find a rich vein of gold and to find a bride. So far, he hadn’t had any luck either with the gold or the ladies. His smooth, eastern manners seemed rather sissy and irritating among the rough miners and rowdy residents of a wild western town. He’d courted the schoolteacher, the local farmers’ daughters, and even took to visiting a few of the other entertainers at the saloon. All to no avail.Then one day, J. Dawson’s lifeless body was found at the bottom of a cliff. He had fallen several hundred feet off the mountain, where he was prospecting for gold. He was buried in Buckskin cemetery with a small service and everyone forgot about him.  Until two days after the funeral, the sheriff found the remains of J. Dawson in the local saloon, lying in the bed of a lady of the evening that he had courted a few months back. She had been sleeping off another busy night when she awoke to find J. Dawson’s remains beside her. The sheriff calmed the hysterical woman and then took J. Dawson back to the graveyard to bury him again.Naturally, no one knew anything. The miners avowed their innocence, and the shopkeepers and businessmen claimed their ignorance. The town treated the matter as a joke, speculating privately on who had dug up poor old J. Dawson.  Three days later, J. Dawson was found at the schoolhouse. He was propped against the doorpost, a love note addressed to the teacher in his hand. After being dead a week, he was not a pretty sight. The sheriff removed the corpse a second time, and had the body buried as deeply as possible. He piled heavy stones atop the grave, and J. Dawson remained in his grave for several weeks.'])

array([0.])