# Testing Model
### Logistic Regression
- We will use a Logistic Regression model to determine our best features. We set up two different iterations of our features that progressively filter more. We will run two different Logistic Reg models and see which performs the best under which iteration. Based on which model performs best will determine which iteration we will use for further complex models. 

In [25]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score 
from sklearn.linear_model import LogisticRegression
from nltk.corpus import stopwords

In [34]:
processed = pd.read_csv('../data/preprocessed.csv')
processed.shape

(1980, 9)

- We will need to redefine the custom stop words that we explored in preprocessing to use as an iteration option. 

In [35]:
custom_stop = stopwords.words('english')
new_words = ['ve', 'don', 'amp', 'x200b', 'just', 'didn', 'house', 'time', 'said', 'isn']
custom_stop.extend(new_words)

In [36]:
processed.isnull().sum()

title               0
text                2
subreddit           0
name                0
text_length         0
text_word_length    0
title_text          1
reg_text            1
spacy_text          1
dtype: int64

In [33]:
processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1977 entries, 0 to 1979
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   title             1977 non-null   object 
 1   text              1977 non-null   object 
 2   subreddit         1977 non-null   float64
 3   name              1977 non-null   object 
 4   text_length       1977 non-null   int64  
 5   text_word_length  1977 non-null   int64  
 6   title_text        1977 non-null   object 
 7   reg_text          1977 non-null   object 
 8   spacy_text        1977 non-null   object 
dtypes: float64(1), int64(2), object(6)
memory usage: 154.5+ KB


- Because we are seeing only three rows with missing values out of 1,980 we will drop those three rows.
- Three rows in our dataset represents 0.15% of our data which is minimal enough to justify dropping them. 

In [29]:
processed.shape

(1980, 9)

In [30]:
processed.dropna(axis=0, inplace=True)

In [31]:
processed.shape

(1977, 9)

In [8]:
processed.head()

Unnamed: 0,title,text,subreddit,name,text_length,text_word_length,title_text,reg_text,spacy_text
0,My Mother's Russian Neighbour,So this story isn't mine but my mother's from ...,0.0,t3_15ga20a,647,117,My Mother's Russian NeighbourSo this story isn...,my mother s russian neighbourso this story isn...,Mother Russian NeighbourSo story mine mother e...
1,Im a weird virgin...,Ok so I (17f) am still a virgin. I never had a...,0.0,t3_15fsclg,935,195,Im a weird virgin...Ok so I (17f) am still a v...,im a weird virgin ok so i 17f am still a virgi...,weird virgin virgin have real boyfriend more t...
3,FUNNY STORY,I WENT TO A FANCY RESTAURANT AFTER CHURCH WITH...,0.0,t3_15er04c,193,37,FUNNY STORYI WENT TO A FANCY RESTAURANT AFTER ...,funny storyi went to a fancy restaurant after ...,STORYI go fancy RESTAURANT church MY CHURCH CL...
4,Arcade poo,"As crazy as this sounds, this is a real story!...",0.0,t3_15c4bg1,1600,334,"Arcade pooAs crazy as this sounds, this is a r...",arcade pooas crazy as this sounds this is a re...,Arcade pooa crazy sound real story go arcade f...
5,My grandpas filter (or lack of),I (16m) live with my grandparents for most of ...,0.0,t3_15c53hk,1650,314,My grandpas filter (or lack of)I (16m) live wi...,my grandpas filter or lack of i 16m live with ...,grandpas filter lack live grandparent most sum...


## Creating two different Logistic Reg Models 
- Our X_1 will be a model using features from our RegExp iteration.
- Our X_2 will be a model using features from our Spacy iteration. 
- Because our baseline accuracy is a low 52.6% we want to see our Logistic Reg models score as well or better than our baseline. 
- I will use the performance of my best Logistic Regression model to determine how successful my other models will perform. 

In [9]:
X_1 = processed['reg_text']
y_1 = processed['subreddit']

X_2 = processed['spacy_text']
y_2 = processed['subreddit']

In [24]:
# Null Baseline 
y_2.value_counts(normalize=True)

subreddit
1.0    0.52605
0.0    0.47395
Name: proportion, dtype: float64

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1, random_state=42)

In [11]:
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_2, random_state=42)

- I am only choosing two hyperparameters because I am mainly interested in the impact of my two iterations. 

In [12]:
pipe = Pipeline([
    ('tri', TfidfVectorizer()),
    ('lr', LogisticRegression())
])

In [13]:
pipe_params = {
    'tri__stop_words': [None, 'english', custom_stop],
    'tri__ngram_range': [(1, 1), (2, 2)]
}

### RegExp Model 

In [14]:
pipe_gs = GridSearchCV(pipe, param_grid=pipe_params, cv = 5, verbose=2)

In [15]:
pipe_gs.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.2s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.2s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.3s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.2s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.2s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.2s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.2s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.2s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.3s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.2s
[CV] END tri__ngram_range=(1, 1), tri__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "

In [16]:
pipe_gs.best_params_

{'tri__ngram_range': (1, 1),
 'tri__stop_words': ['i',
  'me',
  'my',
  'myself',
  'we',
  'our',
  'ours',
  'ourselves',
  'you',
  "you're",
  "you've",
  "you'll",
  "you'd",
  'your',
  'yours',
  'yourself',
  'yourselves',
  'he',
  'him',
  'his',
  'himself',
  'she',
  "she's",
  'her',
  'hers',
  'herself',
  'it',
  "it's",
  'its',
  'itself',
  'they',
  'them',
  'their',
  'theirs',
  'themselves',
  'what',
  'which',
  'who',
  'whom',
  'this',
  'that',
  "that'll",
  'these',
  'those',
  'am',
  'is',
  'are',
  'was',
  'were',
  'be',
  'been',
  'being',
  'have',
  'has',
  'had',
  'having',
  'do',
  'does',
  'did',
  'doing',
  'a',
  'an',
  'the',
  'and',
  'but',
  'if',
  'or',
  'because',
  'as',
  'until',
  'while',
  'of',
  'at',
  'by',
  'for',
  'with',
  'about',
  'against',
  'between',
  'into',
  'through',
  'during',
  'before',
  'after',
  'above',
  'below',
  'to',
  'from',
  'up',
  'down',
  'in',
  'out',
  'on',
  'off',
  

In [17]:
# check cross val score 
pipe_gs.best_score_

0.9197151697151696

In [18]:
pipe_gs.score(X_train, y_train), pipe_gs.score(X_test, y_test)

(0.975033738191633, 0.9232323232323232)

- Although the accuracy for our RegExp model is performing pretty high, we can see this model is overfit due to the Test set performing 5% less than the training set. 
- For our RegExp model we can see that using our custom stop words was the best option however in our Spacy model, (which scored better), using no stop words was the best. 

### Spacy Model 

In [19]:
pipe_gs_2 = GridSearchCV(pipe, param_grid=pipe_params, cv = 5, verbose=2)

In [20]:
pipe_gs_2.fit(X_train_2, y_train_2)

Fitting 5 folds for each of 6 candidates, totalling 30 fits
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.2s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.1s
[CV] END ......tri__ngram_range=(1, 1), tri__stop_words=None; total time=   0.3s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.1s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.1s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.1s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.2s
[CV] END ...tri__ngram_range=(1, 1), tri__stop_words=english; total time=   0.1s
[CV] END tri__ngram_range=(1, 1), tri__stop_words=['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "

In [21]:
pipe_gs_2.best_params_

{'tri__ngram_range': (1, 1), 'tri__stop_words': None}

In [22]:
# check cross val score 
pipe_gs_2.best_score_

0.9217353717353717

In [23]:
pipe_gs_2.score(X_train_2, y_train_2), pipe_gs_2.score(X_test_2, y_test_2)

(0.9689608636977058, 0.9252525252525252)

- It may seem unusual for a model to perform the best by not filtering stop words but for this model Spacy had already removed nearly all stop words making this hyperparameter a bit redundant. 
- We can see that our Spacy model is performing slightly better than our RegExp model. Checking our cross-val-score, we can see that the Spacy and RegExp models performed fairly similarly. However looking at our Train and Test scores we can see that our Spacy model is less overfit than our RegExp model. 
- We will use our Spacy Iterations to run two other models to find our best accuracy. 