# Optimal Production Model

In [1]:
import pandas as pd
import numpy as np 
import warnings

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier 
from sklearn.linear_model import  LogisticRegression
from sklearn.pipeline import Pipeline

from functions import gs_eval

warnings.simplefilter(action='ignore', category=FutureWarning); # stop warnings being printed

In [2]:
df = pd.read_csv('./data/stoicism_buddhism_clean.csv', lineterminator='\n')

df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,name,upvote ratio,num_upvotes,combined_text,is_stoicism,word_count,contains_https,contains_emoji
0,Looking for Seneca's quote on why even bed fle...,I think it was Seneca who wrote something alon...,Stoicism,1705696000.0,t3_19aswwj,0.67,1,Looking for Senecas quote on why even bed flea...,1,64,False,False
1,READ BEFORE POSTING: r/Stoicism beginner's gui...,"Welcome to the r/Stoicism subreddit, a forum f...",Stoicism,1705694000.0,t3_19as7c7,0.76,2,READ BEFORE POSTING rStoicism beginners guide ...,1,208,True,False
2,The New Agora: Daily WWYD and light discussion...,"Welcome to the New Agora, a place for you and ...",Stoicism,1705694000.0,t3_19as6qt,0.76,2,The New Agora Daily WWYD and light discussion ...,1,237,True,False
3,My biggest life mistake was wanting to live an...,"2023 summons this the best, I didn’t want to e...",Stoicism,1705691000.0,t3_19aqv6w,0.94,27,My biggest life mistake was wanting to live an...,1,380,False,False
4,What’s your favorite way to practice gratitude...,You can mention some relevant quotes as well.,Stoicism,1705691000.0,t3_19aqp1z,1.0,3,Whats your favorite way to practice gratitude ...,1,18,False,False


# LogisticRegression Modeling

### Initial Model - Logistic Regression

In [3]:
y = df['is_stoicism']
X = df['combined_text']

y.value_counts(normalize=True) # find baseline 
# baseline is 51% of posts are in the stoicism subreddit

is_stoicism
0    0.517519
1    0.482481
Name: proportion, dtype: float64

In [4]:
# Run a first simple LogisticRegression model with minimal manipulation, CountVectorizer() to process text
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.2,
                                                    stratify=y,
                                                    random_state=42)

In [5]:
cvec = CountVectorizer()
logreg = LogisticRegression(max_iter=10000, random_state=42)
X_train_cvec = cvec.fit_transform(X_train)
X_test_cvec = cvec.transform(X_test)
print(f"Train cv: {cross_val_score(logreg, X_train_cvec, y_train, cv = 5).mean()}")


Train cv: 0.9043065396395186


**Preliminary logistic regression score:**<br/>
Train cv: 0.9043065396395186 <br/>

Using this for the purposes of comparision to other models and to compare benefits (or otherwise) of hyper parameter tuning.

In [6]:
# check coefficients 
logreg_coefficients = logreg.fit(X_train_cvec, y_train).coef_
feature_coefficients = pd.DataFrame({'Feature': cvec.get_feature_names_out(), 'Coefficient': np.exp(logreg_coefficients[0])})
feature_coefficients.describe()

Unnamed: 0,Coefficient
count,24562.0
mean,1.010994
std,0.656653
min,0.043217
25%,0.995066
50%,0.999998
75%,1.001509
max,82.918133


**Coefficient Analysis:** It appears a few words in particular are doing a lot for the fit of our model with max of 82 and a mean/median of ~1. View top coefficents below. 

In [7]:
# list largest odds coefficients
feature_coefficients.sort_values('Coefficient', ascending=False).head(50)


Unnamed: 0,Feature,Coefficient
20659,stoicism,82.918133
20649,stoic,55.413986
7507,epictetus,14.324322
20679,stoics,13.349337
19306,seneca,11.661075
13182,marcus,8.949635
13414,meditations,6.202072
2109,aurelius,5.998471
17351,quote,3.696446
19309,senecas,3.604657


Many stoicism related words in the largest coefficients. Stoicism, stoic, seneca, epictetus, stoics, marcus, aurelius, momento, mori. It is to be expected that these words being present in a reddit submission - all else being equal - would increase the chances of the text being in r/stoicism the most. 

In [8]:
# list smallest coefficents 
feature_coefficients.sort_values('Coefficient', ascending=False).tail(50)

Unnamed: 0,Feature,Coefficient
21428,tara,0.51013
21944,tibetan,0.50669
14593,nirvana,0.506174
10926,impermanent,0.505722
22356,tree,0.503715
18826,samsara,0.499726
17647,recently,0.49496
21204,sutra,0.491212
13946,monks,0.490395
22465,try,0.482939


Conversely we see many Buddhism related words having the lowest coefficients in the model. Meaning these words being present in a reddit submission - all else being held equal - increase the odds of the post being in the stoicism subreddit the least.

In [9]:
# list top ocurring words overall
X_train_df = pd.DataFrame(X_train_cvec.todense(), columns=cvec.get_feature_names_out())
X_train_df.sum().sort_values(ascending=False).head(50)

the       21576
to        19759
and       17477
of        13577
is         9670
in         9427
that       8700
it         7305
you        7000
my         6332
this       5859
for        5409
but        4535
not        4478
have       4393
be         4358
with       4275
me         4050
as         3999
are        3975
or         3878
on         3809
what       3311
if         3227
was        3015
do         2989
so         2758
about      2753
we         2647
how        2639
can        2570
im         2547
from       2461
like       2457
all        2369
your       2343
they       2147
just       2108
life       2059
by         1954
at         1880
its        1868
one        1859
people     1849
dont       1838
am         1804
would      1786
he         1779
when       1758
will       1710
dtype: int64

**Findings from initial model:** 
- Our initial model had quite a high accuracy at 0.9 cross validation. 
- The results of our coefficent analysis suggest that stemming might: 1. Help the model generalise better if we encounter overfitting and: 2. Improve model performance/reduce computational costs. 
- Even in the top/bottom 25 coefficients, it can be observed that many of the words can be grouped together by their stems e.g. buddhist, buddha, buddhism, buddhists, buddhas.
- Try including stop words. Given so many of the top occuring words are very common non-distinct words, including stop words will be explored. I will make a list of custom stop words, removing typical high occuring stop words (pronouns, prepositions, conjunctions) and compare custom stop words with None and 'english' standardised list to see how it affects our model performance. 
- It will be interesting, later in the project to try removing philosophers names and other highly specific terminology to see how it effects model performance. 

### Stemming with Logistic Regression

In [10]:
# source: https://stackoverflow.com/questions/36182502/add-stemming-support-to-countvectorizer-sklearn

def stemmed_words(doc):

    stemmer = SnowballStemmer("english")
    return (stemmer.stem(word) for word in word_tokenize(doc)) 

cvec_stem = CountVectorizer(tokenizer=stemmed_words, max_features=5000)
logreg = LogisticRegression(max_iter=10000, random_state=42)
X_train_stem = cvec_stem.fit_transform(X_train)

print(f"Train cv: {cross_val_score(logreg, X_train_stem, y_train, cv = 5).mean()}")



Train cv: 0.9052939011781748


In [11]:
# view most commmon stemmed words
pd.DataFrame(X_train_stem .todense(), columns=cvec_stem.get_feature_names_out()).sum().sort_values(ascending=False).head(10)


the     21576
to      19759
i       19555
and     17478
of      13577
a       13312
is       9671
in       9429
it       9173
that     9068
dtype: int64

In [12]:
# check largest coefficients 
logreg_coefficients_stem = logreg.fit(X_train_stem, y_train).coef_
feature_coefficients_stem = pd.DataFrame({'Feature': cvec_stem.get_feature_names_out(), 'Coefficient': (logreg_coefficients_stem[0])})
feature_coefficients_stem.sort_values('Coefficient', ascending=False).head(10)

Unnamed: 0,Feature,Coefficient
4174,stoicism,4.600606
4172,stoic,4.318721
1526,epictetus,2.741697
3869,seneca,2.695899
2712,marcus,2.360427
407,aurelius,1.934431
499,belong,1.479674
2697,man,1.293385
992,control,1.28166
3484,quot,1.20271


In [13]:
# check smallest coefficients 
feature_coefficients_stem.sort_values('Coefficient', ascending=False).tail(10)

Unnamed: 0,Feature,Coefficient
227,altar,-1.278735
904,compass,-1.3159
199,ajahn,-1.340418
2846,monasteri,-1.340427
4378,templ,-1.591266
2524,lama,-1.62289
2477,karma,-2.096996
653,buddhism,-3.22517
654,buddhist,-3.252513
647,buddha,-3.287818


**Stemming with Logistic Regression Summary:**
- Slight improvement in model cross validation score, suggesting stemming might be helpful particularly if overfitting is  encountered. <br/>
- However, stemming has not worked exactly as intended, many of the words remain unchanged in the largest and smallest coefficients and therefore have not been grouped see buddhism, buddhist, buddha. <br/>
- Models coefficients for 'stoicism' and 'stoic' have increased. <br/>
- Both the PorterStemmer and SnowballStemmer were tested. <br/>

### Stop Word Comparison with Logistic Regression: 

In [14]:
# Custom stop words made by iteratively viewing the top appearing words. 
# A list of common words that will not contribute to explaining or interpretting the results 

custom_sw = ['about', 'all', 'also', 'am', 'an', 'and', 'any', 'are', 'as', 'at', 'be',  'but', 'by', 'can', 'do', 
             'don', 'even', 'for', 'from', 'get', 'had', 'has', 'have', 'he', 'her', 'him', 'his', 'how', 'if', 
             'in', 'is', 'it', 'just', 'like', 'me', 'much', 'https', 'my', 'not', 'now', 'of', 'on', 'one', 'or', 'people', 
             'she', 'so', 'some', 'that', 'the', 'they', 'this', 'to', 've', 'was', 'we', 'what', 
             'when', 'which', 'who', 'will', 'with', 'would', 'you', 'your',  'their', 'other', 
             'something', 'want', 'only', 'then', 'really', 'day', 'own']

In [15]:
# test stop words 

pipe_sw = Pipeline([
    ('cvec', CountVectorizer()),
    ('logreg', LogisticRegression(max_iter=100000, random_state=42))
])

params_sw = {'cvec__stop_words': [None, 'english', custom_sw]
          }

lgrg_gs = GridSearchCV(pipe_sw,
                  param_grid = params_sw, 
                  cv = 5,
                  n_jobs=-1)

lgrg_sw = lgrg_gs.fit(X_train, y_train)

In [16]:
# use pre-defined function for evaluating models after a grid search - see functions.py

gs_eval(X_train, y_train, X_test, y_test, lgrg_sw)

Best parameters: {'cvec__stop_words': 'english'}
Best score: 0.9088445864510867
Train score:  0.9903314917127072
Test score:  0.9100946372239748


(Pipeline(steps=[('cvec', CountVectorizer(stop_words='english')),
                 ('logreg',
                  LogisticRegression(max_iter=100000, random_state=42))]),
 array([0, 0, 0, ..., 0, 0, 0]))

**Summary of stop-word testing with Logistic Regression:**
- Best stop words in terms of model performance based on cross valitated score found were 'english', in comparison with custom list and None. 
- Model is highly overfit. 
- Below hyperparameter tuning is done with the aim of reducing test score bias, reducing variance. 

### Hyperparameter Tuning - Logistic Regression with Count Vectorizer: 

In [17]:
pipe = Pipeline([
    ('cvec', CountVectorizer(stop_words='english')),
    ('logreg', LogisticRegression(max_iter=10000, random_state=42))
])

params = {'cvec__ngram_range': [(1,2)],
          'cvec__max_df' : [0.3, 0.5, 0.7],
          'cvec__min_df' : [2],
          'logreg__C' : [0.3, 0.6, 0.9]
          }

In [18]:
logreg_gs = GridSearchCV(pipe,
                  param_grid = params, 
                  cv = 5,
                  n_jobs=-1)

logreg_gs.fit(X_train, y_train)

gs_eval(X_train, y_train, X_test, y_test,logreg_gs)

Best parameters: {'cvec__max_df': 0.3, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'logreg__C': 0.3}
Best score: 0.9086475424997712
Train score:  0.9757300710339384
Test score:  0.9100946372239748


(Pipeline(steps=[('cvec',
                  CountVectorizer(max_df=0.3, min_df=2, ngram_range=(1, 2),
                                  stop_words='english')),
                 ('logreg',
                  LogisticRegression(C=0.3, max_iter=10000, random_state=42))]),
 array([0, 0, 0, ..., 0, 0, 0]))

**Results of hyperparameter tuning logreg with cvec v1:** <br/>
Best params: {'cvec__max_df': 0.9, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'logreg__C': 1, 'logreg__penalty': 'l2'}<br/>
Best score: 0.9171282051282053<br/>
Train score:  0.992<br/>
Test score:  0.911402789171452<br/>

**Results of hyperparameter tuning logreg with cvec v2:**<br/>
Best params: {'cvec__max_df': 0.87, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'logreg__C': 1}<br/>
Best score: 0.9171282051282053<br/>
Train score:  0.992<br/>
Test score:  0.911402789171452<br/>

**Results of hyperparameter tuning logreg with cvec v3:**<br/>
Best params: {'cvec__max_df': 0.7, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'logreg__C': 0.95}<br/>
Best score: 0.9173333333333333<br/>
Train score:  0.9911794871794872<br/>
Test score:  0.9105824446267432<br/>

**Insights:**
- Several iterations of hyperparameter tuning were completed - some listed above. Including tuning of 'logreg__penalty', stop words, 'logreg__C', 'min_df' and more. 
- An improvement on previous model accuracy is observed. <br/>
- Attempted to improve overfitting with increaed regularization (lowering C) however bias increased without much improvement in varience. <br/>
- The model is highly overfit: below we will attempt to use TF-DIF to reduce this overfitting and perhaps word stemming. <br/>

<br/>
*Note: gridsearch params have been limited from original gridsearches for computational speed purposes. <br/>

### Hyperparameter Tuning - Logistic Regression with TF-IDF:

In [19]:
pipe2 = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', min_df=1)),
    ('logreg', LogisticRegression(max_iter=10000, random_state=42))
])

params2 = {'tfidf__ngram_range': [(1,1)],
          'tfidf__max_df' : [0.5, 0.7, 0.9],
          'tfidf__max_features' : [3500, 4000, 4500],
          'logreg__C' : [0.96, 0.97, 0.99]
          }

In [20]:
logreg_tfidf = GridSearchCV(pipe2,
                  param_grid = params2, 
                  cv = 5,
                  n_jobs=-1)

logreg_tfidf.fit(X_train, y_train)

gs_eval(X_train, y_train, X_test, y_test, logreg_tfidf)

Best parameters: {'logreg__C': 0.99, 'tfidf__max_df': 0.5, 'tfidf__max_features': 4500, 'tfidf__ngram_range': (1, 1)}
Best score: 0.9157493024605182
Train score:  0.9640883977900553
Test score:  0.9250788643533123


(Pipeline(steps=[('tfidf',
                  TfidfVectorizer(max_df=0.5, max_features=4500,
                                  stop_words='english')),
                 ('logreg',
                  LogisticRegression(C=0.99, max_iter=10000, random_state=42))]),
 array([0, 0, 0, ..., 1, 0, 0]))

**Results LogisticRegression with TF-IDF v1:** <br/>
Best parameters: {'logreg__C': 0.97, 'tfidf__max_df': 0.5, 'tfidf__max_features': 4000, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.9182958907636125<br/>
Train score:  0.9618989405052975<br/>
Test score:  0.9185667752442996<br/>

**Results LogisticRegression with TF-IDF v2:**<br/>
Best parameters: {'logreg__C': 0.99, 'tfidf__max_df': 0.5, 'tfidf__max_features': 4500, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.9157493024605182<br/>
Train score:  0.9640883977900553<br/>
Test score:  0.9250788643533123<br/>

**Insights:**<br/>
- Reduced variance and lower bias compared to CVEC pre-processing. <br/>
- Highest cross validation scores observed in v1. <br/>
- Optimises with ngram range 1,1 unlike CVEC. <br/>
- Lowest bias in test dataset with v2. <br/>

<br/>
*Note: gridsearch parameters inputted above have been limited from original gridsearches for computational speed. <br/>

### Logistic Regression with TF-IDF - Scoring and Stemming: 

In [21]:
pipe2_1 = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', tokenizer=stemmed_words, min_df=1)),
    ('logreg', LogisticRegression(max_iter=10000, random_state=42))
])

params2_1 = {'tfidf__ngram_range': [(1,1)],
          'tfidf__max_df' : [0.3],
          'tfidf__max_features' : [3000],
          'logreg__C' : [0.99]
          }

logreg_tfidf_1 = GridSearchCV(pipe2_1,
                  param_grid = params2_1, 
                  cv = 5,
                  n_jobs=-1, 
                  scoring='balanced_accuracy')

logreg_tfidf_1 = logreg_tfidf_1.fit(X_train, y_train)

gs_eval(X_train, y_train, X_test, y_test, logreg_tfidf_1)



Best parameters: {'logreg__C': 0.99, 'tfidf__max_df': 0.3, 'tfidf__max_features': 3000, 'tfidf__ngram_range': (1, 1)}
Best score: 0.9142485052841911




Train score:  0.9571823204419889
Test score:  0.9274447949526814


(Pipeline(steps=[('tfidf',
                  TfidfVectorizer(max_df=0.3, max_features=3000,
                                  stop_words='english',
                                  tokenizer=<function stemmed_words at 0x142e436a0>)),
                 ('logreg',
                  LogisticRegression(C=0.99, max_iter=10000, random_state=42))]),
 array([0, 0, 0, ..., 1, 0, 0]))

**Results LogisticRegression with TF-IDF v3 - scoring onbalanced_accuracy:**<br/>
Best parameters: {'logreg__C': 0.99, 'tfidf__max_df': 0.3, 'tfidf__max_features': 4500, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.9149147950661556<br/>
Train score:  0.9640883977900553<br/>
Test score:  0.9250788643533123<br/>

**Results LogisticRegression with TF-IDF v4 - scoring on balanced_accuracy:**<br/>
Best parameters: {'logreg__C': 0.3, 'tfidf__max_df': 0.3, 'tfidf__max_features': 4500, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.9066780178094748<br/>
Train score:  0.9374506708760852<br/>
Test score:  0.9132492113564669<br/>

**Results LogisticRegression with TF-IDF v5 - with stemming:**<br/>
Best parameters: {'logreg__C': 0.99, 'tfidf__max_df': 0.3, 'tfidf__max_features': 3000, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.9142485052841911<br/>
Train score:  0.9571823204419889<br/>
Test score:  0.9274447949526814<br/>

<br/>
*Note: gridsearch params have been limited from original gridsearches for computational speed purposes. <br/>

# Random Forest Classifier 

### Random Forest Classifier with Count Vectoriser 

In [22]:
pipe_rf = Pipeline([
    ('cvec', CountVectorizer(stop_words='english')),
    ('rf', RandomForestClassifier(random_state=42))
])

params_rf = {
            'cvec__max_df': [0.99, None],
            'cvec__min_df': [12, 15],
            'rf__max_depth' : [1, 5, 10],
            'rf__min_samples_leaf' : [1, 2],
            'rf__min_samples_split' : [10, 12]
            }

rf_gs = GridSearchCV(pipe_rf,
                  param_grid = params_rf, 
                  cv = 5,
                  n_jobs=-1)
rf_gs.fit(X_train, y_train)

120 fits failed out of a total of 240.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
120 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/AaranDaniel/anaconda3/lib/python3.11/site-packages/sklearn/model_selection/_validation.py", line 732, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/AaranDaniel/anaconda3/lib/python3.11/site-packages/sklearn/base.py", line 1151, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/AaranDaniel/anaconda3/lib/python3.11/site-packages/sklearn/pipeline.py", line 416, in fit
    Xt = self._fit(X, y, **fit_params_steps)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In [23]:
gs_eval(X_train, y_train, X_test, y_test, rf_gs)

Best parameters: {'cvec__max_df': 0.99, 'cvec__min_df': 12, 'rf__max_depth': 10, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 10}
Best score: 0.8853628665611352
Train score:  0.8942383583267561
Test score:  0.8730283911671924


(Pipeline(steps=[('cvec',
                  CountVectorizer(max_df=0.99, min_df=12, stop_words='english')),
                 ('rf',
                  RandomForestClassifier(max_depth=10, min_samples_leaf=2,
                                         min_samples_split=10,
                                         random_state=42))]),
 array([0, 0, 0, ..., 0, 0, 0]))

In [24]:
pipe_rf2 = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,1))),
    ('rf', RandomForestClassifier(random_state=42))
])

params_rf2 = {
            'tfidf__max_df' : [0.3, 0.7],
            'tfidf__max_features' : [1000, 2000],
            'rf__n_estimators' : [100, 200],
            'rf__max_depth' : [10, 20], 
            'rf__min_samples_leaf' : [2, 5],
            'rf__min_samples_split' : [5, 10]
            }

rf_gs2 = GridSearchCV(pipe_rf2,
                  param_grid = params_rf2, 
                  cv = 5,
                  n_jobs=-1)

rf_gs2.fit(X_train, y_train)

In [25]:
gs_eval(X_train, y_train, X_test, y_test, rf_gs2)

Best parameters: {'rf__max_depth': 20, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 10, 'rf__n_estimators': 200, 'tfidf__max_df': 0.3, 'tfidf__max_features': 1000}
Best score: 0.900359429974435
Train score:  0.9287687450670876
Test score:  0.8919558359621451


(Pipeline(steps=[('tfidf',
                  TfidfVectorizer(max_df=0.3, max_features=1000,
                                  stop_words='english')),
                 ('rf',
                  RandomForestClassifier(max_depth=20, min_samples_leaf=2,
                                         min_samples_split=10, n_estimators=200,
                                         random_state=42))]),
 array([0, 0, 0, ..., 0, 0, 0]))

**Results RandomForests with CVEC v1:** <br/>
Best params: {'cvec__max_df': 0.99, 'cvec__min_df': 12, 'rf__max_depth': 12, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 12}<br/>
Best score: 0.8879357486749253<br/>
Train score:  0.9013854930725347<br/>
Test score:  0.9013854930725347<br/>

**Results RandomForests with TF-IDF v1:**<br/>
Best parameters: {'rf__min_samples_split': 15, 'tfidf__max_df': 0.7, 'tfidf__max_features': 4500, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.913608251275248<br/>
Train score:  0.9969437652811736<br/>
Test score:  0.9047231270358306<br/>

**Results RandomForests with TF-IDF v2:**<br/>
Best parameters: {'rf__min_samples_split': 15, 'tfidf__max_df': 0.7, 'tfidf__max_features': 4500, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.913608251275248<br/>
Train score:  0.9969437652811736<br/>
Test score:  0.9047231270358306<br/>

**Results RandomForests with TF-IDF v3:**<br/>
Best parameters: {'rf__max_depth': 10, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 15, 'rf__n_estimators': 200, 'tfidf__max_df': 0.7, 'tfidf__max_features': 1000}<br/>
Best score: 0.8905840293478328<br/>
Train score:  0.9019967400162999<br/>
Test score:  0.8713355048859935<br/>

**Results RandomForests with TF-IDF v4:**<br/>
Best parameters: {'rf__max_depth': 20, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 10, 'rf__n_estimators': 100, 'tfidf__max_df': 0.3, 'tfidf__max_features': 1000}<br/>
Best score: 0.8993449885917981<br/>
Train score:  0.928280358598207<br/>
Test score:  0.8908794788273615<br/>

**Insights:**<br/>
- All models performed worse with stop_words=None compared to stop_words = 'english'.<br/>
- Iterated over different hyperparameter settings, including ngram range and several times to reduce varience without a large increase in bias.<br/>
- RandomForests with CVEC performed worse than RandomForests with TF-IDF, both interms of varience and bias. <br/>
- In attempting to reduce varience with hyper parameter tuning, bias increased disproportionality.<br/>
- None of the Random Forests models so far outperform the best Logistic Regression model in regards to test data set accuracy or cross validated score. <br/>

<br/>
*Note: gridsearch params have been limited from original gridsearches for computational speed purposes. <br/>

### Random Forests and TFIDF With Stemming 

In [26]:
pipe_rf3 = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,1), tokenizer=stemmed_words)),
    ('rf', RandomForestClassifier(random_state=42))
])

params_rf3 = {
            'tfidf__max_df' : [0.9],
            'tfidf__max_features' : [2000],
            'rf__max_depth' : [10, 20], 
            'rf__min_samples_leaf' : [2],
            'rf__min_samples_split' : [5]
            }

rf_gs3 = GridSearchCV(pipe_rf3,
                  param_grid = params_rf3, 
                  cv = 5,
                  n_jobs=-1)

rf_gs3.fit(X_train, y_train)



In [27]:
gs_eval(X_train, y_train, X_test, y_test, rf_gs3)

Best parameters: {'rf__max_depth': 20, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'tfidf__max_df': 0.9, 'tfidf__max_features': 2000}
Best score: 0.898385875141893




Train score:  0.9352801894238358
Test score:  0.9029968454258676


(Pipeline(steps=[('tfidf',
                  TfidfVectorizer(max_df=0.9, max_features=2000,
                                  stop_words='english',
                                  tokenizer=<function stemmed_words at 0x142e436a0>)),
                 ('rf',
                  RandomForestClassifier(max_depth=20, min_samples_leaf=2,
                                         min_samples_split=5,
                                         random_state=42))]),
 array([0, 0, 0, ..., 0, 0, 0]))

**Results RandomForests with TF-IDF v5 - with stemming:**<br/>
Best parameters: {'rf__max_depth': 20, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 5, 'tfidf__max_df': 0.7, 'tfidf__max_features': 2000}<br/>
Best score: 0.9009525089029987<br/>
Train score:  0.9356748224151539<br/>
Test score:  0.8998422712933754<br/>

- Reduced variance observed when stemming used. <br/>

<br/>
*Note: gridsearch params have been limited from original gridsearches for computational speed purposes. <br/>

### Random Forests and TF-IDF - Hyperparameter Tuning Scoring = 'balanced accuracy'

In [28]:
pipe_rf4 = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,1))),
    ('rf', RandomForestClassifier(random_state=42))
])

params_rf4 = {
            'tfidf__max_df' : [0.1, 0.3],
            'tfidf__max_features' : [2000],
            'rf__n_estimators' : [600],
            'rf__max_depth' : [30], 
            'rf__min_samples_leaf' : [2],
            'rf__min_samples_split' : [15]
            }

rf_gs4 = GridSearchCV(pipe_rf4,
                  param_grid = params_rf4, 
                  cv = 5,
                  n_jobs=-1, 
                  scoring='balanced_accuracy')

rf_gs4.fit(X_train, y_train)

In [29]:
gs_eval(X_train, y_train, X_test, y_test, rf_gs4)

Best parameters: {'rf__max_depth': 30, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 15, 'rf__n_estimators': 600, 'tfidf__max_df': 0.3, 'tfidf__max_features': 2000}
Best score: 0.9021214683784816
Train score:  0.9400157853196527
Test score:  0.8990536277602523


(Pipeline(steps=[('tfidf',
                  TfidfVectorizer(max_df=0.3, max_features=2000,
                                  stop_words='english')),
                 ('rf',
                  RandomForestClassifier(max_depth=30, min_samples_leaf=2,
                                         min_samples_split=15, n_estimators=600,
                                         random_state=42))]),
 array([0, 0, 0, ..., 0, 0, 0]))

**Results RandomForests with TF-IDF scoring based on 'balanced accuracy' v1:**<br/>
Best parameters: {'rf__max_depth': 30, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 15, 'rf__n_estimators': 200, 'tfidf__max_df': 0.3, 'tfidf__max_features': 2000}<br/>
Best score: 0.9020542892378233<br/>
Train score:  0.9417916337805841<br/>
Test score:  0.9006309148264984<br/>
<br/>

**Results RandomForests with TF-IDF scoring based on 'balanced accuracy' v2:**<br/>
Best parameters: {'rf__max_depth': 30, 'rf__min_samples_leaf': 2, 'rf__min_samples_split': 15, 'rf__n_estimators': 600, 'tfidf__max_df': 0.3, 'tfidf__max_features': 2000}<br/>
Best score: 0.9004197514345034<br/>
Train score:  0.9408050513022889<br/>
Test score:  0.9014195583596214<br/>


*Note: gridsearch params have been limited from original gridsearches for computational speed purposes. <br/>

###  Brief Summary

- Combination of CVEC pre-processing and TF-IFD were tested: TF-IDF produced lowest bias and lowest variance results. <br/>
- Effects of stemming was explored and did not significantly improve model cross-validated score, though variance decreased slightly (improved generalisation). <br/>
- Including default english stop words reduced bias and variance. <br/>
- Two models focused on were: Logistic Regression and Random Forests. <br/>
- Below selected best performing models based on Cross Validation score. <br/>
- In the following workbook these are evaluated in more detail.<br/>
- The production model Logistic Regression was chosen over Random Forests because of its better generalisation, lower bias on unseen data and higher cross validation accuracy. As well as interpretability of variables which will be explored further in the following workbook.<br/>
- Of the many Logistic Regression models tested the model with TF-IDF v1 had the highest cross validation accuracy. <br/>

**Looking Ahead:**
- In the following workbook models are evaluated and compared in more detail using balanced accuracy, F1 scores, recall, precision and errors are analysed. 
- Visualisation of models are created and compared in the hope to find systematic differences between the two and improve understanding of the models. 
- Models with restricted vocabulary are considered, in the hope to reveal differences between the two subreddits and go some way towards answering the problem statement.

### Production Model:
**LogisticRegression model, with TF-IDF v1:** <br/>
Best parameters: {'logreg__C': 0.97, 'tfidf__max_df': 0.5, 'tfidf__max_features': 4000, 'tfidf__min_df': 1, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.9182958907636125<br/>
Train score:  0.9618989405052975<br/>
Test score:  0.9185667752442996<br/>

#### Other:
**Best performing RandomForests model with TF-IDF v1:**<br/>
Best parameters: {'rf__min_samples_split': 15, 'tfidf__max_df': 0.7, 'tfidf__max_features': 4500, 'tfidf__ngram_range': (1, 1)}<br/>
Best score: 0.913608251275248<br/>
Train score:  0.9969437652811736<br/>
Test score:  0.9047231270358306<br/>