<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Modeling" data-toc-modified-id="Modeling-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Modeling</a></span><ul class="toc-item"><li><span><a href="#Using-the-TF-IDF-Vectorizer" data-toc-modified-id="Using-the-TF-IDF-Vectorizer-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Using the TF-IDF Vectorizer</a></span><ul class="toc-item"><li><span><a href="#And-classified-with-Logistic-Regression" data-toc-modified-id="And-classified-with-Logistic-Regression-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>And classified with Logistic Regression</a></span></li><li><span><a href="#And-classified-with-Multinomial-Naive-Bayes" data-toc-modified-id="And-classified-with-Multinomial-Naive-Bayes-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>And classified with Multinomial Naive Bayes</a></span></li></ul></li><li><span><a href="#With-CountVectorizer" data-toc-modified-id="With-CountVectorizer-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>With CountVectorizer</a></span><ul class="toc-item"><li><span><a href="#With-Logistic-Regression" data-toc-modified-id="With-Logistic-Regression-5.2.1"><span class="toc-item-num">5.2.1&nbsp;&nbsp;</span>With Logistic Regression</a></span></li><li><span><a href="#With-Multinomial-Naive-Bayes" data-toc-modified-id="With-Multinomial-Naive-Bayes-5.2.2"><span class="toc-item-num">5.2.2&nbsp;&nbsp;</span>With Multinomial Naive Bayes</a></span></li></ul></li></ul></li><li><span><a href="#Evaluating-the-best-model" data-toc-modified-id="Evaluating-the-best-model-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Evaluating the best model</a></span><ul class="toc-item"><li><span><a href="#Saving-Features-of-gs3" data-toc-modified-id="Saving-Features-of-gs3-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Saving Features of gs3</a></span></li><li><span><a href="#Saving-Features-from-gs2" data-toc-modified-id="Saving-Features-from-gs2-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Saving Features from gs2</a></span></li></ul></li></ul></div>

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV

# Import CountVectorizer and TFIDFVectorizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline

# Import Classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB

In [3]:
data = pd.read_csv('../datasets/adhd_ocd_210313_mod.csv')

In [4]:
# Check for NaNs before beginning
data.isna().sum()

subreddit      0
content_mod    0
dtype: int64

In [6]:
# Calculate baseline to beat
data['subreddit'].value_counts(normalize=True)

1    0.51547
0    0.48453
Name: subreddit, dtype: float64

### Modeling

For create our best model, we will attempt to use two vectorizers available in sklearn:
1. TF-IDF Vectorizer
2. Count Vectorizer

It is expected that TF-IDF Vectorizer models will perform better than Count Vectorizer ones due to its ability to penalize more frequently occuring words, which from our previous section we already know that r/ADHD and r/OCD share many of.

We will also employ two classifiers:
1. Naive Bayes
2. Logistic Regression

For our purposes, a Type I and Type II error are equal in severity (neither are worse than the other). Thus, we will be optimizing our models for the highest possible accuracy while attempting to ensure that it is not overfitted on the training data.

We will find the best parameters for each of our models using GridSearchCV and making changes as needed. As with the last section, r/ADHD and r/OCD posts share a high number of words used. We expect that our models will utilise less than 50% of available features for prediction, and also omit many most frequently seen words.

For the purposes of our project -- since we are concerned about indivudual words, we will be keeping all vectorizers' ngram_ranges at (1, 1) and ignore composite words.

We will be using 25% of our datasets rows as a test set and the remaining as the training set.

Our baseline model to beat as indicated above, is .51 or 51% accuracy from predicting that every post belongs to r/OCD.

In [4]:
# Results will be saved in a table and compared at the end
results = {}

#### Using the TF-IDF Vectorizer

In [5]:
X = data['content_mod']
y = data['subreddit']

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                   test_size=.25,
                                                    random_state = 42,
                                                   stratify=y)

##### And classified with Logistic Regression

In [7]:
pipe = Pipeline([
    ('vector', TfidfVectorizer(stop_words='english')),
    ('lgr', LogisticRegression(solver='lbfgs')),
])

In [8]:
pipe_params = {
    'vector__max_features': [2500, 3000, 3500],
    'vector__min_df': [2, 3, 4],
    'vector__max_df': [.25, .3, .35],
    'lgr__max_iter': [10000],
}

In [9]:
gs = GridSearchCV(pipe, 
                  pipe_params,
                  cv=5,
                  n_jobs=-1,
                 )

In [10]:
gs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vector',
                                        TfidfVectorizer(stop_words='english')),
                                       ('lgr', LogisticRegression())]),
             n_jobs=-1,
             param_grid={'lgr__max_iter': [10000],
                         'vector__max_df': [0.25, 0.3, 0.35],
                         'vector__max_features': [2500, 3000, 3500],
                         'vector__min_df': [2, 3, 4]})

In [11]:
gs.best_estimator_

Pipeline(steps=[('vector',
                 TfidfVectorizer(max_df=0.3, max_features=3000, min_df=3,
                                 stop_words='english')),
                ('lgr', LogisticRegression(max_iter=10000))])

In [12]:
results.update({'gs_TF_Lgr':
               {
                   'train_score': gs.best_score_,
                   'test_score': gs.score(X_test, y_test)
               }})

##### And classified with Multinomial Naive Bayes

In [13]:
pipe2 = Pipeline([
    ('vector', TfidfVectorizer(stop_words='english')),
    ('nb', MultinomialNB()),
])

In [14]:
pipe_params2 = {
    'vector__max_features': [3000, 3500, 4000],
    'vector__min_df': [1, 2],
    'vector__max_df': [.3, .35, .4],
    'nb__alpha': [.4, .5, .6]
}

In [15]:
gs2 = GridSearchCV(pipe2, # what object are we optimizing?
                  pipe_params2, # what parameters values are we searching?
                  cv=5,
                n_jobs=-1,
                 ) # 5-fold cross-validation.

In [16]:
gs2.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vector',
                                        TfidfVectorizer(stop_words='english')),
                                       ('nb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'nb__alpha': [0.4, 0.5, 0.6],
                         'vector__max_df': [0.3, 0.35, 0.4],
                         'vector__max_features': [3000, 3500, 4000],
                         'vector__min_df': [1, 2]})

In [17]:
gs2.best_estimator_

Pipeline(steps=[('vector',
                 TfidfVectorizer(max_df=0.4, max_features=3000, min_df=2,
                                 stop_words='english')),
                ('nb', MultinomialNB(alpha=0.4))])

In [18]:
results.update({'gs2_TF_NB':
               {
                   'train_score': gs2.best_score_,
                   'test_score': gs2.score(X_test, y_test)
               }})

#### With CountVectorizer

##### With Logistic Regression

In [19]:
pipe3 = Pipeline([
    ('vector', CountVectorizer(stop_words='english')),
    ('lgr', LogisticRegression()),
])

In [20]:
pipe_params3 = {
    'vector__max_features': [2500, 3000, 3500],
    'vector__min_df': [2, 3],
    'vector__max_df': [.3, .35, .4],
    'lgr__max_iter': [10000],
}

In [21]:
gs3 = GridSearchCV(pipe3, 
                  pipe_params3,
                  cv=5,
                   n_jobs=-1,
                 )

In [22]:
gs3.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vector',
                                        CountVectorizer(stop_words='english')),
                                       ('lgr', LogisticRegression())]),
             n_jobs=-1,
             param_grid={'lgr__max_iter': [10000],
                         'vector__max_df': [0.3, 0.35, 0.4],
                         'vector__max_features': [2500, 3000, 3500],
                         'vector__min_df': [2, 3]})

In [23]:
gs3.best_estimator_

Pipeline(steps=[('vector',
                 CountVectorizer(max_df=0.35, max_features=3500, min_df=2,
                                 stop_words='english')),
                ('lgr', LogisticRegression(max_iter=10000))])

In [24]:
results.update({'gs3_Count_Lgr':
               {
                   'train_score': gs3.best_score_,
                   'test_score': gs3.score(X_test, y_test)
               }})

##### With Multinomial Naive Bayes

In [25]:
pipe4 = Pipeline([
    ('vector', CountVectorizer(stop_words='english')),
    ('nb', MultinomialNB()),
])

In [26]:
pipe_params4 = {
    'vector__max_features': [1000, 1500],
    'vector__min_df': [1, 2],
    'vector__max_df': [.25, .3, .35],
    'nb__alpha': [.3, .4, .5],
}

In [27]:
gs4 = GridSearchCV(pipe4, 
                  pipe_params4,
                  cv=5,
                   n_jobs=-1,
                 )

In [28]:
gs4.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('vector',
                                        CountVectorizer(stop_words='english')),
                                       ('nb', MultinomialNB())]),
             n_jobs=-1,
             param_grid={'nb__alpha': [0.3, 0.4, 0.5],
                         'vector__max_df': [0.25, 0.3, 0.35],
                         'vector__max_features': [1000, 1500],
                         'vector__min_df': [1, 2]})

In [29]:
gs4.best_estimator_

Pipeline(steps=[('vector',
                 CountVectorizer(max_df=0.3, max_features=1000,
                                 stop_words='english')),
                ('nb', MultinomialNB(alpha=0.4))])

In [30]:
results.update({'gs4_Count_NB':
               {
                   'train_score': gs4.best_score_,
                   'test_score': gs4.score(X_test, y_test)
               }})

### Evaluating the best model

In [31]:
results_df = pd.DataFrame(results).T

In [32]:
results_df['difference'] = results_df['train_score'] - results_df['test_score']

In [33]:
results_df.sort_values(by='difference', ascending=True)

Unnamed: 0,train_score,test_score,difference
gs3_Count_Lgr,0.866048,0.869464,-0.003416
gs2_TF_NB,0.899541,0.88345,0.016091
gs4_Count_NB,0.889391,0.864802,0.024589
gs_TF_Lgr,0.897194,0.871795,0.025399


All our models have easily beaten the baseline accuracy score.

As expected, the TF-IDF Vectorizer models have a slightly better accuracy than Count Vectorizer ones.

While the Countvectorizer + Logistic Regression model (gs3) has the smallest difference in scores between the train and test sets, the TF-IDF Vectorizer and Naive Bayes (gs2) model has a better score overall. We will take a closer look at them both.

#### Saving Features of gs3

The importance of words in a Logistic Regression is determined by its coefficient, with positive coefficients indicating that the word tends a given text towards being predicted as being from r/OCD, and one with a negative coefficient, r/ADHD.

In [34]:
best_model = gs3.best_estimator_.named_steps

In [36]:
best_model.lgr.coef_.ravel()

array([-0.03303111, -0.14355633, -0.11943444, ..., -0.07102766,
       -0.13362273, -0.0053784 ])

In [37]:
word_df = pd.DataFrame(zip(best_model.vector.get_feature_names(), best_model.lgr.coef_.ravel()),
                      columns = ['word', 'coefficient'])

At one end, the words with corresponding lowest coefficients tend the post towards being predicted as being from r/ADHD.

In [38]:
 word_df.sort_values(by='coefficient', ascending=True).head(20).T

Unnamed: 0,125,1232,3356,1040,622,1246,1840,2547,1684,812,2858,3113,1144,2514,2282,2695,394,540,3105,3008
word,adderall,focus,vyvanse,energy,concerta,forget,med,sadness,lack,depression,stimulant,tomorrow,fail,ritalin,process,sit,bore,class,today,task
coefficient,-1.23386,-1.08889,-0.992342,-0.897748,-0.87824,-0.814418,-0.788284,-0.759452,-0.757148,-0.748582,-0.718724,-0.694334,-0.692355,-0.68479,-0.679669,-0.674385,-0.673979,-0.672597,-0.668248,-0.661922


And on the other end, the words with corresponding highest coefficients tend the post towards being predicted as being from r/OCD.

In [39]:
word_df.sort_values(by='coefficient', ascending=False).head(20).T

Unnamed: 0,613,2016,2014,3070,2399,3163,2602,2387,1747,2018,3125,1173,2566,1719,2313,1693,945,2054,3059,3496
word,compulsion,obsession,obsess,thought,reassurance,trigger,seek,real,live,obsessive,touch,fear,scar,lexapro,prozac,lately,doubt,order,theme,zoloft
coefficient,1.40044,1.15391,0.993564,0.80983,0.798586,0.773463,0.760096,0.760044,0.759874,0.740649,0.652808,0.637671,0.637622,0.636135,0.624141,0.620851,0.612523,0.610499,0.606297,0.602314


In [40]:
# Save words against cofficients for comparison in the next section
word_df.to_csv('../datasets/words_lem_count_lgr.csv', index=False)

#### Saving Features from gs2

Unlike Logistic Regression, Naive Bayes trains a model by assigning each word feature a probability assuming that it is from r/OCD or r/ADHD.

In [41]:
best_model = gs2.best_estimator_.named_steps

In [42]:
best_model.nb.feature_log_prob_

array([[-8.69043092, -8.79874769, -6.87536537, ..., -7.90663342,
        -7.49730996, -8.1246333 ],
       [-9.36682958, -9.36682958, -7.23623583, ..., -9.36682958,
        -8.40317939, -8.29434558]])

The log probabilities above indicate the probability that we saw each corresponding word given that it was a post from r/OCD and r/ADHD respectively. We will save the top 100 of these words and analyze them in the next section.

In [43]:
# Save the log probabilities sorted from the largest to to smallest for each class.
neg_class_prob_sorted = best_model.nb.feature_log_prob_[0, :].argsort()[::-1]
pos_class_prob_sorted = best_model.nb.feature_log_prob_[1, :].argsort()[::-1]

In [44]:
# Save the top 100 feature words for the r/ADHD class with its corresponding log probabilities.
top_100_adhd_words = pd.DataFrame(
    zip(
    np.take(best_model.vector.get_feature_names(), neg_class_prob_sorted[:100]),
     np.sort(best_model.nb.feature_log_prob_[1, :])[::-1][:100]
    ),
    columns=['word','probability'])

In [45]:
top_100_adhd_words.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
word,work,day,med,medication,help,really,try,start,want,need,...,prescribe,self,stop,leave,guy,dose,minute,tip,motivation,fuck
probability,-4.82616,-5.40174,-5.43788,-5.48572,-5.48892,-5.56176,-5.61237,-5.62541,-5.6843,-5.72841,...,-6.6204,-6.62048,-6.62219,-6.6309,-6.63636,-6.63884,-6.63906,-6.64305,-6.64432,-6.64795


In [46]:
# Do the same for r/OCD.
top_100_ocd_words = pd.DataFrame(
    zip(
    np.take(best_model.vector.get_feature_names(), pos_class_prob_sorted[:100]),
     np.sort(best_model.nb.feature_log_prob_[1, :])[::-1][:100]
    ),
    columns=['word','probability'])

In [47]:
top_100_ocd_words.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
word,thought,intrusive,want,say,bad,really,happen,people,compulsion,try,...,live,die,lose,hate,enjoy,little,kind,convince,ago,watch
probability,-4.82616,-5.40174,-5.43788,-5.48572,-5.48892,-5.56176,-5.61237,-5.62541,-5.6843,-5.72841,...,-6.6204,-6.62048,-6.62219,-6.6309,-6.63636,-6.63884,-6.63906,-6.64305,-6.64432,-6.64795


In [48]:
# Save words against log probabilities for comparison in the next section
top_100_adhd_words.to_csv('../datasets/words_adhd_lem_tf_nb.csv', index=False)
top_100_ocd_words.to_csv('../datasets/words_ocd_lem_tf_nb.csv', index=False)