# Project 3

## Part 2: Modeling

Model data for fun and profit.

### 0. Imports and Preliminaries

In [1]:
# imports
import pandas as pd
import numpy as np

# preprocessing
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import StandardScaler

# models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import MultinomialNB

# metrics
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import classification_report

# cross-validation
from sklearn.model_selection import train_test_split, cross_val_score

# pipelines, gridsearch
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# nltk - for stopwords and stemming
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk import word_tokenize

# custom
import ipynb_utils as ipyutils

In [2]:
# load data
df = pd.read_json('../data/scrapes-clean.json', orient='index')

# convert time to datetime object
df['time'] = pd.to_datetime(df['time'], format=ipyutils.DATE_FMT)

In [3]:
# check that all looks good...
df.head()

Unnamed: 0,time,title,body-text,title-cc,title-wc,body-cc,body-wc,media,comments
0,2022-09-05,Newbie questions about ascendants and borders,"I'm new to actually learning astrology, not ju...",45,6,601,107,0,0
2,2022-09-05,Astrology and cognitive dissonance,Open to anyone who wouldn't mind sharing a rec...,34,4,323,49,0,1
3,2022-09-05,what do y’all think of persona charts?,"I feel a bit skeptical of them, since I feel l...",38,7,180,29,0,0
4,2022-09-05,RESOURCE REQUEST: Videos (or articles) with ti...,I think my problem is that I don’t know the pr...,160,24,597,94,0,2
5,2022-09-05,"people who have had saturn transit their 10th,...",How did it affect your career? Did it impact y...,64,12,116,22,0,11


In [4]:
# ... and that the right datatypes are showing
df.dtypes

time         datetime64[ns]
title                object
body-text            object
title-cc              int64
title-wc              int64
body-cc               int64
body-wc               int64
media                 int64
comments              int64
dtype: object

In [5]:
df.shape

(8413, 9)

### 0.5. Problem Statement

What characteristics of a post on Reddit are most predictive of the overall interaction on a thread (as measured by number of comments)? I am looking to predict high engagement vs. low engagement posts. High engagement is defined as having a number of comments that is greater than the median number of comments across all posts. I am therefore trying to create a predictor that will predict whether a post will have more than the median number of comments.

### 1. Generate Target

In [6]:
# median comments
median = np.median(df['comments'])
median

14.0

In [7]:
# target column
df['comments_gt_median'] = (df['comments'] > median).astype(int)
df['comments_gt_median'].value_counts()

0    4271
1    4142
Name: comments_gt_median, dtype: int64

In [8]:
df['comments_gt_median'].value_counts(normalize=True)

0    0.507667
1    0.492333
Name: comments_gt_median, dtype: float64

#### Baseline
Baseline is just about **50%**, as it should be since we are using median as split for determining high vs. low engagement.

In [9]:
# Store in obvious variable
TARGET = df['comments_gt_median']

### 1a. Split Time Column

Might want to check by month or day of week

In [10]:
df['day'] = df['time'].apply(ipyutils.get_day_of_week)

In [11]:
df['month'] = df['time'].apply(lambda x: x.month)

In [12]:
df['year'] = df['time'].apply(lambda x: x.year)

In [13]:
df[['day','month', 'year']].head()

Unnamed: 0,day,month,year
0,0,9,2022
2,0,9,2022
3,0,9,2022
4,0,9,2022
5,0,9,2022


In [14]:
df['weekend'] = (df['day'] > 5).astype(int)
df['weekend'].head(10)

0     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
10    0
11    0
Name: weekend, dtype: int64

In [15]:
# Doing this just to be safe as I've gotten some weird row mismatches
# later on and not sure exactly why
df.reset_index(drop=True, inplace=True)

### 2. Train-Test Split

In [16]:
col_target = 'comments_gt_median'
cols_to_drop = ['time'] # don't need this any more
X = df.drop(columns=[col_target]+cols_to_drop)
y = df[col_target]

# split to train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y,
                                                    test_size=0.2,
                                                    random_state=1)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((6730, 12), (1683, 12), (6730,), (1683,))

### 3. Count Vectorize Text Fields

In [17]:
# utility functions for testing stemming - taken from course 
# materials 33-nlp-ii
def stem_tokenizer(doc):
    stemmer = PorterStemmer()
    tokens = word_tokenize(doc)
    return [stemmer.stem(t) for t in tokens]

def lemma_tokenizer(doc):
    lemmatizer = WordNetLemmatizer()
    tokens = word_tokenize(doc)
    return [lemmatizer.lemmatize(t) for t in tokens]

# I tried these on the CountVectorizer and they result in some bogus matches
# (like whitespace and punctuation). I don't have time to really look into this,
# and the scores from my tests on these were not very different from not using
# them, so I'm not going to use them this time around.

In [18]:
# get count vectorize tables
cv_params = {
    'token_pattern': ipyutils.PAT_TOKEN, # using standard CV token pattern
    'min_df': 15, # we don't want words that rarely appear
    'stop_words': stopwords.words('english'), # use nltk stopwords list
    'ngram_range': (1,3), # allow short phrases
    'tokenizer': None # tried stem_tokenizer, lemma_tokenizer - using neither
}
cv_title = CountVectorizer(**cv_params)
cv_body = CountVectorizer(**cv_params)
cv_alltext = CountVectorizer(**cv_params)

# title
train_title_cv = cv_title.fit_transform(X_train['title'])
test_title_cv = cv_title.transform(X_test['title'])

# body
train_body_cv = cv_body.fit_transform(X_train['body-text'])
test_body_cv = cv_body.transform(X_test['body-text'])

# title + body
train_alltext_cv = cv_alltext.fit_transform(X_train['title'] + ' ' + X_train['body-text'])
test_alltext_cv = cv_alltext.transform(X_test['title'] + ' ' + X_test['body-text'])

(train_title_cv.shape, test_title_cv.shape, 
train_body_cv.shape, test_body_cv.shape,
train_alltext_cv.shape, test_alltext_cv.shape)

((6730, 588),
 (1683, 588),
 (6730, 3994),
 (1683, 3994),
 (6730, 4355),
 (1683, 4355))

In [19]:
cv_title.get_feature_names_out()[-20:]

array(['week', 'well', 'western', 'whole', 'whole sign', 'within',
       'without', 'wondering', 'work', 'working', 'world', 'would',
       'wrong', 'year', 'years', 'yet', 'youtube', 'zodiac',
       'zodiac sign', 'zodiac signs'], dtype=object)

In [20]:
cv_alltext.get_feature_names_out()[-40:]

array(['wrong', 'wrote', 'www', 'www astro', 'www astro com',
       'www reddit', 'www reddit com', 'www youtube', 'www youtube com',
       'ya', 'yang', 'yeah', 'year', 'year ago', 'year half', 'years',
       'years ago', 'years old', 'yes', 'yesterday', 'yet', 'yin',
       'young', 'younger', 'youtu', 'youtube', 'youtube channel',
       'youtube com', 'youtube com watch', 'youtube video',
       'youtube videos', 'youtubers', 'zero', 'zeus', 'zodiac',
       'zodiac sign', 'zodiac signs', 'zodiacal', 'zodiacs', 'zone'],
      dtype=object)

### 3. Random Forest Classifier

In [21]:
title_rfc = RandomForestClassifier()
gs_params = {
    'n_estimators': [200, 300],
    'min_samples_leaf': [4, 5],
    'min_samples_split': [4, 5],
    'min_impurity_decrease': [0.0001, 0.001],
    'n_jobs': [-1],
    'random_state': [1]
}

# use gridsearch this time only to check best model params (takes a long time)
gs = GridSearchCV(title_rfc, gs_params, verbose=1, n_jobs=-1)

In [22]:
# model on titles only
gs.fit(train_title_cv, y_train)
print()
print(ipyutils.score_report(gs, 
                            (train_title_cv, y_train), 
                            (test_title_cv, y_test)))

Fitting 5 folds for each of 16 candidates, totalling 80 fits

Model Train Score (best): 0.7286775631500743
Model Test Score (best): 0.6428995840760546
Model Best Estimator: RandomForestClassifier(min_impurity_decrease=0.0001, min_samples_leaf=4,
                       min_samples_split=4, n_estimators=200, n_jobs=-1,
                       random_state=1)



In [23]:
# save best params to use for later models.
# done like this with variable alias because I previously hand-copied
# the params to a separate dictionary and needed the rfc_params variable for
# that dictionary
rfc_params = gs.best_params_
rfc_params

{'min_impurity_decrease': 0.0001,
 'min_samples_leaf': 4,
 'min_samples_split': 4,
 'n_estimators': 200,
 'n_jobs': -1,
 'random_state': 1}

In [24]:
# model on body text only - use same best params from gridsearch
body_rfc = RandomForestClassifier(**rfc_params)
body_rfc.fit(train_body_cv, y_train)
print()
print(ipyutils.score_report(body_rfc, 
                            (train_body_cv, y_train), 
                            (test_body_cv, y_test)))


Model Train Score (best): 0.7885586924219911
Model Test Score (best): 0.6601307189542484



In [25]:
# model on all text
alltext_rfc = RandomForestClassifier(**rfc_params)
alltext_rfc.fit(train_alltext_cv, y_train)
print()
print(ipyutils.score_report(alltext_rfc, 
                            (train_alltext_cv, y_train), 
                            (test_alltext_cv, y_test)))


Model Train Score (best): 0.837592867756315
Model Test Score (best): 0.7011289364230541



#### Metrics

In [26]:
# title words
print(ipyutils.metrics_report(gs.best_estimator_, y_test, test_title_cv))

              precision    recall  f1-score   support

         low       0.66      0.62      0.64       854
        high       0.63      0.67      0.65       829

    accuracy                           0.64      1683
   macro avg       0.64      0.64      0.64      1683
weighted avg       0.64      0.64      0.64      1683

True Positives: 552
True Negatives: 530
False Positives: 324
False Negatives: 277



In [27]:
# body words
print(ipyutils.metrics_report(body_rfc, y_test, test_body_cv))

              precision    recall  f1-score   support

         low       0.70      0.58      0.63       854
        high       0.63      0.75      0.68       829

    accuracy                           0.66      1683
   macro avg       0.67      0.66      0.66      1683
weighted avg       0.67      0.66      0.66      1683

True Positives: 619
True Negatives: 492
False Positives: 362
False Negatives: 210



In [28]:
# all words
print(ipyutils.metrics_report(alltext_rfc, y_test, test_alltext_cv))

              precision    recall  f1-score   support

         low       0.72      0.68      0.70       854
        high       0.69      0.72      0.70       829

    accuracy                           0.70      1683
   macro avg       0.70      0.70      0.70      1683
weighted avg       0.70      0.70      0.70      1683

True Positives: 598
True Negatives: 582
False Positives: 272
False Negatives: 231



#### Analysis of Random Forest Classifier Score

Perhaps unsurprisingly, analyzing on the full text (body plus title) gave better prediction accuracy. However, for purposes of the problem statement, the title and body are possibly best kept separate, as reddit does not diplay the full body text by default, and searches only display titles. If we are looking for maximum engagement, we are more likely to reach the most number of users via the post titles rather than the post bodies.

Accuracy is better than baseline by 14-20 percentage points depending on the type of text field used. Since we are looking to increase engagement, the metrics concerning the quality of our positive outcomes are probably more important. Here, precision was higher for the high-engagement category, whereas recall was higher for the low-engagement category. In the alltext test, for example, the model correctly predicted high-engagement category 69% of the time, and correctly predicted 72% of all high-engagement posts.

The model is overfit (which is probably to be expected from a decision-tree-based model).

#### Exploration of Model Results - Title Predictors

Or: what words were best predictors in the titles?

In [29]:
# title exploration - what words were best predictors in the titles?

# get predictions
title_preds_test = gs.best_estimator_.predict(test_title_cv)

# make dataframe from CountVectorizer sparse matrix
Xdf = pd.DataFrame(test_title_cv.A, 
                   columns=cv_title.get_feature_names_out(),
                   index=y_test.index)

# get metrics per word (see custom script ipynb_utils.py)
wc_df = ipyutils.wc_metrics(Xdf, y_test, title_preds_test, opts=[])

# filters - high word count, high accuracy, high recall
high_wc_filt = (wc_df['total'] > wc_df['total'].quantile(0.75))
high_accuracy_filt = (wc_df['accuracy'] >= 0.75)
high_recall_filt = (wc_df['recall'] >= 0.75)

# get results
wc_df[high_wc_filt & high_accuracy_filt & high_recall_filt].sort_values(by='accuracy', ascending=False)

Unnamed: 0,total,pct,correct,incorrect,diff,tp,tn,fp,fn,accuracy,recall,specificity,precision,f1
way,21,0.269715,20,1,19,11,9,0,1,0.952381,0.916667,1.0,1.0,0.956522
node,15,0.192653,14,1,13,8,6,1,0,0.933333,1.0,0.857143,0.888889,0.941176
ascendant,18,0.231184,16,2,14,11,5,2,0,0.888889,1.0,0.714286,0.846154,0.916667
placement,27,0.346776,24,3,21,22,2,2,1,0.888889,0.956522,0.5,0.916667,0.93617
sagittarius,17,0.218341,15,2,13,13,2,2,0,0.882353,1.0,0.5,0.866667,0.928571
anyone else,19,0.244028,16,3,13,14,2,3,0,0.842105,1.0,0.4,0.823529,0.903226
really,19,0.244028,16,3,13,10,6,2,1,0.842105,0.909091,0.75,0.833333,0.869565
born,20,0.256871,16,4,12,9,7,1,3,0.8,0.75,0.875,0.9,0.818182
chiron,15,0.192653,12,3,9,8,4,2,1,0.8,0.888889,0.666667,0.8,0.842105
sun moon,15,0.192653,12,3,9,9,3,3,0,0.8,1.0,0.5,0.75,0.857143


#### Exploration of Model Results - Body Predictors

In [30]:
# body exploration - what words were best predictors in the bodies?

# get predictions
body_preds_test = body_rfc.predict(test_body_cv)

# make dataframe from CountVectorizer sparse matrix
Xdf = ipyutils.df_from_cv(cv_title, test_title_cv, y_test.index)

# get metrics per word (see custom script ipynb_utils.py)
wc_df = ipyutils.wc_metrics(Xdf, y_test, body_preds_test, opts=[])

# filters - high word count, high accuracy, high recall
high_wc_filt = (wc_df['total'] > wc_df['total'].quantile(0.75))
high_accuracy_filt = (wc_df['accuracy'] >= 0.70)
high_recall_filt = (wc_df['recall'] >= 0.75)

# get results
wc_df[high_wc_filt & high_accuracy_filt & high_recall_filt].sort_values(by='accuracy', ascending=False)

Unnamed: 0,total,pct,correct,incorrect,diff,tp,tn,fp,fn,accuracy,recall,specificity,precision,f1
interpret,16,0.205497,14,2,12,3,11,1,1,0.875,0.75,0.916667,0.75,0.75
chiron,15,0.192653,12,3,9,7,5,1,2,0.8,0.777778,0.833333,0.875,0.823529
talk,15,0.192653,12,3,9,10,2,2,1,0.8,0.909091,0.5,0.833333,0.869565
venus,73,0.93758,58,15,43,41,17,13,2,0.794521,0.953488,0.566667,0.759259,0.845361
ascendant,18,0.231184,14,4,10,9,5,2,2,0.777778,0.818182,0.714286,0.818182,0.818182
placement,27,0.346776,21,6,15,20,1,3,3,0.777778,0.869565,0.25,0.869565,0.869565
8th,17,0.218341,13,4,9,9,4,4,0,0.764706,1.0,0.5,0.692308,0.818182
sagittarius,17,0.218341,13,4,9,11,2,2,2,0.764706,0.846154,0.5,0.846154,0.846154
houses,37,0.475212,28,9,19,12,16,7,2,0.756757,0.857143,0.695652,0.631579,0.727273
8th house,16,0.205497,12,4,8,9,3,4,0,0.75,1.0,0.428571,0.692308,0.818182


#### Analysis of Metrics

The above two charts contain compiled information from the words in the data set. Filters were created to select documents based on correct and incorrect predictions, among other metrics (see ipynb_utils.py script). The features (word counts) were then summed to get word counts per metric, and various other metrics were derived. This way these metrics could be explored on a word-by-word basis.

The above chart is sorted by accuracy score (high-to-low) and only shows words with total count in the 75th percentile.

These are interesting views on which words contributed most to the predictions. The one that comes up in both body and title is "anyone else", which would make sense as a call to action.

### 4. Other Classifiers Comparison

I am comparing various other classifiers to see how general scores compare. Will select a second to use as a comparison classifier for future modeling.

#### ExtraTrees Classifier

In [31]:
title_etc = ExtraTreesClassifier()
# use same gs_params from random forest
title_etc_gs = GridSearchCV(title_etc, gs_params, verbose=1, n_jobs=-1)
title_etc_gs.fit(train_title_cv, y_train)
print(ipyutils.score_report(title_etc_gs,
                            (train_title_cv, y_train),
                            (test_title_cv, y_test)))

Fitting 5 folds for each of 16 candidates, totalling 80 fits
Model Train Score (best): 0.7242199108469539
Model Test Score (best): 0.6399286987522281
Model Best Estimator: ExtraTreesClassifier(min_impurity_decrease=0.0001, min_samples_leaf=5,
                     min_samples_split=4, n_estimators=200, n_jobs=-1,
                     random_state=1)



In [32]:
etc_params = title_etc_gs.best_params_

In [33]:
body_etc = ExtraTreesClassifier(**etc_params)
body_etc.fit(train_body_cv, y_train)
print(ipyutils.score_report(body_etc,
                            (train_body_cv, y_train),
                            (test_body_cv, y_test)))

Model Train Score (best): 0.787518573551263
Model Test Score (best): 0.6708259061200238



In [34]:
alltext_etc = ExtraTreesClassifier(**etc_params)
alltext_etc.fit(train_alltext_cv, y_train)
print(ipyutils.score_report(alltext_etc,
                            (train_alltext_cv, y_train),
                            (test_alltext_cv, y_test)))

Model Train Score (best): 0.8346210995542348
Model Test Score (best): 0.6898395721925134



#### Analysis of Extra Trees Classifier Score

ExtraTrees performed slightly worse than Random Forest.

#### AdaBoost Classifier

In [35]:
# Ada Boost
ada = AdaBoostClassifier(random_state=1)
ada.fit(train_alltext_cv, y_train)
ada.score(train_alltext_cv, y_train), ada.score(test_alltext_cv, y_test)

(0.6705794947994056, 0.6102198455139631)

#### Gradient Boost Classifier

In [36]:
# Gradient Boost
gb = GradientBoostingClassifier()
gb.fit(train_alltext_cv, y_train)
gb.score(train_alltext_cv, y_train), gb.score(test_alltext_cv, y_test)

(0.7310549777117384, 0.6452762923351159)

#### K Nearest Neighbors Classifier

In [37]:
# K Neighbors
knc = KNeighborsClassifier(5)
knc.fit(train_alltext_cv, y_train)
knc.score(train_alltext_cv, y_train), knc.score(test_alltext_cv, y_test)

(0.7057949479940565, 0.5692216280451574)

#### Logistic Regression

In [38]:
# Logistic Regression
lr = LogisticRegression(max_iter=1000)
lr.fit(train_alltext_cv, y_train)
knc.score(train_alltext_cv, y_train), knc.score(test_alltext_cv, y_test)

(0.7057949479940565, 0.5692216280451574)

#### Multinomial Naive Bayes Classifier

In [73]:
# Multinomial Naive Bayes - using this one so do it for title and body text
mnb = MultinomialNB()
mnb.fit(train_title_cv, y_train)
mnb.score(train_title_cv, y_train), mnb.score(test_title_cv, y_test)

(0.6806835066864785, 0.6286393345216874)

In [74]:
print(ipyutils.metrics_report(mnb, y_test, test_title_cv))

              precision    recall  f1-score   support

         low       0.63      0.65      0.64       854
        high       0.63      0.60      0.62       829

    accuracy                           0.63      1683
   macro avg       0.63      0.63      0.63      1683
weighted avg       0.63      0.63      0.63      1683

True Positives: 501
True Negatives: 557
False Positives: 297
False Negatives: 328



In [75]:
mnb.fit(train_body_cv, y_train)
print(ipyutils.metrics_report(mnb, y_test, test_body_cv))

              precision    recall  f1-score   support

         low       0.60      0.73      0.66       854
        high       0.65      0.51      0.57       829

    accuracy                           0.62      1683
   macro avg       0.63      0.62      0.61      1683
weighted avg       0.62      0.62      0.62      1683

True Positives: 419
True Negatives: 625
False Positives: 229
False Negatives: 410



#### Analysis of Other Classifiers on Word Vectors

Naive Bayes and Gradient Boost were tied on the test set. Other models were weaker performers. Due to less overfitting on Naive Bayes I will use that for future model comparisons.

### 4a. Other Features

There are a few other features I'd like to explore (word/character counts, for example).

Date/time features might not be appropriate here due to how Reddit works and the scraping process. Reddit no longer allows search by date, so I cannot get consecutive posts over time, and I am therefore trying to get as many posts I can via searches for words. Therefore, the post distribution over time that I get from my scrapes may not be the same as the actual post distribution over time, and there is no way to verify this with my current scraping process.

In [40]:
df.columns

Index(['time', 'title', 'body-text', 'title-cc', 'title-wc', 'body-cc',
       'body-wc', 'media', 'comments', 'comments_gt_median', 'day', 'month',
       'year', 'weekend'],
      dtype='object')

In [41]:
df.groupby('day').count() # decent spread, probably enough to be OK with here

Unnamed: 0_level_0,time,title,body-text,title-cc,title-wc,body-cc,body-wc,media,comments,comments_gt_median,month,year,weekend
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,698,698,698,698,698,698,698,698,698,698,698,698,698
1,584,584,584,584,584,584,584,584,584,584,584,584,584
2,1195,1195,1195,1195,1195,1195,1195,1195,1195,1195,1195,1195,1195
3,1080,1080,1080,1080,1080,1080,1080,1080,1080,1080,1080,1080,1080
4,986,986,986,986,986,986,986,986,986,986,986,986,986
5,2858,2858,2858,2858,2858,2858,2858,2858,2858,2858,2858,2858,2858
6,1012,1012,1012,1012,1012,1012,1012,1012,1012,1012,1012,1012,1012


In [42]:
df.groupby('month').count() # what's with september???

Unnamed: 0_level_0,time,title,body-text,title-cc,title-wc,body-cc,body-wc,media,comments,comments_gt_median,day,year,weekend
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
1,232,232,232,232,232,232,232,232,232,232,232,232,232
2,211,211,211,211,211,211,211,211,211,211,211,211,211
3,221,221,221,221,221,221,221,221,221,221,221,221,221
4,227,227,227,227,227,227,227,227,227,227,227,227,227
5,233,233,233,233,233,233,233,233,233,233,233,233,233
6,175,175,175,175,175,175,175,175,175,175,175,175,175
7,213,213,213,213,213,213,213,213,213,213,213,213,213
8,371,371,371,371,371,371,371,371,371,371,371,371,371
9,5637,5637,5637,5637,5637,5637,5637,5637,5637,5637,5637,5637,5637
10,503,503,503,503,503,503,503,503,503,503,503,503,503


#### Word- and Character-counts and Media Indicator

I will take a brute-force approach and do a quick model on a lot of different feature sets and combinations, and select ones to take a closer look at based on model scores.

In [43]:
# I'm going to test all of the following feature combinations just to see if
# they show any major differences
col_opts = [
    ['media', 'title-cc', 'body-cc', 'title-wc', 'body-wc'],
    ['media'],
    ['title-cc', 'title-wc'],
    ['body-cc', 'body-wc'],
    ['title-cc', 'title-wc', 'body-cc', 'body-wc'],
    ['day'],
    ['month'],
    ['year'],
    ['day', 'month'],
    ['month', 'year'],
    ['title-cc', 'title-wc', 'body-cc', 'body-wc', 'day'],
    ['title-cc', 'title-wc', 'body-cc', 'body-wc', 'day', 'month'],
    ['title-cc', 'title-wc', 'body-cc', 'body-wc', 'month', 'year'],
    ['media', 'title-cc', 'title-wc', 'body-cc', 'body-wc', 'month', 'year']
]

In [44]:
# loop over each feature combination and run a model
for opt in col_opts:
    xrfc = RandomForestClassifier(**rfc_params)
    xrfc.fit(X_train[opt], y_train)
    print(f'{opt}\n\ttrain: {xrfc.score(X_train[opt], y_train)}'\
          + f'\n\ttest: {xrfc.score(X_test[opt], y_test)}')

['media', 'title-cc', 'body-cc', 'title-wc', 'body-wc']
	train: 0.8239227340267459
	test: 0.6226975638740344
['media']
	train: 0.5545319465081724
	test: 0.5472370766488414
['title-cc', 'title-wc']
	train: 0.6208023774145617
	test: 0.5561497326203209
['body-cc', 'body-wc']
	train: 0.6910846953937593
	test: 0.5430778371954843
['title-cc', 'title-wc', 'body-cc', 'body-wc']
	train: 0.8460624071322437
	test: 0.5870469399881164
['day']
	train: 0.5671619613670134
	test: 0.5680332739156269
['month']
	train: 0.5221396731054978
	test: 0.5086155674390969
['year']
	train: 0.5682020802377414
	test: 0.5775401069518716
['day', 'month']
	train: 0.6022288261515601
	test: 0.5941770647653001
['month', 'year']
	train: 0.5852897473997029
	test: 0.5793226381461676
['title-cc', 'title-wc', 'body-cc', 'body-wc', 'day']
	train: 0.8527488855869242
	test: 0.6030897207367796
['title-cc', 'title-wc', 'body-cc', 'body-wc', 'day', 'month']
	train: 0.8514115898959881
	test: 0.6125965537730244
['title-cc', 'title-wc',

In [45]:
# now let's do the same but adding the Vectorized title

# make vectorized title dataframes
train_title_cv_df = ipyutils.df_from_cv(cv_title, train_title_cv, X_train.index)
test_title_cv_df = ipyutils.df_from_cv(cv_title, test_title_cv, X_test.index)

# concat all column combos to vectorized title dataframes
combo_dfs = list()
for opt in col_opts:
    # 0 is train, 1 is test
    combo_dfs.append((ipyutils.easy_concat(X_train[opt], train_title_cv_df),
                      ipyutils.easy_concat(X_test[opt], test_title_cv_df)))

# check for integrity
for cdf in combo_dfs:
    print(cdf[0].shape, cdf[1].shape)

(6730, 593) (1683, 593)
(6730, 589) (1683, 589)
(6730, 590) (1683, 590)
(6730, 590) (1683, 590)
(6730, 592) (1683, 592)
(6730, 589) (1683, 589)
(6730, 589) (1683, 589)
(6730, 589) (1683, 589)
(6730, 590) (1683, 590)
(6730, 590) (1683, 590)
(6730, 593) (1683, 593)
(6730, 594) (1683, 594)
(6730, 594) (1683, 594)
(6730, 595) (1683, 595)


In [46]:
# run a model on each combo
for ix, cdf in enumerate(combo_dfs):
    opt = col_opts[ix]
    xrfc.fit(cdf[0], y_train)
    print(f'{opt}\n\ttrain:{xrfc.score(cdf[0], y_train)}\n'\
          + f'\ttest:{xrfc.score(cdf[1], y_test)}')

['media', 'title-cc', 'body-cc', 'title-wc', 'body-wc']
	train:0.7579494799405646
	test:0.6743909685086156
['media']
	train:0.7392273402674592
	test:0.6512180629827689
['title-cc', 'title-wc']
	train:0.736552748885587
	test:0.6339869281045751
['body-cc', 'body-wc']
	train:0.7401188707280832
	test:0.6530005941770648
['title-cc', 'title-wc', 'body-cc', 'body-wc']
	train:0.7527488855869242
	test:0.6589423648247178
['day']
	train:0.7378900445765231
	test:0.6446821152703506
['month']
	train:0.7323922734026745
	test:0.6428995840760546
['year']
	train:0.7421991084695394
	test:0.6648841354723708
['day', 'month']
	train:0.7445765230312036
	test:0.6642899584076055
['month', 'year']
	train:0.7411589895988113
	test:0.6619132501485443
['title-cc', 'title-wc', 'body-cc', 'body-wc', 'day']
	train:0.7557206537890044
	test:0.6654783125371361
['title-cc', 'title-wc', 'body-cc', 'body-wc', 'day', 'month']
	train:0.7610698365527488
	test:0.6714200831847891
['title-cc', 'title-wc', 'body-cc', 'body-wc', 'm

In [47]:
# month + year + word/char counts seem to have most impact. 
# What are the stats?

# re-get score on [month + day + wc/cc + title_cv]
feats_train = combo_dfs[-2][0]
feats_test = combo_dfs[-2][1]

feats_rfc = RandomForestClassifier(**rfc_params)
feats_rfc.fit(feats_train, y_train)
feats_rfc.score(feats_train, y_train), feats_rfc.score(feats_test, y_test)

(0.763447251114413, 0.6838978015448604)

In [48]:
# get report on predictions for [media + wc/cc + title_cv]
feats_preds_test = feats_rfc.predict(feats_test)
print(classification_report(y_test, feats_preds_test))

              precision    recall  f1-score   support

           0       0.69      0.70      0.69       854
           1       0.68      0.67      0.68       829

    accuracy                           0.68      1683
   macro avg       0.68      0.68      0.68      1683
weighted avg       0.68      0.68      0.68      1683



In [49]:
# get report on [month + year + wc/cc] only
feats_rfc = RandomForestClassifier(**rfc_params)
feats_rfc.fit(X_train[col_opts[-2]], y_train)
feats_preds_test = feats_rfc.predict(X_test[col_opts[-2]])
print(ipyutils.metrics_report(feats_rfc, y_test, X_test[col_opts[-2]]))

              precision    recall  f1-score   support

         low       0.63      0.62      0.62       854
        high       0.61      0.62      0.62       829

    accuracy                           0.62      1683
   macro avg       0.62      0.62      0.62      1683
weighted avg       0.62      0.62      0.62      1683

True Positives: 515
True Negatives: 530
False Positives: 324
False Negatives: 314



In [50]:
# Finally, run a test with the Naive Bayes classifier
feats_train = combo_dfs[-2][0]
feats_test = combo_dfs[-2][1]

feats_nb = MultinomialNB()
feats_nb.fit(feats_train, y_train)
feats_nb.score(feats_train, y_train), feats_nb.score(feats_test, y_test)

(0.525408618127786, 0.5222816399286988)

#### Analysis

From what I've seen so far, adding vectorized text data seems to even out the model a bit, with less overfitting than when using just word- and character-count features. It also increases model score from using just the date and word-/character-count features.

Here I was only using vectorized words from title fields. Random Forest performed as usual, Multinomial Naive Bayes performed worse with this data.

#### Adding Title Text

Review on Body Text, Title Text, Media, and Word/Character Counts.

In [51]:
# Vectorized body sets: train_body_cv, test_body_cv
# Vectorized title sets: train_title_cv, test_title_cv
# Vectorized alltext sets: train_alltext_cv, test_alltext_cv
# Train media set: X_train[col_opts[0]]

train_body_cvdf = ipyutils.df_from_cv(cv_body, train_body_cv, y_train.index)
test_body_cvdf = ipyutils.df_from_cv(cv_body, test_body_cv, y_test.index)

# using original variable name because I changed reference last minute and might
# not have time to go through and change all variables that follow
train_media_df = X_train[col_opts[-2]]
test_media_df = X_test[col_opts[-2]]

train_title_cvdf = ipyutils.df_from_cv(cv_title, train_title_cv, y_train.index)
test_title_cvdf = ipyutils.df_from_cv(cv_title, test_title_cv, y_test.index)

# get alltext for future exploration
train_alltext_cvdf = ipyutils.df_from_cv(cv_alltext, train_alltext_cv, y_train.index)
test_alltext_cvdf = ipyutils.df_from_cv(cv_alltext, test_alltext_cv, y_test.index)

In [52]:
# Finally, do one more test taking into account title words, body words, and media

# First prefix title and body words with title and body, respectively
train_title_cvdf = train_title_cvdf.add_prefix('(title) ')
train_body_cvdf = train_body_cvdf.add_prefix('(body) ')
test_title_cvdf = test_title_cvdf.add_prefix('(title) ')
test_body_cvdf = test_body_cvdf.add_prefix('(body) ')
train_title_cvdf.columns[:10]

Index(['(title) 10', '(title) 10th', '(title) 10th house', '(title) 11',
       '(title) 11th', '(title) 11th house', '(title) 12', '(title) 12th',
       '(title) 12th house', '(title) 1st'],
      dtype='object')

In [53]:
# make combination tables of media indicator, title words, and body words
train_megadf = pd.concat([train_media_df, train_title_cvdf, train_body_cvdf], axis=1)
test_megadf = pd.concat([test_media_df, test_title_cvdf, test_body_cvdf], axis=1)
train_megadf.shape, test_megadf.shape, y_train.shape, y_test.shape

((6730, 4588), (1683, 4588), (6730,), (1683,))

In [54]:
# run a Random Forest model
mega_rfc = RandomForestClassifier(**rfc_params)
mega_rfc.fit(train_megadf, y_train)
mega_train_preds = mega_rfc.predict(train_megadf)
mega_test_preds = mega_rfc.predict(test_megadf)
print("TRAIN\n", classification_report(y_train, mega_train_preds),
      "TEST\n", classification_report(y_test, mega_test_preds))

TRAIN
               precision    recall  f1-score   support

           0       0.84      0.87      0.86      3417
           1       0.86      0.83      0.85      3313

    accuracy                           0.85      6730
   macro avg       0.85      0.85      0.85      6730
weighted avg       0.85      0.85      0.85      6730
 TEST
               precision    recall  f1-score   support

           0       0.73      0.72      0.72       854
           1       0.71      0.72      0.72       829

    accuracy                           0.72      1683
   macro avg       0.72      0.72      0.72      1683
weighted avg       0.72      0.72      0.72      1683



In [55]:
# run a Naive Bayes model
mega_nb = MultinomialNB()
mega_nb.fit(train_megadf, y_train)
mega_train_preds = mega_nb.predict(train_megadf)
mega_test_preds = mega_nb.predict(test_megadf)
print("TRAIN\n", classification_report(y_train, mega_train_preds),
      "TEST\n", classification_report(y_test, mega_test_preds))

TRAIN
               precision    recall  f1-score   support

           0       0.54      0.84      0.66      3417
           1       0.61      0.26      0.37      3313

    accuracy                           0.56      6730
   macro avg       0.58      0.55      0.51      6730
weighted avg       0.58      0.56      0.52      6730
 TEST
               precision    recall  f1-score   support

           0       0.54      0.83      0.65       854
           1       0.59      0.26      0.36       829

    accuracy                           0.55      1683
   macro avg       0.56      0.54      0.51      1683
weighted avg       0.56      0.55      0.51      1683



In [56]:
# run Ada Boost
mega_ada = AdaBoostClassifier(n_estimators=100)
mega_ada.fit(train_megadf, y_train)
mega_train_preds = mega_ada.predict(train_megadf)
mega_test_preds = mega_ada.predict(test_megadf)
print("TRAIN\n", classification_report(y_train, mega_train_preds),
      "TEST\n", classification_report(y_test, mega_test_preds))

TRAIN
               precision    recall  f1-score   support

           0       0.72      0.72      0.72      3417
           1       0.71      0.71      0.71      3313

    accuracy                           0.72      6730
   macro avg       0.72      0.72      0.72      6730
weighted avg       0.72      0.72      0.72      6730
 TEST
               precision    recall  f1-score   support

           0       0.66      0.64      0.65       854
           1       0.64      0.66      0.65       829

    accuracy                           0.65      1683
   macro avg       0.65      0.65      0.65      1683
weighted avg       0.65      0.65      0.65      1683



#### Conclusions

The combining of body text, title text, and media indicator got me a 72% accuracy score on the test set - the highest score yet. These results were from the Random Forest classifier, which has consistently outperformed all others on this data. Precision was 71%, meaning correct predictions 71% of the time, and a recall of 72% meaning predicting 72% of the positive values.

In [57]:
mega_rfc.fit(train_megadf, y_train)
mega_train_preds = mega_rfc.predict(train_megadf)
mega_test_preds = mega_rfc.predict(test_megadf)

In [58]:
# make my metrics dataframe
metrics_df = ipyutils.wc_metrics(test_megadf, y_test, mega_test_preds)

# set up filters
title_cols = [c for c in test_megadf.columns if '(title)' in c]
body_cols = [c for c in test_megadf.columns if '(body)' in c]
other_cols = col_opts[0]

high_wc_filt = (metrics_df['total'] > 50)
high_accuracy_filt = (metrics_df['accuracy'] >= 0.80)
high_recall_filt = (metrics_df['recall'] >= 0.75)

In [59]:
title_metrics = metrics_df.loc[title_cols]
body_metrics = metrics_df.loc[body_cols]

In [60]:
# view results
# FULL SET
metrics_df[high_wc_filt & high_accuracy_filt].sort_values(by='accuracy', ascending=False)

Unnamed: 0,total,pct,correct,incorrect,diff,tp,tn,fp,fn,accuracy,recall,specificity,precision,f1
(body) 00,242,0.004775,238,4,234,2,236,0,4,0.983471,0.333333,1.0,1.0,0.5
(body) 30,243,0.004795,237,6,231,7,230,2,4,0.975309,0.636364,0.991379,0.777778,0.7
(body) 2019,80,0.001579,78,2,76,5,73,1,1,0.975,0.833333,0.986486,0.833333,0.833333
(body) 45,58,0.001144,56,2,54,1,55,2,0,0.965517,1.0,0.964912,0.333333,0.5
(body) 07,70,0.001381,66,4,62,2,64,0,4,0.942857,0.333333,1.0,1.0,0.5
(body) 2018,99,0.001954,93,6,87,2,91,2,4,0.939394,0.333333,0.978495,0.5,0.4
(body) heard,53,0.001046,49,4,45,29,20,2,2,0.924528,0.935484,0.909091,0.935484,0.935484
(body) native,73,0.00144,67,6,61,63,4,4,2,0.917808,0.969231,0.5,0.940299,0.954545
(body) figure,79,0.001559,72,7,65,60,12,2,5,0.911392,0.923077,0.857143,0.967742,0.944882
(body) inner,52,0.001026,47,5,42,22,25,5,0,0.903846,1.0,0.833333,0.814815,0.897959


In [61]:
# TITLE
title_acc_filt = title_metrics['accuracy'] > 0.80
title_tot_filt = title_metrics['total'] > 10
title_metrics[title_acc_filt & title_tot_filt].sort_values('accuracy', ascending=False)

Unnamed: 0,total,pct,correct,incorrect,diff,tp,tn,fp,fn,accuracy,recall,specificity,precision,f1
(title) opposition,12,0.000237,11,1,10,2,9,0,1,0.916667,0.666667,1.0,1.0,0.8
(title) things,12,0.000237,11,1,10,5,6,1,0,0.916667,1.0,0.857143,0.833333,0.909091
(title) making,12,0.000237,11,1,10,3,8,0,1,0.916667,0.75,1.0,1.0,0.857143
(title) conjunctions,11,0.000217,10,1,9,2,8,1,0,0.909091,1.0,0.888889,0.666667,0.8
(title) way,21,0.000414,19,2,17,11,8,1,1,0.904762,0.916667,0.888889,0.916667,0.916667
(title) question,20,0.000395,18,2,16,0,18,0,2,0.9,0.0,1.0,,
(title) anyone else,19,0.000375,17,2,15,14,3,2,0,0.894737,1.0,0.6,0.875,0.933333
(title) ascendant,18,0.000355,16,2,14,9,7,0,2,0.888889,0.818182,1.0,1.0,0.9
(title) interpret,16,0.000316,14,2,12,2,12,0,2,0.875,0.5,1.0,1.0,0.666667
(title) books,23,0.000454,20,3,17,1,19,0,3,0.869565,0.25,1.0,1.0,0.4


In [62]:
# BODY
body_acc_filt = body_metrics['accuracy'] > 0.82
body_tot_filt = body_metrics['total'] > 35
body_metrics[body_acc_filt & body_tot_filt].sort_values('accuracy', ascending=False)

Unnamed: 0,total,pct,correct,incorrect,diff,tp,tn,fp,fn,accuracy,recall,specificity,precision,f1
(body) 18,43,0.000849,43,0,43,6,37,0,0,1.0,1.0,1.0,1.0,1.0
(body) 00,242,0.004775,238,4,234,2,236,0,4,0.983471,0.333333,1.0,1.0,0.5
(body) 30,243,0.004795,237,6,231,7,230,2,4,0.975309,0.636364,0.991379,0.777778,0.7
(body) 2019,80,0.001579,78,2,76,5,73,1,1,0.975,0.833333,0.986486,0.833333,0.833333
(body) 45,58,0.001144,56,2,54,1,55,2,0,0.965517,1.0,0.964912,0.333333,0.5
(body) creative,49,0.000967,47,2,45,23,24,2,0,0.959184,1.0,0.923077,0.92,0.958333
(body) rising sign,37,0.00073,35,2,33,29,6,1,1,0.945946,0.966667,0.857143,0.966667,0.966667
(body) 07,70,0.001381,66,4,62,2,64,0,4,0.942857,0.333333,1.0,1.0,0.5
(body) 2018,99,0.001954,93,6,87,2,91,2,4,0.939394,0.333333,0.978495,0.5,0.4
(body) heard,53,0.001046,49,4,45,29,20,2,2,0.924528,0.935484,0.909091,0.935484,0.935484


#### Conclusions

Combining body text, title text, media indicator, and word counts resulted in the highest score, but reviewing the words that contributed to the scores doesn't reveal a strong connection between meaningful words and engagement. For example, having "18" in the body text is predicted with high accuracy to result in strong engagement, but common sense says that's probably random circumstance. 

However, other words and phrases make a lot more sense: "question" is predicted to have high engagement, which would make sense since it is a call to action. Also, titles which mention signs (e.g. "sun sign") seem to be more engaging, as well as "relationship"s.

The fact that title metrics show far fewer of the "random" predictive words suggests that making inferences from a title-based model would make more sense and carry a more obvious intention. I would rather make decisions about engaging content based on the title model than the body model.

Of course, these results are taking the entire text into account - the title and body texts are being modeled together, in conjunction with word counts and media indicators. The previous models may be more accurate when looking at title and body separately.

### 4b. Other Feature Explorations

Various non-modeling views on the data, including date groups.

#### Day Of The Week

In [63]:
df.groupby('day')['comments'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,698.0,37.614613,44.911348,0.0,10.0,22.0,48.0,442.0
1,584.0,28.755137,45.766167,0.0,6.0,14.0,31.0,561.0
2,1195.0,30.141423,47.413106,0.0,6.5,14.0,35.0,507.0
3,1080.0,57.019444,138.504473,0.0,9.0,25.0,64.25,3100.0
4,986.0,28.988844,53.898804,0.0,6.0,13.0,31.0,1075.0
5,2858.0,24.478307,71.773233,0.0,5.0,11.0,25.0,3138.0
6,1012.0,31.366601,44.11164,0.0,6.0,15.0,39.0,596.0


**NOTES** Thursday seems to be a hot day for comments, with a much higher mean, median, and maximum.

#### Month

In [64]:
df.groupby('month')['comments'].describe() # september??

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,232.0,35.155172,46.704673,0.0,7.0,17.5,41.0,302.0
2,211.0,35.037915,55.930386,0.0,7.0,14.0,35.5,442.0
3,221.0,28.384615,40.605771,0.0,7.0,14.0,33.0,342.0
4,227.0,23.092511,29.126749,0.0,5.0,12.0,30.0,196.0
5,233.0,28.703863,33.165115,0.0,7.0,15.0,37.0,159.0
6,175.0,30.394286,41.814849,0.0,7.0,15.0,34.5,332.0
7,213.0,36.413146,55.160899,1.0,7.0,16.0,44.0,368.0
8,371.0,28.41779,43.926666,0.0,5.0,12.0,32.0,387.0
9,5637.0,33.625155,85.486574,0.0,6.0,14.0,35.0,3138.0
10,503.0,23.149105,38.298311,0.0,4.0,11.0,25.0,361.0


In [65]:
df[df['comments'] < 3000].groupby('month')['comments'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,232.0,35.155172,46.704673,0.0,7.0,17.5,41.0,302.0
2,211.0,35.037915,55.930386,0.0,7.0,14.0,35.5,442.0
3,221.0,28.384615,40.605771,0.0,7.0,14.0,33.0,342.0
4,227.0,23.092511,29.126749,0.0,5.0,12.0,30.0,196.0
5,233.0,28.703863,33.165115,0.0,7.0,15.0,37.0,159.0
6,175.0,30.394286,41.814849,0.0,7.0,15.0,34.5,332.0
7,213.0,36.413146,55.160899,1.0,7.0,16.0,44.0,368.0
8,371.0,28.41779,43.926666,0.0,5.0,12.0,32.0,387.0
9,5635.0,32.53008,62.688941,0.0,6.0,14.0,35.0,1386.0
10,503.0,23.149105,38.298311,0.0,4.0,11.0,25.0,361.0


**NOTES** Months pretty stable, except there is one post in september 2021 that has over 3000 comments. September is also heavily biased, with over 5,500 posts with September dates.

However, medians, means, and other percentiles are pretty steady, showing a steady number of comments per month, so overall I'd say month of post does not significantly impact engagement.

#### Word and Character Counts

In [66]:
df[['title-wc','body-wc','title-cc','body-cc']].describe()

Unnamed: 0,title-wc,body-wc,title-cc,body-cc
count,8413.0,8413.0,8413.0,8413.0
mean,11.449186,116.413051,69.234399,694.706882
std,8.906362,306.130716,51.707936,1836.290407
min,0.0,0.0,2.0,0.0
25%,6.0,12.0,35.0,74.0
50%,9.0,43.0,55.0,255.0
75%,14.0,98.0,86.0,582.0
max,57.0,5406.0,300.0,33581.0


In [67]:
df.groupby('comments_gt_median')['title-wc'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
comments_gt_median,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,4271.0,11.485835,9.36105,1.0,5.0,9.0,14.0,57.0
1,4142.0,11.411395,8.412784,0.0,6.0,9.0,14.0,57.0


In [68]:
df.groupby('comments_gt_median')['title-cc'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
comments_gt_median,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,4271.0,70.153594,54.505113,2.0,34.0,54.0,87.0,300.0
1,4142.0,68.286577,48.643956,3.0,37.0,57.0,85.0,300.0


In [69]:
df.groupby('comments_gt_median')['body-wc'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
comments_gt_median,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,4271.0,112.572465,317.660147,0.0,14.0,43.0,92.0,5406.0
1,4142.0,120.37325,293.754575,0.0,10.0,43.0,104.0,4913.0


In [70]:
df.groupby('comments_gt_median')['body-cc'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
comments_gt_median,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
0,4271.0,672.4465,1905.921175,0.0,80.0,254.0,544.0,33581.0
1,4142.0,717.66055,1761.54712,0.0,62.0,256.0,614.0,28813.0


#### Conclusions

Word and Character counts don't seem to differ much based on comment count. Looks like no real significant conclusion to be drawn here.

### 5. Summary and Conclusions

Modeling was performed on various combinations of count-vectorized title and body text data, as well as with other features including:
- Media indicator: whether media (images or video) was present in the post
- Title and Body word and character counts

Model baseline was just about 50% (as baseline for high-engagement is median number of comments, or 50th percentile). Therefore any model with an accuracy higher than 50% would fare better than baseline.

All models tested performed better than baseline, but only Random Forest Classification reached a 70% accuracy rate.

Models that focus solely on title or body vectorized text are the most interpretable. Models that combined text and features carried far less inferential weight. If making a determination on what content to put in a reddit post, I would go with the initial Random Forest models that used only the title text and body text, rather than a combination of both. However, as perhaps a check on a drafted post, the final Random Forest model that combines features could be useful to determine the probability of getting higher engagement from a post.

**Note**, however, that the final model's use of year as a feature may be prohibitive to accuracy, since most future posts are not posted in the past. I would redo the model to a feature set that does not include the year as a feature in order to retain predictive power for future posts.

**Other Areas To Explore**

- One area where I would have liked to explore further is whether there is an optimal length of time for getting a maximum number of comments on a post. This could inform how long one must wait to determine peak engagement on a post.
- Another thought: now that I have some words that look like good predictors, could I not create a new model just based on these words to see if the model scored well? Or do a model based on the predictions of my previous models? I feel like a boosting model would fit this criteria, though boosting models didn't score well on my original tests.

In [71]:
# END