# Models without `body`
#### Making models based off the other features.

##### TABLE OF CONTENTS
 - [Observations and Overview for Models without `body`](#Observations-and-Overview-for-Models-without-body)
 - [Import and Define our Variables for Models without `body`](#Import-and-Define-our-Variables-for-Models-without-body)
 - [Make Regression Class with GridSearch](#Make-Regression-Class-with-GridSearch)
 - [Regression Models](#Regression-Models)
 - [More Regression Models](#More-Regression-Models)
 - [Classification Models without `body`](#Classification-Models-without-body)
 - [Random Forest Models without `body`](#Random-Forest-Models-without-body)
 - [Decision Tree without `body`](#Decision-Tree-without-body)
 - [Bagging without `body`](#Bagging-without-body)
 - [Ada Boosting without `body`](#Ada-Boosting-without-body)
 - [Gradient Boosting without `body`](#Gradient-Boosting-without-body)
 - [Logistic Regression without `body`](#Logistic-Regression-without-body)
 - [Compare Train/Test Models without `body`](#Compare-Train/Test-Models-without-body)
 - [Compare Xy/new Models without `body`](#Compare-Xy/new-Models-without-body)


### Observations and Overview for Models without `body`
[(back to top)](#Models-without-body) <br />

I realized that there are probably a half dozen features that are not being used. They probably do not have much information, but they have *some* information. In this notebook I wanted to investigate these features - unfortunately it means removing body from our features.

Features that I included in this Notebook:
 - author_premium
 - is_submitter
 - no_follow
 - score
 - send_replies
 - total_awards_received

Given more time I *might* have been able to figure out how to combine some predicted probabilities from the outcomes in this Notebook and put it together with the probabilities of the Models from the other notebooks that focus solely on the 'body' or the actual comment. Perhaps using something like a Voting Ensemble to weigh each probability would be helpful. I would need something to match the indices of each model to do this and I use something similar to that at the end of my [Conclusion](#08_conclusion.ipynb) which I used to compare Logisitic Regression and Bayesian Predictions of which subreddit the comment came from, against the true value. But like I mentioned, I did not have the time to put it all together by the time I put the process together in my head.

For my process of using non-text features, I first tried Linear Regression and a few other Regression Models that performed poorly at best. I tweaked my Regression class to optionally round the outcome, based on whether the predicted y-value was greater than a certain value. The Class I created is not perfect and likely would not have any use beyond this scenario (nor was it useful in this scenario), but it gave interesting results. I was also curious what the results would look like if I inverted them - not that this is something I was seriously entertaining, but just something I wanted to try and see what would happen. All I know is that it did __not__ improve anything. So I (not so quickly) abandoned the idea of using Linear Regression Models, but I left the results there just so it can be seen as to what it looks like.

For my Classification Models with this non-text data, the results were mixed and inconsistent. Most models did better than guesssing, and they did better than some of my first iterations ogf my Bayes Model, but both of those statements are not saying much.

Surprisingly, Logistic Regression had an excellent F1 Score and Recall Score, but when you look at the distribution of predictions it is __heavily__ biased towards the AMA subreddit, making it pretty much unusable. In fact, most of these had large bias in one some direction. THat is except Gradient Boosting, whcih surprisingly had a decent F1 score of 0.6523, and the bias was on par for the Linear Regression Model that uses the `body` feature. But I was hesitant to use this without further testing, but if I were able to merge this model and/or it's predictions with the other models, the Gradient Boost would be my choice out of these Models.



### Import and Define our Variables for Models without `body`
[(back to top)](#Models-without-body) <br />


In [1]:
from ipynb.fs.full.functions import *

In [2]:
# Data to create our model
dfa = pd.read_csv('../data/ama_comments.csv')
dfb = pd.read_csv('../data/ar_comments.csv')
df = pd.concat([dfa, dfb], axis=0)
df = CleanUp(df).df.copy()

In [3]:
# Model X, and y
df = df.sample(n=df.shape[0], random_state=3)

X = df.drop(columns=['subreddit_binary', 'body'])
y = df['subreddit_binary']

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=3)

In [4]:
# TEST data (not part of train/test/split)
df1 = pd.read_csv('../data/2021-04-27_1812_AMA_comments.csv')
df2 = pd.read_csv('../data/2021-04-27_1812_AskReddit_comments.csv')
df_test_pred = pd.concat([df1, df2], axis=0)
df_test_pred = CleanUp(df_test_pred).df.copy()

In [5]:
df_test_pred = df_test_pred.sample(n=df_test_pred.shape[0], random_state=3)

X_new = df_test_pred.drop(columns=['subreddit_binary', 'body'])
y_new = df_test_pred['subreddit_binary'] 

### Make Regression Class with GridSearch
[(back to top)](#Models-without-body) <br />

A quick note; I made a RegressionModel Class as well as a ClassificationModel Class to help keep this information organized and in DataFrames. Please head over to the functions Notebook to see the code for both of them.

# Regression Models
[(back to top)](#Models-without-body) <br />

#### Since the models are all converted to numeric form, trying out Linear Regression
This ends badly since I will have to round the predictions... but hey, why not try it!

<h2> (gridsearch) OneHotEncoder(), KNNImputer(), StandardScaler(), PolynomialFeatures(), ElasticNet() </h2>


In [6]:
gs_enet = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
#     Ridge()), 
    ElasticNet()), 
    X_train, X_test, y_train, y_test,
    params={
#         'ridge__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__l1_ratio': [ 0.025, 0.05,  0.075, 0.1, 0.2, 0.3 ]
}, verbose=3, round_y_threshold=-1, invert_y=False, mod_name='Train/Test eNet')

gs_enet.check_y(y_test, gs_enet.y_pred) # Like value_counts() but for np_array
print(gs_enet.model.best_score_)
print(gs_enet.model.best_params_)

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[[   0]
 [2457]]
0.16408494376366212
{'elasticnet__alpha': 0.1, 'elasticnet__l1_ratio': 0.075}


In [7]:
gs_enet_r = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
#     Ridge()), 
    ElasticNet()), 
    X_train, X_test, y_train, y_test,
    params={
#         'ridge__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__l1_ratio': [ 0.025, 0.05,  0.075, 0.1, 0.2, 0.3 ]
}, verbose=3, round_y_threshold=0.5, invert_y=False, mod_name='Train/Test eNet Rounding')

gs_enet_r.check_y(y_test, gs_enet_r.y_pred)
print(gs_enet_r.model.best_score_)
print(gs_enet_r.model.best_params_)
# gs_enet.df

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[[   0    1]
 [ 812 1645]]
0.16408494376366212
{'elasticnet__alpha': 0.1, 'elasticnet__l1_ratio': 0.075}


In [8]:
X_gs_enet = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
#     Ridge()), 
    ElasticNet()), 
    X, X_new, y, y_new,
    params={
#         'ridge__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__l1_ratio': [ 0.025, 0.05,  0.075, 0.1, 0.2, 0.3 ]
}, verbose=3, round_y_threshold=-1, invert_y=False, mod_name='Xy/new eNet')

X_gs_enet.check_y(y_new, X_gs_enet.y_pred)
print(X_gs_enet.model.best_score_)
print(X_gs_enet.model.best_params_)
# X_gs_enet.df

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[[   0]
 [1962]]
0.169417031503531
{'elasticnet__alpha': 0.05, 'elasticnet__l1_ratio': 0.1}


In [9]:
X_gs_enet_r = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
#     Ridge()), 
    ElasticNet()), 
    X, X_new, y, y_new,
    params={
#         'ridge__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__alpha': [ 0.005, 0.01, 0.05, 0.1 ],
        'elasticnet__l1_ratio': [ 0.025, 0.05,  0.075, 0.1, 0.2, 0.3 ]
}, verbose=3, round_y_threshold=0.5, invert_y=False, mod_name='Xy/new eNet Rounding')

X_gs_enet_r.check_y(y_new, X_gs_enet_r.y_pred)
print(X_gs_enet_r.model.best_score_)
print(X_gs_enet_r.model.best_params_)
# X_gs_enet.df

Fitting 5 folds for each of 24 candidates, totalling 120 fits
[[   0    1]
 [ 681 1281]]
0.169417031503531
{'elasticnet__alpha': 0.05, 'elasticnet__l1_ratio': 0.1}


In [10]:
compare_enet = pd.concat([gs_enet.df, gs_enet_r.df, X_gs_enet.df, X_gs_enet_r.df], axis=1)
compare_enet

Unnamed: 0,Train/Test eNet,Train/Test eNet Rounding,Xy/new eNet,Xy/new eNet Rounding
R2 Score,0.182128,-0.321939,0.167097,-0.388415
RMSE,0.452181,0.574878,0.456312,0.589148
MSE,0.204468,0.330484,0.20822,0.347095
MAE,0.417468,0.330484,0.423481,0.347095
Train R2 Score,0.166212,-0.367708,0.172344,-0.356672
Train RMSE,0.45656,0.584745,0.454878,0.582381
Train MSE,0.208447,0.341927,0.206914,0.339168
Train MAE,0.422775,0.341927,0.418257,0.339168


## More Regression Models
[(back to top)](#Models-without-body) <br />


<h2> (gridsearch) OneHotEncoder(), KNNImputer(), StandardScaler(), PolynomialFeatures(), LinearRegression() </h2>


In [11]:
gs_lr = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
    LinearRegression()), 
    X_train, X_test, y_train, y_test,
    params={
}, verbose=3, round_y_threshold=-1, invert_y=False, mod_name='Train/Test LinReg')

gs_lr.check_y(y_test, gs_lr.y_pred)
print(gs_lr.model.best_score_)
print(gs_lr.model.best_params_)
# gs_lr.df

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[[   0]
 [2457]]
-1.1572981431334531e+21
{}


In [12]:
gs_lr_r = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
    LinearRegression()), 
    X_train, X_test, y_train, y_test,
    params={
}, verbose=3, round_y_threshold=0.5, invert_y=False, mod_name='Train/Test LinReg Rounding')

gs_lr_r.check_y(y_test, gs_lr_r.y_pred)
print(gs_lr_r.model.best_score_)
print(gs_lr_r.model.best_params_)
# gs_lr_r.df

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[[   0    1]
 [ 821 1636]]
-1.1572981431334531e+21
{}


In [13]:
X_gs_lr = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
    LinearRegression()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
}, verbose=3, round_y_threshold=-1, invert_y=False, mod_name='Xy/new LinReg')

X_gs_lr.check_y(y_new, X_gs_lr.y_pred)
print(X_gs_lr.model.best_score_)
print(X_gs_lr.model.best_params_)
# X_gs_lr.df

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[[   0]
 [1962]]
-0.3014890858486563
{}


In [14]:
X_gs_lr_r = RegressionModel(make_pipeline(
    OneHotEncoder(),
    KNNImputer(), 
    StandardScaler(),
    PolynomialFeatures(),
    LinearRegression()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
}, verbose=3, round_y_threshold=-1, invert_y=False, mod_name='Xy/new LinReg Rounding')

X_gs_lr_r.check_y(y_new, X_gs_lr_r.y_pred)
print(X_gs_lr_r.model.best_score_)
print(X_gs_lr_r.model.best_params_)
# X_gs_lr_r.df

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[[   0]
 [1962]]
-0.3014890858486563
{}


In [15]:
compare_lr = pd.concat([gs_lr_r.df, gs_lr.df, gs_lr_r.df, gs_lr.df], axis=1)
compare_lr

Unnamed: 0,Train/Test LinReg Rounding,Train/Test LinReg,Train/Test LinReg Rounding.1,Train/Test LinReg.1
R2 Score,-0.336591,0.168729,-0.336591,0.168729
RMSE,0.578055,0.45587,0.578055,0.45587
MSE,0.334147,0.207817,0.334147,0.207817
MAE,0.334147,0.413926,0.334147,0.413926
Train R2 Score,-0.380191,0.165523,-0.380191,0.165523
Train RMSE,0.587407,0.456748,0.587407,0.456748
Train MSE,0.345047,0.208619,0.345047,0.208619
Train MAE,0.345047,0.418083,0.345047,0.418083


In [16]:
compare_regression = pd.concat([compare_enet, compare_lr], axis=1)
compare_regression

Unnamed: 0,Train/Test eNet,Train/Test eNet Rounding,Xy/new eNet,Xy/new eNet Rounding,Train/Test LinReg Rounding,Train/Test LinReg,Train/Test LinReg Rounding.1,Train/Test LinReg.1
R2 Score,0.182128,-0.321939,0.167097,-0.388415,-0.336591,0.168729,-0.336591,0.168729
RMSE,0.452181,0.574878,0.456312,0.589148,0.578055,0.45587,0.578055,0.45587
MSE,0.204468,0.330484,0.20822,0.347095,0.334147,0.207817,0.334147,0.207817
MAE,0.417468,0.330484,0.423481,0.347095,0.334147,0.413926,0.334147,0.413926
Train R2 Score,0.166212,-0.367708,0.172344,-0.356672,-0.380191,0.165523,-0.380191,0.165523
Train RMSE,0.45656,0.584745,0.454878,0.582381,0.587407,0.456748,0.587407,0.456748
Train MSE,0.208447,0.341927,0.206914,0.339168,0.345047,0.208619,0.345047,0.208619
Train MAE,0.422775,0.341927,0.418257,0.339168,0.345047,0.418083,0.345047,0.418083


# Classification Models without `body`
[(back to top)](#Models-without-body) <br />


## Random Forest Models without `body`
[(back to top)](#Models-without-body) <br />

<h2> (gridsearch) RandomForestClassifier() </h2>


In [17]:
gs_rfc = ClassificationModel(make_pipeline(
    RandomForestClassifier()), 
    X_train, X_test, y_train, y_test,
    params={
        'randomforestclassifier__criterion': [ 'gini', 'entropy' ],
        'randomforestclassifier__max_depth': [ None, 1, 2, 3 ],
        'randomforestclassifier__random_state': [ 3 ]
}, verbose=3, mod_name='Train/Test RFC')

print(gs_rfc.model.best_score_)
print(gs_rfc.model.best_params_)
# gs_rfc.df

Fitting 5 folds for each of 8 candidates, totalling 40 fits
0.6611940298507462
{'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__max_depth': 3, 'randomforestclassifier__random_state': 3}


In [18]:
X_gs_rfc = ClassificationModel(make_pipeline(
    RandomForestClassifier()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
        'randomforestclassifier__criterion': [ 'gini', 'entropy' ],
        'randomforestclassifier__max_depth': [ None, 1, 2, 3 ],
        'randomforestclassifier__random_state': [ 3 ]
}, verbose=3, mod_name='Xy/new RFC')

print(X_gs_rfc.model.best_score_)
print(X_gs_rfc.model.best_params_)
# X_gs_rfc.df

Fitting 5 folds for each of 8 candidates, totalling 40 fits
0.666330726679247
{'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__max_depth': 3, 'randomforestclassifier__random_state': 3}


In [19]:
compare_rfc = pd.concat([gs_rfc.df, X_gs_rfc.df], axis=1)
compare_rfc

Unnamed: 0,Train/Test RFC,Xy/new RFC
F1 Score,0.681537,0.732052
Recall Score,0.670732,0.955943
Accuracy,0.686203,0.651886
Balanced Accuracy,0.686222,0.653428
Precision Score,0.692695,0.593134
Average Precision Score,0.748042,0.72012
ROC AUC Score,0.765069,0.751093
True Positive,861.0,346.0
False Negative,366.0,640.0
False Positive,405.0,43.0


## Decision Tree without `body`
[(back to top)](#Models-without-body) <br />

<h2> (gridsearch) DecisionTreeClassifier() </h2>


In [20]:
gs_tree = ClassificationModel(make_pipeline(
    DecisionTreeClassifier()), 
    X_train, X_test, y_train, y_test,
    params={
        'decisiontreeclassifier__max_depth': [ 1, 3, 5, None ],
        'decisiontreeclassifier__criterion': [ 'gini', 'entropy' ]
}, verbose=3, mod_name='Train/Test dTree')

print(gs_tree.model.best_score_)
print(gs_tree.model.best_params_)
# gs_tree.df

Fitting 5 folds for each of 8 candidates, totalling 40 fits
0.6758480325644505
{'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__max_depth': 5}


In [21]:
X_gs_tree = ClassificationModel(make_pipeline(
    DecisionTreeClassifier()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
        'decisiontreeclassifier__max_depth': [ 1, 3, 5, None ],
        'decisiontreeclassifier__criterion': [ 'gini', 'entropy' ]
}, verbose=3, mod_name='Xy/new dTree')

print(X_gs_tree.model.best_score_)
print(X_gs_tree.model.best_params_)
# X_gs_tree.df

Fitting 5 folds for each of 8 candidates, totalling 40 fits
0.6807782169657719
{'decisiontreeclassifier__criterion': 'gini', 'decisiontreeclassifier__max_depth': 5}


In [22]:
compare_tree = pd.concat([gs_tree.df, X_gs_tree.df], axis=1)
compare_tree

Unnamed: 0,Train/Test dTree,Xy/new dTree
F1 Score,0.65,0.595849
Recall Score,0.570732,0.5
Accuracy,0.692308,0.662589
Balanced Accuracy,0.692456,0.661765
Precision Score,0.754839,0.73716
Average Precision Score,0.731528,0.706244
ROC AUC Score,0.773194,0.74277
True Positive,999.0,812.0
False Negative,228.0,174.0
False Positive,528.0,488.0


## Bagging without `body`
[(back to top)](#Models-without-body) <br />

<h2> (gridsearch) BaggingClassifier() </h2>


In [23]:
from sklearn.svm import SVC   # Saw this in the Bagging Classifier Docstring example

In [24]:
gs_bag = ClassificationModel(make_pipeline(
    BaggingClassifier()), 
    X_train, X_test, y_train, y_test,
    params={
        'baggingclassifier__n_estimators': [ 5, 8, 10, 12 ,15 ],
        'baggingclassifier__base_estimator': [ None, SVC() ],
}, verbose=3, mod_name='Train/Test Bagging')

print(gs_bag.model.best_score_)
print(gs_bag.model.best_params_)
# gs_bag.df

Fitting 5 folds for each of 10 candidates, totalling 50 fits
0.6417910447761195
{'baggingclassifier__base_estimator': None, 'baggingclassifier__n_estimators': 15}


In [25]:
X_gs_bag = ClassificationModel(make_pipeline(
    BaggingClassifier()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
        'baggingclassifier__n_estimators': [ 5, 8, 10, 12 ,15 ],
        'baggingclassifier__base_estimator': [ None, SVC() ],
}, verbose=3, mod_name='Xy/new Bagging')

print(X_gs_bag.model.best_score_)
print(X_gs_bag.model.best_params_)
# X_gs_bag.df

Fitting 5 folds for each of 10 candidates, totalling 50 fits
0.6541190052780215
{'baggingclassifier__base_estimator': None, 'baggingclassifier__n_estimators': 12}


In [26]:
compare_bag = pd.concat([gs_bag.df, X_gs_bag.df], axis=1)
compare_bag

Unnamed: 0,Train/Test Bagging,Xy/new Bagging
F1 Score,0.665035,0.62863
Recall Score,0.662602,0.632172
Accuracy,0.665853,0.62844
Balanced Accuracy,0.665857,0.628459
Precision Score,0.667486,0.625127
Average Precision Score,0.690531,0.66111
ROC AUC Score,0.729711,0.693135
True Positive,821.0,616.0
False Negative,406.0,370.0
False Positive,415.0,359.0


## Ada Boosting without `body`
[(back to top)](#Models-without-body) <br />

<h2> (gridsearch) AdaBoostClassifier() </h2>


In [27]:
gs_ada = ClassificationModel(make_pipeline(
    AdaBoostClassifier()), 
    X_train, X_test, y_train, y_test,
    params={
        'adaboostclassifier__n_estimators': [ 25, 50, 75, 100 ]
}, verbose=3, mod_name='Train/Test adaBst')

print(gs_ada.model.best_score_)
print(gs_ada.model.best_params_)
# gs_ada.df

Fitting 5 folds for each of 4 candidates, totalling 20 fits
0.67516960651289
{'adaboostclassifier__n_estimators': 25}


In [28]:
X_gs_ada = ClassificationModel(make_pipeline(
    AdaBoostClassifier()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
        'adaboostclassifier__n_estimators': [ 25, 50, 75, 100 ]
}, verbose=3, mod_name='Xy/new adaBst')

print(X_gs_ada.model.best_score_)
print(X_gs_ada.model.best_params_)
# X_gs_ada.df

Fitting 5 folds for each of 4 candidates, totalling 20 fits
0.6778296174922798
{'adaboostclassifier__n_estimators': 25}


In [29]:
compare_ada = pd.concat([gs_ada.df, X_gs_ada.df], axis=1)
compare_ada

Unnamed: 0,Train/Test adaBst,Xy/new adaBst
F1 Score,0.665185,0.609256
Recall Score,0.60813,0.532787
Accuracy,0.693529,0.660041
Balanced Accuracy,0.693633,0.659395
Precision Score,0.734053,0.711354
Average Precision Score,0.745402,0.694213
ROC AUC Score,0.772003,0.733476
True Positive,956.0,775.0
False Negative,271.0,211.0
False Positive,482.0,456.0


## Gradient Boosting without `body`
[(back to top)](#Models-without-body) <br />

<h2> (gridsearch) GradientBoostingClassifier() </h2>


In [30]:
gs_gb = ClassificationModel(make_pipeline(
    GradientBoostingClassifier()), 
    X_train, X_test, y_train, y_test,
    params={
        'gradientboostingclassifier__n_estimators': [ 100 ],
        'gradientboostingclassifier__learning_rate': [ 0.1, 0.25, 0.5, 0.75, 1.0 ],
        'gradientboostingclassifier__max_depth': [ 1, 3, 5 ]
}, verbose=3, mod_name='Train/Test gBoost')

print(gs_gb.model.best_score_)
print(gs_gb.model.best_params_)
# gs_gb.df

Fitting 5 folds for each of 15 candidates, totalling 75 fits
0.6826322930800542
{'gradientboostingclassifier__learning_rate': 0.1, 'gradientboostingclassifier__max_depth': 3, 'gradientboostingclassifier__n_estimators': 100}


In [31]:
X_gs_gb = ClassificationModel(make_pipeline(
    GradientBoostingClassifier()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
        'gradientboostingclassifier__n_estimators': [ 100 ],
        'gradientboostingclassifier__learning_rate': [ 0.1, 0.25, 0.5, 0.75, 1.0 ],
        'gradientboostingclassifier__max_depth': [ 1, 3, 5 ]
}, verbose=3, mod_name='Xy/new gBoost')

print(X_gs_gb.model.best_score_)
print(X_gs_gb.model.best_params_)
# X_gs_gb.df

Fitting 5 folds for each of 15 candidates, totalling 75 fits
0.6869863506583938
{'gradientboostingclassifier__learning_rate': 0.1, 'gradientboostingclassifier__max_depth': 3, 'gradientboostingclassifier__n_estimators': 100}


In [32]:
compare_gb = pd.concat([gs_gb.df, X_gs_gb.df], axis=1)
compare_gb

Unnamed: 0,Train/Test gBoost,Xy/new gBoost
F1 Score,0.692461,0.652268
Recall Score,0.705691,0.618852
Accuracy,0.686203,0.671764
Balanced Accuracy,0.686179,0.671495
Precision Score,0.679718,0.689498
Average Precision Score,0.762319,0.733354
ROC AUC Score,0.775948,0.759623
True Positive,818.0,714.0
False Negative,409.0,272.0
False Positive,362.0,372.0


## Logistic Regression without `body`
[(back to top)](#Models-without-body) <br />

<h2> (gridsearch) LogisticRegression() </h2>


In [33]:
gs_lgr = ClassificationModel(make_pipeline(
    LogisticRegression()), 
    X_train, X_test, y_train, y_test,
    params={
        'logisticregression__C': [ 0.01, 0.1, 1, 10 ],
        'logisticregression__max_iter': [ 1000 ],
        'logisticregression__penalty': [ 'l1', 'l2', 'elasticnet', 'none' ]
}, verbose=3, mod_name='Train/Test LogReg')

print(gs_lgr.model.best_score_)
print(gs_lgr.model.best_params_)
# gs_lgr.df

Fitting 5 folds for each of 16 candidates, totalling 80 fits
0.657259158751696
{'logisticregression__C': 0.01, 'logisticregression__max_iter': 1000, 'logisticregression__penalty': 'none'}


In [34]:
X_gs_lgr = ClassificationModel(make_pipeline(
    LogisticRegression()), 
    X_train=X, X_test=X_new, y_train=y, y_test=y_new,
    params={
        'logisticregression__C': [ 0.01, 0.1, 1, 10 ],
        'logisticregression__max_iter': [ 1000 ],
        'logisticregression__penalty': [ 'l1', 'l2', 'elasticnet', 'none' ]
}, verbose=3, mod_name='Xy/new LogReg')

print(X_gs_lgr.model.best_score_)
print(X_gs_lgr.model.best_params_)
# X_gs_lgr.df

Fitting 5 folds for each of 16 candidates, totalling 80 fits
0.6599189788749712
{'logisticregression__C': 1, 'logisticregression__max_iter': 1000, 'logisticregression__penalty': 'l2'}


In [35]:
compare_lgr = pd.concat([gs_lgr.df, X_gs_lgr.df], axis=1)
compare_lgr

Unnamed: 0,Train/Test LogReg,Xy/new LogReg
F1 Score,0.740268,0.730572
Recall Score,0.943089,0.94877
Accuracy,0.668702,0.651886
Balanced Accuracy,0.668366,0.653391
Precision Score,0.609244,0.59397
Average Precision Score,0.708126,0.678731
ROC AUC Score,0.73125,0.686902
True Positive,483.0,353.0
False Negative,744.0,633.0
False Positive,70.0,50.0


## Compare Train/Test Models without `body`
[(back to top)](#Models-without-body) <br />


In [36]:
compare_train_df = pd.concat([gs_tree.df, gs_bag.df, gs_ada.df, gs_gb.df, gs_rfc.df, gs_lgr.df], axis=1)
compare_train_df

Unnamed: 0,Train/Test dTree,Train/Test Bagging,Train/Test adaBst,Train/Test gBoost,Train/Test RFC,Train/Test LogReg
F1 Score,0.65,0.665035,0.665185,0.692461,0.681537,0.740268
Recall Score,0.570732,0.662602,0.60813,0.705691,0.670732,0.943089
Accuracy,0.692308,0.665853,0.693529,0.686203,0.686203,0.668702
Balanced Accuracy,0.692456,0.665857,0.693633,0.686179,0.686222,0.668366
Precision Score,0.754839,0.667486,0.734053,0.679718,0.692695,0.609244
Average Precision Score,0.731528,0.690531,0.745402,0.762319,0.748042,0.708126
ROC AUC Score,0.773194,0.729711,0.772003,0.775948,0.765069,0.73125
True Positive,999.0,821.0,956.0,818.0,861.0,483.0
False Negative,228.0,406.0,271.0,409.0,366.0,744.0
False Positive,528.0,415.0,482.0,362.0,405.0,70.0


## Compare Xy/new Models without `body`
[(back to top)](#Models-without-body) <br />


In [37]:
compare_X_df = pd.concat([X_gs_tree.df, X_gs_bag.df, X_gs_ada.df, X_gs_gb.df, X_gs_rfc.df, X_gs_lgr.df], axis=1)
compare_X_df

Unnamed: 0,Xy/new dTree,Xy/new Bagging,Xy/new adaBst,Xy/new gBoost,Xy/new RFC,Xy/new LogReg
F1 Score,0.595849,0.62863,0.609256,0.652268,0.732052,0.730572
Recall Score,0.5,0.632172,0.532787,0.618852,0.955943,0.94877
Accuracy,0.662589,0.62844,0.660041,0.671764,0.651886,0.651886
Balanced Accuracy,0.661765,0.628459,0.659395,0.671495,0.653428,0.653391
Precision Score,0.73716,0.625127,0.711354,0.689498,0.593134,0.59397
Average Precision Score,0.706244,0.66111,0.694213,0.733354,0.72012,0.678731
ROC AUC Score,0.74277,0.693135,0.733476,0.759623,0.751093,0.686902
True Positive,812.0,616.0,775.0,714.0,346.0,353.0
False Negative,174.0,370.0,211.0,272.0,640.0,633.0
False Positive,488.0,359.0,456.0,372.0,43.0,50.0
