# Modeling

This notebook contains all steps and decisions in the modeling phase of the pipeline.

---

## The Required Imports

Below are all the modules needed to run the code cells in this notebook.

In [48]:
from itertools import count

import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.svm import SVC

from sklearn.feature_selection import RFE

from wrangle import wrangle_kepler_modeling
from model import *
from preprocessing import *
from baseline import establish_classification_baseline
from evaluate import *

# We'll be using this random seed throughout.
random_seed = 24

---

## Acquire, Prepare, and Split the Data

Let's acquire, prepare, and split our data using the wrangle module.

In [2]:
train, validate, test = wrangle_kepler_modeling()
train.shape, validate.shape, test.shape

((3376, 7), (1448, 7), (1207, 7))

Let's also scale our data while we're at it because chances are we're going to need the data to be scaled considering some features have extremely large values.

In [3]:
train_scaled, validate_scaled, test_scaled = scale_data(
    train,
    validate,
    test,
    train.drop(columns = 'disposition').columns
)

---

## A Note on Evaluation

In this project we will be primarily be using accuracy to measure the performance of our models since there is a fairly even split between false positives and confirmed exoplanets. However, we will also make note of the recall scores in comparing model performance. In the case when two models have similar accuracy the precision score will help break the tie since we want to be sure of our positive predictions. When an observation is predicted as a confirmed exoplanet this would reasonably lead to prioritizing analysis of that object to verify the prediction. We want to be sure that this time spent analyzing an object will not be for nothing.

Additionally, since we are interested primarily in identifying confirmed exoplanets more than false positive dispositions, the disposition of CONFIRMED will be our positive case and FALSE POSITIVE will be the negative (that's going to get confusing).

## Establish a Baseline

Before we can begin building models we have to establish a baseline model to compare our models to. This way we can determine if our models at least perform better than the baseline.

In [4]:
baseline = establish_classification_baseline(train.disposition)
baseline.shape

(3376,)

In [5]:
baseline.value_counts()

FALSE POSITIVE    3376
dtype: int64

In [6]:
# Now let's make a baseline model for validate.
# We'll use FALSE POSITIVE as the value the baseline model predicts because that was the value chosen 
# for the train baseline.
validate_baseline = pd.Series('FALSE POSITIVE', index = validate.index)

In [7]:
validate_baseline.value_counts()

FALSE POSITIVE    1448
dtype: int64

Here the baseline is a model that always predicts an observation is a false positive since that is the most common value in the target variable. Let's evaluate the baseline model's performance before we continue.

In [8]:
results = {
    **evaluate(train.disposition, baseline, 'CONFIRMED', prefix = 'train_'),
    **evaluate(validate.disposition, validate_baseline, 'CONFIRMED', prefix = 'validate_')
}

In [9]:
eval_df = append_results('Baseline', results)
eval_df

Unnamed: 0,train_accuracy,train_recall,train_precision,validate_accuracy,validate_recall,validate_precision
Baseline,0.624111,0.0,0.0,0.623619,0.0,0.0


So our baseline has accuracy of 62% on both train and validate, and 0 for both recall and precision (which is expected).

---

## Feature Selection

Let's now use RFE to rank our features so that we can start with a few and add more features in the order they are ranked.

In [10]:
rfe = RFE(DecisionTreeClassifier(max_depth = 3), n_features_to_select = 2)
rfe.fit(train_scaled.drop(columns = 'disposition'), train_scaled.disposition)

RFE(estimator=DecisionTreeClassifier(max_depth=3), n_features_to_select=2)

In [11]:
pd.DataFrame({'Var': train_scaled.drop(columns = 'disposition').columns, 'Rank': rfe.ranking_})

Unnamed: 0,Var,Rank
0,transit_depth,3
1,planetary_radius,1
2,temperature,5
3,normalized_depth,2
4,orbital_period,1
5,transit_duration,4


---

## Create Some Models

Now let's finally create some models to predict the exoplanet archive disposition. We'll start by creating a variety of models using different algorithms and our top two features chosen by RFE. For whichever provides the best performance we'll try a variety of different hyper-parameters and start adding on additional features and then choose our best model from there.

In [12]:
algorithms = [
    DecisionTreeClassifier(max_depth = 3, random_state = random_seed),
    RandomForestClassifier(max_depth = 3, random_state = random_seed),
    AdaBoostClassifier(random_state = random_seed),
    BaggingClassifier(random_state = random_seed),
    GradientBoostingClassifier(random_state = random_seed),
    KNeighborsClassifier(),
    SGDClassifier(random_state = random_seed),
    BernoulliNB(),
    SVC(random_state = random_seed)
]

features = ['planetary_radius', 'orbital_period']
models = {}

for key in range(1, len(algorithms) + 1):
    models[key] = Model(algorithms[key - 1], train, features, 'disposition')

In [13]:
models

{1: <model.Model at 0x7fc59f7f2bb0>,
 2: <model.Model at 0x7fc59ab3a340>,
 3: <model.Model at 0x7fc59ab248e0>,
 4: <model.Model at 0x7fc59ab3a790>,
 5: <model.Model at 0x7fc59ab3afa0>,
 6: <model.Model at 0x7fc59ab24fa0>,
 7: <model.Model at 0x7fc59ab249a0>,
 8: <model.Model at 0x7fc59f7eb070>,
 9: <model.Model at 0x7fc59f7cc1c0>}

In [14]:
names = [
    'Decision Tree',
    'Random Forest',
    'Ada Boost',
    'Bagging Classifier',
    'Gradient Boosting',
    'KNN',
    'SGD',
    'Bernoulli NB',
    'SVC'
]

for model, name in zip(models.values(), names):
    eval_df = append_results(
        name,
        {
            **evaluate(train.disposition, model.make_predictions(train), 'CONFIRMED', prefix = 'train_'),
            **evaluate(validate.disposition, model.make_predictions(validate), 'CONFIRMED', prefix = 'validate_')
        },
        eval_df
    )
    
eval_df

Unnamed: 0,train_accuracy,train_recall,train_precision,validate_accuracy,validate_recall,validate_precision
Baseline,0.624111,0.0,0.0,0.623619,0.0,0.0
Decision Tree,0.841825,0.901497,0.736639,0.832873,0.880734,0.730594
Random Forest,0.846564,0.878645,0.753888,0.830801,0.833028,0.746711
Ada Boost,0.846564,0.875493,0.755269,0.828039,0.838532,0.739482
Bagging Classifier,0.986374,0.98818,0.975875,0.80663,0.766972,0.732049
Gradient Boosting,0.872334,0.8684,0.806735,0.84116,0.80367,0.780749
KNN,0.878851,0.866036,0.821375,0.822514,0.779817,0.756228
SGD,0.525474,0.116627,0.235294,0.537983,0.150459,0.284722
Bernoulli NB,0.624111,0.0,0.0,0.623619,0.0,0.0
SVC,0.730154,0.946414,0.587573,0.729972,0.950459,0.587302


In [15]:
eval_df.sort_values(by = 'validate_accuracy', ascending = False)

Unnamed: 0,train_accuracy,train_recall,train_precision,validate_accuracy,validate_recall,validate_precision
Gradient Boosting,0.872334,0.8684,0.806735,0.84116,0.80367,0.780749
Decision Tree,0.841825,0.901497,0.736639,0.832873,0.880734,0.730594
Random Forest,0.846564,0.878645,0.753888,0.830801,0.833028,0.746711
Ada Boost,0.846564,0.875493,0.755269,0.828039,0.838532,0.739482
KNN,0.878851,0.866036,0.821375,0.822514,0.779817,0.756228
Bagging Classifier,0.986374,0.98818,0.975875,0.80663,0.766972,0.732049
SVC,0.730154,0.946414,0.587573,0.729972,0.950459,0.587302
Baseline,0.624111,0.0,0.0,0.623619,0.0,0.0
Bernoulli NB,0.624111,0.0,0.0,0.623619,0.0,0.0
SGD,0.525474,0.116627,0.235294,0.537983,0.150459,0.284722


Considering the performance on unseen data, Gradient Boosting has the best performance, and also not too much of a performance drop off from the train set either which is good news. We'll move forward with this algorithm.

---

## Modifying the Hyper-Parameters

Now let's try changing the hyper-parameters for gradient boosting to see if we can get better results that way.

In [62]:
# Let's use loops to try a variety of hyper-parameters

GBC_models = {}
n = count()

for loss in ['deviance', 'exponential']:
    for n_estimators in range(100, 501, 50):
        for max_depth in range(3, 11):
            GBC_models[next(n)] = Model(
                GradientBoostingClassifier(
                    loss = loss,
                    n_estimators = n_estimators,
                    max_depth = max_depth,
                    random_state = random_seed
                ),
                train,
                features,
                'disposition'
            )

In [63]:
for index, model in enumerate(GBC_models.values()):
    eval_df = append_results(
        f'GBC_{index}',
        {
            **evaluate(train.disposition, model.make_predictions(train), 'CONFIRMED', prefix = 'train_'),
            **evaluate(validate.disposition, model.make_predictions(validate), 'CONFIRMED', prefix = 'validate_')
        },
        eval_df
    )

In [64]:
eval_df.sort_values(by = 'validate_accuracy', ascending = False)

Unnamed: 0,train_accuracy,train_recall,train_precision,validate_accuracy,validate_recall,validate_precision
GBC_104,0.894254,0.897557,0.833821,0.845304,0.811009,0.785080
GBC_40,0.917654,0.921986,0.867309,0.843232,0.796330,0.789091
GBC_89,0.908175,0.912530,0.853353,0.842541,0.805505,0.782531
GBC_112,0.903436,0.907013,0.846946,0.841851,0.801835,0.783154
GBC_24,0.902844,0.900709,0.849814,0.841160,0.794495,0.785844
...,...,...,...,...,...,...
Bagging Classifier,0.986374,0.988180,0.975875,0.806630,0.766972,0.732049
SVC,0.730154,0.946414,0.587573,0.729972,0.950459,0.587302
Bernoulli NB,0.624111,0.000000,0.000000,0.623619,0.000000,0.000000
Baseline,0.624111,0.000000,0.000000,0.623619,0.000000,0.000000


It looks like GBC_104 has a slightly better performance than the plain GBC algorithm. Let's see what the hyper-parameters were.

In [70]:
GBC_models[104].model

GradientBoostingClassifier(loss='exponential', n_estimators=300,
                           random_state=24)

In [67]:
# Let's just verify this is the right one.
evaluate(validate.disposition, GBC_models[104].make_predictions(validate), 'CONFIRMED', prefix = 'validate_')

{'validate_accuracy': 0.8453038674033149,
 'validate_recall': 0.8110091743119267,
 'validate_precision': 0.7850799289520426}