# FSM Notebook

## Summary
- In this Notebook I will be creating 3 simple models with:
    - Sklearn's Base Decision Tree
    - Catboost
    - Keras

I'll evaluate on all of these FSMs and decide how to move forward, and how to split my time between these model types.

# First Simple Models

In [2]:
import keras
# Import statements
import pandas as pd
from catboost import CatBoostClassifier, Pool, metrics, cv
from keras.layers import Dense
from sklearn.tree import DecisionTreeClassifier

# Importing metrics function from functions.py
from functions import metrics as custom_score

In [11]:
# Load in cleaned data from last time.

# Training Data
X_train = pd.read_csv('../../Data/train/X_train_combo.csv', index_col=0)
y_train = pd.read_csv('../../Data/train/y_train_combo.csv', index_col=0)

# Testing Data
X_test = pd.read_csv('../../Data/test/X_test_combo.csv', index_col=0)
y_test = pd.read_csv('../../Data/test/y_test_combo.csv', index_col=0)


## Modeless Baseline
A modeless baseline is how accurate we would be if we guessed the majority class of our target variable. In this case, how accurate would we be if we guessed that the child in question did not have ADHD, for every child. This will be same for every split since we set the stratify parameter to true when we performed the TTS. It will be important to keep this metric in mind when modeling, to see when the model is just guessing the majority class.

In [3]:
y_train.value_counts(normalize=True)

K2Q31A
0.0       0.898907
1.0       0.101093
dtype: float64

In [4]:
# Getting % of each class and assigning to variables
no_adhd, adhd = y_train.value_counts(normalize=True)

# Printing modeless accuracy
print(f'If we said that each child in the set did not have ADHD, we would be {no_adhd*100:.0f}% accurate')


If we said that each child in the set did not have ADHD, we would be 90% accurate


## FSM - Sklearn Decision Tree

To start let's create the simplest model possible, to use as a baseline for future models. I'll use a Decision Tree model from Sklearn.

In [5]:
# Instantiating Tree
FSM_DT = DecisionTreeClassifier()

# Fitting Model
FSM_DT.fit(X_train, y_train)

# Score on the training data.
custom_score(X_train, y_train, FSM_DT)

Model Results
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1: 1.00
ROC AUC: 1.00


{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1': 1.0, 'ROCAUC': 1.0}

Wow! A perfect Score!
I guess we can go home, we did it, we solved ADHD

Just kidding, lets check the score on the testing data.

In [6]:
# Predictions
FSM_DT_preds = FSM_DT.predict(X_test)

# Print metrics
custom_score(X_test, y_test, FSM_DT)

Model Results
Accuracy: 0.90
Precision: 0.48
Recall: 0.51
F1: 0.50
ROC AUC: 0.73


{'Accuracy': 0.8950663661407463,
 'Precision': 0.48259860788863107,
 'Recall': 0.5148514851485149,
 'F1': 0.49820359281437127,
 'ROCAUC': 0.7263585929504067}

In [7]:
# Checking the predictions to ensure that the model isn't guessing one class
pd.Series(FSM_DT_preds).value_counts()

0.0    7124
1.0     862
dtype: int64

## Analysis
The model is, obviously, overfit to the training data, and on the testing data it may as well be guessing. This is expected of an un-pruned DT model though.
I'd like to try a few different kinds of first models, so before iterating on this one let's create a few more FSMs.

## FSM - CatBoost
Catboost is not an Sklearn library, but is known for doing very well on categorical data like the one from this survey. Let's give it a shot and see how it does.

In [8]:
# Setting up the model
model = CatBoostClassifier(
    # Adding Accuracy as a metric
    custom_loss=[metrics.Accuracy()],
    random_seed=15,
    logging_level='Silent'
)

In [9]:
# Fitting the model to training data
model.fit(
    X_train, y_train,
    # Using X/y test as eval set
    eval_set=(X_test, y_test),
    # Plot the learning of the model
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [10]:
# Updating model params with Logloss function
cv_params = model.get_params()
cv_params.update({
    'loss_function': metrics.Logloss()
})
# Pooling data and cross validating
cv_data = cv(
    Pool(X_train, y_train),
    cv_params,
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

In [11]:
# Printing training and testing scores.
print("Training Scores")
custom_score(X_train, y_train, model)
print('\t')
print('Testing Scores')
custom_score(X_test, y_test, model)

Training Scores
Model Results
Accuracy: 0.97
Precision: 0.95
Recall: 0.75
F1: 0.83
ROC AUC: 0.87
	
Testing Scores
Model Results
Accuracy: 0.94
Precision: 0.74
Recall: 0.56
F1: 0.64
ROC AUC: 0.77


{'Accuracy': 0.9356373653894315,
 'Precision': 0.7401960784313726,
 'Recall': 0.5606435643564357,
 'F1': 0.6380281690140845,
 'ROCAUC': 0.769246273680029}

### Analysis
Catboost has done quite well for a baseline model! There is, again, some bad overfitting occuring here. But the scores are still better then the base decision tree; precision, recall, and roc/auc have all vastly improved compared to the base decision tree. Lets move on to something totally different, a neural network

## FSM - Keras NN

In [14]:
# Instantiating a NN
FSM_NN = keras.Sequential()

# Starting small with 30 neurons
FSM_NN.add(Dense(30, 'relu', input_shape=(422,)))

# 1 output
FSM_NN.add(Dense(1, 'sigmoid'))

# Compiling model with accuracy, precision, and recall metrics. Using "Adam" as an optimizer
FSM_NN.compile('adam', 'binary_crossentropy', metrics=['acc', 'Precision', 'Recall'])

In [15]:
# Fitting model on X_train and binarized labels
FSM_NN.fit(X_train, y_train, epochs=10, steps_per_epoch=100, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x1471a6a30>

In [16]:
# Getting stats for test data
NN_loss, NN_acc, NN_prec, NN_recall = FSM_NN.evaluate(X_test, y_test)



In [17]:
# Neatly printing evaluation results
print(f'Test Accuracy: {NN_acc:.2f} \n Test Precision: {NN_prec:.2f} \n Test Recall: {NN_recall:.2f}')

Test Accuracy: 0.90 
 Test Precision: 0.73 
 Test Recall: 0.06


## Analysis
The neural network has an impressively bad recall score at 6%. I think that this poor neural network could use some more neurons, but it is trying its best with what it has.

# Conclusion
All baseline models had high accuracy due to the class imbalance, but had poor recall/precision scores. The neural network by far having the worst recall score at 6%, and the base decision tree having the worst precision at around 0.48.  Moving forward this is what I'm planning:

1. Spend a very small amount of time on the base decision tree, perhaps iterate only once or twice.

2. Catboost will be the way to go here, I'll spend the most time iterating off of this model.

3. Spend a moderate amount of time on the neural network, and see if it will do better then Catboost.

I think that Catboost has the most potential here, and I'm confident that it will do better then a base decision tree ever could. I think it will be interesting to see what happens with the Neural network. I'm wondering if it will be able to do much better then Catboost, and if it can, will it be worth the tradeoff of training time and processing power?

Let's start with a [decision tree](Modeling-Decision_Tree.ipynb)