# Modeling Notebook

## Summary
In this Notebook:
- A first simple model is created
-
-

# First Simple Models

In [19]:
# Import statements
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

# Importing metrics function from functions.py
from functions import metrics

In [7]:
# Load in cleaned data from last time.

# Training Data
X_train = pd.read_csv('../Data/train/X_train.csv', index_col=0)
y_train = pd.read_csv('../Data/train/y_train.csv', index_col=0)

# Testing Data
X_test = pd.read_csv('../Data/test/X_test.csv', index_col=0)
y_test = pd.read_csv('../Data/test/y_test.csv', index_col=0)


## Modeless Baseline
A modeless baseline is how accurate we would be if we guessed the majority class of our target variable. In this case, how accurate would we be if we guessed that the child in question did not have ADHD, for every child. This will be same for every split since we set the stratify parameter to true when we performed the TTS. It will be important to keep this metric in mind when modeling, to see when the model is just guessing the majority class.

In [14]:
# Getting % of each class and assigning to variables
no_adhd, adhd = y_train.value_counts(normalize=True)

# Printing modeless accuracy
print(f'If we said that each child in the set did not have ADHD, we would be {no_adhd*100:.0f}% accurate')


If we said that each child in the set did not have ADHD, we would be 90% accurate


## FSM - Sklearn Decision Tree

To start let's create the simplest model possible, to use as a baseline for future models. I'll use a Decision Tree model from Sklearn.

In [16]:
# Instantiating Tree
FSM_DT = DecisionTreeClassifier()

# Fitting Model
FSM_DT.fit(X_train, y_train)

# Score on the training data.
metrics(y_train, FSM_DT.predict(X_train))

1.0

Wow! A perfect Score!
I guess we can go home, we did it, we solved ADHD

Just kidding, lets check the score on the testing data.

In [21]:
# Predictions
FSM_DT_preds = FSM_DT.predict(X_test)

# Print metrics
metrics(y_test, FSM_DT_preds)

Accuracy: 0.90
Precision: 0.49
Recall: 0.54
F1: 0.51
ROC AUC: 0.74


In [25]:
# Checking the predictions to ensure that the model isn't guessing one class
pd.Series(FSM_DT_preds).value_counts()

2.0    7104
1.0     882
dtype: int64

## Analysis
The model is, obviously, overfit to the training data, and on the testing data it may as well be guessing. This is expected of an un-pruned DT model though.
I'd like to try a few different kinds of first models, so before iterating on this one let's create a few more FSMs.

## FSM - CatBoost
Catboost is not an Sklearn library, but is known for doing very well on categorical data like the one from this survey. Let's give it a shot and see how it does.