# Python Machine Learning: Machine Learning Walkthrough

In this notebook, we're going to execute a machine learning project from start to finish. We'll use techniques covered in the previous workshops to facilitate this process, but we'll also introduce some new concepts. Our goal is to demonstrate a basic machine learning pipeline

## Overview of Pipeline

We'll take the following steps to develop our machine learning models:

1. **Introduce Dataset and Objectives**
    - What is the dataset?
    - What do the features tell us?
    - What is our goal in applying machine learning here? 
    - Why is classification important in this scenario?
2. **Exploratory Data Analysis and Feature Engineering**
    - Produce several plots to give us a better understanding of the data.
    - Perform feature engineering and preprocessing.
    - Create a validation dataset from the data and set it aside until the end when evaluate final models.
3. **Modeling Process**
    - Determine the null accuracy and preferred performance metric.
    - Train three different models: Logistic Regression, Decision Trees and Random Forest.
4. **Evaluation and Interpretation**
    - Evaluate the model using a variety of metrics.
    - Discuss how successful we were with our modeling.

## Part 1: Introduce Dataset and Objectives

We are going to be using a Kaggle dataset called ["Personal Key Indicators of Heart Disease"](https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease).

This dataset consists of 2020 annual CDC survey data from 400,000 adults related to their health status with regard to heart disease.

Below, we provide an edited down description of the data, taken from the Kaggle data description:

### What topic does the dataset cover?

According to the CDC, heart disease is one of the leading causes of death for people of most races in the US (African Americans, American Indians and Alaska Natives, and white people). About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking. Other key indicator include diabetic status, obesity (high BMI), not getting enough physical activity or drinking too much alcohol. Detecting and preventing the factors that have the greatest impact on heart disease is very important in healthcare. Computational developments, in turn, allow the application of machine learning methods to detect "patterns" from the data that can predict a patient's condition.

### Where did the dataset come from and what treatments did it undergo?

Originally, the dataset came from the CDC and is a major part of the Behavioral Risk Factor Surveillance System (BRFSS), which conducts annual telephone surveys to gather data on the health status of U.S. residents. The most recent dataset (as of February 15, 2022) includes data from 2020. It consists of 401,958 rows and 279 columns.

The vast majority of columns are questions asked to respondents about their health status, such as "Do you have serious difficulty walking or climbing stairs?" or "Have you smoked at least 100 cigarettes in your entire life?".

In this dataset, I noticed many different factors (questions) that directly or indirectly influence heart disease, so I decided to select the most relevant variables from it and do some cleaning so that it would be usable for machine learning projects.

### What can you do with this dataset?

As described above, the original dataset of nearly 300 variables was reduced to just about 20 variables. In addition to classical EDA, this dataset can be used to apply a range of machine learning methods, most notably classifier models (logistic regression, SVM, random forest, etc.). You should treat the variable "HeartDisease" as a binary ("Yes" - respondent had heart disease; "No" - respondent had no heart disease). But note that classes are not balanced, so the classic model application approach is not advisable. Fixing the weights/undersampling should yield significantly betters results. Based on the dataset, I constructed a logistic regression model and embedded it in an application you might be inspired by: https://heart-condition-checker.herokuapp.com/. Can you indicate which variables have a significant effect on the likelihood of heart disease?

### Data Dictionary

The features available in the dataset are shown in the table below. The first variable, **HeartDisease**, is the target variable. We aim to predict whether **HeartDisease** is true or false. 

| Feature     | Description |
| ----------- | ----------- |
| **HeartDisease**       | Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)    |
| **BMI**   | Body Mass Index (BMI)        |
| **Smoking** | Have you smoked at least 100 cigarettes in your entire life? |
| **AlcoholDrinking** | Heavy drinkers (adult men having more than 14 drinks per week and adult women having more than 7 drinks per week |
| **Stroke** | Ever had a stroke? |
| **PhysicalHealth** | Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 was your physical health not good. |
| **MentalHealth** | Thinking about your mental health, for how many days during the past 30 days was your mental health not good? |
| **DiffWalking** | Do you have serious difficulty walking or climbing stairs? |
| **Sex** | Sex Assigned at Birth | 
| **AgeCategory** |  Fourteen-level age category |
| **Race** | Race and ethnicity |
| **Diabetic** | Have you ever had diabetes? |
| **PhysicalActivity** | Adults who reported doing physical activity or exercise during the past 30 days other than their regular job |
| **GenHealth** | Would you say that in general your health is...|
| **SleepTime** | On average, how many hours of sleep do you get in a 24-hour period?|
| **Asthma** | Have you ever had asthma?|
| **KidneyDisease** | Not including kidney stones, bladder infection or incontinence, were you ever told you had kidney disease? |
| **SkinCancer** | Have you ever had skin cancer? |

### What is our objective?

Our objective is to use a variety demographic, health, and behavioral data to predict if the patients in this dataset have or ever had heart disease.

### Import the Dataset

We'll begin by importing the dataset and taking a look at the columns. Don't forget to refer to the data dictionary for more details.

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(font_scale=1.5)

# Import functions from scikit-learn
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score,
                             confusion_matrix,
                             classification_report,
                             f1_score,
                             precision_score,
                             recall_score)
from sklearn.model_selection import (cross_val_score,
                                     cross_val_predict,
                                     StratifiedKFold,
                                     GridSearchCV,
                                     RandomizedSearchCV,
                                     train_test_split)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (LabelEncoder,
                                   OneHotEncoder,
                                   StandardScaler)
from sklearn.tree import DecisionTreeClassifier

We'll use `pandas` to import the dataset. Be sure to use the correct file path:

In [None]:
df = pd.read_csv("../data/heart_2020_cleaned_sample.csv")
df.head()

We have 13 categorical features, 4 continuous featues, and one categorical target variable.

Notice that we do not have any null values, so we can skip imputation:

In [None]:
# How many null values per column?
df.isnull().sum()

## Part 2: Exploratory Data Analysis and Feature Engineering

Before we jump into modeling, it's important to get to know our data. This will help motivate the features we use, any additional features we construct, and how we perform preprocessing.

### Exploratory Data Analysis

Let's first get a sense of the distributions of the variables in the dataset. We'll start with the numerical data. Let's plot histograms of these features. Notice that, in some plots, we use a log-scale for the $y$-axis. Try turning it off to see how the distribution looks.

In [None]:
# Grab the numeric features
df_numeric = df.select_dtypes("number")
numeric_features = df_numeric.columns
df_numeric.head()

In [None]:
# Plot BMI
sns.histplot(data=df_numeric, x='BMI', bins=50)
plt.show()

In [None]:
# Plot PhysicalHealth
sns.histplot(data=df_numeric, x='PhysicalHealth', bins=10)
plt.yscale('log')
plt.show()

In [None]:
# Plot PhysicalHealth
sns.histplot(data=df_numeric, x='MentalHealth', bins=10)
plt.yscale('log')
plt.show()

In [None]:
# Plot PhysicalHealth
sns.histplot(data=df_numeric, x='SleepTime', bins=20)
plt.yscale('log')
plt.show()

----
### Challenge 1: Correlation Plot

Calculate the correlation matrix among the numeric features, and plot it using `seaborn`. Use this [example](https://seaborn.pydata.org/examples/many_pairwise_correlations.html) as a reference.

You can create the correlation matrix with the `corr()` function, and then plot it using a `seaborn` `heatmap()`.

----

Next, let's look at the categorical features.

In [None]:
df_categorical = df.select_dtypes("object")
categorical_features = df_categorical.columns
df_categorical.head()

What are the number of unique values for each categorical feature?

In [None]:
df_categorical.nunique()

Let's plot the distributions of all the categorical features.

Note that we're doing this in a single set of subplots using `matplotlib`. There's a lot of code here - don't stress too much about the details. Instead, focus on the plot output and what the distribution of the variables look like. What do you notice?

In [None]:
# We choose 3 rows - feel free to adjust
nrows = 3
# Number of columns chosen automatically based on number of features
ncols = categorical_features.size // 3 + 1

# Create subplots using matplotlib
fig, axes = plt.subplots(nrows=3, ncols=ncols, figsize=(nrows * 9, ncols * 2.5))
# Adjust subplot spacing
plt.subplots_adjust(hspace=0.75)

# Iterate over categorical features
for idx, feature in enumerate(categorical_features):
    # Choose axis for features
    ax = axes[idx // ncols, idx % ncols]
    # Calculate proportions and plot bars
    df_categorical[feature].value_counts(normalize=True).sort_index().plot(
        kind='bar',
        ax=ax)
    # Rotate x ticks
    ax.tick_params(axis='x', rotation=0)
    # Set y limits
    ax.set_ylim([0, 1])
    # Create title for plot
    ax.set_title(feature)

# Adjustments to single plots
axes[1, 1].set_xticklabels(axes[1, 1].get_xticklabels(), rotation=40, ha='right', fontsize=14)
cur_xticks = axes[1, 2].get_xticklabels()
cur_xticks[0] = 'AI/AN'
axes[1, 2].set_xticklabels(cur_xticks, rotation=40, ha='right')
axes[1, 1].set_ylim([0, 0.12])
cur_xticks = axes[1, 3].get_xticklabels()
cur_xticks[1] = 'Borderline'
cur_xticks[3] = 'During\nPregnancy'
axes[1, 3].set_xticklabels(cur_xticks, rotation=40, ha='right')
axes[2, 0].set_xticklabels(axes[2, 0].get_xticklabels(), rotation=40, ha='right')
# Turn off unused plot
axes[-1, -1].axis(False)

plt.show()

Now let's see how these features correlation with the target variable.

For the continuous variables, for example, we can examine the distribution of each feature separately for the samples where the patient had heart disease and the samples where the patient did not have heart disease.

We can use a `seaborn` histogram with a `hue` argument to compare this directly. Pay attention to what arguments are passed into the function. What do these plots tell you?

In [None]:
sns.histplot(data=df, x='BMI', hue='HeartDisease', stat='density', bins=50, common_norm=False)
plt.xlim([0, 50])
plt.show()

In [None]:
sns.histplot(data=df, x='PhysicalHealth', hue='HeartDisease', stat='density', bins=10, common_norm=False)
plt.show()

In [None]:
sns.histplot(data=df, x='MentalHealth', hue='HeartDisease', stat='density', bins=10, common_norm=False)
plt.show()

In [None]:
sns.histplot(data=df, x='SleepTime', hue='HeartDisease', stat='density', bins=15, common_norm=False)
plt.xlim([0, 15])
plt.show()

Now, for the categorical data, we'll plot the average `HeartDisease` rate by each unique value of the variable. For example, consider how heart disease varies with smoking. Let's convert the heart disease features into a binary label to make this easy:

In [None]:
# Create binary variable for heart disease
df_categorical['HeartDiseaseBinary'] = np.where(df_categorical['HeartDisease'] == 'Yes', 1, 0)

Now, we group the samples by smoking, and calculate the heart disease rate for each group by averaging across the `HeartDiseaseBinary` variable. We can then visualize the rates as a bar plot:

In [None]:
df_categorical.groupby("Smoking")['HeartDiseaseBinary'].mean().plot(kind = "barh")

Let's do this same procedure for all categorical features. Once again, don't worry too much about the code. Focus instead on what the data is telling you. What correlates with heart disease?

In [None]:
# We choose 3 rows - feel free to adjust
nrows = 3
# Number of columns chosen automatically based on number of features
categorical_predictors = [feature for feature in df_categorical.columns
                          if 'HeartDisease' not in feature]
ncols = len(categorical_predictors) // 3 + 1

# Create subplots using matplotlib
fig, axes = plt.subplots(nrows=3, ncols=ncols, figsize=(nrows * 9, ncols * 2.5))
# Adjust subplot spacing
plt.subplots_adjust(hspace=0.75)

# Iterate over categorical features
for idx, feature in enumerate(categorical_predictors):
    # Make sure we skip over the heart disease features
    if 'HeartDisease' not in feature:
        # Choose axis for features
        ax = axes[idx // ncols, idx % ncols]
        # Calculate proportions and plot bars
        df_categorical.groupby(feature)['HeartDiseaseBinary'].mean().sort_index().plot(kind='bar', ax=ax)
        # Remove x label
        ax.set_xlabel('')
        # Rotate x ticks
        ax.tick_params(axis='x', rotation=0)
        # Create title for plot
        ax.set_title(feature)

# Adjustments to single plots
axes[1, 0].set_xticklabels(axes[1, 0].get_xticklabels(), rotation=40, ha='right', fontsize=14)
cur_xticks = axes[1, 1].get_xticklabels()
cur_xticks[0] = 'AI/AN'
axes[1, 1].set_xticklabels(cur_xticks, rotation=40, ha='right')
cur_xticks = axes[1, 2].get_xticklabels()
cur_xticks[1] = 'Borderline'
cur_xticks[3] = 'During\nPregnancy'
axes[1, 2].set_xticklabels(cur_xticks, rotation=40, ha='right')
axes[1, 4].set_xticklabels(axes[1, 4].get_xticklabels(), rotation=40, ha='right')
# Turn off unused plots
axes[-1, -2].axis(False)
axes[-1, -1].axis(False)

plt.show()

### Feature Engineering and Preprocessing

After we've performed EDA and gotten a sense of what features correlate with the target variable, we need to do feature engineering and preprocessing.

Feature engineering is at the heart of machine learning, and there's no single way to do it. Feature engineering is the process of constructing new features that we think might be informative about the predictor variable. It could mean taking categorical data and one-hot encoding it, creating interaction terms, and preprocessing. These are all steps we take *prior* to fitting a model in order to make the data more suitable for prediction.

We're going to do limited feature engineering in the interest of time. However, continue thinking about what useful features may exist while working with the data. Specifically, we will:

- Adjust the age features,
- Label encode the target variable,
- Scale the numerical features,
- One-hot encode the categorical data.

First, we'll adjust the age variable into a pseudo-continuous variable. We'll do this because the age category feature has 13 unique values, which is quite a lot for a categorical variable. Furthermore, age has an ordered structure, which we lose when we use the categorical formulation. What we'll do is replace each age value with the lower limit of the age range. We lose some information this way, but sometimes there's a cost to preprocessing.

In [None]:
# Unique age category values
df["AgeCategory"].value_counts().sort_index()

In [None]:
# Create "Age" column by taking the left number (remember, it's a string) and converting it to a float
df["Age"] = df['AgeCategory'].str[:2].astype(float)

Now, let's remove the age category as well as the heart disease column to obtain a "design matrix". We'll also extract the heart disease column into its own dependent variable:

In [None]:
X = df.drop(["HeartDisease", "AgeCategory"], axis=1).copy()
y = df["HeartDisease"]

Now, before we scale and one hot encode data, let's first split it into training and validation datasets. Why do we do this? All preprocessing should be done separately on the training and test (or validation) set. We'll use `sklearn`'s `train_test_split` function to perform the split:

In [None]:
X_train_raw, X_valid_raw, y_train_raw, y_valid_raw = train_test_split(X, y, test_size=0.25, random_state=1, stratify=y)

Before, we converted the heart disease feature into a binary feature using a `numpy` function. Now, we'll do it again, using a `scikit-learn` function. The benefit to using the `LabelEncoder` is that `scikit-learn` does all the work for us. It will also give us an object that can be applied to new data. For example, we can fit the `LabelEncoder` to the training data, and apply it to the validation data:

In [None]:
# Intialize label encoder
labeler = LabelEncoder()
# Fit and transform the target variable from the training set
y_train = labeler.fit_transform(y_train_raw)
# Transform the validation target variable
y_valid = labeler.transform(y_valid_raw)

In [None]:
# What classes did we obtain?
print(labeler.classes_)
# Confidence check: does it work?
print(labeler.transform(["No", "Yes"]))

Next, to perform additional preprocessing, we're going to create a `scikti-learn` `Pipeline`. The `Pipeline` allows us to compose multiple steps into a single object, which we can then fit and apply to multiple datasets. Let's take it one step at a time.

In [None]:
# Collect all features
feature_cols = X_train_raw.columns
# Identify numeric features
numeric_cols = X_train_raw.select_dtypes("number").columns.tolist()
# Identify categorical features
categorical_cols = X_train_raw.select_dtypes("object").columns.tolist()

Every `Pipeline` is composed of steps. Each step is a tuple of two elements: the first tells us the name, and the second tells us what transformation is happening. We can create a `Pipeline` by stitching together steps via a list. We can also create a `Pipeline` by stitching together smaller `Pipeline`s.

The tricky thing about a `Pipeline` is that it applies a transformation to all the data. This won't work in cases with heterogeneous data. For example, we don't want to one-hot encode continuous features, and we don't want to standardize categorical features. So, we need one more tool: the `ColumnTransformer`. The `ColumnTransformer` 

In [None]:
# Use Pipeline to create a numeric transformer: the Pipeline accepts a list of tuples
numeric_transformer = Pipeline([("scaler", StandardScaler())])

In [None]:
# Use Pipeline to create a categorical transformer
categorical_transformer = Pipeline(
    [("one_hot_encoder",
      OneHotEncoder(categories='auto', 
                    handle_unknown='error', 
                    sparse=False,
                    drop="first"))
    ])

In [None]:
# Now, we create the overall preprocessor with the ColumnTransformer.
# The ColumnTransformer is itself a Pipeline, and needs steps (i.e., a list). 
# In this case, each step needs a tuple with length 3:
# 1. The name of the step.
# 2. The Pipeline to apply at that step.
# 3. The columns to apply the step to.
preprocessor = ColumnTransformer(transformers=[
    # First step: numeric features
    ("numeric", numeric_transformer, numeric_cols),
    # Second step: categorical features
    ("categorical", categorical_transformer, categorical_cols)
])

Pipelines can also have classifiers (e.g., a `LogisticRegression`) included as well. In that case, the output of the pipeline would be a trained model. For now, however, we'll simply just preprocess the data using the `Pipeline`.

In [None]:
# Fit transform the train dataset using the pipeline
X_train = preprocessor.fit_transform(X_train_raw)
# Transform the testing dataset with the rules learned from the training dataset
X_valid = preprocessor.transform(X_valid_raw)

Now, we have our data preprocessed. Notice how easy, clean, and reproducible this process was. This demonstrates the value of using the `Pipeline` to conduct machine learning analyses. We can quickly and cleanly transform any new batches of data to confirm to the rules established by the training dataset.

In [None]:
# View result
print(X_train.shape)
print(X_train)

----
### Challenge 2: Pipelines on Training vs. Validation Data

Why do we use `fit_transform` for the training data, and only `transform` for the validation data? How does this prevent data leakage?

----

----
### Challenge 3: Extending Pipelines

A common preprocessing step is to decorrelate data via Principal Components Analysis (PCA). We have an alternative workshop on PCA and other dimensionality reduction techniques, but for now, all you need to know is that it has its own transformation object in `scitkit-learn`, just like the `StandardScaler`. Extend the `numeric_transformer` to have an additional step which performs PCA. Check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) on PCA in `scikit-learn`.

----

Notice that the output of the preprocessor is a `numpy` array with more columns than the original dataframe. This is because the one-hot encoding created some new columns. It'd be nice if we had this data in a data frame, with named columns. So, the last step we'll do is convert this back into a data frame. First, we need to access the new column names, which we can do with the `get_feature_names_out` method:

In [None]:
preprocessor.named_transformers_['categorical']['one_hot_encoder'].get_feature_names_out(categorical_cols)

In [None]:
# Access pipeline data to so we can name the one-hot encoded columns
new_categorical_cols = preprocessor.named_transformers_['categorical']['one_hot_encoder'].get_feature_names_out(categorical_cols).tolist()
# Create list of column names - numeric columns don't change
column_names = numeric_cols + new_categorical_cols
# Create dataframes
X_train = pd.DataFrame(data=X_train, columns=column_names)
X_valid = pd.DataFrame(data=X_valid, columns=column_names)
X_train.head()

Voila! We have our finalized dataset and we can move on to building predictive models.

## Part 3: Modeling Proccess

Now the fun begins!

But first: we need to calculate a baseline accuracy. This is the fraction of the data that is of one particular class. We can calculate this by taking the mean of the outcome variable.

----
### Challenge 4: Baseline Accuracy

Calculate the baseline accuracy of the training data. What does this baseline accuracy mean in the context of the data? What does it mean in the context of trying to predict heart disease outcomes?

----

----
### Challenge 5: False Positives and False Negatives

Accuracy is not the only thing we need to be worried about when building machine learning models. A model can be accurate, and still have issues in what samples it classifies correctly, and what samples it makes mistakes on.

For example, consider false positives and false negatives. What do each mean in this context? To be clear, if a model produces a false positive, what does that mean in the context of classifying heat disease?

If such a model were deployed in real life, which of the two - false positives or false negatives - would be more costly? Which should we be more concerned about?

----

### First Model: Logistic Regression

The first model we'll try is called logistic regression, which we've already studied in Python Machine Learning. As a reminder, logistic regression is a linear model that can be used to predict the probability of a sample falling in a certain class. Thus, it's a common model for classification.

We're going to use the `cross_val_score` function to calculate model performance across folds in the training data. The way we'll cross-validate is via the `StatifiedKFold` cross-validator.

----
### Challenge 6: Stratifying Cross-Validation

What does `StratifiedKFold` do that `KFold` does not? Why is stratifying cross-validation important, particularly in this context?

----

In [None]:
# Make list of metric functions
metrics = ['accuracy', 'precision', 'recall', 'f1']
n_folds = 5

In [None]:
# Create cross-validator
skfold = StratifiedKFold(n_folds)
# Create model
model = LogisticRegression()

# Iterate over metrics
for metric in metrics:
    cv_results = cross_val_score(model, X_train, y_train, cv=skfold, scoring=metric)
    print(f"Mean {metric} score is {cv_results.mean().round(3)}")

----
### Challenge 7: Ridge Regression

Re-run the above analysis, but use ridge regression instead. In particular, use `RidgeCV` so that you can choose the best regularization penalty. 

Notice that, by doing this, we have two loops of cross-validation: an outer loop in which we report metrics, and an inner loop which we use to choose the regularization penalty. This is a common way to report generalization accuracy.

----

At first, the accuracy score of 91.6% might look pretty good. However, recall our baseline accuracy. The baseline accuracy is what we could get if we simply said *nobody* got heart disease. Does our accuracy score look good in that context? This is why establishing baseline accuracy is important, particularly in imbalanced data!

What about the precision and recall scores? What do those metrics tell you?

### Second Model: Decision Trees

Next, let's try using a decision tree. We studied these in Part 1 of Machine Learning Fundamentals. You may recall that decision trees have a wide array of *hyperparameters*, or settings in the model we set before fitting it to the data. These can include the maximum depth, the criterion used for performing a split, etc. When we first fit the decision tree, we used default parameters for all of these specified by `scitkit-learn`. How can we go about *choosing* the best values instead?

In the case of ridge regression, we did a cross-validation procedure to choose the best hyperparameter. When we have many hyperparameters though, we'll need to do cross-validation across all combinations. Enter the **grid search**, which we can use to search across all combinations of hyperparameters to find the best one.

#### Grid Search for Model Selection

Grid search is a brute-force method that executes cross-validation for *all* possible combinations of hyperparameters from a set of hyperparameter ranges.

Let's consider an example. Suppose we have two hyperparameters $A$ and $B$. We don't know what values to choose for them, so we'll use a grid search to identify the best set. Grid search requires we specify hyperparameter ranges. So, let's say hyperparameter $A$ can be either of the two values $(0, 1)$, and hyperparameter $B$ can be either of the two values $(2, 3)$ (in practice, we might choose more values, but we'll use two each for simplicity). 

Grid search forms each combination of hyperparameters, and fits a model for it. We can then use the valiation performance to choose the best combination across all hyperparameters. In this case, we'd consider all the following combinations:

- $A = 0$, $B = 1$
- $A = 0$, $B = 3$
- $A = 2$, $B = 1$
- $A = 2$, $B = 3$

and choose the combination that performs the best.

We can easily perform this process by using `scikit-learn`'s `GridSearchCV`. Let's take a look at how it works by running it on the `max_depth` and `min_samples_leaf` hyperparameters in a decision tree:

In [None]:
# First, we specify a parameter grid as a dictionary
param_grid = {
    "max_depth": np.arange(5, 50, 5),
    "min_samples_leaf": np.arange(20, 200, 10)
}

In [None]:
# What is the size of the parameter grid we're tuning on?
param_grid["max_depth"].shape[0] * param_grid["min_samples_leaf"].shape[0]

That's 162 different sets of parameters. That's a lot of models to fit!

Next, we pass some information into the `GridSearchCV` object (check out the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) for more details):

In [None]:
grid_dt = GridSearchCV(
    # Specify the model
    estimator=DecisionTreeClassifier(),
    # Specify the hyperparameter grid
    param_grid=param_grid,
    # What metric should we use to select for the best model?
    scoring = "accuracy",
    # How do we generate cross-validation folds?
    cv=skfold)

Finally, we can fit the grid search. Let's use the `%%time` magic command to see how long this process takes. 

Note that `%%time` is a magic command to measure the time of execution for a whole cell, whereas `%time` only measures the next line.

In [None]:
%%time

# Fit the grid search object on the training data
grid_dt.fit(X_train, y_train)

We've got a fitted grid search variable! Let's take a look at what we get with it. First, the best score:

In [None]:
grid_dt.best_score_

We can also get the best cross-validated parameters:

In [None]:
grid_dt.best_params_

The grid search variable is its own predictor, and we can run it on any set of samples:

In [None]:
grid_dt.predict(X_train)

You can also directly access the estimator used in the `predict()` function. This is the `best_estimator_` attribute:

In [None]:
best_dt = grid_dt.best_estimator_
type(best_dt)

----
### Challenge 8: Choosing a Different Scoring Metric

Run a new grid search, this time using recall as the choice of the scoring metric. What are the best parameters in this case? Are they different from before?

----

### Third Model: Random Forests

So far, our modeling hasn't yield great results. Let's consider a different model, which is more commonly used in harder prediction problems.

This model is called the Random Forest. As you might expect from the name, a random forest is a collection of many decision trees. Specifically, it's an **ensemble model**, since it consists of an ensemble of $N$ decision trees. The $N$ tree in the forest can separately make predictions, each of which counts as a vote toward the final prediction. The ensemble prediction - typically by majority voting - performs better than a single tree alone. This is the machine learning version of a model "greater than the sum of its parts".

![](https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png)

There's a few important things to note about the random forest:

- Each tree in the forest is not trained on the same data and features. That would be counterproductive, because you would up end up with dozens of duplicate trees.
- Instead, the trees are trained on a subset (usually a random 2/3) of the features and a bootstrapped sample of the data. This helps reduce the variance of the predictions.
- To further decorrelate the trees, pruning trees is discouraged for the purpose of overfitting.

![](https://miro.medium.com/max/1240/1*EemYMyOADnT0lJWSXmTDdg.jpeg)

We are going to gloss over some of the details of the random forest in order to focus on their application in this context. However, those details are important! Check out this [blog post](https://victorzhou.com/blog/intro-to-random-forests/) for a gentle introduction to random forests. For a *very* in-depth explanation of random forests, check out Chapter 15 of [Elements of Statistical Learning Theory](https://hastie.su.domains/Papers/ESLII.pdf).

Let's get a sense for how a random forest performs without any hyperparameter tuning. We'll use the `RandomForestClassifier` from `scikit-learn`. The `n_estimators` argument is where we specify the number of trees.

In [None]:
# Create random forest
rf = RandomForestClassifier(n_estimators=50)
# Cross-validate
cv_results = cross_val_score(rf, X_train, y_train, cv=5)
cv_results.mean()

Not much improvement. Let's bring in the grid search to see if we can improve on this result.

Let's try varying the following hyperparameters: `n_estimators` and `min_samples_split`. 

In [None]:
param_grid = {
    "n_estimators": [50, 100, 200],
    "min_samples_split": [2, 5, 10, 0.25],
}

grid_rf = GridSearchCV(
    estimator=RandomForestClassifier(),
    param_grid=param_grid,
    cv=skfold,
    scoring="accuracy")

In [None]:
%%time
grid_rf.fit(X_train, y_train)

In [None]:
print(grid_rf.best_score_)
print(grid_rf.best_params_)

In [None]:
best_rf = grid_rf.best_estimator_

We did slightly better than the baseline accuracy, but not much - this is a hard problem!

----
### Challenge 8: Random Search

The downside of grid search is that it can take a long time, especially when you have a large number hyperparameters, a complex model, and a lot of data. Grid search quickly becomes unwieldy because 

RandomSearchCV is a solution to this issue because. In a random search, we randomly select a fraction of the hyperparameters sets to evaluate model prformance. In a random search, you're not guaranteed to find the best set of parameters. However, it oftens performs pretty well, especially when there are computational constraints.

You can do a random search in `scikit-learn` with `RandomSearchCV`. Choose a set of hyperparameters, and run a random search with `RandomSearchCV`. What are the best set of parameters?

----

## Part 4: Evaluation and Interpretation

It's time to bring back our validation dataset to get a stronger sense of how well our models perform on out-of-sample data.

We're going to use the logistic regression, decision tree, and random forest models we created to make predictions on the validation features and evaluate them on a variety of metrics.

In [None]:
# Refit logisic regression
lr = LogisticRegression()
lr.fit(X_train, y_train)

In [None]:
# Make predictions
lr_pred = lr.predict(X_valid)
dt_pred = best_dt.predict(X_valid)
rf_pred = best_rf.predict(X_valid)

Let's use the `classification_report` and `confusion_matrix` functions to evaluate the predictions:

In [None]:
print('Logistic Regression\n')
print(confusion_matrix(y_valid, lr_pred))
print(classification_report(y_valid, lr_pred))

In [None]:
print('Decision Tree\n')
print(confusion_matrix(y_valid, dt_pred))
print(classification_report(y_valid, dt_pred))

In [None]:
print('Random Forest\n')
print(confusion_matrix(y_valid, rf_pred))
print(classification_report(y_valid, rf_pred))

How do all the models compare to each other?

### Interpretation

The model performance we obtained was OK, but not amazing. This often happens in the development of machine learning algorithms. It's useful at this point to try and interpret our models to see where they're getting signal from in order to decide what to do next. Do we need more data? Do we need better features? Do we need a better model?

First, let's take a look at the logistic regression coefficients.

In [None]:
coefs = lr.coef_[0]
coefs = pd.Series(index=column_names, data = coefs)
coefs.sort_values(ascending=False)

Which features, according to the model, are the most and least associated with heart disease? How should you interpret categorical coefficients versus numerical coefficients? What do the sign of the coefficients mean?

With tree-based models, feature importance works a little differently. We can access a `feature_importannces_` attribute which captures the "importance", defined as "The (normalized) total reduction of the criterion brought by that feature." Basically, a quantification of how much the criterion we used (in our case the Gini impurity) was impacted by the feature's decision point. Notice that these feature importances are not signed:

In [None]:
dt_fi = best_dt.feature_importances_
dt_fi = pd.Series(index=column_names, data=dt_fi)
dt_fi.sort_values(ascending = False)

In [None]:
rf_fi = best_rf.feature_importances_
rf_fi = pd.Series(index=column_names, data=rf_fi)
rf_fi.sort_values(ascending = False)

Do the two sets of feature importances differ greatly? What do they tell us about predicting heart disease?

Some of the feature importances are pretty low. This could imply that we should cut them out of the model. This may improve generalization performance, since the model is not trying to incorporate those less predictive features during training. This choice falls in the domain of *feature selection*. So, in future work, one thing we could do is retrain models without these features. We could also use regularization to implicitly do feature selection (e.g., a Lasso regression).

Try to think about steps you might take to improve your models!

# Machine Learning Walkthrough Recap

In this exercise we attempted to predict the onset of heart disease. We did the following:

- We familiarized ourselves with the data and its patterns by studying the data dictionary and conducting exploratory data analyses.
- We applied a number of feature engineering techniques to the data to prepare for modeling.
- We employed three different machine learning models, two of which we parameter tuned in order maximize the generalization performance.
- We evaluated our models on a validation dataset to get a sense of how well it does on a out-of-sample dataset.
- We analyzed the relationship between the features and target variable by using attributes provided by the model.