In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sklearn
print(f"Using Scikit-Learn version: {sklearn.__version__}")

Using Scikit-Learn version: 1.5.1


## 0. An end-to-end Scikit-Learn workflow

Before we get in-depth, let's quickly check out what an end-to-end Scikit-Learn workflow might look like.

Once we've seen an end-to-end workflow, we'll dive into each step a little deeper.

Specifically, we'll get hands-on with the following steps:

   1. Getting data ready (split into features and labels, prepare train and test steps)
   2. Choosing a model for our problem
   3. Fit the model to the data and use it to make a prediction
   4. Evaluate the model
   5. Experiment to improve
   6. Save a model for someone else to use


## Random Forest Classifier Workflow for Classifying Heart Disease

### 1. Get the data ready

As an example dataset, we'll import heart-disease.csv.

This file contains anonymised patient medical records and whether or not they have heart disease or not (this is a classification problem since we're trying to predict whether something is one thing or another).

In [9]:
import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except `target` are different patient characteristics.
The `target` column indicates whether the patient has heart disease (target=1) or not (target=0), this is our "label" columnm, the variable we're going to try and predict.

The rest of the `columns` (often called `features`) are what we'll be using to predict the target value.

In [10]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

# Check the head of the features DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [11]:
#check the head and value counts of the labels
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [12]:
y.value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [13]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64



One of the most important practices in machine learning is to split datasets into `training and test sets.`

As in, a model will train on the training set to learn patterns and then those patterns can be evaluated on the test set.

Crucially, a model should never see testing data during training.

This is equivalent to a student studying course materials during the semester (training set) and then testing their abilities on the following exam (testing set).

Scikit-learn provides the `sklearn.model_selection.train_test_split` method to split datasets in training and test sets.

In [17]:
#split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))


## 2. Choose the model and hyperparameters

Choosing a model often depends on the type of problem you're working on.

For example, there are different models that Scikit-Learn recommends whether you're working on a classification or regression problem.

You can see a map breaking down the different kinds of `model` options and recommendations in the Scikit-Learn documentation.

Scikit-Learn refers to models as "estimators", however, they are often also referred to as model or `clf` (short for classifier).

A model's hyperparameters are settings you can change to adjust it for your problem, much like knobs on an oven you can tune to cook your favourite dish.


In [18]:
# since we're working on classification problem, we'll start with a RandomForestClassification
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

we can see the current hyperparameter of a model with the `get_params()` method