In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sklearn
print(f"Using Scikit-Learn version: {sklearn.__version__}")

Using Scikit-Learn version: 1.5.1


## 0. An end-to-end Scikit-Learn workflow

Before we get in-depth, let's quickly check out what an end-to-end Scikit-Learn workflow might look like.

Once we've seen an end-to-end workflow, we'll dive into each step a little deeper.

Specifically, we'll get hands-on with the following steps:

   1. Getting data ready (split into features and labels, prepare train and test steps)
   2. Choosing a model for our problem
   3. Fit the model to the data and use it to make a prediction
   4. Evaluate the model
   5. Experiment to improve
   6. Save a model for someone else to use


## Random Forest Classifier Workflow for Classifying Heart Disease

### 1. Get the data ready

As an example dataset, we'll import heart-disease.csv.

This file contains anonymised patient medical records and whether or not they have heart disease or not (this is a classification problem since we're trying to predict whether something is one thing or another).

In [2]:
import pandas as pd
heart_disease = pd.read_csv("data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except `target` are different patient characteristics.
The `target` column indicates whether the patient has heart disease (target=1) or not (target=0), this is our "label" columnm, the variable we're going to try and predict.

The rest of the `columns` (often called `features`) are what we'll be using to predict the target value.

In [3]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

# Check the head of the features DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [4]:
#check the head and value counts of the labels
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [5]:
y.value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [6]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64



One of the most important practices in machine learning is to split datasets into `training and test sets.`

As in, a model will train on the training set to learn patterns and then those patterns can be evaluated on the test set.

Crucially, a model should never see testing data during training.

This is equivalent to a student studying course materials during the semester (training set) and then testing their abilities on the following exam (testing set).

Scikit-learn provides the `sklearn.model_selection.train_test_split` method to split datasets in training and test sets.

In [7]:
#split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))


## 2. Choose the model and hyperparameters

Choosing a model often depends on the type of problem you're working on.

For example, there are different models that Scikit-Learn recommends whether you're working on a classification or regression problem.

You can see a map breaking down the different kinds of `model` options and recommendations in the Scikit-Learn documentation.

Scikit-Learn refers to models as "estimators", however, they are often also referred to as model or `clf` (short for classifier).

A model's hyperparameters are settings you can change to adjust it for your problem, much like knobs on an oven you can tune to cook your favourite dish.


In [8]:
# since we're working on classification problem, we'll start with a RandomForestClassification
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

we can see the current hyperparameter of a model with the `get_params()` method

In [9]:
# View the current hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


### 3. Fit the model to the data and use it to make a prediction

Fitting a model a dataset involves passing it the data and asking it to figure out the patterns.

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

Most Scikit-Learn models have the fit(X, y) method built-in, where the X parameter is the features and the y parameter is the labels.

In our case, we start by fitting a model on the training split (X_train, y_train).


In [10]:
clf.fit(X=X_train, y=y_train)

### Use the model to make a prediction

The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once your model instance is trained, you can use the predict() method to predict a target value given a set of features.

In other words, use the model, along with some new, unseen and unlabelled data to predict the label.

In [11]:
# This doesn't work... incorrect shapes
array=np.array([0,2,3,4])
array.shape
y_label = clf.predict(np.array([0,2,3,4]))



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We get a ValueError (mismatched shapes):

ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

This happens because we're trying to make predictions on data that is in a different format to the data our model was trained on.

Since our model was trained on data from X_train, predictions should be made on data in the same format and shape as X_train.

Our goal in many machine learning problems is to use patterns learned from the training data to make predictions on the test data (or future unseen data).

In [12]:
# In order to predict a label, data has to be in the same shape as X_train
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
104,50,1,2,129,196,0,1,163,0,0.0,2,0,2
139,64,1,0,128,263,0,1,105,1,0.2,1,1,3
141,43,1,0,115,303,0,1,181,0,1.2,1,0,2
171,48,1,1,110,229,0,1,168,0,1.0,0,0,3
125,34,0,1,118,210,0,1,192,0,0.7,2,0,2


In [13]:
# use the model to make a prediction on the test data (further evaluation)
y_preds = clf.predict(X=X_test)


### 4. Evaluate the model

Now we've made some predictions, we can start to use some more Scikit-Learn methods to figure out how good our model is.

Each model or estimator has a built-in `score()` method.

This method compares how well the model was able to learn the patterns between the features and labels.

The score() method for each model uses a standard evaluation metric to measure your model's results.

In the case of a classifier (our model), one of the most common evaluation metrics is accuracy (the fraction of correct predictions out of total predictions).

Let's check out our model's accuracy on the training set.


In [15]:
# Evaluate the model on the training set
train_acc = clf.score(X=X_train, y= y_train)
print(f"The model's accuracy on the training dataset is: {train_acc*100}%")

The model's accuracy on the training dataset is: 100.0%




The model's accuracy on the training dataset is: 100.0%

Woah! Looks like our model does pretty well on the training datset.

This is because it has a chance to see both data and labels.

How about the test dataset?


In [16]:
# Evaluate the model on the test set
test_acc = clf.score(X=X_test, y= y_test)
print(f"The model's accuracy on the testing dataset is: {test_acc*100:.2f}%")

The model's accuracy on the testing dataset is: 81.58%




The model's accuracy on the testing dataset is: 88.16%

Hmm, looks like our model's accuracy is a bit less on the test dataset than the training dataset.

This is quite often the case, because remember, a model has never seen the testing examples before.

There are also a number of other evaluation methods we can use for our classification models.

All of the following classification metrics come from the sklearn.metrics module:

`classification_report(y_true, y_true)` - Builds a text report showing various classification metrics such as precision, recall and F1-score.
`confusion_matrix(y_true, y_pred)` - Create a confusion matrix to compare predictions to truth labels.
`accuracy_score(y_true, y_pred)` - Find the accuracy score (the default metric) for a classifier.

All metrics have the following in common: they compare a model's predictions (y_pred) to truth labels (y_true).


In [18]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#create a classification report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.79      0.79      0.79        33
           1       0.84      0.84      0.84        43

    accuracy                           0.82        76
   macro avg       0.81      0.81      0.81        76
weighted avg       0.82      0.82      0.82        76



In [20]:
# Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[26,  7],
       [ 7, 36]])

In [21]:
# Compute the accuracy score (same as the score() method for classifiers) 
accuracy_score(y_test, y_preds)

0.8157894736842105