In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

import sklearn
print(f"Using Scikit-Learn version: {sklearn.__version__}")

Using Scikit-Learn version: 1.5.1


## 0. An end-to-end Scikit-Learn workflow

Before we get in-depth, let's quickly check out what an end-to-end Scikit-Learn workflow might look like.

Once we've seen an end-to-end workflow, we'll dive into each step a little deeper.

Specifically, we'll get hands-on with the following steps:

   1. Getting data ready (split into features and labels, prepare train and test steps)
   2. Choosing a model for our problem
   3. Fit the model to the data and use it to make a prediction
   4. Evaluate the model
   5. Experiment to improve
   6. Save a model for someone else to use


## Random Forest Classifier Workflow for Classifying Heart Disease

### 1. Get the data ready

As an example dataset, we'll import heart-disease.csv.

This file contains anonymised patient medical records and whether or not they have heart disease or not (this is a classification problem since we're trying to predict whether something is one thing or another).

In [2]:
import pandas as pd
heart_disease = pd.read_csv("../data/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except `target` are different patient characteristics.
The `target` column indicates whether the patient has heart disease (target=1) or not (target=0), this is our "label" columnm, the variable we're going to try and predict.

The rest of the `columns` (often called `features`) are what we'll be using to predict the target value.

In [3]:
# Create X (all the feature columns)
X = heart_disease.drop("target", axis=1)

# Create y (the target column)
y = heart_disease["target"]

# Check the head of the features DataFrame
X.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2


In [4]:
#check the head and value counts of the labels
y.head()

0    1
1    1
2    1
3    1
4    1
Name: target, dtype: int64

In [5]:
y.value_counts()

target
1    165
0    138
Name: count, dtype: int64

In [6]:
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64



One of the most important practices in machine learning is to split datasets into `training and test sets.`

As in, a model will train on the training set to learn patterns and then those patterns can be evaluated on the test set.

Crucially, a model should never see testing data during training.

This is equivalent to a student studying course materials during the semester (training set) and then testing their abilities on the following exam (testing set).

Scikit-learn provides the `sklearn.model_selection.train_test_split` method to split datasets in training and test sets.

In [7]:
#split the data into training and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((227, 13), (76, 13), (227,), (76,))


## 2. Choose the model and hyperparameters

Choosing a model often depends on the type of problem you're working on.

For example, there are different models that Scikit-Learn recommends whether you're working on a classification or regression problem.

You can see a map breaking down the different kinds of `model` options and recommendations in the Scikit-Learn documentation.

Scikit-Learn refers to models as "estimators", however, they are often also referred to as model or `clf` (short for classifier).

A model's hyperparameters are settings you can change to adjust it for your problem, much like knobs on an oven you can tune to cook your favourite dish.


In [8]:
# since we're working on classification problem, we'll start with a RandomForestClassification
from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier()

we can see the current hyperparameter of a model with the `get_params()` method

In [9]:
# View the current hyperparameters
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}


### 3. Fit the model to the data and use it to make a prediction

Fitting a model a dataset involves passing it the data and asking it to figure out the patterns.

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

Most Scikit-Learn models have the fit(X, y) method built-in, where the X parameter is the features and the y parameter is the labels.

In our case, we start by fitting a model on the training split (X_train, y_train).


In [10]:
clf.fit(X=X_train, y=y_train)

### Use the model to make a prediction

The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once your model instance is trained, you can use the predict() method to predict a target value given a set of features.

In other words, use the model, along with some new, unseen and unlabelled data to predict the label.

In [11]:
# This doesn't work... incorrect shapes
array=np.array([0,2,3,4])
array.shape
y_label = clf.predict(np.array([0,2,3,4]))



ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

We get a ValueError (mismatched shapes):

ValueError: Expected 2D array, got 1D array instead:
array=[0. 2. 3. 4.].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

This happens because we're trying to make predictions on data that is in a different format to the data our model was trained on.

Since our model was trained on data from X_train, predictions should be made on data in the same format and shape as X_train.

Our goal in many machine learning problems is to use patterns learned from the training data to make predictions on the test data (or future unseen data).

In [12]:
# In order to predict a label, data has to be in the same shape as X_train
X_test.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
136,60,0,2,120,178,1,1,96,0,0.0,2,0,2
222,65,1,3,138,282,1,0,174,0,1.4,1,1,2
255,45,1,0,142,309,0,0,147,1,0.0,1,3,3
279,61,1,0,138,166,0,0,125,1,3.6,1,1,2
48,53,0,2,128,216,0,0,115,0,0.0,2,0,0


In [13]:
# use the model to make a prediction on the test data (further evaluation)
y_preds = clf.predict(X=X_test)


### 4. Evaluate the model

Now we've made some predictions, we can start to use some more Scikit-Learn methods to figure out how good our model is.

Each model or estimator has a built-in `score()` method.

This method compares how well the model was able to learn the patterns between the features and labels.

The score() method for each model uses a standard evaluation metric to measure your model's results.

In the case of a classifier (our model), one of the most common evaluation metrics is accuracy (the fraction of correct predictions out of total predictions).

Let's check out our model's accuracy on the training set.


In [14]:
# Evaluate the model on the training set
train_acc = clf.score(X=X_train, y= y_train)
print(f"The model's accuracy on the training dataset is: {train_acc*100}%")

The model's accuracy on the training dataset is: 100.0%




The model's accuracy on the training dataset is: 100.0%

Woah! Looks like our model does pretty well on the training datset.

This is because it has a chance to see both data and labels.

How about the test dataset?


In [15]:
# Evaluate the model on the test set
test_acc = clf.score(X=X_test, y= y_test)
print(f"The model's accuracy on the testing dataset is: {test_acc*100:.2f}%")

The model's accuracy on the testing dataset is: 84.21%




The model's accuracy on the testing dataset is: 88.16%

Hmm, looks like our model's accuracy is a bit less on the test dataset than the training dataset.

This is quite often the case, because remember, a model has never seen the testing examples before.

There are also a number of other evaluation methods we can use for our classification models.

All of the following classification metrics come from the sklearn.metrics module:

`classification_report(y_true, y_true)` - Builds a text report showing various classification metrics such as precision, recall and F1-score.
`confusion_matrix(y_true, y_pred)` - Create a confusion matrix to compare predictions to truth labels.
`accuracy_score(y_true, y_pred)` - Find the accuracy score (the default metric) for a classifier.

All metrics have the following in common: they compare a model's predictions (y_pred) to truth labels (y_true).


In [16]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

#create a classification report
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.80      0.85      0.82        33
           1       0.88      0.84      0.86        43

    accuracy                           0.84        76
   macro avg       0.84      0.84      0.84        76
weighted avg       0.84      0.84      0.84        76



In [17]:
# Create a confusion matrix
conf_mat = confusion_matrix(y_test, y_preds)
conf_mat

array([[28,  5],
       [ 7, 36]])

In [18]:
# Compute the accuracy score (same as the score() method for classifiers) 
accuracy_score(y_test, y_preds)

0.8421052631578947

## 5. Experiment to improve

The first model you build is often referred to as a baseline (a baseline is often even simpler than the model we've used, a baseline could be "let's just by default predict the most common value and then try to improve").

Once you've got a baseline model, like we have here, it's important to remember, this is often not the final model you'll use.

The next step in the workflow is to try and improve upon your baseline model.

How?

With one of the most important mottos in machine learning...

Experiment, experiment, experiment!

Experiments can come in many different forms.

But let's break it into two.

    1. From a model perspective.
    2. From a data perspective.

From a model perspective may involve things such as using a more complex model or tuning your models hyperparameters.

From a data perspective may involve collecting more data or better quality data so your existing model has more of a chance to learn the patterns within.

If you're already working on an existing dataset, it's often easier try a series of model perspective experiments first and then turn to data perspective experiments if you aren't getting the results you're looking for.

One thing you should be aware of is if you're tuning a models hyperparameters in a series of experiments, your reuslts should always be cross-validated (we'll see this later on!).

Cross-validation is a way of making sure the results you're getting are consistent across your training and test datasets (because it uses multiple versions of training and test sets) rather than just luck because of the order the original training and test sets were created.

   * Try different hyperparameters.
   * All different parameters should be cross-validated.
       * Note: Beware of cross-validation for time series problems (as for time series, you don't want to mix samples from the future with samples from the past).

Different models you use will have different hyperparameters you can tune.

For the case of our model, the `RandomForestClassifier()`, we'll start trying different values for `n_estimators` (a measure for the number of trees in the random forest).

By default, n_estimators=100, so how about we try values from 100 to 200 and see what happens (generally more is bet

ter)?


In [19]:
# Try different numbers of estimators (trees)... (no cross-validation)
np.random.seed(42)
for i in range(100, 200 , 10):
    print(f'trying model with {i} estimators ... ')
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accuracy on test set : {model.score(X_test, y_test) * 100:.2f}%")
    print("")

trying model with 100 estimators ... 
Model accuracy on test set : 85.53%

trying model with 110 estimators ... 
Model accuracy on test set : 85.53%

trying model with 120 estimators ... 
Model accuracy on test set : 84.21%

trying model with 130 estimators ... 
Model accuracy on test set : 85.53%

trying model with 140 estimators ... 
Model accuracy on test set : 84.21%

trying model with 150 estimators ... 
Model accuracy on test set : 85.53%

trying model with 160 estimators ... 
Model accuracy on test set : 86.84%

trying model with 170 estimators ... 
Model accuracy on test set : 85.53%

trying model with 180 estimators ... 
Model accuracy on test set : 85.53%

trying model with 190 estimators ... 
Model accuracy on test set : 84.21%





The metrics above were measured on a single train and test split.

Let's use `sklearn.model_selection.cross_val_score` to measure the results across 5 different train and test sets.

We can achieve this by setting `cross_val_score(X, y, cv=5).`

Where X is the full feature set and y is the full label set and cv is the number of train and test splits cross_val_score will automatically create from the data (in our case, 5 different splits, this is known as 5-fold cross-validation).


In [20]:
from sklearn.model_selection import cross_val_score

# with cross-validation
np.random.seed(42)
for i in range(100, 200, 10):
    print(f"trying model with {i} estimators...")
    model = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)

    # Measure the model score on a single train/test split
    model_score = model.score(X_test, y_test)
    print(f"Model accuracy on single test set split : {model_score * 100:.2f}%")

    # Measure the mean cross-validation score across 5 different train and test splits
    cross_val_mean = np.mean(cross_val_score(model, X, y, cv=5))
    print(f"5-fold cross-validation score: {cross_val_mean * 100:.2f}%")

    print("")

trying model with 100 estimators...
Model accuracy on single test set split : 85.53%
5-fold cross-validation score: 82.15%

trying model with 110 estimators...
Model accuracy on single test set split : 86.84%
5-fold cross-validation score: 81.17%

trying model with 120 estimators...
Model accuracy on single test set split : 82.89%
5-fold cross-validation score: 83.16%

trying model with 130 estimators...
Model accuracy on single test set split : 85.53%
5-fold cross-validation score: 83.14%

trying model with 140 estimators...
Model accuracy on single test set split : 85.53%
5-fold cross-validation score: 82.48%

trying model with 150 estimators...
Model accuracy on single test set split : 85.53%
5-fold cross-validation score: 80.17%

trying model with 160 estimators...
Model accuracy on single test set split : 85.53%
5-fold cross-validation score: 80.83%

trying model with 170 estimators...
Model accuracy on single test set split : 85.53%
5-fold cross-validation score: 81.83%

trying m



Which model had the best cross-validation score?

This is usually a better indicator of a quality model than a single split accuracy score.

Rather than set up and track the results of these experiments manually, we can get Scikit-Learn to do the exploration for us.

Scikit-Learn's `sklearn.model_selection.GridSearchCV` is a way to search over a set of different hyperparameter values and *automatically* track which perform the best.

Let's test it!


In [23]:
# Another way to do it with GridSearchCV...
np.random.seed(42)
from sklearn.model_selection import GridSearchCV

# Define the parameters to search over in dictionary form
# (these can be any of your target model's hyperparameters) 
param_grid = {'n_estimators': [i for i in range(100,200, 10)]}

# setup the grid search
grid = GridSearchCV(estimator=RandomForestClassifier(),param_grid=param_grid,cv=5,verbose=1)

# fit the grid search to the data
grid.fit(X, y)

#find the best parameters
print(f"The best parameter values are : {grid.best_params_}")
print(f"With a score of : {grid.best_score_*100:.2f}%")


Fitting 5 folds for each of 10 candidates, totalling 50 fits
The best parameter values are : {'n_estimators': 120}
With a score of : 82.82%



We can extract the best model/estimator with the` best_estimator_` attribute.

In [24]:
# Set the model to the best estimator
clf = grid.best_estimator_
clf

And now we've got the best cross-validated model, we can fit and score it on our original single train/test split of the data.

In [25]:
# fit the best model
clf = clf.fit(X_train, y_train)

# Find the best model scores on our single test split
# (note: this may be lower than the cross-validation score since it's only on one split of the data)
print(f'best model score on single split of the data: {clf.score(X_test, y_test)*100:.2f}%')

best model score on single split of the data: 85.53%



## 6. Save a model for someone else to use

When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

You can save a Scikit-Learn model using Python's in-built `pickle` module.


In [26]:
import pickle

# save an existing mode to file
pickle.dump(model, open("random_forest_model_1.pkl", "wb"))# "wb" stands for write binary


In [27]:
# Load a saved pickle model and evaluate it
loaded_pickle_model = pickle.load(open("random_forest_model_1.pkl", "rb"))
print(f"Loaded pickle model prediction score: {loaded_pickle_model.score(X_test, y_test) * 100:.2f}%")

Loaded pickle model prediction score: 86.84%




For larger models, it may be more efficient to use [Joblib].


In [28]:
from joblib import dump,load

# save a model using joblib
dump(model, "random_forest_model_1.joblib")

['random_forest_model_1.joblib']

In [29]:
# load a saved joblib model and evaluate it 
loaded_joblib_model = load("random_forest_model_1.joblib")
print(f"Loaded joblib model prediction score: {loaded_joblib_model.score(X_test, y_test) *100:.2f}%")

Loaded joblib model prediction score: 86.84%



## 1. Getting the data ready

Data doesn't always come ready to use with a Scikit-Learn machine learning model.

Three of the main steps you'll often have to take are:

  * Splitting the data into features (usually X) and labels (usually y).
  * Splitting the data into training and testing sets (and possibly a validation set).
  * Filling (also called imputing) or disregarding missing values.
  * Converting non-numerical values to numerical values (also call feature encoding).

Let's see an example.


In [30]:
# Splitting the data into X & y
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [31]:
# splitiing the data into features (X) and labels (y)
X = heart_disease.drop('target', axis=1)
X

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...
298,57,0,0,140,241,0,1,123,1,0.2,1,0,3
299,45,1,3,110,264,0,1,132,0,1.2,1,0,3
300,68,1,0,144,193,1,1,141,0,3.4,1,2,3
301,57,1,0,130,131,0,1,115,1,1.2,1,1,3


In [32]:
y = heart_disease['target']
y

0      1
1      1
2      1
3      1
4      1
      ..
298    0
299    0
300    0
301    0
302    0
Name: target, Length: 303, dtype: int64

Now let's split our data into training and test sets, we'll use an 80/20 split (80% of samples for training and 20% of samples for testing).

In [33]:
# splitting the data into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2) 

# check the shapes of different data splits
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((242, 13), (61, 13), (242,), (61,))

In [34]:
# 80% of data is being used for the training set (the model will learn patterns on these samples)
X.shape[0]* 0.8

242.4

In [35]:

# And 20% of the data is being used for the testing set (the model will be evaluated on these samples)
X.shape[0] * 0.2

60.6


## 1.1 Make sure it's all numerical

Computers love numbers.

So one thing you'll often have to make sure of is that your datasets are in numerical form.

This even goes for datasets which contain non-numerical features that you may want to include in a model.

For example, if we were working with a car sales dataset, how might we turn features such as `Make` and `Colour` into numbers?

Let's figure it out.

First, we'll import the car-sales-extended.csv dataset.


In [36]:
# import extended car-sales file
car_sales = pd.read_csv("../data/car-sales-extended.csv")
car_sales

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043
...,...,...,...,...,...
995,Toyota,Black,35820,4,32042
996,Nissan,White,155144,3,5716
997,Nissan,Blue,66604,4,31570
998,Honda,White,215883,4,4001


In [37]:
car_sales.dtypes

Make             object
Colour           object
Odometer (KM)     int64
Doors             int64
Price             int64
dtype: object



Notice the Make and Colour features are of `dtype=object` (they're strings) where as the rest of the columns are of dtype=int64.

If we want to use the Make and Colour features in our model, we'll need to figure out how to turn them into numerical form.


In [38]:
# Split into X & y and train/test
from sklearn.model_selection import train_test_split

X = car_sales.drop("Price", axis=1)
y = car_sales["Price"]

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [39]:
X_train.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
417,Honda,Blue,37625,4
666,Honda,White,211154,4
653,BMW,Blue,40774,5
822,Nissan,White,114699,4
762,Toyota,White,147230,4


In [40]:
# Try to predict with random forest on price column (doesn't work)
from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

ValueError: could not convert string to float: 'Honda'

Oh no! We get a another ValueError (some of data is in string format rather than numerical format).

Machine learning models prefer to work with numbers than text.

So we'll have to convert the non-numerical features into numbers first.

The process of turning categorical features into numbers is often referred to as encoding.

Scikit-Learn has a fantastic in-depth guide on Encoding categorical features.

But let's look at one of the most straightforward ways to turn categorical features into numbers, one-hot encoding.

In machine learning, one-hot encoding gives a value of 1 to the target value and a value of 0 to the other values.


In [41]:
# Import OneHotEncoder and ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

# define the catergorical features to transform
categorical_features = ["Make", "Colour", "Doors"]

# create an instance of OneHotEncoder
one_hot = OneHotEncoder()

# Create an instance of ColumnTransformer
transformer = ColumnTransformer([("one_hot", #name
                                  one_hot, # transformer
                                  categorical_features)], # columns to transform
                                  remainder="passthrough") # what to do whith the rest

# 5. Turn the categorical features into numbers (this will return an array-like sparse matrix, not a DataFrame)
transformed_X = transformer.fit_transform(X)
transformed_X

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])



**Note:** You might be thinking why we considered `Doors` as a categorical variable. Which is a good question considering Doors is already numerical. Well, the answer is that Doors could be either numerical or categorical. However, I've decided to go with categorical, since where I'm from, number of doors is often a different category of car. For example, you can shop for 4-door cars or shop for 5-door cars (which always confused me since where's the 5th door?). However, you could experiment with treating this value as numerical or categorical, training a model on each, and then see how each model performs.

Woah! Looks like our samples are all numerical, what did our data look like previously?


In [42]:
X.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors
0,Honda,White,35431,4
1,BMW,Blue,192714,5
2,Honda,White,84714,4
3,Toyota,White,154365,4
4,Nissan,Blue,181577,3


In [43]:
# View first transformed sample
transformed_X[0]

array([0.0000e+00, 1.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00,
       0.0000e+00, 0.0000e+00, 0.0000e+00, 1.0000e+00, 0.0000e+00,
       1.0000e+00, 0.0000e+00, 3.5431e+04])

In [44]:
# View original first sample
X.iloc[0]

Make             Honda
Colour           White
Odometer (KM)    35431
Doors                4
Name: 0, dtype: object


### 1.1.1 Nuemrically encoding data with pandas

Another way we can numerically encode data is directly with pandas.

We can use the pandas.get_dummies() (or pd.get_dummies() for short) method and then pass it our target columns.

In return, we'll get a one-hot encoded version of our target columns.

Let's remind ourselves of what our DataFrame looks like.


In [45]:
# View head of original DataFrame
car_sales.head()

Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431,4,15323
1,BMW,Blue,192714,5,19943
2,Honda,White,84714,4,28343
3,Toyota,White,154365,4,13434
4,Nissan,Blue,181577,3,14043


In [46]:
# One-hot encode categorical variables
categorical_variables = ["Make", "Colour", "Doors"]
dummies = pd.get_dummies(data=car_sales[categorical_variables])
dummies

Unnamed: 0,Doors,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White
0,4,False,True,False,False,False,False,False,False,True
1,5,True,False,False,False,False,True,False,False,False
2,4,False,True,False,False,False,False,False,False,True
3,4,False,False,False,True,False,False,False,False,True
4,3,False,False,True,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...
995,4,False,False,False,True,True,False,False,False,False
996,3,False,False,True,False,False,False,False,False,True
997,4,False,False,True,False,False,True,False,False,False
998,4,False,True,False,False,False,False,False,False,True




Nice!

Notice how there's a new column for each categorical option (e.g. Make_BMW, Make_Honda, etc).

But also notice how it also missed the `Doors` column?

This is because Doors is already numeric, so for pd.get_dummies() to work on it, we can `change it to type object`.

By default, pd.get_dummies() also turns all of the values to bools (True or False).

We can get the returned values as `0 or 1` by setting `dtype=float`.


In [47]:
# Have to convert doors to object for dummies to work on it...
car_sales["Doors"] = car_sales["Doors"].astype(object)
dummies = pd.get_dummies(data=car_sales[["Make", "Colour", "Doors"]],
                         dtype=float)
dummies

Unnamed: 0,Make_BMW,Make_Honda,Make_Nissan,Make_Toyota,Colour_Black,Colour_Blue,Colour_Green,Colour_Red,Colour_White,Doors_3,Doors_4,Doors_5
0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
4,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
996,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0
997,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
998,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0


Now you might be wondering...

Should you use Scikit-Learn or pandas for turning data into numerical form?

And the answer is either.

But as a rule of thumb:

    If you're performing quick data analysis and running small modelling experiments, use pandas as it's generally quite fast to get up and running.
    If you're performing a larger scale modelling experiment or would like to put your data processing steps into a production pipeline, I'd recommend leaning towards Scikit-Learn, specifically a Scikit-Learn Pipeline (chaining together multiple estimator/modelling steps).

Since we've turned our data into numerical form, how about we try and fit our model again?

Let's recreate a train/test split except this time we'll use transformed_X instead of X.

In [48]:
np.random.seed(42)

# create train and test splits with transformed_X
X_train, X_test, y_train, y_test = train_test_split(transformed_X, y, test_size=0.2)

# Create the model instance
model = RandomForestRegressor()

# Fit the model on the numerical data (this errored before since our data wasn't fully numeric)
model.fit(X_train, y_train)

# Score the model (returns r^2 metric by default, also called coefficient of determination, higher is better)
model.score(X_test, y_test)

0.3235867221569877


## 1.2 What if there were missing values in the data?

Holes in the data means holes in the patterns your machine learning model can learn.

Many machine learning models don't work well or produce errors when they're used on datasets with missing values.

A missing value can appear as a blank, as a NaN or something similar.

There are two main options when dealing with missing values:

    Fill them with some given or calculated value (imputation) - For example, you might fill missing values of a numerical column with the mean of all the other values. The practice of calculating or figuring out how to fill missing values in a dataset is called imputing. For a great resource on imputing missing values, I'd recommend refering to the Scikit-Learn user guide.
    Remove them - If a row or sample has missing values, you may opt to remove them from your dataset completely. However, this potentially results in using less data to build your model.

    Note: Dealing with missing values differs from problem to problem, meaning there's no 100% best way to fill missing values across datasets and problem types. It will often take careful experimentation and practice to figure out the best way to deal with missing values in your own datasets.

To practice dealing with missing values, let's import a version of the car_sales dataset with several missing values (namely car-sales-extended-missing-data.csv).


In [49]:
# Import car sales dataframe with missing values
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv") # load from local directory
car_sales_missing.head(10)


Unnamed: 0,Make,Colour,Odometer (KM),Doors,Price
0,Honda,White,35431.0,4.0,15323.0
1,BMW,Blue,192714.0,5.0,19943.0
2,Honda,White,84714.0,4.0,28343.0
3,Toyota,White,154365.0,4.0,13434.0
4,Nissan,Blue,181577.0,3.0,14043.0
5,Honda,Red,42652.0,4.0,23883.0
6,Toyota,Blue,163453.0,4.0,8473.0
7,Honda,White,,4.0,20306.0
8,,White,130538.0,4.0,9374.0
9,Honda,Blue,51029.0,4.0,26683.0


In [50]:
# Get the sum of all missing values
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64



Hmm... seems there's about 50 or so missing values per column.

How about we try and split the data into features and labels, then convert the categorical data to numbers, then split the data into training and test and then try and fit a model on it (just like we did before)?


In [51]:
# Create features
X_missing = car_sales_missing.drop("Price", axis=1)
print(f"Number of missing X values :\n{X_missing.isna().sum()}")

Number of missing X values :
Make             49
Colour           50
Odometer (KM)    50
Doors            50
dtype: int64


In [52]:
# Create labels
y_missing = car_sales_missing["Price"]
print(f"Number of missing y values: {y_missing.isna().sum()}")

Number of missing y values: 50




Now we can convert the categorical columns into one-hot encodings (just as before).


In [53]:
# Let's convert the categorical columns to one hot encoded (code copied from above)
# Turn the categories (Make and Colour) into numbers
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()

transformer = ColumnTransformer([("one_hot", 
                                  one_hot, 
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0) # return a sparse matrix or not

transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        0.00000e+00, 2.48360e+05]])

Finally, let's split the missing data samples into train and test sets and then try to fit and score a model on them.

In [54]:
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,
                                                    y_missing,
                                                    test_size=0.2)

# fit and score a model
model = RandomForestRegressor()
model.fit(X_train, y_test)
model.score(X_test, y_test)

ValueError: Input y contains NaN.

Ahh... dam! Another ValueError (our input data contains missing values).

`ValueError: Input y contains NaN.`

Looks like the model we're trying to use doesn't work with missing values.

When we try to fit it on a dataset with missing samples, Scikit-Learn produces an error similar to:

ValueError: Input X contains NaN. RandomForestRegressor does not accept missing values encoded as NaN natively...

Looks like if we want to use RandomForestRegressor, we'll have to either fill or remove the missing values.
Note: Scikit-Learn does have a list of models which can handle NaNs or missing values directly.

Such as, `sklearn.ensemble.HistGradientBoostingClassifier` or sklearn.ensemble.HistGradientBoostingRegressor.

As an experiment, you may want to try the following:


from sklearn.ensemble import HistGradientBoostingRegressor

```
# Try a model that can handle NaNs natively
nan_model = HistGradientBoostingRegressor()
nan_model.fit(X_train, y_train)
nan_model.score(X_test, y_test)
```

Let's
see what values are missing again.


In [55]:
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64


## 1.2.1 Fill missing data with pandas

Let's see how we might fill missing values with pandas.

For categorical values, one of the simplest ways is to fill the missing fields with the string "missing".

We could do this for the Make and Colour features.

As for the Doors feature, we could use "missing" or we could fill it with the most common option of 4.

With the Odometer (KM) feature, we can use the mean value of all the other values in the column.

And finally, for those samples which are missing a Price value, we can remove them (since Price is the target value, removing probably causes less harm than imputing, however, you could design an experiment to test this).

In summary:
```
Column/Feature 	Fill missing value with
Make 	"missing"
Colour 	"missing"
Doors 	4 (most common value)
Odometer (KM) 	mean of Odometer (KM)
Price (target) 	NA, remove samples missing Price
```
We can use the pandas method `fillna(value="missing", inplace=True)` to fill all the missing values with the string "missing".


In [56]:
# Fill the missing values in the Make column
# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.
# car_sales_missing["Make"].fillna(value="missing", inplace=True)

car_sales_missing["Make"] = car_sales_missing["Make"].fillna(value="missing")

In [57]:
# Fill the missing values in the Make column
# Note: In previous versions of pandas, inplace=True was possible, however this will be changed in a future version, can use reassignment instead.

# fill th colour column
car_sales_missing["Colour"] = car_sales_missing["Colour"].fillna(value="missing")

In [58]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)    50
Doors            50
Price            50
dtype: int64



Wonderful! We're making some progress.

Now let's fill the Doors column with 4 (the most common value), this is the same as filling it with the median or mode of the Doors column.


In [59]:
# Find the most common value of the Doors column
car_sales_missing["Doors"].value_counts()

Doors
4.0    811
5.0     75
3.0     64
Name: count, dtype: int64

In [60]:
# Fill the Doors column with the most common value
car_sales_missing["Doors"] = car_sales_missing["Doors"].fillna(value=4)

In [61]:
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)    50
Doors             0
Price            50
dtype: int64

In [62]:
# Fill the Odometer (Km) column
# Old:# Old: car_sales_missing["Odometer (KM)"].fillna(value=car_sales_missing["Odometer (KM)"].mean(), inplace=True)
car_sales_missing["Odometer (KM)"] = car_sales_missing["Odometer (KM)"].fillna(value=car_sales_missing["Odometer (KM)"].mean())



How many missing values do we have now?

In [63]:
# Check the number of missing values
car_sales_missing.isna().sum()

Make              0
Colour            0
Odometer (KM)     0
Doors             0
Price            50
dtype: int64



Woohoo! That's looking a lot better.

Finally, we can remove the rows which are missing the target value Price.

    Note: Another option would be to impute the Price value with the mean or median or some other calculated value (such as by using similar cars to estimate the price), however, to keep things simple and prevent introducing too many fake labels to the data, we'll remove the samples missing a Price value.

We can remove rows with missing values in place from a pandas DataFrame with the `pandas.DataFrame.dropna(inplace=True)` method.


In [64]:
# Remove rows with missing Price labels
car_sales_missing.dropna(inplace=True)

In [65]:
# Check the number of missing values
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

Since we removed samples missing a Price value, there's now less overall samples but none of them have missing values.

In [66]:
# Check the number of total samples (previously was 1000)
len(car_sales_missing)

950



Can we fit a model now?

Let's try!

First we'll create the features and labels.

Then we'll convert categorical variables into numbers via one-hot encoding.

Then we'll split the data into training and test sets just like before.

Finally, we'll try to fit a `RandomForestRegressor()` model to the newly filled data.

In [67]:
# Create features
X_missing = car_sales_missing.drop("Price", axis =1)
print(f"Number of missing X values:\n{X_missing.isna().sum()}")

# create labels
y_missing = car_sales_missing["Price"]
print(f"Number of missing y values : {y_missing.isna().sum()}")

Number of missing X values:
Make             0
Colour           0
Odometer (KM)    0
Doors            0
dtype: int64
Number of missing y values : 0


In [68]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer

categorical_features = ["Make", "Colour", "Doors"]

one_hot = OneHotEncoder()
transformer = ColumnTransformer([("one_hot",
                                  one_hot,
                                  categorical_features)],
                                remainder="passthrough",
                                sparse_threshold=0) # return a sparse matrix or not

transformed_X_missing = transformer.fit_transform(X_missing)
transformed_X_missing

array([[0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 3.54310e+04],
       [1.00000e+00, 0.00000e+00, 0.00000e+00, ..., 0.00000e+00,
        1.00000e+00, 1.92714e+05],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 8.47140e+04],
       ...,
       [0.00000e+00, 0.00000e+00, 1.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 6.66040e+04],
       [0.00000e+00, 1.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.15883e+05],
       [0.00000e+00, 0.00000e+00, 0.00000e+00, ..., 1.00000e+00,
        0.00000e+00, 2.48360e+05]])

In [69]:
# Split data into training and test sets
np.random.seed(42)
X_train, X_test, y_train, y_test = train_test_split(transformed_X_missing,y_missing,test_size=0.2) 

# Fit and score a model
model = RandomForestRegressor()
model.fit(X_train, y_train)
model.score(X_test, y_test)

0.22011714008302485


Fantastic!!!

Looks like filling the missing values with pandas worked!

Our model can be fit to the data without issues.


## 1.2.2 Filling missing data and transforming categorical data with Scikit-Learn

Now we've filled the missing columns using pandas functions, you might be thinking, "Why pandas? I thought this was a Scikit-Learn introduction?".

Not to worry, Scikit-Learn provides a class called `sklearn.impute.SimpleImputer()` which allows us to do a similar thing.

SimpleImputer() transforms data by filling missing values with a given strategy parameter.

And we can use it to fill the missing values in our DataFrame as above.

At the moment, our dataframe has no mising values.


In [70]:
car_sales_missing.isna().sum()

Make             0
Colour           0
Odometer (KM)    0
Doors            0
Price            0
dtype: int64

Let's reimport it so it has missing values and we can fill them with Scikit-Learn.

In [71]:
# Reimport the DataFrame (so that all the missing values are back)
car_sales_missing = pd.read_csv("../data/car-sales-extended-missing-data.csv")
car_sales_missing.isna().sum()

Make             49
Colour           50
Odometer (KM)    50
Doors            50
Price            50
dtype: int64

To begin, we'll remove the rows which are missing a Price value

In [72]:
# Drop the rows with missing in the price column
car_sales_missing.dropna(subset=["Price"], inplace=True)



Now there are no rows missing a Price value.


In [73]:
car_sales_missing.isna().sum()

Make             47
Colour           46
Odometer (KM)    48
Doors            47
Price             0
dtype: int64