# End to End Machine Learning Workflow - RandomClassifer

In [7]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import sklearn
print(sklearn.__version__)
sklearn.show_versions()

1.6.1

System:
    python: 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 08:28:27) [Clang 14.0.6 ]
executable: /opt/miniconda3/bin/python
   machine: macOS-26.2-x86_64-i386-64bit

Python dependencies:
      sklearn: 1.6.1
          pip: 24.2
   setuptools: 75.1.0
        numpy: 1.26.4
        scipy: 1.15.1
       Cython: None
       pandas: 2.2.3
   matplotlib: 3.10.0
       joblib: 1.4.2
threadpoolctl: 3.5.0

Built with OpenMP: True

threadpoolctl info:
       user_api: blas
   internal_api: openblas
    num_threads: 8
         prefix: libopenblas
       filepath: /opt/miniconda3/lib/libopenblasp-r0.3.21.dylib
        version: 0.3.21
threading_layer: pthreads
   architecture: Haswell

       user_api: openmp
   internal_api: openmp
    num_threads: 16
         prefix: libomp
       filepath: /opt/miniconda3/lib/libomp.dylib
        version: None


# 1. Get the data ready

As an example dataset, we'll import heart-disease.csv.

This file contains anonymised patient medical records and whether or not they have heart disease or not (this is a classification problem since we're trying to predict whether something is one thing or another).

In [9]:
heart_disease = pd.read_csv("resources/heart-disease.csv")
heart_disease.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


Here, each row is a different patient and all columns except target are different patient characteristics.

The target column indicates whether the patient has heart disease (target=1) or not (target=0), this is our "label" columnm, 
the variable we're going to try and predict.

The rest of the columns (often called features) are what we'll be using to predict the target value.

Note: It's a common custom to save features to a varialbe X and labels to a variable y. 
In practice, we'd like to use the X (features) to build a predictive algorithm to predict the y (labels).

In [14]:
# Features
X = heart_disease.drop("target", axis=1)

# Label
y = heart_disease["target"]

# 2. Pick a Model

Choosing a model often depends on the type of problem you're working on.

For example, there are different models that Scikit-Learn recommends, whether you're working on a classification or regression problem.

You can see a map breaking down the [different kinds of model options and recommendations in the Scikit-Learn documentation](https://scikit-learn.org/stable/machine_learning_map.html).

Scikit-Learn refers to models as "estimators"; however, they are often also referred to as `model` or `clf` (short for classifier).

A model's hyperparameters are settings you can change to adjust it for your problem, much like knobs on an oven you can tune to cook your favourite dish.

In [22]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier()
# We can see the current hyperparameters of a model with the get_params() method.
clf.get_params()

{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'sqrt',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'monotonic_cst': None,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

We'll leave this as is for now, as Scikit-Learn models generally have good default settings.


One of the most important practices in machine learning is to split datasets into training and test sets.
As in, a model will train on the training set to learn patterns, and then those patterns can be evaluated on the test set.
Crucially, a model should never see testing data during training.

This is equivalent to a student studying course materials during the semester (training set) and 
then testing their abilities on the following exam (testing set).

Scikit-learn provides the sklearn.model_selection.train_test_split method to split datasets into training and test sets.

Note: A common practice is to use an 80/20, 70/30, or 75/25 split for training/testing data. There is also a third set, known as a validation set (e.g. 70/15/15 for training/validation/test) for hyperparameter tuning on but for now we'll focus on training and test sets.


In [17]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

# 3. Fit the model to the data and make a prediction

Fitting a model a dataset involves passing it the data and asking it to figure out the patterns.

If there are labels (supervised learning), the model tries to work out the relationship between the data and the labels.

If there are no labels (unsupervised learning), the model tries to find patterns and group similar samples together.

Most Scikit-Learn models have the fit(X, y) method built-in, where the X parameter is the features and the y parameter is the labels.

In our case, we start by fitting a model on the training split (X_train, y_train).

In [10]:
clf.fit(X_train, y_train)

The whole point of training a machine learning model is to use it to make some kind of prediction in the future.

Once your model instance is trained, you can use the predict() method to predict a target value given a set of features.

In other words, use the model, along with some new, unseen, and unlabelled data, to predict the label.

Note: The data you predict on should be in the same shape and format as the data you trained on.

Since our model was trained on data from X_train, predictions should be made on data in the same format and shape as X_train.

Our goal in many machine learning problems is to use patterns learned from the training data to make predictions on the test data (or future unseen data).

In [11]:
y_preds = clf.predict(X_test)
y_preds

array([1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0,
       1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1])

# 4. Evaluate the Model 

Now we've made some predictions, we can start to use some more Scikit-Learn methods to figure out how good our model is.

Each model or estimator has a built-in score() method.

This method compares how well the model was able to learn the patterns between the features and labels.

The score() method for each model uses a standard evaluation metric to measure your model's results.

In the case of a classifier (our model), one of the most common evaluation metrics is accuracy (the fraction of correct predictions out of total predictions).

Let's check out our model's accuracy on the training set.

In [13]:
clf.score(X_train, y_train)

1.0

The model's accuracy on the training dataset is: 100.0%
Woah! Looks like our model does pretty well on the training dataset.

This is because it has a chance to see both data and labels.

How about the test dataset?

In [14]:
clf.score(X_test, y_test)

0.819672131147541

The model's accuracy on the testing dataset is: 81.96%
Hmm, looks like our model's accuracy is a bit less on the test dataset than the training dataset.

This is quite often the case, because remember, a model has never seen the test examples before.

There are also several other evaluation methods we can use for our classification models.

All of the following classification metrics come from `sklearn.metrics` module:

* `classification_report(y_true, y_true)` - Builds a text report showing various classification metrics such as precision, recall, and F1-score.
* `confusion_matrix(y_true, y_pred)` - Create a confusion matrix to compare predictions to truth labels.
* `accuracy_score(y_true, y_pred)` - Find the accuracy score (the default metric) for a classifier.  

All metrics have the following in common: they compare a model's predictions (y_pred) to truth labels (y_true).

In [15]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

In [16]:
print(classification_report(y_test, y_preds))

              precision    recall  f1-score   support

           0       0.95      0.67      0.78        30
           1       0.75      0.97      0.85        31

    accuracy                           0.82        61
   macro avg       0.85      0.82      0.81        61
weighted avg       0.85      0.82      0.82        61



In [17]:
confusion_matrix(y_test, y_preds)

array([[20, 10],
       [ 1, 30]])

In [18]:
accuracy_score(y_test, y_preds)

0.819672131147541

# 5. Improve the model

The first model you build is often referred to as a baseline (a baseline is often even simpler than the model we've used, a baseline could be "let's just by default predict the most common value and then try to improve").

Once you've got a baseline model, like we have here, it's important to remember, this is often not the final model you'll use.

The next step in the workflow is to try and improve upon your baseline model.

How?

With one of the most important mottos in machine learning...

Experiment, experiment, experiment!

Experiments can come in many different forms.

But let's break it into two.

1. From a model perspective.
2. From a data perspective.

From a model perspective may involve things such as using a more complex model or tuning your models hyperparameters.

From a data perspective may involve collecting more data or better quality data so your existing model has more of a chance to learn the patterns within.

If you're already working on an existing dataset, it's often easier try a series of model perspective experiments first and then turn to data perspective experiments if you aren't getting the results you're looking for.

One thing you should be aware of is if you're tuning a models hyperparameters in a series of experiments, your reuslts should always be cross-validated (we'll see this later on!).

[Cross-validation](https://scikit-learn.org/stable/modules/cross_validation.html) is a way of making sure the results you're getting are consistent across your training and test datasets (because it uses multiple versions of training and test sets), rather than just luck because of the order in which the original training and test sets were created.

* Try different hyperparameters.
* All different parameters should be cross-validated.
    * Note: Beware of cross-validation for time series problems (as for time series, you don't want to mix samples from the future with samples from the past).

Different models you use will have different hyperparameters you can tune.

For the case of our model, the `RandomForestClassifier()`, we'll start trying different values for `n_estimators` (a measure for the number of trees in the random forest).

By default, `n_estimators=100`, so how about we try values from `10` to `100` and see what happens (generally more is better)?

In [20]:
# Try different numbers of estimators (n_estimators is a hyperparameter you can change)
np.random.seed(42)
for i in range(10, 100, 10):
    print(f"Trying model with {i} estimators...")
    clf = RandomForestClassifier(n_estimators=i).fit(X_train, y_train)
    print(f"Model accruacy on test set: {clf.score(X_test, y_test)}")
    print("")

Trying model with 10 estimators...
Model accruacy on test set: 0.7868852459016393

Trying model with 20 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 30 estimators...
Model accruacy on test set: 0.8032786885245902

Trying model with 40 estimators...
Model accruacy on test set: 0.819672131147541

Trying model with 50 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 60 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 70 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 80 estimators...
Model accruacy on test set: 0.8360655737704918

Trying model with 90 estimators...
Model accruacy on test set: 0.8360655737704918



# 6. Save a model and load it

When you've done a few experiments and you're happy with how your model is doing, you'll likely want someone else to be able to use it.

This may come in the form of a teammate or colleague trying to replicate and validate your results or through a customer using your model as part of a service or application you offer.

Saving a model also allows you to reuse it later without having to go through retraining it. Which is helpful, especially when your training times start to increase.

You can save a [Scikit-Learn model](https://scikit-learn.org/stable/model_persistence.html) using Python's in-built [pickle](https://docs.python.org/3/library/pickle.html) module.

In [22]:
import pickle

# Save trained model to file
pickle.dump(clf, open("random_forest_model_1.pkl", "wb")) # wb stands for binary write mode

In [23]:
# Load a saved model and predict a single example
loaded_model = pickle.load(open("random_forest_model_1.pkl", "rb")) # rb stands for read binary 
loaded_model.score(X_test, y_test)

0.8360655737704918

Loaded pickle model prediction score: 83.60%  

We've covered a lot of ground fast...

Let's break things down a bit more by revisting each section.