# Sample ML Model Training and Export

This notebook serves as an example of training a simple ML model, evaluating it, then finally exporting it to a binary file format so that we can serve our model using [Starpack](https://github.com/a360-starpack/starpack).

The sample data is publicly available with an account on [Kaggle](https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset).

----

First, we will import the necessary libraries:

1. `pandas` - Create and manipulate tabular data, such as the CSV file we will be using.
2. `sklearn` - A Python library with a large variety of machine learning algorithms and helper functions for manipulating data and training models. In this example, we will specifically use the `RandomForestClassifier` class to use a Random Forest of Decision Trees to classify our patients as having heart disease or not. Additionally, we will use the `train_test_split` helper function to split our initial dataset into a training set and a testing set for verification. Finally, we import the `accuracy_score` helper method to check the accuracy of our classification model.
3. `pathlib` - Builtin library for manipulating the host operating system's filesystem. We will use this just to ensure we are pointing to the right files when we load and export different files.
4. `joblib` - Another builtin library used for saving Python objects in memory (such as a trained model object in this example) and saving them to a binary file format for later use.

Next, we will use `pathlib` to set our base directory as a variable to use when reading and writing files. Additionally, we will set a random seed so that any randomized actions in this notebook are deterministic and will remain the same across machines and different runs.

In [None]:
!pip install -r requirements.txt

In [None]:
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from pathlib import Path
import joblib

BASE_DIR = Path(".").resolve()
RANDOM_SEED = 42
print(BASE_DIR)

## Load Data

In [3]:
heart_df = pd.read_csv(BASE_DIR / "heart.csv")
heart_df

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,52,1,0,125,212,0,1,168,0,1.0,2,2,3,0
1,53,1,0,140,203,1,0,155,1,3.1,0,0,3,0
2,70,1,0,145,174,0,1,125,1,2.6,0,0,3,0
3,61,1,0,148,203,0,1,161,0,0.0,2,1,3,0
4,62,0,0,138,294,1,1,106,0,1.9,1,3,2,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1020,59,1,1,140,221,0,1,164,1,0.0,2,0,2,1
1021,60,1,0,125,258,0,0,141,1,2.8,1,1,3,0
1022,47,1,0,110,275,0,0,118,1,1.0,1,1,2,0
1023,50,0,0,110,254,0,0,159,0,0.0,2,0,2,1


## Analysis

Our target variable (will also be referred to as `y`) will be the `target` column. In order to reduce the complexity of this example, we will do no data preprocessing or feature selection beyond what was already provided to us. Our explanatory variables, `X` will be all columns except for `target`. For real-world use cases, tools such as exploratory data analysis and feature engineering are highly recommended.

In [4]:
# Split data along its columns into explanatory and target variables
X = heart_df.drop('target', axis=1)
y = heart_df['target']

# Split data along its rows into training and test sets
train_X, test_X, train_y, test_y = train_test_split(X, y, random_state=RANDOM_SEED, train_size=0.6)

Next, we will instantiate our model object, which as not yet been trained, and train it on our `train_X` and `train_y` data.

In [5]:
model = RandomForestClassifier(random_state=RANDOM_SEED)

model.fit(train_X, train_y)

## Evaluation

Next, we'll look at how our model does predictions on our test set that was withheld from training and examine the model's accuracy.

In [6]:
predictions = model.predict(test_X)
accuracy = accuracy_score(test_y, predictions)

print(f"The accuracy of the model is {accuracy * 100}%.")

The accuracy of the model is 98.53658536585365%.


## Export

The model performed well, so let's go ahead and export our model into a format usable by Starpack.

In [None]:
MODEL_FILENAME = BASE_DIR / 'pred_heart_disease.joblib'

joblib.dump(model, MODEL_FILENAME)

Finally, we can confirm that if we load back in the model, it has identical performance to the previous model.

In [8]:
imported_model = joblib.load(MODEL_FILENAME)

In [9]:
predictions_new = imported_model.predict(test_X)
accuracy_new = accuracy_score(test_y, predictions_new)

print(f"The accuracy of the imported model is {accuracy_new * 100}%.")

The accuracy of the imported model is 98.53658536585365%.


We can see that the model we loaded back in from the file performs exactly as the saved model does! Refer back to the example's `README.md` for more information on how to package and deploy this model using Starpack.

In [10]:
test_X.to_csv(BASE_DIR / "heart_disease_score.csv", index=False)