# Integrating ModelKits into Jupyter Notebook Workflows: A Practical Example

## Introduction

The kaggle competition, [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic), issues a challenge to create a model that uses Titanic passenger data (name, age, price of ticket, etc) to try to predict who survived and who died.  

While this Notebook does build out a solution to the problem posed, the primary goal isn't to create the best predictive model, but, instead, to demonstrate how to leverage [KitOps Modelkits](https://www.kitops.ml) within a machine learning workflow.

And, though the current context applies to Jupyter Notebooks written in Python, the code provided could be used just as effectively in workflows existing outside of a Notebook environment, as well.  Also, the code's functionality could be easily reproduced in other programming languages.

## Before You Begin

1. If you haven't already done so, [sign up for a free account with Jozu.ml](https://api.jozu.ml/signup)

2. After you log into Jozu, add a new Repository named *"titanic-survivability"*, which we'll use in this Notebook.

3. In the same directory as this Notebook--which we'll call the *Project directory*--create a `.env` file.

4. Edit the `.env` file and add an entry for your **JOZU_USERNAME**, your **JOZU_PASSWORD** and your **JOZU_NAMESPACE** (aka your **Personal Organization** name). For example:
```bash
    JOZU_USERNAME=brett@jozu.org
    JOZU_PASSWORD=my_password
    JOZU_NAMESPACE=brett
```
5. Be sure to save the changes to your .env file before continuing.

6. Install the [Kit CLI](https://kitops.ml/docs/cli/installation.html) on the machine where the notebook will run.

## Project Setup

### Set Up Your Python Environment

- This project was created using Python 3.12, but should work for Python versions >= 3.7.

- We recommend using a Python or Conda virtual environment to isolate this project's code to prevent it from affecting the system-installed Python.

- If you name your Python or Conda environment something other than ".venv" or "venv", then be sure to add the name to the `.gitignore` file. *This step assumes you'll be using `git` for version control of this project.*

### Load the Required Python Packages

In [1]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


## Use the Collated Datasets from Jozu Hub

### Pull the `collated-data-v1` ModelKit Version

This tagged ModelKit version contains the unprocessed data split into two separate datasets:  one to be used for model training, and the other to be used for model testing.

In [2]:
from dotenv import load_dotenv
from kitfile_helpers import import_kitfile
from modelkit_helpers import unpack_collated_data_modelkit
import os

#construct the tag name for this ModelKit's version
tag_name = "collated-data-v1"

# load the login credentials to Jozu.ml taken from environment variables stored in the .env file
load_dotenv(override=True)

# unpack the 'collated-data-v1` ModelKit version to the local machine
unpack_collated_data_modelkit(user = os.getenv("JOZU_USERNAME"), 
                            passwd = os.getenv("JOZU_PASSWORD"), 
                            tag = tag_name)

kitfile = import_kitfile(print = False)

Log in successful
Unpacking to /Users/brett/Projects/Demos/notebook-with-modelkit
Unpacking config to /Users/brett/Projects/Demos/notebook-with-modelkit/Kitfile
Unpacking dataset training to data/train.csv
Unpacking dataset testing to data/test.csv
Unpacking docs to docs/README.md
Unpacking docs to docs/LICENSE
Successfully logged out from jozu.ml


## Explore the Project

### View the Project's Files

```bash
├── Kitfile                      # the ModelKit's Kitfile to be updated via the Notebook's workflow
└── data
    ├── test.csv                 # the validation dataset
    └── train.csv                # the training dataset
```

### View the ModelKit's Kitfile

At the heart of every [ModelKit](https://kitops.ml/docs/modelkit/intro.html) is a [Kitfile](https://kitops.ml/docs/kitfile/format.html), a YAML-formatted configuration file. The Kitfile for this project has already been created with a base set of configuration details, but we'll update it as we progress through the workflow.

Let's view the Kitfile's contents:

In [None]:
kitfile = import_kitfile(print = True)

The `manifestVersion` and `package` sections primarily include metadata about the ModelKit.

### Load the Collated Datasets from disk

In [28]:
from pathlib import Path
import pandas as pd

# load the titanic data 
train_data, test_data = [pd.read_csv(Path("data") / filename) for filename in ("train.csv", "test.csv")]

## Data Exploration

In [None]:
train_data.head(10)

The attributes have the following meaning:
* **PassengerId**: a unique identifier for each passenger
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

We want to train a model that predicts which passengers **Survived** based on the values in the other attributes.

Let's see how much data is missing:

In [None]:
train_data.info()

Some observations:
- The **PassengerID** attribute may be used as the dataset's index
- The **Name** and **Ticket** attributes may have some value, but they will be a bit tricky to convert into useful numbers that a model can consume. So for now, we will ignore them.
- The **SibSp** and **Parch** attributes may be added to create the **NumRelatives** attribute
- About 77% of the **Cabin** values are null, so we'll ignore that column, as well.
- About 19% of the **Age** values are null, but we can replace those values with mean of the k-nearest neighbors.
- Two of the **Embarked** values are empty, but we can replace those values with the most common value in that column.



## Data Preparation

Let's explicitly set the **PassengerId** column as the index column:

In [31]:
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

Create the new attribute **NumRelatives** by adding the values from **SibSp** and **Parch**

In [32]:
train_data["NumRelatives"] = train_data["SibSp"] + train_data["Parch"]
test_data["NumRelatives"] = test_data["SibSp"] + test_data["Parch"]

#### Build the Run the Data Preprocessing Pipeline

Starting with the pipeline for numerical attributes:

In [33]:
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ("imputer", KNNImputer(n_neighbors=2)),
        ("scaler", StandardScaler())
    ])

And continuing with the pipeline for the categorical attributes:

In [34]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

cat_pipeline = Pipeline([
        ("ordinal_encoder", OrdinalEncoder()),    
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse_output=False)),
    ])

Finally, let's join the numerical and categorical pipelines and run them agains the training and testing data.

In [35]:
from sklearn.compose import ColumnTransformer

num_attribs = ["Age", "NumRelatives", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

X_train = preprocess_pipeline.fit_transform(train_data)
X_test = preprocess_pipeline.transform(test_data)


Get the labels:

In [36]:
y_train = train_data["Survived"]

#### Export the Processed Datasets

In [37]:
import numpy as np

# Save the processed datasets to disk
np.save(Path("data") / "processed_X_train.npy", X_train)
np.save(Path("data") / "processed_X_test.npy", X_test)
y_train.to_csv(Path("data") / "processed_y_train.csv")


#### Update the Kitfile

We need to update the Kitfile's `datasets` section to include the new processed data files.

In [38]:
from kitfile_helpers import update_datasets_section

# add the test and train data sets to the Kitfile
training_data_info = {
    "name": "processed_training",
    "path": "data/processed_X_train.npy",
    "description": "Processed data to be used for model training.",
    "license": "Apache-2.0"
}

testing_data_info = {
    "name": "processed_testing",
    "path": "data/processed_X_test.npy",
    "description": "Processed data to be used for model testing.",
    "license": "Apache-2.0"
}

label_data_info = {
    "name": "processed_labels",
    "path": "data/processed_y_train.csv",
    "description": "Processed labesl to be used for model training.",
    "license": "Apache-2.0"
}

datasets_info = [
    training_data_info,
    testing_data_info,
    label_data_info
]

update_datasets_section(kitfile, datasets_info, replace = False, print = False)

We need to update the Kitfile's `code` section to include the Jupyter Notebook and related code files.

In [39]:
from kitfile_helpers import update_code_section

# add the Jupyter Notebook
notebook = {
    "path": "titanic_survivability.ipynb",
    "description": "Jupyter notebook.",
    "license": "Apache-2.0"
}

requirements = {
    "path": "requirements.txt",
    "description": "Required packages for the notebook.",
    "license": "Apache-2.0"
}

kitfile_template = {
    "path": "template/Kitfile.template",
    "description": "Template for the Kitfile.",
    "license": "Apache-2.0"
}

kitfile_helpers = {
    "path": "kitfile_helpers.py",
    "description": "Python helper functions for authoring Kitfiles.",
    "license": "Apache-2.0"
}

modelkit_helpers = {
    "path": "modelkit_helpers.py",
    "description": "Python helper functions for performing ModelKit actions.",
    "license": "Apache-2.0"
}

gitignore = {
    "path": ".gitignore",
    "description": "gitignore file.",
    "license": "Apache-2.0"
}

code_info = [
    notebook,
    requirements,
    kitfile_template,
    kitfile_helpers,
    modelkit_helpers,
    gitignore
]

update_code_section(kitfile, code_info, replace = True, print = False)

#### Push this ModelKit's Version to Jozu Hub

And we need to update the Kitfile's `package` section with this ModelKit version's tag name and approximate timestamp that it was updated. Then display the updated Kitfile.

Next, let's update the Kitfile's Package `Description` to include the tag name, `processed-data-v5`,  along with the current UTC timestamp of this update for future reference.

Then, we'll pack and push the ModelKit version tagged as `processed-data-v5` to Jozu Hub.

Finally, we'll print the contents of the updated Kitfile.

In [None]:
from kitfile_helpers import update_package_section
from modelkit_helpers import pack_and_push_modelkit

#construct the tag name for this ModelKit's version
tag_name = "processed-data-v5"

update_package_section(kitfile, tag_name, print = False)

# load the login credentials to Jozu.ml taken from environment variables stored in the .env file
load_dotenv(override=True)

# pack and push the 'processed-data-v5` ModelKit version
pack_and_push_modelkit(user = os.getenv("JOZU_USERNAME"), 
                       passwd = os.getenv("JOZU_PASSWORD"), 
                       namespace = os.getenv("JOZU_NAMESPACE"),
                       registry = "jozu.ml",
                       tag = tag_name)

kitfile = import_kitfile(print = True)

With the data processed, we can pack and push the ModelKit's latest state  with the tag name: `processed-data-v5`

## Model Training

Let's try training a `RandomForestClassifier` to predict the survivability of the passengers.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

With the model trained, let's use it to make predictions on the test data:

In [42]:
y_pred = rfc.predict(X_test)

## Model Validation

Let's use the mean accuracy of 10 cross-validation folds to get an idea of how good our model is.

In [None]:
from sklearn.model_selection import cross_val_score

rfc_scores = cross_val_score(rfc, X_train, y_train, cv=10)
rfc_scores.mean()

This model performs at about 81.8% accuracy.  There are a number of things we could do to try to improve our prediction accuracy--such as *feature engineering*, *trying different types of models*, and *optimizing our models' parmameters*--but, for the purpose of this exercise, we'll assume we're ready to move our Model into production.

## Export the Trained Model

Let's export our trained `RandomForestClassifier` model to the joblib-formatted file named, **model.joblib**.

In [None]:
import joblib

artifact_filename = 'model.joblib'

# Save model artifact to local filesystem (doesn't persist)
model_path = Path() / "model" / artifact_filename
joblib.dump(rfc, model_path)


### Update the Kitfile

With the model exported, let's update our ModelKit's Kitfile with the configuration details for the `model` section.

In [None]:
from kitfile_helpers import update_model_section

# add the `model` section to the kitfile
model_info = {
    "name": "titanic-survivability-predictor",
    "path": str(model_path),
    "description": "RandomForestClassifier",
    "framework": "joblib",
    "license": "Apache-2.0",
    "version": "1.0"
}

update_model_section(kitfile, model_info, print = True)

### Push the ModelKit to Jozu Hub

Finally, we can tag this ModelKit version as `trained_model_v2`, pack and push it to Jozu Hub, and print the updated Kitfile.

In [None]:
#construct the tag name for this ModelKit's version
tag_name = "trained_model_v2"

update_package_section(kitfile, tag_name, print = False)

# load the login credentials to Jozu.ml taken from environment variables stored in the .env file
load_dotenv(override=True)

# pack and push the 'processed-data-v5` ModelKit version
pack_and_push_modelkit(user = os.getenv("JOZU_USERNAME"), 
                       passwd = os.getenv("JOZU_PASSWORD"), 
                       namespace = os.getenv("JOZU_NAMESPACE"),
                       registry = "jozu.ml",
                       tag = tag_name)

kitfile = import_kitfile(print = True)

To view your tagged ModelKits:
1. Log back into [Jozu Hub](https://jozu.ml) in your browser
2. Click on *My Repositories*
3. Click on *titanic-survivability*
4. Your tagged ModelKits will be displayed.

You can expand each one and click the "MORE DETAILS" link to view information about their contents.