# Integrating ModelKits into Jupyter Notebook Workflows: A Practical Example

## Introduction

The kaggle competition, [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic), issues a challenge to create a model that uses Titanic passenger data (name, age, price of ticket, etc) to try to predict who survived and who died.  

While this Notebook does build out a solution to the problem posed, the primary goal isn't to create the best predictive model, but, instead, to demonstrate how to leverage [KitOps Modelkits](https://www.kitops.ml) within a machine learning workflow.

And,though the current context applies to Jupyter Notebooks written in Python, the code provided could be used just as effectively in workflows existing outside of a Notebook environment, as well.  Also, the code's functionality could be easily reproduced in other programming languages.

## Before You Begin

1. If you haven't aready done so, [sign up for a free account with Jozu.ml](https://api.jozu.ml/signup)

2. After you log into Jozu, add a new Repository named *"titanic-survivability"*, which we'll use in this Notebook.

3. In the same directory as this Notebook--which we'll call the *Project directory*--create a `.env` file.

4. Edit the `.env` file and add an entry for your **JOZU_USERNAME**, your **JOZU_PASSWORD** and your **JOZU_NAMESPACE** (aka your **Personal Organization** name). For example:
```bash
    JOZU_USERNAME=brett@jozu.org
    JOZU_PASSWORD=my_password
    JOZU_NAMESPACE=brett
```
5. Be sure to save the changes to your .env file before continuing.

## Project Setup

### Set Up Your Python Environment

- This project was created using Python 3.12, but should work for Python versions >= 3.7.

- We recommend using a Python or Conda virtual environment to isolate this project's code to prevent it from affecting the system-installed Python.

- If you name your Python or Conda environment something other than ".venv" or "venv", then be sure to add the name to the `.gitignore` file. *This step assumes you'll be using `git` for version control of this project.*

### Load the Required Python Packages

In [159]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


### Define Helper Functions

#### Functions for Working with Kitfiles

In [160]:
import yaml
from datetime import datetime, timezone
from pathlib import Path

# Helper function to print the contents of the python dictionary object
# representing the project's Kitfile
def print_kitfile_contents(kitfile):
    print('Kitfile Contents...')
    print('===================\n')
    print(yaml.safe_dump(kitfile, sort_keys=False))

# Helper function to read the project's Kitfile from disk into
# a python dictionary object.
# If the print flag is true, the Kitfile's contents are printed.
def import_kitfile(print = True) -> dict:
    # Path to kitfile template
    kitfile_template = Path("template") / "Kitfile.template"
    # Open the Kitfile template
    with open(kitfile_template, 'r') as file:
        # Load the contents into a Python dictionary
        kitfile = yaml.safe_load(file)
    if print:
        print_kitfile_contents(kitfile)
    return kitfile

# Helper function to export the python dictionary object 
# representing the project's Kitfile to disk.
# If the print flag is true, the Kitfile's contents are printed.
def export_kitfile(kitfile, print = True):
    # Open the Kitfile 
    yaml.safe_dump(kitfile, open('Kitfile', 'w'), sort_keys=False)
    if print:
        print_kitfile_contents(kitfile)

# Helper function to update the Kitfile's Datasets section with
# details about the data files to include.
# If the replace flag is True, the Datasets section will 
# be over-written; otherwise, the Datasets section will be
# updated with the additional Datasets.
# If the print flag is True, the contents of the updated
# Kitfile are printed.
def update_datasets_section(kitfile, datasets, replace = True, print = True):
    if replace:
        kitfile["datasets"] = datasets 
    else:
        kitfile["datasets"].extend(datasets)
    # save the updated Kitfile contents to disk
    export_kitfile(kitfile, print)

# Helper function to update the Kitfile's Package section with
# details about the current ModelKit, including the tag name
# to be used when pushing it, and the approximate timestamp
# when the ModelKit is to be pushed.
# If the print flag is True, the contents of the updated
# Kitfile are printed.
def update_package_section(kitfile, tag_name, print = True):
    # Get the current UTC timestamp
    current_utc_timestamp = datetime.now(timezone.utc).strftime("%Y-%m-%d %H:%M:%S %Z")
    package_section = kitfile["package"]
    description = package_section["description"]
    description += ("\nModelKit tag: " + tag_name + " pushed at: " + current_utc_timestamp)
    package_section["description"] = description
    # save the updated Kitfile contents to disk
    export_kitfile(kitfile, print)

#### Functions for Working with ModelKits

In [161]:
import subprocess

def kit_login(user: str, passwd: str, registry: str = "jozu.ml"):
    subprocess.run(["kit", "login", registry, "-u", user, "--password-stdin"], input=passwd, text=True)

def kit_logout(registry: str = "jozu.ml"):
    subprocess.run(["kit", "logout", registry])

def kit_pack(repo_path_with_tag: str):
    subprocess.run(["kit", "pack", ".", "-t", repo_path_with_tag])

def kit_push(repo_path_with_tag):
    subprocess.run(["kit", "push", repo_path_with_tag])

def pack_and_push_modelkit(user: str, passwd: str, namespace: str, 
                           registry: str = "jozu.ml", tag:str = "latest"):
    repo_path_with_tag = registry + "/" + namespace + "/" + "titanic-survivability" + ":" + tag
    kit_login(user, passwd, registry)
    kit_pack(repo_path_with_tag)
    kit_push(repo_path_with_tag)
    kit_logout()

## Explore the Project

### View the Project's Files

```bash
├── requirements.txt             # a list of python packages required for this project
├── template
│   └── Kitfile.template         # a base Kitfile used as a starting point
├── docs
│   ├── LICENSE                  # the license file 
│   └── README.md                # the project's README file
├── titanic_survivability.ipynb  # this Jupyter Notebook
├── Kitfile                      # the ModelKit's Kitfile to be updated via the Notebook's workflow
└── data
    ├── test.csv                 # the validation dataset
    └── train.csv                # the training dataset
```

### View the ModelKit's Kitfile

At the heart of every [ModelKit](https://kitops.ml/docs/modelkit/intro.html) is a [Kitfile](https://kitops.ml/docs/kitfile/format.html), a YAML-formatted configuration file. The Kitfile for this project has already been created with a base set of configuration details, but we'll update it as we progress through the workflow.

Let's view the Kitfile's contents:

In [162]:
kitfile = import_kitfile(print=True)

Kitfile Contents...

manifestVersion: 1.0
package:
  name: Titanic-Survivability-Predictor
  version: 1.0.0
  description: A project attempting to predict passenger survivability of the Titanic
    Shipwreck.
  authors:
  - Jozu
docs:
- path: docs/README.md
  description: Important notes about the project.
- path: docs/LICENSE
  description: The license for this ModelKit
code:
- path: requirements.txt
  description: Python packages required by this example.
  license: Apache-2.0
- path: titanic_survivability.ipynb
  description: Jupyter Notebook used to train, validate, optimize and export the model.
  license: Apache-2.0
- path: template/Kitfile.template



The `manifestVersion` and `package` sections primarily include metadata about the ModelKit.  The `docs` and `code` sections contain references to their respective project files.

## Data Collation

In this section we'll load the unprocessed data into two separate datasets:  one to be used for model training, and the other to be used for model testing. Once the unprocessed data sets have been collated, their paths will be added to the ModelKit's Kitfile, and the current state of the ModelKit will be packed and pushed to Jozu Hub.

### Load the Datasets

In [163]:
from pathlib import Path
import pandas as pd

# load the titanic data 
train_data, test_data = [pd.read_csv(Path("data") / filename) for filename in ("train.csv", "test.csv")]

### Update the Kitfile

#### Add the datasets to the Kitfile

With the data files loaded, now's a good time to update our ModelKit's Kitfile with the configuration details for the `datasets` section.

In [164]:
# add the test and train data sets to the Kitfile
training_data_info = {
    "name": "training",
    "path": "data/train.csv",
    "description": "Unprocessed data to be used for model training.",
    "license": "Apache-2.0"
}

testing_data_info = {
    "name": "testing",
    "path": "data/test.csv",
    "description": "Unprocessed data to be used for model testing.",
    "license": "Apache-2.0"
}

datasets_info = [
    training_data_info,
    testing_data_info
]

update_datasets_section(kitfile, datasets_info, print = False)

Next, let' update the Kitfile's Package `Description` will be updated to include the `tag` name of the ModelKit's version, along with the current UTC timestamp of this update for future reference.  The contents of the updated Kitfile will be then be displayed.

In [165]:
#construct the tag name for this ModelKit's version
tag_name = "collated-data-v1"

update_package_section(kitfile, tag_name, print=True)

Kitfile Contents...

manifestVersion: 1.0
package:
  name: Titanic-Survivability-Predictor
  version: 1.0.0
  description: 'A project attempting to predict passenger survivability of the Titanic
    Shipwreck.

    ModelKit tag: collated-data-v1 pushed at: 2024-10-21 21:05:15 UTC'
  authors:
  - Jozu
docs:
- path: docs/README.md
  description: Important notes about the project.
- path: docs/LICENSE
  description: The license for this ModelKit
code:
- path: requirements.txt
  description: Python packages required by this example.
  license: Apache-2.0
- path: titanic_survivability.ipynb
  description: Jupyter Notebook used to train, validate, optimize and export the model.
  license: Apache-2.0
- path: template/Kitfile.template
datasets:
- name: training
  path: data/train.csv
  description: Unprocessed data to be used for model training.
  license: Apache-2.0
- name: testing
  path: data/test.csv
  description: Unprocessed data to be used for model testing.
  license: Apache-2.0



#### Push this ModelKit's Version to Jozu Hub

Now we can pack and push the ModelKit version tagged as `collated-data-v1`.

In [166]:

from dotenv import load_dotenv
import os

# load the login credentials to Jozu.ml taken from environment variables stored in the .env file
load_dotenv(override=True)

# pack and push the 'collated-data-v1` ModelKit version
pack_and_push_modelkit(user = os.getenv("JOZU_USERNAME"), 
                       passwd = os.getenv("JOZU_PASSWORD"), 
                       namespace = os.getenv("JOZU_NAMESPACE"),
                       registry = "jozu.ml",
                       tag = tag_name)

Log in successful
Saved configuration: sha256:cc78b2b9f7bb1cdae367b0c27bffae0a3c75004a411ca679ae4f34f234ba5a34
Already saved code layer: sha256:9de9f1238dd7a623c617cbc14e7f5eada59d28085909e9cf57279ce0ebd6a5ee
Saved code layer: sha256:63ea171856317029b88dd3c0d291aea7a415ae1ae6bfd188549eea0412f9381e
Already saved code layer: sha256:220109bcd4dc8be72b9228aefb24f69239c3308297cb4a4806705c002e910e3c
Already saved dataset layer: sha256:8af74261fda984270cc30a58277f28c3435de603ba2722e160528a0fab6d1127
Already saved dataset layer: sha256:cbd1c99ae96851ca944aeda77e1e8338c00fd639b9cdcdbe42044ec99352cb79
Already saved docs layer: sha256:7de63cef90f75c857e373367403ad0a23143068b62e5b725c02f845043df6c8c
Already saved docs layer: sha256:0b45ae1c1fe73e1bcc537625392baeb70253f551ab2b3ee384b214fbf82c2be7
Saved manifest to storage: sha256:1c99095ad1b876dba3a2178f72c0e1c4768b33fe9714e5681d998114a7c8f053
Model saved: sha256:1c99095ad1b876dba3a2178f72c0e1c4768b33fe9714e5681d998114a7c8f053
Pushing jozu.ml/brett

## Data Exploration

In [167]:
train_data.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


The attributes have the following meaning:
* **PassengerId**: a unique identifier for each passenger
* **Survived**: that's the target, 0 means the passenger did not survive, while 1 means he/she survived.
* **Pclass**: passenger class.
* **Name**, **Sex**, **Age**: self-explanatory
* **SibSp**: how many siblings & spouses of the passenger aboard the Titanic.
* **Parch**: how many children & parents of the passenger aboard the Titanic.
* **Ticket**: ticket id
* **Fare**: price paid (in pounds)
* **Cabin**: passenger's cabin number
* **Embarked**: where the passenger embarked the Titanic

We want to train a model that predicts which passengers **Survived** based on the values in the other attributes.

Let's see how much data is missing:

In [168]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Some observations:
- The **PassengerID** attribute may be used as the dataset's index
- The **Name** and **Ticket** attributes may have some value, but they will be a bit tricky to convert into useful numbers that a model can consume. So for now, we will ignore them.
- The **SibSp** and **Parch** attributes may be added to create the **NumRelatives** attribute
- About 77% of the **Cabin** values are null, so we'll ignore that column, as well.
- About 19% of the **Age** values are null, but we can replace those values with mean of the k-nearest neighbors.
- Two of the **Embarked** values are empty, but we can replace those values with the most common value in that column.



## Data Preparation

Let's explicitly set the **PassengerId** column as the index column:

In [169]:
train_data = train_data.set_index("PassengerId")
test_data = test_data.set_index("PassengerId")

Create the new attribute **NumRelatives** by adding the values from **SibSp** and **Parch**

In [170]:
train_data["NumRelatives"] = train_data["SibSp"] + train_data["Parch"]
test_data["NumRelatives"] = test_data["SibSp"] + test_data["Parch"]

#### Build the Run the Data Preprocessing Pipeline

Starting with the pipeline for numerical attributes:

In [171]:
from sklearn.pipeline import Pipeline
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
        ("imputer", KNNImputer(n_neighbors=2)),
        ("scaler", StandardScaler())
    ])

And continuing with the pipeline for the categorical attributes:

In [173]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

cat_pipeline = Pipeline([
        ("ordinal_encoder", OrdinalEncoder()),    
        ("imputer", SimpleImputer(strategy="most_frequent")),
        ("cat_encoder", OneHotEncoder(sparse_output=False)),
    ])

Finally, let's join the numerical and categorical pipelines and run them agains the training and testing data.

In [174]:
from sklearn.compose import ColumnTransformer

num_attribs = ["Age", "NumRelatives", "Fare"]
cat_attribs = ["Pclass", "Sex", "Embarked"]

preprocess_pipeline = ColumnTransformer([
        ("num", num_pipeline, num_attribs),
        ("cat", cat_pipeline, cat_attribs),
    ])

X_train = preprocess_pipeline.fit_transform(train_data)
X_test = preprocess_pipeline.transform(test_data)


Get the labels:

In [175]:
y_train = train_data["Survived"]

#### Export the Processed Datasets

In [176]:
import numpy as np

# Save the processed datasets to disk
np.save(Path("data") / "processed_X_train.npy", X_train)
np.save(Path("data") / "processed_X_test.npy", X_test)
y_train.to_csv(Path("data") / "processed_y_train.csv")


#### Update the Kitfile

We need to update the Kitfile's `datasets` section to include the new processed data files.

In [177]:
# add the test and train data sets to the Kitfile
training_data_info = {
    "name": "processed_training",
    "path": "data/processed_X_train.npy",
    "description": "Processed data to be used for model training.",
    "license": "Apache-2.0"
}

testing_data_info = {
    "name": "processed_testing",
    "path": "data/processed_X_test.npy",
    "description": "Processed data to be used for model testing.",
    "license": "Apache-2.0"
}

label_data_info = {
    "name": "processed_labels",
    "path": "data/processed_y_train.csv",
    "description": "Processed labesl to be used for model training.",
    "license": "Apache-2.0"
}

datasets_info = [
    training_data_info,
    testing_data_info,
    label_data_info
]

update_datasets_section(kitfile, datasets_info, replace = False, print = False)

And we need to update the Kitfile's `package` section with this ModelKit version's tag name and approximate timestamp that it was updated. Then display the updated Kitfile.

In [178]:
#construct the tag name for this ModelKit's version
tag_name = "processed-data-v5"

update_package_section(kitfile, tag_name, print=True)

Kitfile Contents...

manifestVersion: 1.0
package:
  name: Titanic-Survivability-Predictor
  version: 1.0.0
  description: 'A project attempting to predict passenger survivability of the Titanic
    Shipwreck.

    ModelKit tag: collated-data-v1 pushed at: 2024-10-21 21:05:15 UTC

    ModelKit tag: processed-data-v5 pushed at: 2024-10-21 21:07:28 UTC'
  authors:
  - Jozu
docs:
- path: docs/README.md
  description: Important notes about the project.
- path: docs/LICENSE
  description: The license for this ModelKit
code:
- path: requirements.txt
  description: Python packages required by this example.
  license: Apache-2.0
- path: titanic_survivability.ipynb
  description: Jupyter Notebook used to train, validate, optimize and export the model.
  license: Apache-2.0
- path: template/Kitfile.template
datasets:
- name: training
  path: data/train.csv
  description: Unprocessed data to be used for model training.
  license: Apache-2.0
- name: testing
  path: data/test.csv
  description: Unp

With the data processed, we can pack and push the ModelKit's latest state  with the tag name: `processed-data-v5`

In [179]:
# load the login credentials to Jozu.ml taken from environment variables stored in the .env file
load_dotenv(override=True)

# pack and push the 'processed-data-v5` ModelKit version
pack_and_push_modelkit(user = os.getenv("JOZU_USERNAME"), 
                       passwd = os.getenv("JOZU_PASSWORD"), 
                       namespace = os.getenv("JOZU_NAMESPACE"),
                       registry = "jozu.ml",
                       tag = tag_name)

Log in successful
Saved configuration: sha256:d30417b7df6486d5ac700bb207196fc224b924f3fe30ab171310636f127d24b7
Already saved code layer: sha256:9de9f1238dd7a623c617cbc14e7f5eada59d28085909e9cf57279ce0ebd6a5ee
Already saved code layer: sha256:63ea171856317029b88dd3c0d291aea7a415ae1ae6bfd188549eea0412f9381e
Already saved code layer: sha256:220109bcd4dc8be72b9228aefb24f69239c3308297cb4a4806705c002e910e3c
Already saved dataset layer: sha256:8af74261fda984270cc30a58277f28c3435de603ba2722e160528a0fab6d1127
Already saved dataset layer: sha256:cbd1c99ae96851ca944aeda77e1e8338c00fd639b9cdcdbe42044ec99352cb79
Saved dataset layer: sha256:b977d0496b3ecc638e56502bbba28dcbf14534ba8d21eae27358c9f0207a83af
Saved dataset layer: sha256:2b89b253b6ea4eec1f505d6953fc8ad4275f9c10c8eaa3ea46ec42c29bc8c2b5
Already saved dataset layer: sha256:8b10202d2fc9010b880469a826a92bf9b0c10da52eba303d654cbb3ba5efdbeb
Already saved docs layer: sha256:7de63cef90f75c857e373367403ad0a23143068b62e5b725c02f845043df6c8c
Already 

## Model Training

Let's try training a `RandomForestClassifier` to predict the survivability of the passengers.

In [None]:
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=100, random_state=42)
rfc.fit(X_train, y_train)

With the model trained, let's use it to make predictions on the test data:

In [138]:
y_pred = rfc.predict(X_test)

## Model Validation

Let's use the mean accuracy of 10 cross-validation folds to get an idea of how good our model is.

In [None]:
from sklearn.model_selection import cross_val_score

rfc_scores = cross_val_score(rfc, X_train, y_train, cv=10)
rfc_scores.mean()

This model performs at about 81.8% accuracy.  There are a number of things we could do to try to improve our prediction accuracy--such as *feature engineering*, *trying different types of models*, and *optimizing our models' parmameters*--but, for the purpose of this exercise, we'll assume we're ready to move our Model into production.

## Export the Model

Let's export our trained `RandomForestClassifier` model to the joblib-formatted file named, **model.joblib**.

In [None]:
import joblib

artifact_filename = 'model.joblib'

# Save model artifact to local filesystem (doesn't persist)
model_path = Path() / "model" / artifact_filename
joblib.dump(rfc, model_path)


### Update the Kitfile

With the model exported, let's update our ModelKit's Kitfile with the configuration details for the `model` section.

In [None]:
# add the 'images' folder to the 'docs' section of the Kitfile
model_info = {
    "name": "titanic-survivability-predictor",
    "path": str(model_path),
    "description": "RandomForestClassifier",
    "framework": "joblib",
    "license": "Apache-2.0",
    "version": "1.0"
}

kitfile["model"] = model_info

# save the updated Kitfile contents to disk
export_kitfile(kitfile)

# reload the Kitfile from disk and display the contents
# to make sure it was persisited correctly
print_kitfile_contents(kitfile)

## Creating and Pushing the ModelKit to Jozu Hub

Finally, we can pack and push our final ModelKit to Jozu Hub.

In [None]:
from dotenv import load_dotenv
import os

# the login credentials to Jozu.ml taken from environment variables stored in the .env file
load_dotenv(override=True)

pack_and_push_modelkit(user = os.getenv("JOZU_USERNAME"), 
                       passwd = os.getenv("JOZU_PASSWORD"), 
                       namespace = os.getenv("JOZU_NAMESPACE"),
                       registry = "jozu.ml",
                       tag = "latest")

Log back into [Jozu Hub](https://jozu.ml) in your browser, click on *My Repositories* and you should see your ModelKit tagged as "latest".