## Chassisml Example Notebooks
Welcome to the examples section for [Chassis.ml](https://chassis.ml), which contains notebooks that leverage Chassisml to auto-containerize models built using the most common machine learning frameworks. 

**NOTE:** Chassisml provides two key functionalities: 
1. Create a Docker container from your model code and push that container image to a Docker registry. This is the default behavior.
2. Should you pass valid Modzy credentials as optional parameters, Chassisml will take the container and upload it directly to the Modzy environment you specify. You will notice most of these notebooks deploy the model to one of the Modzy internal development environments.   

Can't find the framework you are looking for? Fork this repository and open a PR, we're always interested in growing this example bank! 

The example model built in this notebook comes from this [Kaggle competition](https://www.kaggle.com/c/shelter-animal-outcomes). Find the original code from the model used in this notebook [here](https://jovian.ai/aakanksha-ns/shelter-outcome).   

In [1]:
import chassisml
import torch
import getpass
import pandas as pd
import numpy as np
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import torch
from torch.utils.data import Dataset, DataLoader
import torch.optim as torch_optim
import torch.nn as nn
import torch.nn.functional as F
from torchvision import models
from io import StringIO
from datetime import datetime

## Enter credentials
Dockerhub creds and Modzy API Key

In [2]:
dockerhub_user = getpass.getpass('docker hub username')
dockerhub_pass = getpass.getpass('docker hub password')
modzy_api_key = getpass.getpass('modzy api key')

docker hub username········
docker hub password········
modzy api key········


## Train Model 

#### Data Preprocessing

In [3]:
# load data
train = pd.read_csv('./data/animal-shelter-outcomes/train.csv')
test = pd.read_csv('./data/animal-shelter-outcomes/test.csv')

# drop irrelevant columns and stack train & test sets
train_X = train.drop(columns= ['OutcomeType', 'OutcomeSubtype', 'AnimalID'])
Y = train['OutcomeType']
test_X = test
stacked_df = train_X.append(test_X.drop(columns=['ID']))
stacked_df = stacked_df.drop(columns=['DateTime'])

# drop columns with many null values
for col in stacked_df.columns:
    if stacked_df[col].isnull().sum() > 10000:
        stacked_df = stacked_df.drop(columns = [col])
        
# label encoding
for col in stacked_df.columns:
    if stacked_df.dtypes[col] == "object":
        stacked_df[col] = stacked_df[col].fillna("NA")
    else:
        stacked_df[col] = stacked_df[col].fillna(0)
    stacked_df[col] = LabelEncoder().fit_transform(stacked_df[col])
    
# make all variables categorical
for col in stacked_df.columns:
    stacked_df[col] = stacked_df[col].astype('category')

# split train from test again
X = stacked_df[0:26729]
test_processed = stacked_df[26729:]

#check if shape[0] matches original
print("train shape: ", X.shape, "orignal: ", train.shape)
print("test shape: ", test_processed.shape, "original: ", test.shape)

train shape:  (26729, 5) orignal:  (26729, 10)
test shape:  (11456, 5) original:  (11456, 8)


In [4]:
# Create small sample data csv for testing later
sample_test = test[:5]
with open("./data/animal-shelter-outcomes/sample_data.csv", "w") as sample_data:
    sample_test.to_csv(sample_data, index=False, line_terminator="\n")

In [5]:
# Assign encoding target
Y = LabelEncoder().fit_transform(Y)

# split dataset into train/val
X_train, X_val, y_train, y_val = train_test_split(X, Y, test_size=0.10, random_state=0)

#categorical embedding for columns having more than two values
embedded_cols = {n: len(col.cat.categories) for n,col in X.items() if len(col.cat.categories) > 2}
embedded_col_names = embedded_cols.keys()
embedding_sizes = [(n_categories, min(50, (n_categories+1)//2)) for _,n_categories in embedded_cols.items()]

#### Create PyTorch Dataset

In [6]:
class ShelterOutcomeDataset(Dataset):
    def __init__(self, X, Y, embedded_col_names):
        X = X.copy()
        self.X1 = X.loc[:,embedded_col_names].copy().values.astype(np.int64) #categorical columns
        self.X2 = X.drop(columns=embedded_col_names).copy().values.astype(np.float32) #numerical columns
        self.y = Y
        
    def __len__(self):
        return len(self.y)
    
    def __getitem__(self, idx):
        return self.X1[idx], self.X2[idx], self.y[idx]

In [7]:
# create torch train and validation sets
train_ds = ShelterOutcomeDataset(X_train, y_train, embedded_col_names)
valid_ds = ShelterOutcomeDataset(X_val, y_val, embedded_col_names)

#### Configure CPU/GPU Resources

In [8]:
def to_device(data, device):
    """Move tensor(s) to chosen device"""
    if isinstance(data, (list,tuple)):
        return [to_device(x, device) for x in data]
    return data.to(device, non_blocking=True)

class DeviceDataLoader():
    """Wrap a dataloader to move data to a device"""
    def __init__(self, dl, device):
        self.dl = dl
        self.device = device
        
    def __iter__(self):
        """Yield a batch of data after moving it to device"""
        for b in self.dl: 
            yield to_device(b, self.device)

    def __len__(self):
        """Number of batches"""
        return len(self.dl)

#### Define Model

In [22]:
class ShelterOutcomeModel(nn.Module):
    def __init__(self, embedding_sizes, n_cont):
        super().__init__()
        self.embeddings = nn.ModuleList([nn.Embedding(categories, size) for categories,size in embedding_sizes])
        n_emb = sum(e.embedding_dim for e in self.embeddings) #length of all embeddings combined
        self.n_emb, self.n_cont = n_emb, n_cont
        self.lin1 = nn.Linear(self.n_emb + self.n_cont, 200)
        self.lin2 = nn.Linear(200, 70)
        self.lin3 = nn.Linear(70, 5)
        self.bn1 = nn.BatchNorm1d(self.n_cont)
        self.bn2 = nn.BatchNorm1d(200)
        self.bn3 = nn.BatchNorm1d(70)
        self.emb_drop = nn.Dropout(0.6)
        self.drops = nn.Dropout(0.3)
        

    def forward(self, x_cat, x_cont):
        x = [e(x_cat[:,i]) for i,e in enumerate(self.embeddings)]
        x = torch.cat(x, 1)
        x = self.emb_drop(x)
        x2 = self.bn1(x_cont)
        x = torch.cat([x, x2], 1)
        x = F.relu(self.lin1(x))
        x = self.drops(x)
        x = self.bn2(x)
        x = F.relu(self.lin2(x))
        x = self.drops(x)
        x = self.bn3(x)
        x = self.lin3(x)
        return x

In [23]:
model = ShelterOutcomeModel(embedding_sizes, 1)
model.to(torch.device("cpu"))

ShelterOutcomeModel(
  (embeddings): ModuleList(
    (0): Embedding(6, 3)
    (1): Embedding(46, 23)
    (2): Embedding(1678, 50)
    (3): Embedding(411, 50)
  )
  (lin1): Linear(in_features=127, out_features=200, bias=True)
  (lin2): Linear(in_features=200, out_features=70, bias=True)
  (lin3): Linear(in_features=70, out_features=5, bias=True)
  (bn1): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn2): BatchNorm1d(200, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (bn3): BatchNorm1d(70, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (emb_drop): Dropout(p=0.6, inplace=False)
  (drops): Dropout(p=0.3, inplace=False)
)

#### Define Optimizer, Training, and Evaluation Functions

In [49]:
def get_optimizer(model, lr = 0.001, wd = 0.0):
    parameters = filter(lambda p: p.requires_grad, model.parameters())
    optim = torch_optim.Adam(parameters, lr=lr, weight_decay=wd)
    return optim

def train_model(model, optim, train_dl):
    model.train()
    total = 0
    sum_loss = 0
    for x1, x2, y in train_dl:
        batch = y.shape[0]
        output = model(x1, x2)
        y = y.type(torch.LongTensor)     
        loss = F.cross_entropy(output, y)   
        optim.zero_grad()
        loss.backward()
        optim.step()
        total += batch
        sum_loss += batch*(loss.item())
    return sum_loss/total

def val_loss(model, valid_dl):
    model.eval()
    total = 0
    sum_loss = 0
    correct = 0
    for x1, x2, y in valid_dl:
        current_batch_size = y.shape[0]
        out = model(x1, x2)
        y = y.type(torch.LongTensor)
        loss = F.cross_entropy(out, y)
        sum_loss += current_batch_size*(loss.item())
        total += current_batch_size
        pred = torch.max(out, 1)[1]
        correct += (pred == y).float().sum().item()
    print("valid loss %.3f and accuracy %.3f" % (sum_loss/total, correct/total))
    return sum_loss/total, correct/total

def train_loop(model, epochs, lr=0.01, wd=0.0):
    optim = get_optimizer(model, lr = lr, wd = wd)
    for i in range(epochs): 
        loss = train_model(model, optim, train_dl)
        print("training loss: ", loss)
        val_loss(model, valid_dl)

#### Training Loop

In [50]:
batch_size = 1000
train_dl = DataLoader(train_ds, batch_size=batch_size,shuffle=True)
valid_dl = DataLoader(valid_ds, batch_size=batch_size,shuffle=True)

In [51]:
train_dl = DeviceDataLoader(train_dl, torch.device("cpu"))
valid_dl = DeviceDataLoader(valid_dl, torch.device("cpu"))

In [52]:
train_loop(model, epochs=8, lr=0.05, wd=0.00001)

training loss:  0.9865190627807234
valid loss 0.891 and accuracy 0.623
training loss:  0.9513287070109118
valid loss 0.869 and accuracy 0.626
training loss:  0.9488261960890351
valid loss 0.883 and accuracy 0.629
training loss:  0.9475331834712851
valid loss 0.872 and accuracy 0.633
training loss:  0.9390505436135163
valid loss 0.882 and accuracy 0.634
training loss:  0.954956340084133
valid loss 0.876 and accuracy 0.631
training loss:  0.951382885214733
valid loss 0.867 and accuracy 0.634
training loss:  0.9364405162991738
valid loss 0.887 and accuracy 0.624


#### Test Set

In [53]:
test_ds = ShelterOutcomeDataset(test_processed, np.zeros(len(test_processed)), embedded_col_names)
test_dl = DataLoader(test_ds, batch_size=batch_size)

In [54]:
preds = []
with torch.no_grad():
    for x1,x2,y in test_dl:
        out = model(x1, x2)
        prob = F.softmax(out, dim=1)
        preds.append(prob)
final_probs = [item for sublist in preds for item in sublist]
print(final_probs)

[tensor([0.1058, 0.0105, 0.0704, 0.0898, 0.7235]), tensor([0.6522, 0.0013, 0.0092, 0.1625, 0.1749]), tensor([0.4965, 0.0047, 0.0270, 0.0772, 0.3946]), tensor([0.0805, 0.0098, 0.0605, 0.0759, 0.7733]), tensor([0.6050, 0.0014, 0.0096, 0.2043, 0.1796]), tensor([0.5297, 0.0027, 0.0241, 0.2458, 0.1977]), tensor([0.4016, 0.0044, 0.1556, 0.2145, 0.2238]), tensor([0.7426, 0.0027, 0.0087, 0.0343, 0.2117]), tensor([0.8174, 0.0010, 0.0031, 0.0277, 0.1508]), tensor([0.5109, 0.0035, 0.0432, 0.2310, 0.2113]), tensor([0.7712, 0.0013, 0.0043, 0.0466, 0.1767]), tensor([0.0545, 0.0068, 0.1730, 0.3637, 0.4020]), tensor([0.6431, 0.0025, 0.0185, 0.1185, 0.2173]), tensor([0.6050, 0.0014, 0.0096, 0.2043, 0.1796]), tensor([0.7283, 0.0011, 0.0051, 0.0954, 0.1701]), tensor([0.6824, 0.0018, 0.0080, 0.0882, 0.2196]), tensor([0.3286, 0.0034, 0.0824, 0.4025, 0.1832]), tensor([0.4745, 0.0020, 0.0296, 0.3170, 0.1769]), tensor([0.4504, 0.0054, 0.0487, 0.1464, 0.3491]), tensor([0.0510, 0.0124, 0.1399, 0.0361, 0.7605]),

## Prepare context dict
Initialize anything here that should persist across inference runs

In [55]:
model_tabular = model
model.eval()

embedded_columns = embedded_col_names
output_names = ["ID", "Adoption", "Died", "Euthanasia", "Return_to_owner", "Transfer"]

device = torch.device('cpu')

# This will be passed to Chassis:
context = {
    "model": model_tabular,    
    "embedded_columns": embedded_columns,
    "output_column_names": output_names,
    "device": device
}

## Write process function

* Must take bytes and context dict as input
* Preprocess bytes, run inference, postprocess model output, return results

In [56]:
def preprocess(csv, embedded_cols):
    
    # drop ID and date columns
    csv = csv.drop(columns=["ID", "DateTime", "Name"])
    
    # drop columns with many null values
    for col in csv.columns:
        if csv[col].isnull().sum() > 10000:
            csv = csv.drop(columns = [col])

    # label encoding
    for col in csv.columns:
        if csv.dtypes[col] == "object":
            csv[col] = csv[col].fillna("NA")
        else:
            csv[col] = csv[col].fillna(0)
        csv[col] = LabelEncoder().fit_transform(csv[col])

    # make all variables categorical
    for col in stacked_df.columns:
        csv[col] = csv[col].astype('category')
        
    # convert csv to dataloader
    shelter_ds = ShelterOutcomeDataset(csv, np.zeros(len(csv)), embedded_cols)
    test_dl = DataLoader(shelter_ds, batch_size=1)
        
    return test_dl
    
def postprocess(predictions_df, output_csv):
    
    # fill in output
    output_csv['Adoption'] = [float(t[0]) for t in predictions_df]
    output_csv['Died'] = [float(t[1]) for t in predictions_df]
    output_csv['Euthanasia'] = [float(t[2]) for t in predictions_df]
    output_csv['Return_to_owner'] = [float(t[3]) for t in predictions_df]
    output_csv['Transfer'] = [float(t[4]) for t in predictions_df]
    
    return output_csv

In [69]:
def process(input_bytes,context):
    
    # preprocess
    df = pd.read_csv(StringIO(str(input_bytes, "utf-8")))
    df = preprocess(df, context["embedded_columns"])
    
    # run inference
    preds = []
    for x1, x2, _  in df: 
        out = model(x1, x2)
        prob = F.softmax(out, dim=1)
        preds.append(prob)
    final_probs = [item for sublist in preds for item in sublist] 

    # postprocess
    output_skeleton = pd.DataFrame(0, index=np.arange(len(final_probs)), columns=context["output_column_names"])
    output_skeleton["ID"] = [i+1 for i in range(len(final_probs))]
    final_output = postprocess(final_probs, output_skeleton)
    
    inference_result = final_output.to_json()

    structured_output = {
        "data": {
            "result": inference_result,
            "explanation": None,
            "drift": None,
        }
    }
    
    return structured_output

## Initialize Chassis Client
We'll use this to interact with the Chassis service

In [70]:
chassis_client = chassisml.ChassisClient("http://localhost:5000")

## Create and test Chassis model
* Requires `context` dict containing all variables which should be loaded once and persist across inferences
* Requires `process_fn` defined above

In [71]:
# create Chassis model
chassis_model = chassis_client.create_model(context=context,process_fn=process)

# test Chassis model (can pass filepath, bufferedreader, bytes, or text here):
sample_filepath = "./data/animal-shelter-outcomes/sample_data.csv"
results = chassis_model.test(sample_filepath)
print(results)

b'{"data":{"result":"{\\"ID\\":{\\"0\\":1,\\"1\\":2,\\"2\\":3,\\"3\\":4,\\"4\\":5},\\"Adoption\\":{\\"0\\":0.0205119159,\\"1\\":0.728466332,\\"2\\":0.0652435422,\\"3\\":0.0106766773,\\"4\\":0.4149729609},\\"Died\\":{\\"0\\":0.0118920719,\\"1\\":0.0022707784,\\"2\\":0.0102348896,\\"3\\":0.0128209479,\\"4\\":0.0086779036},\\"Euthanasia\\":{\\"0\\":0.0620467477,\\"1\\":0.0077749281,\\"2\\":0.0379733928,\\"3\\":0.0512745008,\\"4\\":0.0421656035},\\"Return_to_owner\\":{\\"0\\":0.0594031587,\\"1\\":0.0341950096,\\"2\\":0.012345545,\\"3\\":0.0251460969,\\"4\\":0.0532071441},\\"Transfer\\":{\\"0\\":0.8461461067,\\"1\\":0.2272929996,\\"2\\":0.8742026091,\\"3\\":0.9000817537,\\"4\\":0.4809763134}}","explanation":null,"drift":null}}'


In [74]:
# test environment and model within Chassis service, must pass filepath here:
test_env_result = chassis_model.test_env(sample_filepath)
print(test_env_result)

Starting test job... Ok!
{'model_output': 'b\'{"data":{"result":"{\\\\"ID\\\\":{\\\\"0\\\\":1,\\\\"1\\\\":2,\\\\"2\\\\":3,\\\\"3\\\\":4,\\\\"4\\\\":5},\\\\"Adoption\\\\":{\\\\"0\\\\":0.0205119159,\\\\"1\\\\":0.728466332,\\\\"2\\\\":0.0652435422,\\\\"3\\\\":0.0106766783,\\\\"4\\\\":0.4149729609},\\\\"Died\\\\":{\\\\"0\\\\":0.0118920719,\\\\"1\\\\":0.0022707784,\\\\"2\\\\":0.0102348896,\\\\"3\\\\":0.0128209479,\\\\"4\\\\":0.0086779092},\\\\"Euthanasia\\\\":{\\\\"0\\\\":0.0620467477,\\\\"1\\\\":0.0077749281,\\\\"2\\\\":0.0379733928,\\\\"3\\\\":0.0512745008,\\\\"4\\\\":0.0421656109},\\\\"Return_to_owner\\\\":{\\\\"0\\\\":0.0594031587,\\\\"1\\\\":0.0341950096,\\\\"2\\\\":0.0123455506,\\\\"3\\\\":0.0251460969,\\\\"4\\\\":0.0532071516},\\\\"Transfer\\\\":{\\\\"0\\\\":0.8461461067,\\\\"1\\\\":0.2272929847,\\\\"2\\\\":0.8742026091,\\\\"3\\\\":0.9000817537,\\\\"4\\\\":0.480976373}}","explanation":null,"drift":null}}\'\n'}


## Publish model to Modzy
Need to provide model name, model version, Dockerhub credentials, and required Modzy info

In [75]:
MODZY_URL = "https://integration.modzy.engineering/api"

response = chassis_model.publish(
    model_name="PyTorch Tabular Data Animal Shelter Outcome Predictions",
    model_version="0.0.2",
    registry_user=dockerhub_user,
    registry_pass=dockerhub_pass,
    modzy_sample_input_path=sample_filepath,
    modzy_api_key=modzy_api_key,
    modzy_url=MODZY_URL
)

job_id = response.get('job_id')
final_status = chassis_client.block_until_complete(job_id)

Starting build job... Ok!


In [76]:
if chassis_client.get_job_status(job_id)["result"] is not None:
    print("New model URL: {}".format(chassis_client.get_job_status(job_id)["result"]["container_url"]))
else:
    print("Chassis job failed \n\n {}".format(chassis_client.get_job_status(job_id)))

New model URL: https://integration.modzy.engineering/models/pcmfspdnml/0.0.2


## Run sample job
Submit inference job to our newly-deploy model running on Modzy

In [77]:
from modzy import ApiClient

client = ApiClient(base_url='https://integration.modzy.engineering/api', api_key=modzy_api_key)

input_name = final_status['result']['inputs'][0]['name']
model_id = final_status['result'].get("model").get("modelId")
model_version = final_status['result'].get("version")

inference_job = client.jobs.submit_file(model_id, model_version, {input_name: sample_filepath})
inference_job_result = client.results.block_until_complete(inference_job, timeout=None)
inference_job_results_json = inference_job_result.get_first_outputs()['results.json']
print(inference_job_results_json)

ApiObject({
  "data": {
    "drift": null,
    "explanation": null,
    "result": "{\"ID\":{\"0\":1,\"1\":2,\"2\":3,\"3\":4,\"4\":5},\"Adoption\":{\"0\":0.0205119159,\"1\":0.728466332,\"2\":0.0652435347,\"3\":0.0106766829,\"4\":0.4149729908},\"Died\":{\"0\":0.0118920719,\"1\":0.0022707784,\"2\":0.0102348896,\"3\":0.0128209479,\"4\":0.0086779045},\"Euthanasia\":{\"0\":0.0620467477,\"1\":0.0077749281,\"2\":0.0379733928,\"3\":0.051274512,\"4\":0.0421656109},\"Return_to_owner\":{\"0\":0.0594031587,\"1\":0.0341950096,\"2\":0.012345545,\"3\":0.0251460969,\"4\":0.0532071367},\"Transfer\":{\"0\":0.8461461067,\"1\":0.2272929847,\"2\":0.8742026091,\"3\":0.9000817537,\"4\":0.480976373}}"
  }
})


In [78]:
results_df = pd.read_json(inference_job_results_json["data"]["result"])
results_df

Unnamed: 0,ID,Adoption,Died,Euthanasia,Return_to_owner,Transfer
0,1,0.020512,0.011892,0.062047,0.059403,0.846146
1,2,0.728466,0.002271,0.007775,0.034195,0.227293
2,3,0.065244,0.010235,0.037973,0.012346,0.874203
3,4,0.010677,0.012821,0.051275,0.025146,0.900082
4,5,0.414973,0.008678,0.042166,0.053207,0.480976
