# MLFLOW - Deploying Machine Learning in Production

In this assignment you will be writing a script that train models and use `mlflow` to submit runs.

In [43]:
%%writefile ./new_data.json
{"age": {"0": 40, "1": 47},
 "balance": {"0": 580, "1": 3644},
 "campaign": {"0": 1, "1": 2},
 "contact": {"0": "unknown", "1": "unknown"},
 "day": {"0": 16, "1": 9},
 "default": {"0": "no", "1": "no"},
 "duration": {"0": 192, "1": 83},
 "education": {"0": "secondary", "1": "secondary"},
 "housing": {"0": "yes", "1": "no"},
 "job": {"0": "blue-collar", "1": "services"},
 "loan": {"0": "no", "1": "no"},
 "marital": {"0": "married", "1": "single"},
 "month": {"0": "may", "1": "jun"},
 "pdays": {"0": -1, "1": -1},
 "poutcome": {"0": "unknown", "1": "unknown"},
 "previous": {"0": 0, "1": 0}}

Overwriting ./new_data.json


In [44]:
# Imagine the above came in from a NoSQL database or an API. Ideally we would add all the onehotencoded columns from the trained model (even if empty) to the above data prior to fit, if we had them saved in memory or in an API/models dir.

In [45]:
#Load all necessary libraries
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
import pandas as pd
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
import joblib
import json

# Load Dataset
bank = pd.read_csv('bank-full.csv', delimiter = ';')

# Split data between train and validation
X_train, X_test, y_train, y_test = train_test_split(bank.drop(columns = "y"), bank["y"],
                                                    test_size = 0.10, random_state = 42)

X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)


Question 1: Create pre-processing function to be later used as part of the pipeline (custom transformer)

In [46]:
def transformations(df):
    df_copy = df.copy() # Copy so as to not blow up the original data
    onehoter = OneHotEncoder(sparse_output=False)
    cat_cols = df_copy.select_dtypes(['object']).columns
    encoded = onehoter.fit_transform(df_copy[cat_cols])
    onehot_cols = onehoter.get_feature_names_out(cat_cols)
    df_onehot = pd.DataFrame(encoded, columns=onehot_cols)

    num_cols = df_copy.select_dtypes(['integer', 'float']).columns
    znormalizer = StandardScaler()
    znormalizer.fit(df_copy[num_cols])
    df_norm = znormalizer.fit_transform(df_copy[num_cols])

    df_featurized = df_onehot
    df_featurized[num_cols] = df_norm

    del df_onehot, df_norm, df_copy
    return df_featurized

In [47]:
# Testing

'''
df_ft = transformations(X_train)
with open('./new_data.json', 'r') as f:
    data = json.load(f)
new_predictions = pd.DataFrame(data)
print(df_ft.columns)
print(transformations(new_predictions).columns)
new_predictions = new_predictions.reindex(columns=df_ft.columns).fillna(0)
print(new_predictions)
print(new_predictions.columns)
print(new_predictions.shape)
print(new_predictions.describe())
'''

"\ndf_ft = transformations(X_train)\nwith open('./new_data.json', 'r') as f:\n    data = json.load(f)\nnew_predictions = pd.DataFrame(data)\nprint(df_ft.columns)\nprint(transformations(new_predictions).columns)\nnew_predictions = new_predictions.reindex(columns=df_ft.columns).fillna(0)\nprint(new_predictions)\nprint(new_predictions.columns)\nprint(new_predictions.shape)\nprint(new_predictions.describe())\n"

Question 2: Creating a custom transformer from the previously defined function

In [48]:
pre_processing = FunctionTransformer(transformations)

Question 3: Creating the pipeline and defining each of two steps: (i) pre-processing, and; (ii) model (Logistic)

In [49]:
pipeline = Pipeline([
    ('pre_process', pre_processing),
    ('model', LogisticRegression())
], verbose=True)

Question 4: Call `fit` and `predict` on the pipeline to make sure that it all works. Remember to pass them the **un-processed** (original) data, since the data processing should be built into the pipeline now.

In [50]:
#Set parameters for Logistic Regression estimator ('model') inside the pipeline
pipeline.set_params(model__C=1.0,                 # C: default=1.0
                    model__solver='liblinear',   # solver: {‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’}, default=’lbfgs’
                    model__max_iter=10000,         # max_iter: default=100
                    model__fit_intercept=True,   # fit_intercept:{True, False}, default=True
                    model__penalty='l2')         # penalty: {‘l1’, ‘l2’, ‘elasticnet’, ‘none’}, default=’l2’
                                                 # Warning: The choice of the algorithm depends on the penalty chosen.
                                                 #          Not all algorithms support every type of penalty

#Fit Training Data to Model
pipeline.fit(X_train, y_train)

#Prediction on Training and Test Data
y_train_pred = pipeline.predict(X_train)
y_test_pred = pipeline.predict(X_test)

[Pipeline] ....... (step 1 of 2) Processing pre_process, total=   0.1s
[Pipeline] ............. (step 2 of 2) Processing model, total=   0.3s


Question 5: Evaluate your model by calculating the precision and recall.

In [51]:
print(y_train_pred.shape)
print(y_train.shape)
print(y_test_pred.shape)
print(y_test.shape)

(40689,)
(40689,)
(4522,)
(4522,)


In [52]:
#Create a function to evaluate the model performance using precision and recall
def eval_metrics(actual, pred):
    precision = precision_score(actual, pred, pos_label="no")
    recall = recall_score(actual, pred, pos_label="no")

    return precision, recall

#Calculation of evaluation metrics - Precision and Recall for training and validation data
(precision_train, recall_train) = eval_metrics(y_train, y_train_pred)
(precision_test, recall_test) = eval_metrics(y_test, y_test_pred)

# Print Model (Logistic Regression) parameters
print()
print('Main Parameters used in logistic regression are: C={}, solver={}, max_iter={}, fit_intercept={} and penalty={}'.format(pipeline['model'].get_params()['C'],
                                                                                                                             pipeline['model'].get_params()['solver'],
                                                                                                                             pipeline['model'].get_params()['max_iter'],
                                                                                                                             pipeline['model'].get_params()['fit_intercept'],
                                                                                                                             pipeline['model'].get_params()['penalty']))
# Print Evaluation Metrics for the Model (Logistic Regression)
print()
print('Precision = {:.2f}% and recall = {:.2f}% on the training data.'.format(precision_train, recall_train))
print('Precision = {:.2f}% and recall = {:.2f}% on the validation data.'.format(precision_test, recall_test))



Main Parameters used in logistic regression are: C=1.0, solver=liblinear, max_iter=10000, fit_intercept=True and penalty=l2

Precision = 0.92% and recall = 0.98% on the training data.
Precision = 0.91% and recall = 0.97% on the validation data.


Question 6: Save your pipeline object using `joblib` as shown [here](https://sklearn.org/modules/model_persistence.html).

In [53]:
#store 'pipeline' as pickle file using joblib
joblib.dump(pipeline, 'bb_pipeline.pkl')

['bb_pipeline.pkl']

Question 7: Now write a **new script** for scoring: it loads the pipeline you saved in the last step, reads the data `../data/new_data.json` and converts it to a `pandas.DataFrame` object, and obtains predictions on it. The predictions should be stored as a `json` file `../data/new_preds.json`.

In [54]:
#Call and load stored 'pipeline'
pipeline = joblib.load('bb_pipeline.pkl')

#Read json file with new data and write into a pandas dataframe
with open('./new_data.json', 'r') as f:
    data = json.load(f)
new_predictions = pd.DataFrame(data)

# Have to make sure the columns match for this new dataframe, so we need to add the 'missing' columns that we imputed with onehotencoding to match our model
# This means adding whatever columns we got from the first transformation of the "FULL" dataset
# This also means we need to clear the NaNs as LogReg won't like them
# Once that's done, the model should be properly predicted and should not fail during 'fit()'

#print(new_predictions.columns)
df_ft = transformations(X_train) # This gets the full dataset index with the onehotencoded column names
# print(df_ft.columns) # Check to see full list
new_predictions = new_predictions.reindex(columns=df_ft.columns)
new_predictions = new_predictions.fillna(0)
# print(new_predictions.columns) # Check to make sure it matches full list

#Use predict method of pipeline to score (make prediction) on new data
new_predictions['prediction'] = pipeline.predict(new_predictions)

#Write predictions of new data into a json file
new_predictions.to_json('./new_preds.json', orient='columns')

In [55]:
# Read json file containing predictions made for the new data and load them into a dataframe
with open('./new_preds.json', 'r') as f:
    data = json.load(f)

new_pred_dataframe= pd.DataFrame(data)

#Print predictions for each observation contained in the new_data.json file and the dataframe with the data and prediction
print(new_pred_dataframe['prediction'])
new_pred_dataframe

0    yes
1     no
Name: prediction, dtype: object


Unnamed: 0,job_admin.,job_blue-collar,job_entrepreneur,job_housemaid,job_management,job_retired,job_self-employed,job_services,job_student,job_technician,...,poutcome_success,poutcome_unknown,age,balance,day,duration,campaign,pdays,previous,prediction
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,40,580,16,192,1,-1,0,yes
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,47,3644,9,83,2,-1,0,no


Question 8: Create a new text cell in your Notebook: Complete a 50-100 word summary (or short description of your thinking in applying this week's learning to the solution) of your experience in this assignment. Include: What was your incoming experience with this model, if any? what steps you took, what obstacles you encountered. how you link this exercise to real-world, machine learning problem-solving. (What steps were missing? What else do you need to learn?) This summary allows your instructor to know how you are doing and allot points for your effort in thinking and planning, and making connections to real-world work.

# Q8.

1. I have no experience with building pipelines or custom transformers. I saw in my 'research' that you can build some complex transformer models if necessary, with a lot of internal Python class logic. I have plenty of software and systems engineering experience though (even if it may not show in these labs) so imagining this as building a model and then having an API or queue serve a request and the model generating a response made most sense to me. 😅
2. Obstacles encountered: When we encode the full model we get extra columns. When the data comes in, since it is a limited dataset of two new datapoints, we don't get the full index when we encode them. So the 'fit' will fail when we load our pipeline. This can be corrected by adding the missing columns to the incoming data prior to feeding it to the pipeline.
3. This exercise is useful in understanding the pipeline (preprocessing) and transformers, as those are the basics of most ML, as well as saving and loading a pre-trained model! This is great for when you say, open a new session for a user and they want to interact with a fresh model. However as that model gets trained (as you'll probably capture input/output in your monitoring and use it to retrain the model), it'll degrade over time.
4. It'd be good to learn how to 'retrain' models next, and then re-serve the same model. Migth be somewhat trivial as you add existing data to the full dataset, but there's probably other ways too.
5. Thanks for the class! 😀

In [None]:
%%shell

jupyter nbconvert --to html /content/Lab10.ipynb