# Operationalization of machine learning models

In this notebook, we cover some of the important themes around model operationalization. This is an extensive topic, and we do not try to be comprehensive here. Instead we learn about some essentials and look at an example of a library that makes this kind of work very easy for us: the `mlflow` library. To introduce you to the library, we go over their [own example](https://www.mlflow.org/docs/latest/tutorials-and-examples/tutorial.html) for running an experiment. But first some vocabulary:

- A **script** is some Python code we want to run, stored as a `.py` or `.ipynb` formats. Usually, the script has a set of required or optional inputs we provide (just like a Python function). In `mlflow`, we refer to these inputs as **parameters**, but do NOT confuse this term with model parameters in ML.
- A **run** is what we call when we fix the inputs of a script to some value and executing the script. In the context of ML, the script could be a training script, its "parameters" could be hyper-parameters to the model we wish to train, and a run is when we train a model with the hyper-parameters set to some fix values.
- As part of a run we can log the **parameters** we used, the **metrics** we calculated such as training and test accuracy, and **artifacts** such as plots, tables, or trained models we save externally for reuse later. We can refer to these as run meta-data. In addition to the meta-data we log explicitly in the code, `mlflow` also logs some of its own meta-data such as run ID or run time.
- A **experiment** is a collection of related runs. So to continue with the above example, if we execute the script several times each time using another set of values for the hyper-parameters, then the experiment is the collection of all such runs. After executing all the runs, we can go to our experiment to compare them in terms of accuracy, run time, or whatever **metric** of interest.

Note that the example we provide above is a "typical" example, and this is what we show in this notebook. But in general we can be flexible in what exactly we define as an experiment. The general idea is that from run to run, we change things and later we want to see what worked and what didn't by looking at metrics or artifacts generated by the model. A machine learning project can consist of one or several experiments. It all depends on the complexity of the proect, and how granular we think of individual runs. This is to some extent a matter of preference and can even be driven by business needs. 

Finally of course we can do a lot of this manually. After all we know how to run scripts with different inputs, or how to save plots or models on disk. Using a **version control** tool like Git, we can also track changes to the code. So why do we need `mlflow`? The answer is simple: It takes away most of the hassle that comes with doing such things manually, and on top of that it provides us with a UI where we go to find all our runs and quickly compare them. There are other concepts in `mlflow` that we do not cover here, but we invite you to check out [their website](https://mlflow.org/).

To begin with, we create a folder to save not only the code, but also the meta-data generated by our runs. Once we begin to log runs, the project folder will be populated by such meta-data. You are advised against deleting the meta-data directly (the better way is to use the UI).

In [1]:
!pip install mlflow

Collecting mlflow
  Using cached mlflow-1.26.1-py3-none-any.whl (17.8 MB)
Collecting Flask
  Using cached Flask-2.1.2-py3-none-any.whl (95 kB)
Collecting querystring-parser
  Using cached querystring_parser-1.2.4-py2.py3-none-any.whl (7.9 kB)
Collecting gitpython>=2.1.0
  Using cached GitPython-3.1.27-py3-none-any.whl (181 kB)
Collecting databricks-cli>=0.8.7
  Downloading databricks-cli-0.16.8.tar.gz (67 kB)
     |████████████████████████████████| 67 kB 4.3 MB/s             
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting docker>=4.0.0
  Using cached docker-5.0.3-py2.py3-none-any.whl (146 kB)
Collecting gunicorn
  Using cached gunicorn-20.1.0-py3-none-any.whl (79 kB)
Collecting sqlparse>=0.3.1
  Using cached sqlparse-0.4.2-py3-none-any.whl (42 kB)
Collecting prometheus-flask-exporter
  Using cached prometheus_flask_exporter-0.20.2-py3-none-any.whl (18 kB)
Collecting tabulate>=0.7.7
  Using cached tabulate-0.8.9-py3-none-any.whl (25 kB)
Collecting gitdb<5,>=4.0.1
  Usin

In [2]:
import mlflow
import pandas as pd
import os

experiment_name = "predict_wine_quality"
project_folder = 'wine'

os.makedirs(project_folder, exist_ok = True)
os.makedirs(project_folder + '/code', exist_ok = True)
os.makedirs(project_folder + '/config', exist_ok = True)

try:
    experiment_id = mlflow.create_experiment(experiment_name)
except:
    experiment = mlflow.get_experiment_by_name(experiment_name)
    experiment_id = experiment.experiment_id
    
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///home/jovyan/Notebooks/mlruns/1', experiment_id='1', lifecycle_stage='active', name='predict_wine_quality', tags={}>

### Exercise

Below is the script we wish to execute. A lot of the code should look familiar. Examine this script and try to point out the pieces that are new. What is the purpose of `sys.argv`? Notice how and where the `mlflow` library is used in the code. Finally, execute the script to make sure it works. There are several ways to execute a script:

- from the **command line** navigate to its folder and run `python train.py`
- from this **notebook** create a new cell and paste this `!python $project_folder/code/train.py`
- from this **notebook** create a new cell and paste this `%run $project_folder/code/train.py`

In order to execute the script make sure you first run the cell below. Note that if you changed the name of the experiment in cell above, you will need to also change it in the script in the cell below.

### End of exercise

In [3]:
%%writefile $project_folder/code/train.py
# The data set used in this example is from http://archive.ics.uci.edu/ml/datasets/Wine+Quality
# P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
# Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

import os
import warnings
import sys

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import ElasticNet
from urllib.parse import urlparse
import mlflow
import mlflow.sklearn

import logging

logging.basicConfig(level = logging.WARN)
logger = logging.getLogger(__name__)


def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


if __name__ == "__main__":
    warnings.filterwarnings("ignore")
    np.random.seed(40)

    # read the wine-quality csv file from the URL
    csv_url = (
        "http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
    )
    try:
        data = pd.read_csv(csv_url, sep = ";")
    except Exception as e:
        logger.exception(
            "Unable to download training & test CSV, check your internet connection. Error: %s", e
        )

    # split the data into training and test sets. (0.75, 0.25) split.
    train, test = train_test_split(data)

    # the predicted column is "quality" which is a scalar from [3, 9]
    train_x = train.drop(["quality"], axis = 1)
    test_x = test.drop(["quality"], axis = 1)
    train_y = train[["quality"]]
    test_y = test[["quality"]]

    alpha = float(sys.argv[1]) if len(sys.argv) > 1 else 0.5
    l1_ratio = float(sys.argv[2]) if len(sys.argv) > 2 else 0.5

    mlflow.set_experiment("predict_wine_quality")
    # mlflow.autolog()
    with mlflow.start_run():
        
        run = mlflow.active_run()
        experiment = mlflow.get_experiment(run.info.experiment_id)
        print("Experiment ID: \"{}\"".format(run.info.experiment_id))
        print("Experiment name: \"{}\"".format(experiment.name))
        print("Run ID: \"{}\"".format(run.info.run_id))

        lr = ElasticNet(alpha = alpha, l1_ratio = l1_ratio, random_state = 42)
        lr.fit(train_x, train_y)

        predicted_qualities = lr.predict(test_x)

        (rmse, mae, r2) = eval_metrics(test_y, predicted_qualities)

        print("Using alpha = {:0.2f}, l1_ratio = {:0.2f} we get the following metrics:".format(alpha, l1_ratio))
        print("  metric RMSE: {:6.2f}".format(rmse))
        print("  metric MAE: {:6.2f}".format(mae))
        print("  metric R-squared: {:0.2f}".format(r2))

        mlflow.log_param("alpha", alpha)
        mlflow.log_param("l1_ratio", l1_ratio)
        mlflow.log_metric("rmse", rmse)
        mlflow.log_metric("r2", r2)
        mlflow.log_metric("mae", mae)

        tracking_url_type_store = urlparse(mlflow.get_tracking_uri()).scheme

        # model registry does not work with file store
        if tracking_url_type_store != "file":

            # register the model
            mlflow.sklearn.log_model(lr, "model", registered_model_name = "ElasticnetWineModel")
        else:
            mlflow.sklearn.log_model(lr, "model")

Overwriting wine/code/train.py


In [4]:
!python $project_folder/code/train.py

Experiment ID: "1"
Experiment name: "predict_wine_quality"
Run ID: "4b6827c35b0440c9a9e81457476cdfd7"
Using alpha = 0.50, l1_ratio = 0.50 we get the following metrics:
  metric RMSE:   0.79
  metric MAE:   0.63
  metric R-squared: 0.11


In [5]:
%run $project_folder/code/train.py

Experiment ID: "1"
Experiment name: "predict_wine_quality"
Run ID: "00973f39a40d41459fd7d5f82780943f"
Using alpha = 0.50, l1_ratio = 0.50 we get the following metrics:
  metric RMSE:   0.79
  metric MAE:   0.63
  metric R-squared: 0.11


Since we defined the above script with two inputs (what `mlflow` calls "parameters"), we can now change them to new values and execute the script again.

In [6]:
!python $project_folder/code/train.py 0.25 0.50

Experiment ID: "1"
Experiment name: "predict_wine_quality"
Run ID: "988e111f74374ef49a07085443bc8b15"
Using alpha = 0.25, l1_ratio = 0.50 we get the following metrics:
  metric RMSE:   0.75
  metric MAE:   0.58
  metric R-squared: 0.21


Let's now define an `mlflow` experiment and formalize what we did above. We create a file below that defines an `mlflow` project with its parameters and the command to be executed. Note that file paths are sepecified relative to the project directory.

In [7]:
%%writefile $project_folder/MLproject
name: Wine Quality Prediction

conda_env: config/conda.yaml

entry_points:
  main:
    parameters:
      alpha: float
      l1_ratio: {type: float, default: 0.1}
    command: "python code/train.py {alpha} {l1_ratio}"

Overwriting wine/MLproject


The above file also points to a conda environment file which we create below. This file defines the Python runtime used by the experiment. So for example, as part of the experiment, we can update one of the packages listed below and execute a new run to see if the update breaks our script.

In [8]:
%%writefile $project_folder/config/conda.yaml
channels:
  - defaults
dependencies:
  - numpy=1.14.3
  - pandas=0.22.0
  - pip:
    - mlflow
    - scikit-learn==0.24.1

Overwriting wine/config/conda.yaml


To execute our experiment, we use the `mlflow` command. This is very similar to the way we executed the script earlier, but instead of pointing to the file we just provide the experiment name.

In [9]:
!mlflow run $project_folder --experiment-name $experiment_name -P alpha=0.42

2022/06/15 17:18:48 INFO mlflow.utils.conda: === Creating conda environment mlflow-2f52bdca12d03fd90351043b6b674318958998c5 ===
Collecting package metadata (repodata.json): done
Solving environment: done


  current version: 4.10.3
  latest version: 4.13.0

Please update conda by running

    $ conda update -n base conda



Downloading and Extracting Packages
libstdcxx-ng-11.2.0  | 4.7 MB    | ##################################### | 100% 
xz-5.2.5             | 339 KB    | ##################################### | 100% 
zlib-1.2.12          | 106 KB    | ##################################### | 100% 
tbb-2021.5.0         | 157 KB    | ##################################### | 100% 
numpy-1.14.3         | 39 KB     | ##################################### | 100% 
mkl_fft-1.0.6        | 135 KB    | ##################################### | 100% 
readline-8.1.2       | 354 KB    | ##################################### | 100% 
ncurses-6.3          | 782 KB    | ####################################

We can also run the above command from the **command line**, as we will see in the following exercise. Finally, here's some useful information about our experiment.

### Exercise

Let's look at some other examples of how `mlflow` works:

- Running the above cell should show you some additional output besides the metrics. In the output, you should see the name of the conda environment created by `mlflow`. Launch the **Anaconda prompt** and activate the environment, then launch a Python session and check that the libraries match the versions specified in the conda environment we provided above. This can be useful for debugging your code. You can also learn about various `mlflow` commands by typing `mlflow --help`, `mlflow experiments --help` etc.
- Return to activating the course environment (named `uwdatasci`), then from the **Anaconda prompt** navigate to the directory containing this notebook and then run `mlflow ui`. This will launch a new tab on your browser where you should now see an entry for the experiment we just ran. The user interface (UI) is where we go to look at all the runs in our experiment. Once in the `mlflow` UI, click on one of the successful runs and scroll down to **Artifacts** and look at the model artifact it saved for us. An example code is shown for how to load the artifact in a new Python session.
- Return to the training script above, and uncomment the line `mlflow.autolog()`, then comment out the lines that use `log_param` and `log_metric` that tell us what parameters and metrics we want to log. Run the cell to save the new script, then submit a new run and go to the UI to compare the results with previous runs.

![](../images/mlflow-ui-artifacts.jpg)

### End of exercise

Now let's see how we can load the model saved from one of our runs into the current Python session. To do so, we copy the line with `logged_model = ...` (see above screenshot) from the model artifacts page, and paste it below. We can then load a few rows of the wine data and use the model to get predictions.

In [10]:
logged_model = 'file:///C:/Users/sethmott/OneDrive/Documents/UW/DATASCI-530/notebooks/mlruns/2/b752c0dc989e458db3940d195d4be348/artifacts/model'
loaded_model = mlflow.pyfunc.load_model(logged_model) # load model as a PyFuncModel.
df_wine_sample = pd.read_csv('../data/wine.csv').drop(columns = ['quality', 'Class']).head() # load some data
loaded_model.predict(df_wine_sample) # predict on a pandas.DataFrame

OSError: No such file or directory: '/C:/Users/sethmott/OneDrive/Documents/UW/DATASCI-530/notebooks/mlruns/2/b752c0dc989e458db3940d195d4be348/artifacts/model'

Based on the business need, we can also go one step further and serve the model over HTTP as a **scoring service**. This makes the model behave like an application. To do so, run the next cell, and copy its **output** and run it from the command line. Note that you can only run `mlflow` commands from the **Anaconda prompt** after activating the environment that `mlflow` is installed in.

In [None]:
!echo mlflow models serve -m $logged_model -p 1234

Examine the output as you run the above command. We should see the conda environment being created before the model is served. Once the model is ready, the HTTP URL is shown as well.

The data we send to the model must be in json format, which is one of the most command format that applications use to send data to each other. In this context, the data is sometimes referred to as the **payload**. Here is an example of what the data should look like in our case:

In [None]:
%%writefile $project_folder/data/input_sample.json
{"columns":["fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol"], 
 "index":[0, 1, 2, 3, 4], 
 "data":[
     [7.4,  0.7,  0.0,  1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 9.4], 
     [7.8,  0.88, 0.0,  2.6, 0.098, 25.0, 67.0, 0.9968, 3.2,  0.68, 9.8], 
     [7.8,  0.76, 0.04, 2.3, 0.092, 15.0, 54.0, 0.997,  3.26, 0.65, 9.8], 
     [11.2, 0.28, 0.56, 1.9, 0.075, 17.0, 60.0, 0.998,  3.16, 0.58, 9.8], 
     [7.4,  0.7,  0.0,  1.9, 0.076, 11.0, 34.0, 0.9978, 3.51, 0.56, 9.4]]
}

To send a request to the model, we can use the `curl` command, or any Rest API application like [Postman](https://www.postman.com/). Here is what the `curl` command looks like, which you can run on Linux or on Windows using [WSL](https://docs.microsoft.com/en-us/windows/wsl/install-win10).

In [None]:
!echo curl -X POST -H "Content-Type:application/json; format=pandas-split" --data @$project_folder/data/input_sample.json http://127.0.0.1:1234/invocations

Here's the output you should see by running the above command.

    [5.422102809496764, 5.448114600770513, 5.444533999028288, 5.513957675441143, 5.422102809496764]
    
If we get errors two possible reasons are:
- We need to first run `conda activate <environment-name>` to activate the Conda environment in which `mlflow` is installed.
- We need to navigate to the folder where the notebook is running. This is because we set up the code so that paths are specified relative to this folder. You can run `print(os.getcwd())` to see the path, and then `cd` into it.

Let's finish by pointing out two important aspects about `mlflow` here:
- Everything we did here is "local", meaning that all meta-data is being saved to a local file path, but in most production system we use the cloud both for storage and for serving such models in production. For example, look [here](https://mlflow.org/docs/latest/models.html#deploy-a-python-function-model-on-microsoft-azure-ml) for an example of deployment in Azure. There are similar "plug-ins" for other cloud providers.
- As we saw, there are three ways to interact with `mlflow`: through the Python library, through the command line, and through the UI. Which we use depends to some extent on what we want to do. For example, to log metrics, it makes sense to use the Python library and embed `mlflow` in the code. To run experiments and serve models we used the command line and to see and compare runs we used the UI, but in most cases we can also use the Python library, so it's a matter of preference to some extent. As an example, take a look at the next cell, which returns a `DataFrame` with meta-data for runs under our experiment.

In [None]:
mlflow.search_runs(experiment_id).head()

# Assignment

In the lab, we saw how we can take a training script and use `mlflow` to submit runs. In practice, the training script isn't always clean code that is ready to use. So in this assignment, we learn to **refactor code** to train a model and prepare it for scoring. We learn about functionality in `sklearn` for **model persistence**, a fancy term for saving a model, so we can later load it and use it for prediction. We also learn how we can chain pre-processing steps and attach them to the model prediction step so we can both pre-process and score in one smooth flow.

The bulk of the code that needs to execute is already given. This should look like code that we've written throughout the course. But when moving to production, it is **highly, highly recommended** that we **refactor** the code. What this means is that we need to go over the code from top to finish and do a bunch of things (now that we have the advantage of **hindsight**):

- add **comments** in the code, for future us or (heaven forbid!) if someone else has to look at our code!
- remove **extra code** that we wrote during development for debugging purposes but no longer need, or at least comment it out
- simplify things, remove redundancies and make the code more **modular** by using functions or classes if needed
- **parametrize** the code, so you avoid **hard-coding** things that need to change, and move them as high-level parameters at the top of the code, making it easy to change things without breaking things
- create a runtime for the environment using Conda or PIP, which acts as a snapshot of the Python libraries we used and pins down their versions (careful: not all packages used during training are needed during scoring, and since the scoring enivironment is supposed to be **lightweight**, we should identify and remove such packages)
- add **scaffolding**, this is to make sure that the code executes **gracefully** when errors happen, such as when the model expects a certain feature in the future data but it is missing for some reason (by gracefully, we mean that we use things like `try` and `except` to catch and redirect errors)
- last but not least, never stop **testing**, but testing here can mean **unit testing**, **integration testing**, and even statistical tests for **data drift** or **model drift**
- if you haven't yet (what have you been waiting for!) begin to **version control** your code using **git** or something similar

Of course even with hindsight, doing these things is not easy and the answers are not always available, but we do the best we can and with experience we get better at it. Every project can be viewed as a work in progress, but applying certain **best practices** can make it easier to keep improving things without breaking them. This is what **agile development** is all about. You may have noticed that all the above steps are things we do generally when we write applications. It's just that not all data scientists have a rigorous computer science background and so we tend to have looser standards in general. The above is just the beginnig, not the end. Usually depending on the type of deployment we are doing, there are specific additional steps needed. 

Enough said. It's time for work! We used our knowledge of data science to write up some code to read data, pre-process it, train a model and use it to get predictions on a test data set. Here's a standard training code snippet. Examine it and run it to make sure it works.

In [None]:
import pandas as pd

bank = pd.read_csv('../data/bank-full.csv', delimiter = ';')

num_cols = bank.select_dtypes(['integer', 'float']).columns
cat_cols = bank.select_dtypes(['object']).drop(columns = "y").columns

print("Numeric columns are {}.".format(", ".join(num_cols)))
print("Categorical columns are {}.".format(", ".join(cat_cols)))

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(bank.drop(columns = "y"), bank["y"], 
                                                    test_size = 0.10, random_state = 42)

X_train = X_train.reset_index(drop = True)
X_test = X_test.reset_index(drop = True)

print("Training data has {} rows.".format(X_train.shape[0]))
print("Test data has {} rows.".format(X_test.shape[0]))

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

onehoter = OneHotEncoder(sparse = False)
onehoter.fit(X_train[cat_cols])
onehot_cols = onehoter.get_feature_names(cat_cols)
X_train_onehot = pd.DataFrame(onehoter.transform(X_train[cat_cols]), columns = onehot_cols)
X_test_onehot = pd.DataFrame(onehoter.transform(X_test[cat_cols]), columns = onehot_cols)

znormalizer = StandardScaler()
znormalizer.fit(X_train[num_cols])
X_train_norm = pd.DataFrame(znormalizer.transform(X_train[num_cols]), columns = num_cols)
X_test_norm = pd.DataFrame(znormalizer.transform(X_test[num_cols]), columns = num_cols)

X_train_featurized = X_train_onehot # add one-hot-encoded columns
X_test_featurized = X_test_onehot   # add one-hot-encoded columns
X_train_featurized[num_cols] = X_train_norm # add numeric columns
X_test_featurized[num_cols] = X_test_norm   # add numeric columns

del X_train_norm, X_test_norm, X_train_onehot, X_test_onehot

print("Featurized training data has {} rows and {} columns.".format(*X_train_featurized.shape))
print("Featurized test data has {} rows and {} columns.".format(*X_test_featurized.shape))

from sklearn.linear_model import LogisticRegression

logit = LogisticRegression(max_iter = 5000, solver = 'lbfgs')
logit.fit(X_train_featurized, y_train)

y_hat_train = logit.predict(X_train_featurized)
y_hat_test = logit.predict(X_test_featurized)

from sklearn.metrics import precision_score, recall_score
precision_train = precision_score(y_train, y_hat_train, pos_label = 'yes') * 100
precision_test = precision_score(y_test, y_hat_test, pos_label = 'yes') * 100

recall_train = recall_score(y_train, y_hat_train, pos_label = 'yes') * 100
recall_test = recall_score(y_test, y_hat_test, pos_label = 'yes') * 100

print("Precision = {:.0f}% and recall = {:.0f}% on the training data.".format(precision_train, recall_train))
print("Precision = {:.0f}% and recall = {:.0f}% on the validation data.".format(precision_test, recall_test))

The code above trains a model on data that contains both categorical and numeric features. We normalize the numeric features and one-hot-encode the categorical features as part of pre-processing. In the above code we do this "manually", however as shown [here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer.html) we can compose data transformations and ML steps to create a **single multi-step pipeline**. The pipeline object (conviniently called `pipeline` in the docs) has a `fit` and `predict` method:
- By calling `fit`, the raw input data is first transformed into the featurized data, and then passed to the ML algorithm to train a model.
- By calling `predict`, the raw input data is first transformed into the featurized data (just like `fit`), and and then used to get predictions (using the model trained when we called `fit`).

You can even create your own transformers and inculde them in a pipeline step, using as shown [here](https://scikit-learn.org/stable/modules/preprocessing.html#custom-transformers), but we won't worry about this for this assignment.

Let's also have some data that we can use for scoring:

In [None]:
%%writefile ../data/new_data.json
{'age': {'0': 40, '1': 47},
 'balance': {'0': 580, '1': 3644},
 'campaign': {'0': 1, '1': 2},
 'contact': {'0': 'unknown', '1': 'unknown'},
 'day': {'0': 16, '1': 9},
 'default': {'0': 'no', '1': 'no'},
 'duration': {'0': 192, '1': 83},
 'education': {'0': 'secondary', '1': 'secondary'},
 'housing': {'0': 'yes', '1': 'no'},
 'job': {'0': 'blue-collar', '1': 'services'},
 'loan': {'0': 'no', '1': 'no'},
 'marital': {'0': 'married', '1': 'single'},
 'month': {'0': 'may', '1': 'jun'},
 'pdays': {'0': -1, '1': -1},
 'poutcome': {'0': 'unknown', '1': 'unknown'},
 'previous': {'0': 0, '1': 0}}

Run through the following steps to first refactor and test code:

- **Step 1:** Compose the data processing and training steps in the above code into one pipeline as shown in the referenced doc. <span style="color:red" float:right>[15 point]</span>
- **Step 2:** Call `fit` and `predict` on the pipeline to make sure that it all works. Remember to pass them the **un-processed** (original) data, since the data processing should be built into the pipeline now. <span style="color:red" float:right>[5 point]</span>
- **Step 3:** Save your pipeline object using `joblib` as shown [here](https://sklearn.org/modules/model_persistence.html). <span style="color:red" float:right>[5 point]</span>
- **Step 4:** Now write a **new script** for scoring: it loads the pipeline you saved in the last step, reads the data `../data/new_data.json` and converts it to a `pandas.DataFrame` object, and obtains predictions on it. The predictions should be stored as a `json` file `../data/new_preds.json`. <span style="color:red" float:right>[10 point]</span>

To begin work on the assignemnt, it's best to first copy the training script in a cell below and begin modifying it according to the instructions in steps 1-3. Then create a new cell and populating with the scoring script described in step 4.

In [None]:
## modified training script goes here

In [None]:
## scoring script goes here

# End of assignment