# Build a simple ML pipeline for expense report estimation

## Introduction
This tutorial shows how to train and deploy a model which we can use to estimate expense report expenditures.  The goal with this model is to create a service which we can call and get an estimation of how far off from reality our expense reports are.


## Set up your development environment

All the setup for your development work can be accomplished in a Python notebook.  Setup includes:

### Import packages

Import Python packages you need in this session. Also display the Azure Machine Learning SDK version.  If you have not already done so, you will need to ensure that [the Azure ML SDK is installed](https://docs.microsoft.com/en-us/python/api/overview/azure/ml/install?view=azure-ml-py).  If you are running this from outside an Azure ML compute instance, run the following operations to install the relevant SDK libraries.

```python
pip install azureml-core
pip install azureml-pipeline
```

In [None]:
import os
import azureml.core
from azureml.core import (
    Workspace,
    Dataset,
    Datastore,
    ComputeTarget,
    Experiment,
    ScriptRunConfig,
)
from azureml.data import DataType, OutputFileDatasetConfig
from azureml.data.datapath import DataPath
from azureml.data.dataset_consumption_config import DatasetConsumptionConfig
from azureml.pipeline.steps import PythonScriptStep
from azureml.pipeline.core import Pipeline, PipelineParameter, PipelineRun

# check core SDK version number
print("Azure ML SDK Version: ", azureml.core.VERSION)

### Connect to workspace

Create a workspace object from the existing workspace. `Workspace.from_config()` reads the file **config.json** and loads the details into an object named `workspace`.

If you do not already have a **config.json** file, uncomment and run the first batch of code to create one.

In [None]:
#workspace = Workspace.get(name="<Enter your workspace name>",
#                subscription_id="<Enter your subscription ID>",
#                resource_group="<Enter your resource group>")
#workspace.write_config(file_name="config.json")

In [None]:
# load workspace
workspace = Workspace.from_config()
print(
    "Workspace name: " + workspace.name,
    "Azure region: " + workspace.location,
    "Resource group: " + workspace.resource_group,
    sep="\n",
)

### Create experiment and a directory

Create an experiment to track the runs in your workspace and a directory to deliver the necessary code from your computer to the remote resource.

In [None]:
# create an ML experiment
exp = Experiment(workspace=workspace, name="ExpenseReportsPipeline")

# create a directory for source data
script_folder = "./src"
os.makedirs(script_folder, exist_ok=True)

# create a temp directory for data
tmp_folder = "./tmp"
os.makedirs(tmp_folder, exist_ok=True)

### Create or Attach existing compute resource

**Creation of compute takes approximately 5 minutes.** If the AmlCompute with that name is already in your workspace the code will skip the creation process.

In [None]:
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException

# choose a name for your cluster
cluster_name = "cpu-cluster"

found = False
# Check if this compute target already exists in the workspace.
cts = workspace.compute_targets
if cluster_name in cts and cts[cluster_name].type == "AmlCompute":
    found = True
    print("Found existing compute target.")
    compute_target = cts[cluster_name]
if not found:
    print("Creating a new compute target...")
    compute_config = AmlCompute.provisioning_configuration(
        vm_size="STANDARD_D2_V2",
        max_nodes=4,
    )

    # Create the cluster.
    compute_target = ComputeTarget.create(workspace, cluster_name, compute_config)

    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min_node_count is provided, it will use the scale settings for the cluster.
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=10
    )
# For a more detailed view of current AmlCompute status, use get_status().print(compute_target.get_status().serialize())

## Create the expense reports dataset

By creating a dataset, you create a reference to the data source location. If you applied any subsetting transformations to the dataset, they will be stored in the dataset as well. The data remains in its existing location, so no extra storage cost is incurred.

In [None]:
expenses_datastore = Datastore.get(workspace, datastore_name="expense_reports")

## Build an ML pipeline

### Step 1: data preparation

In step one, we will load the data and labels (expense report amounts) from the Expense Reports dataset and then send it along to the pipeline.

Here is where things get a little tricky.  Our data in the original example was stored in Azure SQL Database, but for pipelines to work, we need to send **mountable** data store information, which means text files in a folder.  We can't send a TabularDataset as an input to a pipeline step, so we'll need to make sure that the data is available in text format in Azure ML.  Instead of pre-loading that data, I'll build the datastore here.

In [None]:
query = """SELECT
    er.EmployeeID,
    CONCAT(e.FirstName, ' ', e.LastName) AS EmployeeName,
    ec.ExpenseCategoryID,
    ec.ExpenseCategory,
    er.ExpenseDate,
    YEAR(er.ExpenseDate) AS ExpenseYear,
    -- Python requires FLOAT values--it does not support DECIMAL
    CAST(er.Amount AS FLOAT) AS Amount
FROM dbo.ExpenseReport er
    INNER JOIN dbo.ExpenseCategory ec
        ON er.ExpenseCategoryID = ec.ExpenseCategoryID
    INNER JOIN dbo.Employee e
        ON e.EmployeeID = er.EmployeeID
WHERE
	YEAR(er.ExpenseDate) < 2017;"""

dp = DataPath(expenses_datastore, query)

data_types = {
    'EmployeeID': DataType.to_long(),
    'EmployeeName': DataType.to_string(),
    'ExpenseCategoryID': DataType.to_long(),
    'ExpenseCategory': DataType.to_string(),
    'ExpenseDate': DataType.to_datetime('%Y-%m-%d'),
    'ExpenseYear': DataType.to_long(),
    'Amount': DataType.to_float()
}

expensereport_ds = Dataset.Tabular.from_sql_query(dp, set_column_types=data_types).to_pandas_dataframe()

Now that we have the dataset and have shaped it as a Pandas dataframe, let's write this out to CSV.  To do that, I'm writing it to a single CSV entitled `ExpenseReports.csv` in the temp folder that we created earlier.

In [None]:
local_path = tmp_folder + "/ExpenseReports.csv"
expensereport_ds.to_csv(local_path)

The next step is to write the contents of the `tmp` folder to our default datastore, into a folder named `ExpenseReports`.  Once we're done, this data will be in Azure Blob Storage inside a container for Azure ML, with the folder path `ExpenseReports/ExpenseReports.csv`.

In [None]:
datastore = workspace.get_default_datastore()
datastore.upload(src_dir=tmp_folder, target_path='ExpenseReports')

We already have the datastore object here, so we can retrieve it.  As a quick note, if we already had this data in Azure ML, we could simply reference this `from_delimited_files()` method instead of going through intermediate steps.

In [None]:
expensereportcsv_ds = Dataset.Tabular.from_delimited_files(datastore.path('ExpenseReports/ExpenseReports.csv'))

### Step 2: train the model

Our first step in training the model is to specify what our remote compute environment will look like.  There are a few options available to us, but the two most popular methods are to specify a Conda environment and to use a pre-built environment.

Specifying a Conda environment is nice because it provides the most flexibility:  you can install particular versions of libraries and customize what is on that compute machine.  Azure ML will then create a Docker image, include in the `DOCKERFILE` the environment dependencies you specify, and set up a custom machine.  If you want an example of Conda dependencies, check out `src/conda_dependencies.yml` for a simple example.

Another option is to use a pre-built environment.  In this case, we're going to use an environment built into Azure Machine Learning which include `scikit-learn` and everything we will need to perform our model training.

In [None]:
from azureml.core import Environment

er_env = Environment.get(workspace=workspace, name="AzureML-sklearn-0.24-ubuntu18.04-py37-cpu")

Next, construct a `ScriptRunConfig` to configure the training run that trains a model of expense report expectations. It's going to use the compute target we specified, install the pre-configured environment we specified, and indicate that there is a file named `train.py` in the script folder we specified.

In [None]:
train_src = ScriptRunConfig(
    source_directory=script_folder,
    script="train.py",
    compute_target=compute_target,
    environment=er_env,
)

Pass the run configuration details into the PythonScriptStep.

A **PythonScriptStep** is a basic, built-in step to run a Python Script on a compute target. It takes a script name and optionally other parameters like arguments for the script, compute target, inputs and outputs. If no compute target is specified, default compute target for the workspace is used.

Here, we will pass in our flat file expense reports as a prepared dataset and then execute the training script as specified in `train_src` above.

In [None]:
train_step = PythonScriptStep(
    name="train step",
    arguments=[
        expensereportcsv_ds.as_named_input(name="prepared_expensereport_ds")
    ],
    source_directory=train_src.source_directory,
    script_name=train_src.script,
    runconfig=train_src.run_config,
)

### Build the pipeline
Once we have the steps (or steps collection), we can build the pipeline.

A pipeline is created with a list of steps and a workspace. Submit a pipeline using `submit`. When submit is called, a PipelineRun is created which in turn creates StepRun objects for each step in the workflow.

In [None]:
# build pipeline & run experiment
pipeline = Pipeline(workspace, steps=[train_step])
run = exp.submit(pipeline)

### Monitor the PipelineRun

In [None]:
run.wait_for_completion(show_output=True)

In [None]:
run.find_step_run("train step")[0].get_metrics()

## Register the input dataset and the output model

You can trace how your data is used in Azure Machine Learning datasets.  Using the `run` object, you can get where and how the datasets are used.

In [None]:
# get input datasets
prep_step = run.find_step_run("train step")[0]
inputs = prep_step.get_details()["inputDatasets"]
input_dataset = inputs[0]["dataset"]

# list the files referenced by input_dataset
input_dataset

Register the input Expense Reports dataset with the workspace so that you can reuse it in other experiments or share it with your colleagues who have access to your workspace.

In [None]:
expensereports_ds = input_dataset.register(
    workspace=workspace,
    name="expensereports_ds",
    description="Generated expense report data from 2011-2017",
    create_new_version=True,
)
expensereports_ds

Our last step is to register the output model with dataset.  That way, we can deploy the model as a service.

In [None]:
run.find_step_run("train step")[0].register_model(
    model_name="ExpenseReportsPipelineModel",
    model_path="outputs/model.pkl",
    datasets=[("train test data", expensereports_ds)],
)