# NYC taxi data regression

**Requirements** - In order to benefit from this tutorial, you will need:
- A basic understanding of Machine Learning
- An Azure account with an active subscription - [Create an account for free](https://azure.microsoft.com/free/?WT.mc_id=A261C142F)
- An Azure ML workspace with computer cluster - [Configure workspace](../../configuration.ipynb)
- A python environment
- Installed Azure Machine Learning Python SDK v2 - [install instructions](../../../README.md) - check the getting started section

**Learning Objectives** - By the end of this tutorial, you should be able to:
- Connect to your AML workspace from the Python SDK
- Define different `CommandComponent` using YAML
- Create `Pipeline` load these components from YAML

**Motivations** - This notebook explains how to load component via SDK then use these components to build pipeline. We use NYC dataset, build pipeline with five steps, prep data, transform data, train model, predict results and evaluate model performance.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1 Import the required libraries

In [1]:
# import required libraries
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential

from azure.ai.ml import MLClient, Input
from azure.ai.ml.dsl import pipeline
from azure.ai.ml import load_component

## 1.2 Configure credential

We are using `DefaultAzureCredential` to get access to workspace. 
`DefaultAzureCredential` should be capable of handling most Azure SDK authentication scenarios. 

Reference for more available credentials if it does not work for you: [configure credential example](../../configuration.ipynb), [azure-identity reference doc](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity?view=azure-python).

In [2]:
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

## 1.3 Get a handle to the workspace

We use config file to connect to a workspace. The Azure ML workspace should be configured with computer cluster. [Check this notebook for configure a workspace](../../configuration.ipynb)

In [3]:
# Get a handle to workspace
ml_client = MLClient.from_config(credential=credential)

# Retrieve an already attached Azure Machine Learning Compute.
cluster_name = "cpu-cluster"
print(ml_client.compute.get(cluster_name))

Found the config file in: /Users/deepyaman_datta/.azureml/config.json


AmlCompute({'type': 'amlcompute', 'created_on': None, 'provisioning_state': 'Succeeded', 'provisioning_errors': None, 'name': 'cpu-cluster', 'description': None, 'tags': {}, 'properties': {}, 'id': '/subscriptions/3f15b3d0-4a6d-468b-aa61-c8a757b7745a/resourceGroups/quantumhack-aml/providers/Microsoft.MachineLearningServices/workspaces/quantumhack-ws/computes/cpu-cluster', 'base_path': './', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7fd6cb78dfa0>, 'resource_id': None, 'location': 'australiaeast', 'size': 'STANDARD_DS3_V2', 'min_instances': 0, 'max_instances': 10, 'idle_time_before_scale_down': 120.0, 'identity': None, 'ssh_public_access_enabled': True, 'ssh_settings': None, 'network_settings': <azure.ai.ml.entities._compute.compute.NetworkSettings object at 0x7fd6cb78dc10>, 'tier': 'dedicated'})


# 2. Build pipeline

In [4]:
%load_ext autoreload
%autoreload 2

# load component function from component python file
from prep.prep_component import prepare_data_component

# print hint of components
help(prepare_data_component)

Help on function prepare_data_component in module prep.prep_component:

prepare_data_component(raw_green_data: <mldesigner._input_output.Input object at 0x7fd6cb7fff10>, raw_yellow_data: <mldesigner._input_output.Input object at 0x7fd6cb7ffe80>, prep_green_data: <mldesigner._input_output.Output object at 0x7fd6cb7ffdf0>, prep_yellow_data: <mldesigner._input_output.Output object at 0x7fd6cb9c6250>, merged_data: <mldesigner._input_output.Output object at 0x7fd6cb9c6670>)



In [5]:
parent_dir = ""

# 1. Load components
# prepare_data = load_component(path=parent_dir + "./prep.yml")
transform_data = load_component(path=parent_dir + "./transform.yml")
train_model = load_component(path=parent_dir + "./train.yml")
predict_result = load_component(path=parent_dir + "./predict.yml")
score_data = load_component(path=parent_dir + "./score.yml")

# 2. Construct pipeline
@pipeline(default_compute="cpu-cluster", default_datastore="workspaceblobstore")
def nyc_taxi_data_regression(raw_green_data, raw_yellow_data):
    """NYC taxi data regression example."""
    prepare_sample_data = prepare_data_component(
        raw_green_data=raw_green_data,
        raw_yellow_data=raw_yellow_data,
    )
    transform_sample_data = transform_data(
        clean_data=prepare_sample_data.outputs.merged_data
    )
    train_with_sample_data = train_model(
        training_data=transform_sample_data.outputs.transformed_data
    )
    predict_with_sample_data = predict_result(
        model_input=train_with_sample_data.outputs.model_output,
        test_data=train_with_sample_data.outputs.test_data,
    )
    score_with_sample_data = score_data(
        predictions=predict_with_sample_data.outputs.predictions,
        model=train_with_sample_data.outputs.model_output,
    )
    return {
        "pipeline_job_prepped_data": prepare_sample_data.outputs.merged_data,
        "pipeline_job_transformed_data": transform_sample_data.outputs.transformed_data,
        "pipeline_job_trained_model": train_with_sample_data.outputs.model_output,
        "pipeline_job_test_data": train_with_sample_data.outputs.test_data,
        "pipeline_job_predictions": predict_with_sample_data.outputs.predictions,
        "pipeline_job_score_report": score_with_sample_data.outputs.score_report,
    }


pipeline_job = nyc_taxi_data_regression(
    Input(type="uri_file", path=parent_dir + "./data/greenTaxiData.csv"),
    Input(type="uri_file", path=parent_dir + "./data/yellowTaxiData.csv"),
)
# demo how to change pipeline output settings
pipeline_job.outputs.pipeline_job_prepped_data.mode = "rw_mount"

## 3. Submit pipeline job

In [6]:
# submit job to workspace
pipeline_job = ml_client.jobs.create_or_update(
    pipeline_job, experiment_name="pipeline_samples"
)
pipeline_job

[32mUploading prep (0.01 MBs): 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7322/7322 [00:01<00:00, 3715.15it/s][0m
[39m

[32mUploading transform_src (0.01 MBs): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5378/5378 [00:00<00:00, 12019.79it/s][0m
[39m



Experiment,Name,Type,Status,Details Page
pipeline_samples,tidy_plow_60hrcbp72b,pipeline,Preparing,Link to Azure Machine Learning studio


In [8]:
print(pipeline_job)

name: tidy_plow_60hrcbp72b
display_name: nyc_taxi_data_regression
type: pipeline
inputs:
  raw_green_data:
    mode: ro_mount
    type: uri_file
    path: azureml://datastores/workspaceblobstore/paths/LocalUpload/05dc73703267d53fab9a873628538cd2/greenTaxiData.csv
  raw_yellow_data:
    mode: ro_mount
    type: uri_file
    path: azureml://datastores/workspaceblobstore/paths/LocalUpload/e6acfe9e7adcc057aff3cbe9ef6cc7df/yellowTaxiData.csv
outputs:
  pipeline_job_prepped_data:
    mode: rw_mount
    type: uri_folder
  pipeline_job_transformed_data:
    mode: rw_mount
    type: uri_folder
  pipeline_job_trained_model:
    mode: rw_mount
    type: uri_folder
  pipeline_job_test_data:
    mode: rw_mount
    type: uri_folder
  pipeline_job_predictions:
    mode: rw_mount
    type: uri_folder
  pipeline_job_score_report:
    mode: rw_mount
    type: uri_folder
status: Preparing
services:
  Tracking:
    job_service_type: Tracking
    endpoint: azureml://australiaeast.api.azureml.ms/mlflow/v1.0

In [7]:
# Wait until the job completes
ml_client.jobs.stream(pipeline_job.name)

RunId: tidy_plow_60hrcbp72b
Web View: https://ml.azure.com/runs/tidy_plow_60hrcbp72b?wsid=/subscriptions/3f15b3d0-4a6d-468b-aa61-c8a757b7745a/resourcegroups/quantumhack-aml/workspaces/quantumhack-ws

Streaming logs/azureml/executionlogs.txt

[2022-06-02 23:17:16Z] Submitting 1 runs, first five are: ed8e23d6:d591fc4f-fb1e-40a8-8192-b5bb470e3ced
[2022-06-02 23:19:31Z] Completing processing run id d591fc4f-fb1e-40a8-8192-b5bb470e3ced.
[2022-06-02 23:19:31Z] Submitting 1 runs, first five are: ca9d0044:441ec010-5e8c-4d3b-9821-270cba753fe1
[2022-06-02 23:20:08Z] Completing processing run id 441ec010-5e8c-4d3b-9821-270cba753fe1.
[2022-06-02 23:20:09Z] Submitting 1 runs, first five are: 9f0e681f:92fe4ab6-7f68-40d7-b6f3-a4b87c637b9f
[2022-06-02 23:20:38Z] Completing processing run id 92fe4ab6-7f68-40d7-b6f3-a4b87c637b9f.
[2022-06-02 23:20:38Z] Submitting 1 runs, first five are: d8ef5118:e44e4843-b15a-4058-91d8-f5e9a56a5ea0
[2022-06-02 23:20:59Z] Completing processing run id e44e4843-b15a-4058-9

# Next Steps
You can see further examples of running a pipeline job [here](../)