Copyright (c) Microsoft Corporation. All rights reserved.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/regression-part2-automated-ml.png)

# Tutorial: Use machine learning to predict taxi fares

In this tutorial, you use  machine learning in Azure Machine Learning service to create a regression model to predict NYC taxi fare prices.
In this tutorial you learn the following tasks:

* Download, transform, and clean data using Azure Open Datasets
* Train an automated machine learning regression model
* Calculate model accuracy

If you donâ€™t have an Azure subscription, create a free account before you begin. Try the [free or paid version](https://aka.ms/AMLFree) of Azure Machine Learning service today.

## Prerequisites

* Complete the [setup tutorial](https://docs.microsoft.com/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup) if you don't already have an Azure Machine Learning service workspace or notebook virtual machine.
* After you complete the setup tutorial, open the **tutorials/regression-automated-ml.ipynb** notebook using the same notebook server.

This tutorial is also available on [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/tutorials) if you wish to run it in your own [local environment](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/README.md#setup-using-a-local-conda-environment).

## Download and prepare data

Import the necessary packages. The Open Datasets package contains a class representing each data source (`NycTlcGreen` for example) to easily filter date parameters before downloading.

In [13]:
user = "memasanz"

In [1]:
import pandas as pd
from azureml.core import Dataset
from datetime import datetime
from dateutil.relativedelta import relativedelta

Begin by creating a dataframe to hold the taxi data. Then preview the data.

In [2]:
green_taxi_dataset = Dataset.Tabular.from_parquet_files(path="https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/green_taxi_data.parquet")
green_taxi_df = green_taxi_dataset.to_pandas_dataframe()
green_taxi_df.head(10)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,__index_level_0__
0,2,2015-01-30 18:38:09,2015-01-30 19:01:49,1,1.88,,,-73.996155,40.690903,-73.964287,...,15.0,1.0,0.5,0.3,4.0,0.0,,20.8,1.0,2015-01-30 18:38:09
1,1,2015-01-17 23:21:39,2015-01-17 23:35:16,1,2.7,,,-73.978508,40.687984,-73.955116,...,11.5,0.5,0.5,0.3,2.55,0.0,,15.35,1.0,2015-01-17 23:21:39
2,2,2015-01-16 01:38:40,2015-01-16 01:52:55,1,3.54,,,-73.957787,40.721779,-73.963005,...,13.5,0.5,0.5,0.3,2.8,0.0,,17.6,1.0,2015-01-16 01:38:40
3,2,2015-01-04 17:09:26,2015-01-04 17:16:12,1,1.0,,,-73.919914,40.826023,-73.904839,...,6.5,0.0,0.5,0.3,0.0,0.0,,7.3,1.0,2015-01-04 17:09:26
4,1,2015-01-14 10:10:57,2015-01-14 10:33:30,1,5.1,,,-73.94371,40.825439,-73.982964,...,18.5,0.0,0.5,0.3,3.85,0.0,,23.15,1.0,2015-01-14 10:10:57
5,2,2015-01-19 18:10:41,2015-01-19 18:32:20,1,7.41,,,-73.940918,40.839714,-73.994339,...,24.0,0.0,0.5,0.3,4.8,0.0,,29.6,1.0,2015-01-19 18:10:41
6,2,2015-01-01 15:44:21,2015-01-01 15:50:16,1,1.03,,,-73.985718,40.685646,-73.996773,...,6.5,0.0,0.5,0.3,1.3,0.0,,8.6,1.0,2015-01-01 15:44:21
7,2,2015-01-12 08:01:21,2015-01-12 08:14:52,5,2.94,,,-73.939865,40.789822,-73.952957,...,12.5,0.0,0.5,0.3,0.0,0.0,,13.3,1.0,2015-01-12 08:01:21
8,1,2015-01-16 21:54:26,2015-01-16 22:12:39,1,3.0,,,-73.957939,40.721928,-73.926247,...,14.0,0.5,0.5,0.3,2.0,0.0,,17.3,1.0,2015-01-16 21:54:26
9,2,2015-01-06 06:34:53,2015-01-06 06:44:23,1,2.31,,,-73.943825,40.810257,-73.943062,...,10.0,0.0,0.5,0.3,2.0,0.0,,12.8,1.0,2015-01-06 06:34:53


Now that the initial data is loaded, define a function to create various time-based features from the pickup datetime field. This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. 

Use the `apply()` function on the dataframe to iteratively apply the `build_time_features()` function to each row in the taxi data.

In [3]:
def build_time_features(vector):
    pickup_datetime = vector[0]
    month_num = pickup_datetime.month
    day_of_month = pickup_datetime.day
    day_of_week = pickup_datetime.weekday()
    hour_of_day = pickup_datetime.hour
    
    return pd.Series((month_num, day_of_month, day_of_week, hour_of_day))

green_taxi_df[["month_num", "day_of_month","day_of_week", "hour_of_day"]] = green_taxi_df[["lpepPickupDatetime"]].apply(build_time_features, axis=1)
green_taxi_df.head(10)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,__index_level_0__,month_num,day_of_month,day_of_week,hour_of_day
0,2,2015-01-30 18:38:09,2015-01-30 19:01:49,1,1.88,,,-73.996155,40.690903,-73.964287,...,4.0,0.0,,20.8,1.0,2015-01-30 18:38:09,1,30,4,18
1,1,2015-01-17 23:21:39,2015-01-17 23:35:16,1,2.7,,,-73.978508,40.687984,-73.955116,...,2.55,0.0,,15.35,1.0,2015-01-17 23:21:39,1,17,5,23
2,2,2015-01-16 01:38:40,2015-01-16 01:52:55,1,3.54,,,-73.957787,40.721779,-73.963005,...,2.8,0.0,,17.6,1.0,2015-01-16 01:38:40,1,16,4,1
3,2,2015-01-04 17:09:26,2015-01-04 17:16:12,1,1.0,,,-73.919914,40.826023,-73.904839,...,0.0,0.0,,7.3,1.0,2015-01-04 17:09:26,1,4,6,17
4,1,2015-01-14 10:10:57,2015-01-14 10:33:30,1,5.1,,,-73.94371,40.825439,-73.982964,...,3.85,0.0,,23.15,1.0,2015-01-14 10:10:57,1,14,2,10
5,2,2015-01-19 18:10:41,2015-01-19 18:32:20,1,7.41,,,-73.940918,40.839714,-73.994339,...,4.8,0.0,,29.6,1.0,2015-01-19 18:10:41,1,19,0,18
6,2,2015-01-01 15:44:21,2015-01-01 15:50:16,1,1.03,,,-73.985718,40.685646,-73.996773,...,1.3,0.0,,8.6,1.0,2015-01-01 15:44:21,1,1,3,15
7,2,2015-01-12 08:01:21,2015-01-12 08:14:52,5,2.94,,,-73.939865,40.789822,-73.952957,...,0.0,0.0,,13.3,1.0,2015-01-12 08:01:21,1,12,0,8
8,1,2015-01-16 21:54:26,2015-01-16 22:12:39,1,3.0,,,-73.957939,40.721928,-73.926247,...,2.0,0.0,,17.3,1.0,2015-01-16 21:54:26,1,16,4,21
9,2,2015-01-06 06:34:53,2015-01-06 06:44:23,1,2.31,,,-73.943825,40.810257,-73.943062,...,2.0,0.0,,12.8,1.0,2015-01-06 06:34:53,1,6,1,6


Remove some of the columns that you won't need for training or additional feature building.

In [4]:
columns_to_remove = ["lpepPickupDatetime", "lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID", 
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)
    
green_taxi_df.head(5)

Unnamed: 0,vendorID,passengerCount,tripDistance,pickupLongitude,pickupLatitude,dropoffLongitude,dropoffLatitude,totalAmount,__index_level_0__,month_num,day_of_month,day_of_week,hour_of_day
0,2,1,1.88,-73.996155,40.690903,-73.964287,40.679707,20.8,2015-01-30 18:38:09,1,30,4,18
1,1,1,2.7,-73.978508,40.687984,-73.955116,40.708138,15.35,2015-01-17 23:21:39,1,17,5,23
2,2,1,3.54,-73.957787,40.721779,-73.963005,40.682774,17.6,2015-01-16 01:38:40,1,16,4,1
3,2,1,1.0,-73.919914,40.826023,-73.904839,40.821404,7.3,2015-01-04 17:09:26,1,4,6,17
4,1,1,5.1,-73.94371,40.825439,-73.982964,40.767857,23.15,2015-01-14 10:10:57,1,14,2,10


### Cleanse data 

Run the `describe()` function on the new dataframe to see summary statistics for each field.

In [5]:
green_taxi_df.describe()

Unnamed: 0,vendorID,passengerCount,tripDistance,pickupLongitude,pickupLatitude,dropoffLongitude,dropoffLatitude,totalAmount,month_num,day_of_month,day_of_week,hour_of_day
count,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0
mean,1.777625,1.373625,2.893981,-73.827403,40.68973,-73.81967,40.684436,14.892744,6.5,15.150208,3.266042,13.623458
std,0.41585,1.04618,3.072343,2.821767,1.556082,2.901199,1.599776,12.339749,3.452124,8.432627,1.965772,6.818732
min,1.0,0.0,0.0,-74.357101,0.0,-74.342766,0.0,-120.8,1.0,1.0,0.0,0.0
25%,2.0,1.0,1.05,-73.959175,40.699127,-73.966476,40.699459,8.0,3.75,8.0,2.0,9.0
50%,2.0,1.0,1.93,-73.945049,40.746754,-73.944221,40.747536,11.3,6.5,15.0,3.0,15.0
75%,2.0,1.0,3.7,-73.917089,40.80306,-73.909061,40.791526,17.8,9.25,22.0,5.0,19.0
max,2.0,8.0,154.28,0.0,41.109089,0.0,40.982826,425.0,12.0,30.0,6.0,23.0


From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features. 

Additionally filter the `tripDistance` field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). This eliminates long outlier trips that have inconsistent trip cost.

Lastly, the `totalAmount` field has negative values for the taxi fares, which don't make sense in the context of our model, and the `passengerCount` field has bad data with the minimum values being zero.

Filter out these anomalies using query functions, and then remove the last few columns unnecessary for training.

In [6]:
final_df = green_taxi_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>=0.25 and tripDistance<31")
 = final_df.query("passengerCount>0 and totalAmount>0")

columns_to_remove_for_training = ["pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

Call `describe()` again on the data to ensure cleansing worked as expected. 

In [7]:
final_df.describe()

Unnamed: 0,vendorID,passengerCount,tripDistance,totalAmount,month_num,day_of_month,day_of_week,hour_of_day
count,23222.0,23222.0,23222.0,23222.0,23222.0,23222.0,23222.0,23222.0
mean,1.778572,1.374688,2.956753,14.838994,6.502541,15.139437,3.274524,13.635087
std,0.415217,1.046995,2.862415,10.3636,3.453589,8.425423,1.964555,6.822877
min,1.0,1.0,0.25,0.01,1.0,1.0,0.0,0.0
25%,2.0,1.0,1.1,8.19,4.0,8.0,2.0,9.0
50%,2.0,1.0,2.0,11.75,7.0,15.0,3.0,15.0
75%,2.0,1.0,3.76,17.88,10.0,22.0,5.0,19.0
max,2.0,8.0,30.84,191.7,12.0,30.0,6.0,23.0


## Configure workspace


Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.

In [8]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()

ws

Workspace.create(name='mm-machine-learning-ws-dev', subscription_id='5da07161-3770-4a4b-aa43-418cbbb627cf', resource_group='mm-machine-learning-rg')

Here we will save into a register folder the data set that we are going to register for later use. Notice that we have now created a new folder that holds the dataset we would like to use.

In [16]:
cwd = os. getcwd()
print(cwd)
dataset_name = user + '-ds-prepped.csv'
print(dataset_name)
dataset_dir = './register/'
os.makedirs(dataset_dir, exist_ok=True)
file_path = os.path.join(dataset_dir, dataset_name)
final_df.to_csv(file_path, index=False)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-python-version/code/Users/memasanz/regression-automl-nyc-taxi-data
memasanz-ds-prepped.csv


Upload the file to the datastore from the register folder to data/prepped folder

upload(src_dir, target_path=None, overwrite=False, show_progress=True)

In [18]:
from azureml.core.datastore import Datastore
ds = Datastore.get_default(ws)
ds.upload('register/', target_path='data/prepped', overwrite=True)

from azureml.core.dataset import Dataset
#create a dataset object from the uploaded file
#prepped_dataset = Dataset.File.from_files((ds, 'data/prepped'))
dataset = Dataset.Tabular.from_delimited_files(ds.path('data/prepped/' + dataset_name))
#register dataset
dataset.register(ws, dataset_name, create_new_version=True)

Uploading an estimated of 3 files
Uploading register/ds-prepped.csv
Uploaded register/ds-prepped.csv, 1 files out of an estimated total of 3
Uploading register/memasanz-ds-prepped.csv
Uploaded register/memasanz-ds-prepped.csv, 2 files out of an estimated total of 3
Uploading register/memasanzds-prepped.csv
Uploaded register/memasanzds-prepped.csv, 3 files out of an estimated total of 3
Uploaded 3 files


{
  "source": [
    "('workspaceblobstore', 'data/prepped/memasanz-ds-prepped.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "371b1cc7-a9f6-4b10-86fc-4ba8538b8c65",
    "name": "memasanz-ds-prepped.csv",
    "version": 1,
    "workspace": "Workspace.create(name='mm-machine-learning-ws-dev', subscription_id='5da07161-3770-4a4b-aa43-418cbbb627cf', resource_group='mm-machine-learning-rg')"
  }
}

In [19]:
#sample of consuming the dataset.

# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = '5da07161-3770-4a4b-aa43-418cbbb627cf'
resource_group = 'mm-machine-learning-rg'
workspace_name = 'mm-machine-learning-ws-dev'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name=dataset_name)
dataset.to_pandas_dataframe()

Unnamed: 0,vendorID,passengerCount,tripDistance,totalAmount,__index_level_0__,month_num,day_of_month,day_of_week,hour_of_day
0,2,1,1.88,20.80,2015-01-30 18:38:09,1,30,4,18
1,1,1,2.70,15.35,2015-01-17 23:21:39,1,17,5,23
2,2,1,3.54,17.60,2015-01-16 01:38:40,1,16,4,1
3,2,1,1.00,7.30,2015-01-04 17:09:26,1,4,6,17
4,1,1,5.10,23.15,2015-01-14 10:10:57,1,14,2,10
...,...,...,...,...,...,...,...,...,...
23217,2,1,0.42,5.30,2015-12-21 20:36:02,12,21,0,20
23218,2,1,0.32,5.80,2015-12-16 17:48:50,12,16,2,17
23219,2,1,1.80,11.16,2015-12-22 22:47:05,12,22,1,22
23220,1,1,4.00,17.75,2015-12-20 08:24:12,12,20,6,8


### Train the automatic regression model

Create an experiment object in your workspace. An experiment acts as a container for your individual runs. 

In [22]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, user + "python-regression-taxi-experiment")

### Create Training Script

In [24]:
import os
script_folder = os.path.join(os.getcwd(), "train")
print(script_folder)
os.makedirs(script_folder, exist_ok=True)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-python-version/code/Users/memasanz/regression-automl-nyc-taxi-data/train


### TODO: ADD PARAMETER FOR DATASET NAME

In [25]:
%%writefile $script_folder/train.py

import os
import sys
import argparse
import joblib
import pandas as pd

from azureml.core import Run
from azureml.core.run import Run
from azureml.core import Dataset
from azureml.core import Workspace

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


def getRuntimeArgs():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data-path', type=str)
    args = parser.parse_args()
    return args


def main():
    args = getRuntimeArgs()
    run = Run.get_context()

    
    dataset_dir = './dataset/'
    os.makedirs(dataset_dir, exist_ok=True)
    ws = run.experiment.workspace
    print(ws)

    dataset_lt = Dataset.get_by_name(ws, name='memasanz-ds-prepped.csv')
    
    # Load a TabularDataset & save into pandas DataFrame
    df = dataset_lt.to_pandas_dataframe()
    df.to_csv(os.path.join(dataset_dir, 'dataset.csv'), index = False)
    

    lr = model_train(df, run)

    #copying to "outputs" directory, automatically uploads it to Azure ML
    output_dir = './outputs/'
    os.makedirs(output_dir, exist_ok=True)
    joblib.dump(value=lr, filename=os.path.join(output_dir, 'model.pkl'))

def model_train(ds_df, run):

    y_raw = ds_df['totalAmount']
    X_raw = ds_df.drop('totalAmount', axis=1)

    categorical_features = X_raw.select_dtypes(include=['object']).columns
    numeric_features = X_raw.select_dtypes(include=['int64', 'float']).columns

    categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value="missing")),('onehotencoder', OneHotEncoder(categories='auto', sparse=False))])

    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

    feature_engineering_pipeline = ColumnTransformer(
        transformers=[
            ('numeric', numeric_transformer, numeric_features),
            ('categorical', categorical_transformer, categorical_features)
        ], remainder="drop")


    # Train test split
    X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2, random_state=0)

    clf = Pipeline(steps=[('preprocessor', feature_engineering_pipeline),('regr', LinearRegression())])
    clf.fit(X_train, y_train)
    #


    # Capture metrics
    train_acc = clf.score(X_train, y_train)
    test_acc = clf.score(X_test, y_test)
    print("Training accuracy: %.3f" % train_acc)
    print("Test data accuracy: %.3f" % test_acc)

    # Log to Azure ML
    run.log('Train accuracy', train_acc)
    run.log('Test accuracy', test_acc)

    return clf

if __name__ == "__main__":
    main()

Writing /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-python-version/code/Users/memasanz/regression-automl-nyc-taxi-data/train/train.py


### Create your compute

In [42]:
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.exceptions import ComputeTargetException
print(user)
compute_name = user + "-cluster"
print(compute_name)

# checks to see if compute target already exists in workspace, else create it
try:
    compute_target = ComputeTarget(workspace=ws, name=compute_name)
except ComputeTargetException:
    config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D13",
                                                   min_nodes=0, 
                                                   max_nodes=1)

    compute_target = ComputeTarget.create(workspace=ws, name=compute_name, provisioning_configuration=config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=40)

memasanz
memasanz-cluster


### Create your Run Config

In [43]:
from azureml.core.conda_dependencies import CondaDependencies
dependencies = CondaDependencies()
dependencies.add_pip_package('numpy==1.17.0')
dependencies.add_pip_package('joblib==0.14.1')
dependencies.add_pip_package('scikit-learn')

#     - numpy==1.16.2
#     - scikit-learn==0.20.3
#     - scipy==1.2.1
#     - pandas==0.25.3
#     - joblib==0.13.2

#Create a Run Configuration and add this to your pythonscriptstep
from azureml.core.runconfig import RunConfiguration
run_config = RunConfiguration()
run_config.target = compute_name
run_config.environment.python.conda_dependencies = dependencies
run_config.environment.docker.enabled = True

### Select your training script and create a ScriptRunConfig
A ScriptRunConfig object packages together the environment from a RunConfiguration along with your model training script. This object can then be submitted to your experiment and model training will commence on your remote cluster. 

In this sample, we have put the training script in a separate directory which is targeted for training. This separation allows for a snapshot of just the relevant pieces of code to be stored with the Run in your AML workspace. The <code>train.py</code> file here accesses your registered datasets, trains a model, saves a pickled version, and registers the trained model.

ScriptRunConfiguration documentation: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.scriptrunconfig?view=azure-ml-py

In [45]:
from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory='./train', script='train.py')
src.run_config = run_config

### Submit the training run
Here, the ScriptRunConfiguration is submitted as a run which triggers your model training operation. The cluster you defined above is automatically spun up and the training procedures outlined in ./train/train.py begin. That file contains all the code needed to train and save a pickled version of your trained model. The code below will display the output logs from your training job - you can also monitor training progress inside AML studio.

Note: As you iterate on your model, you should modify the code inside ./train/train.py. The model parameters there were adjusted for rapid training and should not be used for a production scenario.

In [46]:
from azureml.widgets import RunDetails
run = experiment.submit(config=src)
RunDetails(run).show()
run.wait_for_completion(show_output=True)

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

RunId: memasanzpython-regression-taxi-experiment_1602615903_fcf15c6a
Web View: https://ml.azure.com/experiments/memasanzpython-regression-taxi-experiment/runs/memasanzpython-regression-taxi-experiment_1602615903_fcf15c6a?wsid=/subscriptions/5da07161-3770-4a4b-aa43-418cbbb627cf/resourcegroups/mm-machine-learning-rg/workspaces/mm-machine-learning-ws-dev

Streaming azureml-logs/20_image_build_log.txt

2020/10/13 19:05:15 Downloading source code...
2020/10/13 19:05:16 Finished downloading source code
2020/10/13 19:05:17 Creating Docker network: acb_default_network, driver: 'bridge'
2020/10/13 19:05:17 Successfully set up Docker network: acb_default_network
2020/10/13 19:05:17 Setting up Docker configuration...
2020/10/13 19:05:18 Successfully set up Docker configuration
2020/10/13 19:05:18 Logging in to registry: 0f5f6637ebe142f38aeb0dd9f78d4097.azurecr.io
2020/10/13 19:05:19 Successfully logged into 0f5f6637ebe142f38aeb0dd9f78d4097.azurecr.io
2020/10/13 19:05:19 Executing step ID: acb_ste

Removing intermediate container 461cddd104a4
 ---> 39778bca4100
Step 9/15 : ENV PATH /azureml-envs/azureml_0f43d1c5aa2e4214a3e1aac40ca0cdb5/bin:$PATH
 ---> Running in e0c665438c34
Removing intermediate container e0c665438c34
 ---> 03063a6b6cef
Step 10/15 : ENV AZUREML_CONDA_ENVIRONMENT_PATH /azureml-envs/azureml_0f43d1c5aa2e4214a3e1aac40ca0cdb5
 ---> Running in 18d09e03d6e5
Removing intermediate container 18d09e03d6e5
 ---> af83d1a7add0
Step 11/15 : ENV LD_LIBRARY_PATH /azureml-envs/azureml_0f43d1c5aa2e4214a3e1aac40ca0cdb5/lib:$LD_LIBRARY_PATH
 ---> Running in a5d1a04245f1
Removing intermediate container a5d1a04245f1
 ---> 1f797bd45baa
Step 12/15 : COPY azureml-environment-setup/spark_cache.py azureml-environment-setup/log4j.properties /azureml-environment-setup/
 ---> 61bdd4f95d3a
Step 13/15 : RUN if [ $SPARK_HOME ]; then /bin/bash -c '$SPARK_HOME/bin/spark-submit  /azureml-environment-setup/spark_cache.py'; fi
 ---> Running in 5e36f74525cd
Removing intermediate container 5e36f74525cd

{'runId': 'memasanzpython-regression-taxi-experiment_1602615903_fcf15c6a',
 'target': 'memasanz-cluster',
 'status': 'Completed',
 'startTimeUtc': '2020-10-13T19:13:35.561673Z',
 'endTimeUtc': '2020-10-13T19:15:25.18759Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '315c3960-1602-4d9a-b8ad-e456c8ba637f',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '371b1cc7-a9f6-4b10-86fc-4ba8538b8c65'}, 'consumptionDetails': {'type': 'Reference'}}],
 'outputDatasets': [],
 'runDefinition': {'script': 'train.py',
  'command': [],
  'useAbsolutePath': False,
  'arguments': [],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'memasanz-cluster',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': None,
  'nodeCount': 1,
  'priority': None,
  'environment': {

### Register your model and deploy to an authenticated endpoint
Now you have a trained model that has been saved to your run outputs folder. You can register this model and deploy it to an endpoint by defining an inferencing configuration and providing a scoring script. Here the model is deployed to an Azure Container Instance which provides an API endpoint that can be used to make predictions with your LDA model. We utilize an authentication strategy here which requires a key to be provided with any requests sent to the API. These keys can be rotated as needed and allow only approved users to access your endpoint.

Azure Container Instances are typically lower cost and useful for dev/test purposes during model development, though we recommend deploying to an Azure Kubernetes Service cluster for production purposes.

Below, an InferenceConfig is created which uses the same python dependencies that were used during model training, and references the scoring script located at <code>./score/score.py</code>. This script loads the trained model upon initialization, and facilitates transforming data submitted to the API endpoint, making predictions with the model, and returning formatted results to the user.

<b>Note:</b> You should modify this script during development to more appropriately format your model results. 

Model registration documentation: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where

Azure Container Instance documentation: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-azure-container-instance

In [48]:
import os
script_folder = os.path.join(os.getcwd(), "score")
print(script_folder)
os.makedirs(script_folder, exist_ok=True)

/mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-python-version/code/Users/memasanz/regression-automl-nyc-taxi-data/score


In [49]:
%%writefile $script_folder/score.py

import json
import os
import numpy as np
import pandas as pd
import joblib
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.standard_py_parameter_type import StandardPythonParameterType

def init():
    global model
    
    # Update to your model's filename
    model_filename = "model.pkl"

    # AZUREML_MODEL_DIR is injected by AML
    model_dir = os.getenv('AZUREML_MODEL_DIR')

    print("Model dir:", model_dir)
    print("Model filename:", model_filename)
    
    model_path = os.path.join(model_dir, model_filename)

    # Replace this line with your model loading code
    model = joblib.load(model_path)

# Define some sample data for automatic generation of swagger interface
#make	num-of-doors	body-style
input_sample = [{
 "vendorID" : "1",
 "passengerCount":1,
 "tripDistance": 4.2,
 "month_num": "1",
 "day_of_month" : "4",
 "day_of_week" : "1",
 "hour_of_day": "18"
}]
output_sample = [18.2281]

# This will automatically unmarshall the data parameter in the HTTP request
@input_schema('data', StandardPythonParameterType(input_sample))
@output_schema(StandardPythonParameterType(output_sample))
def run(data):
    try:
        input_df = pd.DataFrame(data)
        proba = model.predict(input_df)
        
        result = {"predict_proba": proba.tolist()}
        return result
    except Exception as e:
        error = str(e)
        return error

Writing /mnt/batch/tasks/shared/LS_root/mounts/clusters/compute-python-version/code/Users/memasanz/regression-automl-nyc-taxi-data/score/score.py


In [50]:
from azureml.core.model import Model
model_name = user + '-python-regression'
trained_model = run.register_model(model_path='outputs/model.pkl', model_name=model_name, tags={'Model Type': 'linear regression'})

In [64]:
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig

env = Environment('tutorial-env')
cd = CondaDependencies.create(pip_packages=['azureml-dataprep[pandas,fuse]>=1.1.14', 'azureml-defaults', 'inference-schema'], conda_packages = ['scikit-learn==0.22.1'])

env.python.conda_dependencies = cd

# Register environment to re-use later
env.register(workspace = ws)

{
    "databricks": {
        "eggLibraries": [],
        "jarLibraries": [],
        "mavenLibraries": [],
        "pypiLibraries": [],
        "rcranLibraries": []
    },
    "docker": {
        "arguments": [],
        "baseDockerfile": null,
        "baseImage": "mcr.microsoft.com/azureml/intelmpi2018.3-ubuntu16.04:20200821.v1",
        "baseImageRegistry": {
            "address": null,
            "password": null,
            "registryIdentity": null,
            "username": null
        },
        "enabled": false,
        "platform": {
            "architecture": "amd64",
            "os": "Linux"
        },
        "sharedVolumes": true,
        "shmSize": null
    },
    "environmentVariables": {
        "EXAMPLE_ENV_VAR": "EXAMPLE_VALUE"
    },
    "inferencingStackVersion": null,
    "name": "tutorial-env",
    "python": {
        "baseCondaEnvironment": null,
        "condaDependencies": {
            "channels": [
                "anaconda",
                "conda-forge"

In [65]:
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               tags={"data": "taxi-prepped",  "method" : "sklearn"}, 
                                               description='Predict taxi pricing')

In [66]:
model_name

'memasanz-python-regression'

In [68]:
%%time
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model

ws = Workspace.from_config()
model = Model(ws, 'memasanz-python-regression')


#myenv = Environment.get(workspace=ws, name="tutorial-env", version="1")
myenv = Environment.get(workspace=ws, name="tutorial-env", version="5")
inference_config = InferenceConfig(source_directory='./score', entry_script="score.py", environment=myenv)

service = Model.deploy(workspace=ws, 
                       name=model_name +'-srv2', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aciconfig)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running........................................................................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"
CPU times: user 3.15 s, sys: 289 ms, total: 3.44 s
Wall time: 9min 44s


In [70]:
print('Scoring API available at: {}'.format(service.serialize()['scoringUri']))

Scoring API available at: http://d21758e0-bb0a-4b24-abad-5022856ace22.eastus2.azurecontainer.io/score
