Copyright (c) Microsoft Corporation. All rights reserved.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/regression-part2-automated-ml.png)

# Tutorial: Use automated machine learning to predict taxi fares

In this tutorial, you use automated machine learning in Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

In this tutorial you learn the following tasks:

* Download, transform, and clean data using Azure Open Datasets
* Train an automated machine learning regression model
* Calculate model accuracy

If you donâ€™t have an Azure subscription, create a free account before you begin. Try the [free or paid version](https://aka.ms/AMLFree) of Azure Machine Learning service today.

## Prerequisites

* Complete the [setup tutorial](https://docs.microsoft.com/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup) if you don't already have an Azure Machine Learning service workspace or notebook virtual machine.
* After you complete the setup tutorial, open the **tutorials/regression-automated-ml.ipynb** notebook using the same notebook server.

This tutorial is also available on [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/tutorials) if you wish to run it in your own [local environment](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/README.md#setup-using-a-local-conda-environment).

## Download and prepare data

In [1]:
pip install azureml-opendatasets azureml-widgets

Note: you may need to restart the kernel to use updated packages.


In [2]:
from azureml.opendatasets import NycTlcGreen
import pandas as pd
from datetime import datetime
from dateutil.relativedelta import relativedelta

In [3]:
green_taxi_df = pd.DataFrame([])
start = datetime.strptime("1/1/2015","%m/%d/%Y")
end = datetime.strptime("1/31/2015","%m/%d/%Y")

for sample_month in range(12):
    temp_df_green = NycTlcGreen(start + relativedelta(months=sample_month), end + relativedelta(months=sample_month)) \
        .to_pandas_dataframe()
    green_taxi_df = green_taxi_df.append(temp_df_green.sample(2000))

green_taxi_df.head(10)

[Info] read from /tmp/tmpbgzy24pu/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/green/puYear=2015/puMonth=1/part-00175-tid-4753095944193949832-fee7e113-666d-4114-9fcb-bcd3046479f3-2745-1.c000.snappy.parquet
[Info] read from /tmp/tmpjo1pke_r/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/green/puYear=2015/puMonth=2/part-00007-tid-4753095944193949832-fee7e113-666d-4114-9fcb-bcd3046479f3-2577-1.c000.snappy.parquet
[Info] read from /tmp/tmp9fw_meie/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/green/puYear=2015/puMonth=3/part-00133-tid-4753095944193949832-fee7e113-666d-4114-9fcb-bcd3046479f3-2703-1.c000.snappy.parquet
[Info] read from /tmp/tmpderlrfdp/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/green/puYear=2015/puMonth=4/part-00073-tid-4753095944193949832-fee7e113-666d-4114-9fcb-bcd3046479f3-2643-1.c000.snappy.parquet
[Info] read from /tmp/tmp3kmmzduu/https%3A/%2Fazureopendatastorage.azurefd.net/nyctlc/green/puYear=2015/puMonth=5/part-00177-tid-4753095944193949832

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,paymentType,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType
1284614,2,2015-01-14 17:34:34,2015-01-14 17:42:33,1,1.19,,,-73.922218,40.760525,-73.923973,...,2,7.0,1.0,0.5,0.3,0.0,0.0,,8.8,1.0
964688,2,2015-01-23 22:46:37,2015-01-23 22:51:55,1,1.14,,,-73.938675,40.817287,-73.928116,...,2,6.0,0.5,0.5,0.3,0.0,0.0,,7.3,1.0
150724,2,2015-01-13 18:05:06,2015-01-13 18:27:38,1,2.69,,,-73.953712,40.743832,-73.945107,...,2,16.5,1.0,0.5,0.3,0.0,0.0,,18.3,1.0
98174,1,2015-01-10 21:13:26,2015-01-10 21:25:08,1,1.9,,,-73.884331,40.749508,-73.855515,...,1,9.5,0.5,0.5,0.3,1.0,0.0,,11.8,1.0
1373123,1,2015-01-07 15:30:58,2015-01-07 15:31:23,1,0.1,,,-73.916222,40.872177,-73.917351,...,2,20.0,0.0,0.0,0.0,0.0,0.0,,20.0,2.0
91546,2,2015-01-10 20:36:17,2015-01-10 20:44:17,1,2.31,,,-73.959404,40.809402,-73.976746,...,1,9.0,0.5,0.5,0.3,1.5,0.0,,11.8,1.0
969037,2,2015-01-23 23:26:43,2015-01-23 23:29:07,2,0.53,,,-73.969093,40.689613,-73.965034,...,2,4.0,0.5,0.5,0.3,0.0,0.0,,5.3,1.0
1056662,2,2015-01-30 20:04:28,2015-01-30 20:19:58,1,2.33,,,-73.941315,40.713245,-73.935471,...,1,11.5,0.5,0.5,0.3,2.56,0.0,,15.36,1.0
630226,2,2015-01-04 00:31:51,2015-01-04 00:37:25,1,1.37,,,-73.937492,40.801548,-73.940117,...,2,6.5,0.5,0.5,0.3,0.0,0.0,,7.8,1.0
274479,1,2015-01-02 18:18:56,2015-01-02 18:23:48,1,1.5,,,-73.939095,40.757286,-73.930923,...,1,6.5,1.0,0.5,0.3,0.0,0.0,,8.3,1.0


In [4]:
columns_to_remove = ["lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID",
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)

green_taxi_df.head(5)

Unnamed: 0,vendorID,lpepPickupDatetime,passengerCount,tripDistance,pickupLongitude,pickupLatitude,dropoffLongitude,dropoffLatitude,totalAmount
1284614,2,2015-01-14 17:34:34,1,1.19,-73.922218,40.760525,-73.923973,40.774525,8.8
964688,2,2015-01-23 22:46:37,1,1.14,-73.938675,40.817287,-73.928116,40.811798,7.3
150724,2,2015-01-13 18:05:06,1,2.69,-73.953712,40.743832,-73.945107,40.711555,18.3
98174,1,2015-01-10 21:13:26,1,1.9,-73.884331,40.749508,-73.855515,40.747711,11.8
1373123,1,2015-01-07 15:30:58,1,0.1,-73.916222,40.872177,-73.917351,40.872467,20.0


In [5]:
green_taxi_df.describe()

Unnamed: 0,vendorID,passengerCount,tripDistance,pickupLongitude,pickupLatitude,dropoffLongitude,dropoffLatitude,totalAmount
count,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0
mean,1.783,1.367542,2.869215,-73.81503,40.683074,-73.847521,40.699964,14.702281
std,0.412211,1.045706,2.921003,2.978357,1.642359,2.524397,1.392212,11.751186
min,1.0,0.0,0.0,-74.348526,0.0,-74.187279,0.0,-499.0
25%,2.0,1.0,1.05,-73.959551,40.698855,-73.966839,40.700205,7.8
50%,2.0,1.0,1.9,-73.945068,40.746841,-73.944237,40.747692,11.3
75%,2.0,1.0,3.61,-73.9167,40.803376,-73.908926,40.792242,17.8
max,2.0,9.0,52.8,0.0,40.906429,0.0,41.062893,295.0


In [6]:
final_df = green_taxi_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>=0.25 and tripDistance<31")
final_df = final_df.query("passengerCount>0 and totalAmount>0")

columns_to_remove_for_training = ["pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

In [7]:
final_df.describe()

Unnamed: 0,vendorID,passengerCount,tripDistance,totalAmount
count,23253.0,23253.0,23253.0,23253.0
mean,1.784501,1.372296,2.935687,14.771872
std,0.411177,1.051897,2.871841,10.4733
min,1.0,1.0,0.25,0.01
25%,2.0,1.0,1.1,8.16
50%,2.0,1.0,1.97,11.44
75%,2.0,1.0,3.7,17.8
max,2.0,6.0,29.3,210.8


In [8]:
import pandas as pd
from azureml.core import Dataset
from datetime import datetime
from dateutil.relativedelta import relativedelta

Begin by creating a dataframe to hold the taxi data. Then preview the data.

In [9]:
green_taxi_dataset = Dataset.Tabular.from_parquet_files(path="https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/green_taxi_data.parquet")
green_taxi_df = green_taxi_dataset.to_pandas_dataframe()
green_taxi_df.head(10)

Unnamed: 0,vendorID,lpepPickupDatetime,lpepDropoffDatetime,passengerCount,tripDistance,puLocationId,doLocationId,pickupLongitude,pickupLatitude,dropoffLongitude,...,fareAmount,extra,mtaTax,improvementSurcharge,tipAmount,tollsAmount,ehailFee,totalAmount,tripType,__index_level_0__
0,2,2015-01-30 18:38:09,2015-01-30 19:01:49,1,1.88,,,-73.996155,40.690903,-73.964287,...,15.0,1.0,0.5,0.3,4.0,0.0,,20.8,1.0,2015-01-30 18:38:09
1,1,2015-01-17 23:21:39,2015-01-17 23:35:16,1,2.7,,,-73.978508,40.687984,-73.955116,...,11.5,0.5,0.5,0.3,2.55,0.0,,15.35,1.0,2015-01-17 23:21:39
2,2,2015-01-16 01:38:40,2015-01-16 01:52:55,1,3.54,,,-73.957787,40.721779,-73.963005,...,13.5,0.5,0.5,0.3,2.8,0.0,,17.6,1.0,2015-01-16 01:38:40
3,2,2015-01-04 17:09:26,2015-01-04 17:16:12,1,1.0,,,-73.919914,40.826023,-73.904839,...,6.5,0.0,0.5,0.3,0.0,0.0,,7.3,1.0,2015-01-04 17:09:26
4,1,2015-01-14 10:10:57,2015-01-14 10:33:30,1,5.1,,,-73.94371,40.825439,-73.982964,...,18.5,0.0,0.5,0.3,3.85,0.0,,23.15,1.0,2015-01-14 10:10:57
5,2,2015-01-19 18:10:41,2015-01-19 18:32:20,1,7.41,,,-73.940918,40.839714,-73.994339,...,24.0,0.0,0.5,0.3,4.8,0.0,,29.6,1.0,2015-01-19 18:10:41
6,2,2015-01-01 15:44:21,2015-01-01 15:50:16,1,1.03,,,-73.985718,40.685646,-73.996773,...,6.5,0.0,0.5,0.3,1.3,0.0,,8.6,1.0,2015-01-01 15:44:21
7,2,2015-01-12 08:01:21,2015-01-12 08:14:52,5,2.94,,,-73.939865,40.789822,-73.952957,...,12.5,0.0,0.5,0.3,0.0,0.0,,13.3,1.0,2015-01-12 08:01:21
8,1,2015-01-16 21:54:26,2015-01-16 22:12:39,1,3.0,,,-73.957939,40.721928,-73.926247,...,14.0,0.5,0.5,0.3,2.0,0.0,,17.3,1.0,2015-01-16 21:54:26
9,2,2015-01-06 06:34:53,2015-01-06 06:44:23,1,2.31,,,-73.943825,40.810257,-73.943062,...,10.0,0.0,0.5,0.3,2.0,0.0,,12.8,1.0,2015-01-06 06:34:53


Remove some of the columns that you won't need for training or additional feature building.  Automate machine learning will automatically handle time-based features such as lpepPickupDatetime.

In [10]:
columns_to_remove = ["lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID", 
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)
    
green_taxi_df.head(5)

Unnamed: 0,vendorID,lpepPickupDatetime,passengerCount,tripDistance,pickupLongitude,pickupLatitude,dropoffLongitude,dropoffLatitude,totalAmount,__index_level_0__
0,2,2015-01-30 18:38:09,1,1.88,-73.996155,40.690903,-73.964287,40.679707,20.8,2015-01-30 18:38:09
1,1,2015-01-17 23:21:39,1,2.7,-73.978508,40.687984,-73.955116,40.708138,15.35,2015-01-17 23:21:39
2,2,2015-01-16 01:38:40,1,3.54,-73.957787,40.721779,-73.963005,40.682774,17.6,2015-01-16 01:38:40
3,2,2015-01-04 17:09:26,1,1.0,-73.919914,40.826023,-73.904839,40.821404,7.3,2015-01-04 17:09:26
4,1,2015-01-14 10:10:57,1,5.1,-73.94371,40.825439,-73.982964,40.767857,23.15,2015-01-14 10:10:57


### Cleanse data 

Run the `describe()` function on the new dataframe to see summary statistics for each field.

In [11]:
green_taxi_df.describe()

Unnamed: 0,vendorID,passengerCount,tripDistance,pickupLongitude,pickupLatitude,dropoffLongitude,dropoffLatitude,totalAmount
count,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0,24000.0
mean,1.777625,1.373625,2.893981,-73.827403,40.68973,-73.81967,40.684436,14.892744
std,0.41585,1.04618,3.072343,2.821767,1.556082,2.901199,1.599776,12.339749
min,1.0,0.0,0.0,-74.357101,0.0,-74.342766,0.0,-120.8
25%,2.0,1.0,1.05,-73.959175,40.699127,-73.966476,40.699459,8.0
50%,2.0,1.0,1.93,-73.945049,40.746754,-73.944221,40.747536,11.3
75%,2.0,1.0,3.7,-73.917089,40.80306,-73.909061,40.791526,17.8
max,2.0,8.0,154.28,0.0,41.109089,0.0,40.982826,425.0


From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features. 

Additionally filter the `tripDistance` field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). This eliminates long outlier trips that have inconsistent trip cost.

Lastly, the `totalAmount` field has negative values for the taxi fares, which don't make sense in the context of our model, and the `passengerCount` field has bad data with the minimum values being zero.

Filter out these anomalies using query functions, and then remove the last few columns unnecessary for training.

In [12]:
final_df = green_taxi_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>=0.25 and tripDistance<31")
final_df = final_df.query("passengerCount>0 and totalAmount>0")

columns_to_remove_for_training = ["pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

Call `describe()` again on the data to ensure cleansing worked as expected. You now have a prepared and cleansed set of taxi, holiday, and weather data to use for machine learning model training.

In [13]:
final_df.describe()

Unnamed: 0,vendorID,passengerCount,tripDistance,totalAmount
count,23222.0,23222.0,23222.0,23222.0
mean,1.778572,1.374688,2.956753,14.838994
std,0.415217,1.046995,2.862415,10.3636
min,1.0,1.0,0.25,0.01
25%,2.0,1.0,1.1,8.19
50%,2.0,1.0,2.0,11.75
75%,2.0,1.0,3.76,17.88
max,2.0,8.0,30.84,191.7


## Configure workspace


Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.

In [14]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()

## Split the data into train and test sets

Split the data into training and test sets by using the `train_test_split` function in the `scikit-learn` library. This function segregates the data into the x (**features**) data set for model training and the y (**values to predict**) data set for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are deterministic.

In [15]:
from sklearn.model_selection import train_test_split

x_train, x_test = train_test_split(final_df, test_size=0.2, random_state=223)

The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. 

In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. You now have data prepared for auto-training a machine learning model.

## Automatically train a model

To automatically train a model, take the following steps:
1. Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.
1. Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.

### Define training settings

Define the experiment parameter and model settings for training. View the full list of [settings](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train). Submitting the experiment with these default settings will take approximately 20 minutes, but if you want a shorter run time, reduce the `experiment_timeout_hours` parameter.


|Property| Value in this tutorial |Description|
|----|----|---|
|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration. Increase this value for larger datasets that need more time for each iteration.|
|**experiment_timeout_hours**|0.3|Maximum amount of time in hours that all iterations combined can take before the experiment terminates.|
|**enable_early_stopping**|True|Flag to enable early termination if the score is not improving in the short term.|
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|
|**featurization**| auto | By using auto, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|
|**verbosity**| logging.INFO | Controls the level of logging.|
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data is not specified.|

In [16]:
import logging

automl_settings = {
    "iteration_timeout_minutes": 10,
    "experiment_timeout_hours": 0.3,
    "enable_early_stopping": True,
    "primary_metric": 'spearman_correlation',
    "featurization": 'auto',
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

Use your defined training settings as a `**kwargs` parameter to an `AutoMLConfig` object. Additionally, specify your training data and the type of model, which is `regression` in this case.

In [17]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='regression',
                             debug_log='automated_ml_errors.log',
                             training_data=x_train,
                             label_column_name="totalAmount",
                             **automl_settings)

Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.

### Train the automatic regression model

Create an experiment object in your workspace. An experiment acts as a container for your individual runs. Pass the defined `automl_config` object to the experiment, and set the output to `True` to view progress during the run. 

After starting the experiment, the output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type.

In [18]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "Tutorial-NYCTaxi")
local_run = experiment.submit(automl_config, show_output=True)

No run_configuration provided, running on local with default configuration
Running in the active local environment.


2022-01-09:07:40:48,554 INFO     [modeling_bert.py:226] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2022-01-09:07:40:48,579 INFO     [modeling_xlnet.py:339] Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .


Experiment,Id,Type,Status,Details Page,Docs Page
Tutorial-NYCTaxi,AutoML_44e803e7-edd8-4ee4-8470-23b47774529a,automl,Preparing,Link to Azure Machine Learning studio,Link to Documentation


Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

****************************************************************************************************

TYPE:         High cardinality feature detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed

2022-01-09:08:00:09,919 INFO     [logging_handler.py:290] Sending 2618 bytes
2022-01-09:08:00:09,921 INFO     [logging_handler.py:304] Finish uploading in 0.491675 seconds.
2022-01-09:08:02:17,381 INFO     [explanation_client.py:332] Using default datastore for uploads


## Explore the results

Explore the results of automatic training with a [Jupyter widget](https://docs.microsoft.com/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py). The widget allows you to see a graph and table of all individual run iterations, along with training accuracy metrics and metadata. Additionally, you can filter on different accuracy metrics than your primary metric with the dropdown selector.

In [19]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

### Retrieve the best model

Select the best model from your iterations. The `get_output` function returns the best run and the fitted model for the last fit invocation. By using the overloads on `get_output`, you can retrieve the best run and fitted model for any logged metric or a particular iteration.

In [20]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: Tutorial-NYCTaxi,
Id: AutoML_44e803e7-edd8-4ee4-8470-23b47774529a_24,
Type: None,
Status: Completed)
RegressionPipeline(pipeline=Pipeline(memory=None,
                                     steps=[('datatransformer',
                                             DataTransformer(enable_dnn=False, enable_feature_sweeping=True, feature_sweeping_config={}, feature_sweeping_timeout=86400, featurization_config=None, force_text_dnn=False, is_cross_validation=True, is_onnx_compatible=False, observer=None, task='regression', working_dir='/mnt/batch/ta...
    gpu_training_param_dict={'processing_unit_type': 'cpu'}
), random_state=None, reg_alpha=1.275, reg_lambda=0.825, subsample=0.9, subsample_freq=7))], verbose=False))], weights=[0.4666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667]))],
                                     verbose=False),
          

### Test the best model accuracy

Use the best model to run predictions on the test data set to predict taxi fares. The function `predict` uses the best model and predicts the values of y, **trip cost**, from the `x_test` data set. Print the first 10 predicted cost values from `y_predict`.

In [21]:
y_test = x_test.pop("totalAmount")

y_predict = fitted_model.predict(x_test)
print(y_predict[:10])

[12.54276806 11.32192895 13.75267329  8.27077287 15.0324912   6.66122514
 21.84323776 14.15612195 11.51693388 14.69298837]


Calculate the `root mean squared error` of the results. Convert the `y_test` dataframe to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, **cost**. It indicates roughly how far the taxi fare predictions are from the actual fares.

In [22]:
from sklearn.metrics import mean_squared_error
from math import sqrt

y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
rmse

3.5336440390880193

Run the following code to calculate mean absolute percent error (MAPE) by using the full `y_actual` and `y_predict` data sets. This metric calculates an absolute difference between each predicted and actual value and sums all the differences. Then it expresses that sum as a percent of the total of the actual values.

In [23]:
sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

Model MAPE:
0.13093205448143924

Model Accuracy:
0.8690679455185608


From the two prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $4.00, and approximately 15% error. 

The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario.

In [24]:
from azureml.interpret import ExplanationClient

client = ExplanationClient.from_run(best_run)
engineered_explanations = client.download_model_explanation(raw=False)
print(engineered_explanations.get_feature_importance_dict())

2022-01-09:08:02:24,562 INFO     [explanation_client.py:332] Using default datastore for uploads


{'tripDistance_MeanImputer': 6.9709498391900055, '__index_level_0___ModeCatImputer_Hour': 0.3652550208143672, '__index_level_0___ModeCatImputer_DayOfWeek': 0.1999265585672325, '__index_level_0___ModeCatImputer_DayOfYear': 0.08355592060091525, '__index_level_0___ModeCatImputer_Minute': 0.04313071664462477, '__index_level_0___ModeCatImputer_Day': 0.037896085603609776, '__index_level_0___ModeCatImputer_Month': 0.03284310873407971, '__index_level_0___ModeCatImputer_Second': 0.0323609763062722, 'vendorID_ModeCatImputer_LabelEncoder': 0.006505306328075415, 'passengerCount_CharGramCountVectorizer_1': 0.0057434704648023594, 'passengerCount_CharGramCountVectorizer_6': 0.0033266301222509747, 'passengerCount_CharGramCountVectorizer_3': 0.0019003138545717725, 'lpepPickupDatetime_ModeCatImputer_Month': 0.0, '__index_level_0___ModeCatImputer_QuarterOfYear': 0.0, '__index_level_0___ModeCatImputer_WeekOfMonth': 0.0, 'lpepPickupDatetime_ModeCatImputer_Year': 0.0, 'lpepPickupDatetime_ModeCatImputer_DayO

In [25]:
from azureml.interpret import ExplanationClient

client = ExplanationClient.from_run(best_run)
raw_explanations = client.download_model_explanation(raw=True)
print(raw_explanations.get_feature_importance_dict())

2022-01-09:08:10:37,281 INFO     [explanation_client.py:332] Using default datastore for uploads


{'tripDistance': 6.9709498391900055, '__index_level_0__': 0.4700175979301062, 'passengerCount': 0.010181031873355898, 'vendorID': 0.006505306328075415, 'lpepPickupDatetime': 0.0}


In [28]:
automl_run, fitted_model = local_run.get_output(metric='Accuracy')

ConfigException: ConfigException:
	Message: An invalid value for argument [metric] was provided.
	InnerException: None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "An invalid value for argument [metric] was provided.",
        "target": "metric",
        "inner_error": {
            "code": "BadArgument"
        }
    }
}

In [27]:
from azureml.train.automl.runtime.automl_explain_utilities import automl_setup_model_explanations

automl_explainer_setup_obj = automl_setup_model_explanations(fitted_model, X=X_train, 
                                                             X_test=X_test, y=y_train, 
                                                             task='classification')

NameError: name 'X_train' is not defined

## Clean up resources

Do not complete this section if you plan on running other Azure Machine Learning service tutorials.

### Stop the notebook VM

If you used a cloud notebook server, stop the VM when you are not using it to reduce cost.

1. In your workspace, select **Compute**.
1. Select the **Notebook VMs** tab in the compute page.
1. From the list, select the VM.
1. Select **Stop**.
1. When you're ready to use the server again, select **Start**.

### Delete everything

If you don't plan to use the resources you created, delete them, so you don't incur any charges.

1. In the Azure portal, select **Resource groups** on the far left.
1. From the list, select the resource group you created.
1. Select **Delete resource group**.
1. Enter the resource group name. Then select **Delete**.

You can also keep the resource group but delete a single workspace. Display the workspace properties and select **Delete**.

## Next steps

In this automated machine learning tutorial, you did the following tasks:

> * Configured a workspace and prepared data for an experiment.
> * Trained by using an automated regression model locally with custom parameters.
> * Explored and reviewed training results.

[Deploy your model](https://docs.microsoft.com/azure/machine-learning/service/tutorial-deploy-models-with-aml) with Azure Machine Learning service.