Copyright (c) Microsoft Corporation. All rights reserved.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/regression-part2-automated-ml.png)

# Tutorial: Use automated machine learning to predict taxi fares

In this tutorial, you use automated machine learning in Azure Machine Learning service to create a regression model to predict NYC taxi fare prices. This process accepts training data and configuration settings, and automatically iterates through combinations of different feature normalization/standardization methods, models, and hyperparameter settings to arrive at the best model.

In this tutorial you learn the following tasks:

* Download, transform, and clean data using Azure Open Datasets
* Train an automated machine learning regression model
* Calculate model accuracy

If you donâ€™t have an Azure subscription, create a free account before you begin. Try the [free or paid version](https://aka.ms/AMLFree) of Azure Machine Learning service today.

## Prerequisites

* Complete the [setup tutorial](https://docs.microsoft.com/azure/machine-learning/service/tutorial-1st-experiment-sdk-setup) if you don't already have an Azure Machine Learning service workspace or notebook virtual machine.
* After you complete the setup tutorial, open the **tutorials/regression-automated-ml.ipynb** notebook using the same notebook server.

This tutorial is also available on [GitHub](https://github.com/Azure/MachineLearningNotebooks/tree/master/tutorials) if you wish to run it in your own [local environment](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/README.md#setup-using-a-local-conda-environment).

## Download and prepare data

Import the necessary packages. The Open Datasets package contains a class representing each data source (`NycTlcGreen` for example) to easily filter date parameters before downloading.

In [1]:
import pandas as pd
from azureml.core import Dataset
from datetime import datetime
from dateutil.relativedelta import relativedelta

In [2]:
import numpy as np

In [3]:
!pip install --upgrade azureml-sdk azureml-widgets

Requirement already up-to-date: azureml-sdk in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.20.0)
Requirement already up-to-date: azureml-widgets in /anaconda/envs/azureml_py36/lib/python3.6/site-packages (1.20.0)


In [3]:
import azureml.core
from azureml.core import Workspace

# Load the workspace from the saved config file
ws = Workspace.from_config()

print('Ready to use Azure ML {} to work with {}'.format(azureml.core.VERSION, ws.name))

Ready to use Azure ML 1.20.0 to work with cloudchronicles


In [4]:
from azureml.core import Dataset

# Get the default datastore
default_ds = ws.get_default_datastore()
default_ds

{
  "name": "workspaceblobstore",
  "container_name": "azureml-blobstore-58173a5e-80b6-4f14-8388-cd34fad101a8",
  "account_name": "cloudchronicle6143683706",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

In [5]:
print("Datasets:")
for dataset_name in list(ws.datasets.keys()):
    dataset = Dataset.get_by_name(ws, dataset_name)
    print("\t", dataset.name, 'version', dataset.version)

Datasets:
	 TD-Pipeline-Created-on-12-30-2020_on_01-21-2021-Clean_Missing_Data-Cleaning_transformation-4c43cba7 version 2
	 TD-Pipeline-Created-on-12-30-2020_on_01-21-2021-Convert_to_Indicator_Values-Indicator_values_transformation-0d1e0744 version 1
	 MD-Pipeline-Created-on-12-30-2020_on_01-21-2021-Train_Model-Trained_model-a8e07745 version 3
	 sba_new version 1
	 sbadata_2 version 3
	 SBAdata version 1


In [12]:
#import dataset
#use version 3
sba_df = Dataset.get_by_name(ws, 'sbadata_2', version = 3)

sba_df = sba_df.to_pandas_dataframe()

In [13]:
#first few rows of the datset
new_header = sba_df.iloc[0]
sba_df = sba_df[1:]
#set the header row 
sba_df.columns = new_header

In [14]:
sba_df.isnull().sum()

0
LoanNr_ChkDgt             0
Name                      3
City                     30
State                    14
Zip                       0
Bank                   1559
BankState              1566
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                136
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr              4528
LowDoc                 2582
ChgOffDate           736465
DisbursementDate       2368
DisbursemnetFY            0
DisbursementGross         0
BalanceGross              0
MIS_Status             1997
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

Begin by creating a dataframe to hold the taxi data. Then preview the data.

In [15]:
#Remove all the null values from the dataset - not using any imputing method as data is large enough to decide which imputing method will be worth to go ahead with
sba_df.dropna(subset= ['Name','City','State','BankState','Bank', 'NewExist','RevLineCr','LowDoc','DisbursementDate','MIS_Status'], inplace = True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [16]:
sba_df.isnull().sum()

0
LoanNr_ChkDgt             0
Name                      0
City                      0
State                     0
Zip                       0
Bank                      0
BankState                 0
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                  0
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr                 0
LowDoc                    0
ChgOffDate           725379
DisbursementDate          0
DisbursemnetFY            0
DisbursementGross         0
BalanceGross              0
MIS_Status                0
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

In [17]:
#To remove dollar and commas from the float varibles
sba_df[['DisbursementGross', 'BalanceGross', 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv']] = \
sba_df[['DisbursementGross', 'BalanceGross', 'ChgOffPrinGr', 'GrAppv', 'SBA_Appv']].applymap(lambda x:x.strip().replace('$','').replace(',',''))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [13]:
sba_df.isnull().sum()

0
LoanNr_ChkDgt             0
Name                      0
City                      0
State                     0
Zip                       0
Bank                      0
BankState                 0
NAICS                     0
ApprovalDate              0
ApprovalFY                0
Term                      0
NoEmp                     0
NewExist                  0
CreateJob                 0
RetainedJob               0
FranchiseCode             0
UrbanRural                0
RevLineCr                 0
LowDoc                    0
ChgOffDate           725379
DisbursementDate          0
DisbursemnetFY            0
DisbursementGross         0
BalanceGross              0
MIS_Status                0
ChgOffPrinGr              0
GrAppv                    0
SBA_Appv                  0
dtype: int64

In [19]:
sba_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 886251 entries, 1 to 899164
Data columns (total 28 columns):
LoanNr_ChkDgt        886251 non-null object
Name                 886251 non-null object
City                 886251 non-null object
State                886251 non-null object
Zip                  886251 non-null object
Bank                 886251 non-null object
BankState            886251 non-null object
NAICS                886251 non-null object
ApprovalDate         886251 non-null object
ApprovalFY           886251 non-null object
Term                 886251 non-null object
NoEmp                886251 non-null object
NewExist             886251 non-null object
CreateJob            886251 non-null object
RetainedJob          886251 non-null object
FranchiseCode        886251 non-null object
UrbanRural           886251 non-null object
RevLineCr            886251 non-null object
LowDoc               886251 non-null object
ChgOffDate           160872 non-null object
Disbursem

In [20]:
#Changing the type of other columns
sba_df = sba_df.astype({'Zip': 'str', 'NewExist': 'int64', 'UrbanRural': 'str', 'DisbursementGross': 'float', 'BalanceGross': 'float',
                          'ChgOffPrinGr': 'float', 'GrAppv': 'float', 'SBA_Appv': 'float', 'DisbursemnetFY': 'int64', 'NoEmp':'int64'})

In [21]:
sba_df['Percentage_sabapp'] = (sba_df['SBA_Appv']/sba_df['DisbursementGross'])*100

In [23]:
sba_df['LowDoc'].unique()
sba_df = sba_df.astype({'FranchiseCode':'int64'})

In [24]:
#create all the required variables
sba_df['default_status'] = np.where(sba_df['MIS_Status'] == "P I F", 0, 1)
sba_df['Industry'] = sba_df['NAICS'].astype('str').apply(lambda x:x[:2])

In [25]:
#Map the appropriate indusrty to each record based on the first 2 digits of the NAICS code
sba_df['Industry'] = sba_df['Industry'].map({
    '11': 'Ag/For/Fish/Hunt',
    '21': 'Min/Quar/Oil_Gas_ext',
    '22': 'Utilities',
    '23': 'Construction',
    '31': 'Manufacturing',
    '32': 'Manufacturing',
    '33': 'Manufacturing',
    '42': 'Wholesale_trade',
    '44': 'Retail_trade',
    '45': 'Retail_trade',
    '48': 'Trans/Ware',
    '49': 'Trans/Ware',
    '51': 'Information',
    '52': 'Finance/Insurance',
    '53': 'RE/Rental/Lease',
    '54': 'Prof/Science/Tech',
    '55': 'Mgmt_comp',
    '56': 'Admin_sup/Waste_Mgmt_Rem',
    '61': 'Educational',
    '62': 'Healthcare/Social_assist',
    '71': 'Arts/Entertain/Rec',
    '72': 'Accom/Food_serv',
    '81': 'Other_no_pub',
    '92': 'Public_Admin',
    '0':'Unknown'
})

In [27]:
sba_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 886251 entries, 1 to 899164
Data columns (total 31 columns):
LoanNr_ChkDgt        886251 non-null object
Name                 886251 non-null object
City                 886251 non-null object
State                886251 non-null object
Zip                  886251 non-null object
Bank                 886251 non-null object
BankState            886251 non-null object
NAICS                886251 non-null object
ApprovalDate         886251 non-null object
ApprovalFY           886251 non-null object
Term                 886251 non-null object
NoEmp                886251 non-null int64
NewExist             886251 non-null int64
CreateJob            886251 non-null object
RetainedJob          886251 non-null object
FranchiseCode        886251 non-null int64
UrbanRural           886251 non-null object
RevLineCr            886251 non-null object
LowDoc               886251 non-null object
ChgOffDate           160872 non-null object
Disbursement

In [28]:
sba_df['IsFranchise'] = np.where(sba_df['FranchiseCode'] <= 1, 0, 1) 

In [29]:
sba_df = sba_df[(sba_df['NewExist'] == 1) | (sba_df['NewExist'] == 2)]

In [30]:
sba_df = sba_df[(sba_df['LowDoc'] == "Y") | (sba_df['LowDoc'] == "N")]

In [31]:
sba_df = sba_df[(sba_df['RevLineCr'] == 'Y') | (sba_df['RevLineCr'] == 'N')]

In [32]:
#convert Y and N to 0 and 1
sba_df['RevLineCr'] = np.where(sba_df['RevLineCr'] == 'N',0,1)
sba_df['LowDoc'] = np.where(sba_df['LowDoc'] == 'N', 0, 1)

In [33]:
# Check that it worked
print(sba_df['RevLineCr'].unique())
print(sba_df['LowDoc'].unique())

[0 1]
[1 0]


In [34]:
# Convert ApprovalDate and DisbursementDate columns to datetime values
# ChgOffDate not changed to datetime since it is not of value and will be removed later
sba_df[['ApprovalDate', 'DisbursementDate']] = sba_df[['ApprovalDate', 'DisbursementDate']].apply(pd.to_datetime)

Now that the initial data is loaded, define a function to create various time-based features from the pickup datetime field. This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. 

Use the `apply()` function on the dataframe to iteratively apply the `build_time_features()` function to each row in the taxi data.

In [35]:
sba_df['ApprovalFY'].unique()

array(['1997', '1980', '2006', '1998', '1999', '2000', '2001', '1971',
       '2002', '2003', '2004', '1972', '1978', '1981', '2005', '1979',
       '1982', '1983', '1984', '1985', '2007', '1986', '1987', '1973',
       '2008', '1988', '1989', '2009', '1991', '1990', '2010', '2011',
       '1992', '1993', '2012', '1994', '2013', '1975', '1976', '1974',
       '2014', '1977', '1969', '1970', '1995', '1996'], dtype=object)

In [36]:
sba_df['DisbursemnetFY'].unique()

array([1999, 1997, 1980, 1998, 2006, 2002, 2001, 2000, 2003, 1982, 2004,
       1971, 2005, 2009, 2007, 2008, 1981, 1972, 1978, 1979, 1996, 2010,
       1995, 2012, 1983, 1985, 1984, 1948, 1987, 1973, 1986, 2011, 1988,
       1989, 2013, 1990, 1991, 2014, 1992, 1993, 1994, 2020, 1974, 2028,
       1975, 1976, 1977, 1969, 1970])

In [37]:
sba_df = sba_df.astype({'ApprovalFY':'int64', 'DisbursemnetFY' : 'int64'})

In [38]:
sba_df['ApprovalFY'].unique()

array([1997, 1980, 2006, 1998, 1999, 2000, 2001, 1971, 2002, 2003, 2004,
       1972, 1978, 1981, 2005, 1979, 1982, 1983, 1984, 1985, 2007, 1986,
       1987, 1973, 2008, 1988, 1989, 2009, 1991, 1990, 2010, 2011, 1992,
       1993, 2012, 1994, 2013, 1975, 1976, 1974, 2014, 1977, 1969, 1970,
       1995, 1996])

In [39]:
#Real Estate Backed up
sba_df = sba_df.astype({'Term':'int64'})
sba_df["RealEstate"] = np.where(sba_df['Term'] >= 240, 1, 0)

In [40]:
sba_df['UrbanRural'].unique()

array(['0', '1', '2'], dtype=object)

In [41]:
# Recession time and Field for loans active during the Great Recession (2007-2009)
sba_df['Recession'] = np.where(((2007 <= sba_df['DisbursemnetFY']) & (sba_df['DisbursemnetFY'] <= 2009)) | 
                                     ((sba_df['DisbursemnetFY'] < 2007) & (sba_df['DisbursemnetFY'] + (sba_df['Term']/12) >= 2007)), 1, 0)

In [42]:
sba_df['Recession'].unique()

array([0, 1])

In [43]:
#Filter the records with a disbursement year through 20
sba_filtered = sba_df[sba_df['DisbursemnetFY'] <= 2010]

In [44]:
sba_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 591757 entries, 1 to 899164
Data columns (total 34 columns):
LoanNr_ChkDgt        591757 non-null object
Name                 591757 non-null object
City                 591757 non-null object
State                591757 non-null object
Zip                  591757 non-null object
Bank                 591757 non-null object
BankState            591757 non-null object
NAICS                591757 non-null object
ApprovalDate         591757 non-null datetime64[ns]
ApprovalFY           591757 non-null int64
Term                 591757 non-null int64
NoEmp                591757 non-null int64
NewExist             591757 non-null int64
CreateJob            591757 non-null object
RetainedJob          591757 non-null object
FranchiseCode        591757 non-null int64
UrbanRural           591757 non-null object
RevLineCr            591757 non-null int64
LowDoc               591757 non-null int64
ChgOffDate           112654 non-null object
Disburse

In [45]:
sba_filtered['same_bankstate'] = np.where(sba_filtered['State'] == sba_filtered['BankState'],1,0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [46]:
sba_filtered['same_bankstate'].unique()

array([0, 1])

In [47]:
sba_filtered['Daystodisbursement'] = (sba_filtered['DisbursementDate'] - sba_filtered['ApprovalDate']).dt.days

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [48]:
#checking
sba_check = sba_filtered[sba_filtered['Daystodisbursement'] <= 0]

In [42]:
sba_check[['ApprovalDate','DisbursementDate', 'Daystodisbursement','LoanNr_ChkDgt']]

Unnamed: 0,ApprovalDate,DisbursementDate,Daystodisbursement,LoanNr_ChkDgt
482,1980-06-25,1980-05-12,-44,1003703008
591,2001-04-30,2001-04-30,0,1004375010
1102,1997-03-04,1997-03-04,0,1007554004
1549,1997-03-05,1997-02-28,-5,1010634005
1656,1997-03-05,1997-02-28,-5,1011204007
...,...,...,...,...
897224,1997-02-10,1996-04-13,-303,9952913010
898205,1997-02-20,1997-02-20,0,9975553001
898374,1997-02-21,1987-04-01,-3614,9978813002
898478,1997-02-21,1997-02-21,0,9980473006


In [49]:
#filtering the negative values out from days to disbursement
sba_filtered = sba_filtered[sba_filtered['Daystodisbursement'] >=0]

In [44]:
sba_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 591183 entries, 1 to 899164
Data columns (total 35 columns):
LoanNr_ChkDgt         591183 non-null object
Name                  591183 non-null object
City                  591183 non-null object
State                 591183 non-null object
Zip                   591183 non-null object
Bank                  591183 non-null object
BankState             591183 non-null object
NAICS                 591183 non-null object
ApprovalDate          591183 non-null datetime64[ns]
ApprovalFY            591183 non-null int64
Term                  591183 non-null int64
NoEmp                 591183 non-null int64
NewExist              591183 non-null int64
CreateJob             591183 non-null object
RetainedJob           591183 non-null object
FranchiseCode         591183 non-null int64
UrbanRural            591183 non-null object
RevLineCr             591183 non-null int64
LowDoc                591183 non-null int64
ChgOffDate            112587 non-

In [50]:
sba_filtered = sba_filtered.astype({'CreateJob':'int64', 'RetainedJob' : 'int64'})

In [46]:
sba_filtered['CreateJob'].describe()

count    591183.000000
mean         11.358779
std         289.491592
min           0.000000
25%           0.000000
50%           0.000000
75%           1.000000
max        8800.000000
Name: CreateJob, dtype: float64

In [47]:
sba_filtered.columns

Index(['LoanNr_ChkDgt', 'Name', 'City', 'State', 'Zip', 'Bank', 'BankState',
       'NAICS', 'ApprovalDate', 'ApprovalFY', 'Term', 'NoEmp', 'NewExist',
       'CreateJob', 'RetainedJob', 'FranchiseCode', 'UrbanRural', 'RevLineCr',
       'LowDoc', 'ChgOffDate', 'DisbursementDate', 'DisbursemnetFY',
       'DisbursementGross', 'BalanceGross', 'MIS_Status', 'ChgOffPrinGr',
       'GrAppv', 'SBA_Appv', 'default_status', 'Industry', 'IsFranchise',
       'RealEstate', 'Recession', 'same_bankstate', 'Daystodisbursement'],
      dtype='object', name=0)

In [51]:
sba_filtered.drop(columns = ['LoanNr_ChkDgt','Name','City','State','Zip','Bank','BankState','NAICS','ApprovalDate','ApprovalFY','FranchiseCode','ChgOffDate','DisbursementDate','DisbursemnetFY','MIS_Status','SBA_Appv','ChgOffPrinGr'], inplace=True)

In [52]:
sba_filtered.columns

Index(['Term', 'NoEmp', 'NewExist', 'CreateJob', 'RetainedJob', 'UrbanRural',
       'RevLineCr', 'LowDoc', 'DisbursementGross', 'BalanceGross', 'GrAppv',
       'Percentage_sabapp', 'default_status', 'Industry', 'IsFranchise',
       'RealEstate', 'Recession', 'same_bankstate', 'Daystodisbursement'],
      dtype='object', name=0)

## Configure workspace


Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.

In [53]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()

In [54]:
# Establish target and feature fields
y = sba_filtered['default_status']
X = sba_filtered.drop('default_status', axis=1)

## Split the data into train and test sets

Split the data into training and test sets by using the `train_test_split` function in the `scikit-learn` library. This function segregates the data into the x (**features**) data set for model training and the y (**values to predict**) data set for testing. The `test_size` parameter determines the percentage of data to allocate to testing. The `random_state` parameter sets a seed to the random generator, so that your train-test splits are deterministic.

In [55]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=223)

In [56]:
from sklearn.model_selection import train_test_split

x_train, x_test = train_test_split(sba_filtered, test_size=0.2, random_state=223)

The purpose of this step is to have data points to test the finished model that haven't been used to train the model, in order to measure true accuracy. 

In other words, a well-trained model should be able to accurately make predictions from data it hasn't already seen. You now have data prepared for auto-training a machine learning model.

## Automatically train a model

To automatically train a model, take the following steps:
1. Define settings for the experiment run. Attach your training data to the configuration, and modify settings that control the training process.
1. Submit the experiment for model tuning. After submitting the experiment, the process iterates through different machine learning algorithms and hyperparameter settings, adhering to your defined constraints. It chooses the best-fit model by optimizing an accuracy metric.

### Define training settings

Define the experiment parameter and model settings for training. View the full list of [settings](https://docs.microsoft.com/azure/machine-learning/service/how-to-configure-auto-train). Submitting the experiment with these default settings will take approximately 20 minutes, but if you want a shorter run time, reduce the `experiment_timeout_hours` parameter.


|Property| Value in this tutorial |Description|
|----|----|---|
|**iteration_timeout_minutes**|10|Time limit in minutes for each iteration. Increase this value for larger datasets that need more time for each iteration.|
|**experiment_timeout_hours**|0.3|Maximum amount of time in hours that all iterations combined can take before the experiment terminates.|
|**enable_early_stopping**|True|Flag to enable early termination if the score is not improving in the short term.|
|**primary_metric**| spearman_correlation | Metric that you want to optimize. The best-fit model will be chosen based on this metric.|
|**featurization**| auto | By using auto, the experiment can preprocess the input data (handling missing data, converting text to numeric, etc.)|
|**verbosity**| logging.INFO | Controls the level of logging.|
|**n_cross_validations**|5|Number of cross-validation splits to perform when validation data is not specified.|

In [57]:
import logging

automl_settings = {
    "iteration_timeout_minutes": 10,
    "experiment_timeout_hours": 0.3,
    "enable_early_stopping": True,
    "primary_metric": 'AUC_weighted',
    "featurization": 'auto',
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

Use your defined training settings as a `**kwargs` parameter to an `AutoMLConfig` object. Additionally, specify your training data and the type of model, which is `regression` in this case.

In [58]:
from azureml.train.automl import AutoMLConfig

automl_config = AutoMLConfig(task='classification',
                             debug_log='automated_ml_errors.log',
                             training_data=x_train,
                             label_column_name="default_status",
                             **automl_settings)




Automated machine learning pre-processing steps (feature normalization, handling missing data, converting text to numeric, etc.) become part of the underlying model. When using the model for predictions, the same pre-processing steps applied during training are applied to your input data automatically.

### Train the automatic regression model

Create an experiment object in your workspace. An experiment acts as a container for your individual runs. Pass the defined `automl_config` object to the experiment, and set the output to `True` to view progress during the run. 

After starting the experiment, the output shown updates live as the experiment runs. For each iteration, you see the model type, the run duration, and the training accuracy. The field `BEST` tracks the best running training score based on your metric type.

In [None]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "sba_loan")
local_run = experiment.submit(automl_config, show_output=True)

No run_configuration provided, running on local with default configuration
Running on local machine
Parent Run ID: AutoML_31e33585-da86-4c0e-baea-1d53c39f2372

Current status: DatasetEvaluation. Gathering dataset statistics.
Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturization. Beginning to fit featurizers and featurize the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: DatasetCrossValidationSplit. Generating individually featurized CV splits.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

**************************************************

## Explore the results

Explore the results of automatic training with a [Jupyter widget](https://docs.microsoft.com/python/api/azureml-widgets/azureml.widgets?view=azure-ml-py). The widget allows you to see a graph and table of all individual run iterations, along with training accuracy metrics and metadata. Additionally, you can filter on different accuracy metrics than your primary metric with the dropdown selector.

In [60]:
from azureml.widgets import RunDetails
RunDetails(local_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

### Retrieve the best model

Select the best model from your iterations. The `get_output` function returns the best run and the fitted model for the last fit invocation. By using the overloads on `get_output`, you can retrieve the best run and fitted model for any logged metric or a particular iteration.

In [61]:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)

Run(Experiment: sba_loan,
Id: AutoML_dc93ea92-93ec-416a-a573-2f09a3b5bd68_1,
Type: None,
Status: Completed)
Pipeline(memory=None,
         steps=[('datatransformer',
                 DataTransformer(enable_dnn=None, enable_feature_sweeping=None,
                                 feature_sweeping_config=None,
                                 feature_sweeping_timeout=None,
                                 featurization_config=None, force_text_dnn=None,
                                 is_cross_validation=None,
                                 is_onnx_compatible=None, logger=None,
                                 observer=None, task=None, working_dir=None)),
                ('MaxAbsScaler', MaxAbsScaler(copy...
                                   colsample_bylevel=1, colsample_bynode=1,
                                   colsample_bytree=1, gamma=0,
                                   learning_rate=0.1, max_delta_step=0,
                                   max_depth=3, min_child_weight=1, mis

### Test the best model accuracy

Use the best model to run predictions on the test data set to predict taxi fares. The function `predict` uses the best model and predicts the values of y, **trip cost**, from the `x_test` data set. Print the first 10 predicted cost values from `y_predict`.

In [64]:
y_test = x_test.pop("default_status")

y_predict = fitted_model.predict(x_test)
print(y_predict[:10])

[0 1 0 0 0 1 0 0 0 0]


In [65]:
print(local_run.get_metrics())

{'precision_score_macro': 0.9853253590077333, 'recall_score_macro': 0.9937524633144076, 'AUC_micro': 0.9990718620211689, 'AUC_macro': 0.9981316560161868, 'f1_score_micro': 0.9935933482528384, 'log_loss': 0.02802761589445259, 'average_precision_score_weighted': 0.9977573460849968, 'matthews_correlation': 0.9790413632193523, 'f1_score_macro': 0.989473824514916, 'accuracy': 0.9935933482528384, 'norm_macro_recall': 0.9875049266288152, 'weighted_accuracy': 0.9935248413902121, 'average_precision_score_macro': 0.9946340754560712, 'precision_score_weighted': 0.9937051533131868, 'f1_score_weighted': 0.9936209871924658, 'precision_score_micro': 0.9935933482528384, 'AUC_weighted': 0.9981316549442159, 'average_precision_score_micro': 0.9990871998555569, 'recall_score_weighted': 0.9935933482528384, 'balanced_accuracy': 0.9937524633144076, 'recall_score_micro': 0.9935933482528384}


In [66]:
print(local_run.get_file_names())

['outputs/verifier_results.json']


In [None]:
import json
test = json.dumps({"data": X_test.tolist()})
test = bytes(test, encoding='utf8')
y_hat = service.run(input_data=test)

In [51]:
x_test = x_test.head(2)
x_test

Unnamed: 0,Term,NoEmp,NewExist,CreateJob,RetainedJob,UrbanRural,RevLineCr,LowDoc,DisbursementGross,BalanceGross,ChgOffPrinGr,GrAppv,SBA_Appv,default_status,Industry,IsFranchise,RealEstate,Recession,same_bankstate,Daystodisbursement
79602,36,3,1,0,3,1,0,0,51000.0,0.0,0.0,51000.0,43350.0,0,Admin_sup/Waste_Mgmt_Rem,0,0,1,0,67
821077,68,1,1,16,17,1,1,0,15000.0,0.0,13324.0,15000.0,7500.0,1,Other_no_pub,0,0,1,0,39


In [67]:
import requests
import json

# URL for the web service
scoring_uri = 'http://52.187.233.80:80/api/v1/service/sbaloanpredict/score'
# If the service is authenticated, set the key or token
key = '<your key or token>'

# Two sets of data to score, so we get two results back
data = {"data":
            [
                [
                1004285007,
                "SIMPLEX OFFICE SOLUTIONS",
                "ANAHEIM",
                "CA",
                92801,
                "CALIFORNIA BANK & TRUST",
                "CA",
                532420,
                15074,
                2001,
                36,
                1,
                1,
                0,
                0,
                1,
                0,
                "Y",
                "N", 
                17799,
                15095,
                32812,
                0,
                "P I F",
                0,
                30000,
                15000,
                0,
                0,
                0.5,
                0,
                1080,
                16175,
                0
                ]
            ]
        }
# Convert to JSON string
#x_test_list = [x_test.columns.values.tolist()] + x_test.values.tolist()
#input_data = json.dumps({"data": x_test_list})
input_data = json.dumps(data)
print(input_data)
# Set the content type
headers = {'Content-Type': 'application/json'}
# If authentication is enabled, set the authorization header
#headers['Authorization'] = f'Bearer {key}'

# Make the request and display the response
resp = requests.post(scoring_uri, input_data, headers=headers)
#resp = requests.post(scoring_uri, input_data)
print(resp.text)

{"data": [[1004285007, "SIMPLEX OFFICE SOLUTIONS", "ANAHEIM", "CA", 92801, "CALIFORNIA BANK & TRUST", "CA", 532420, 15074, 2001, 36, 1, 1, 0, 0, 1, 0, "Y", "N", 17799, 15095, 32812, 0, "P I F", 0, 30000, 15000, 0, 0, 0.5, 0, 1080, 16175, 0]]}
Length mismatch: Expected axis has 34 elements, new values have 19 elements
Help: https://go.microsoft.com/fwlink/?linkid=2146748


In [68]:
for webservice_name in ws.webservices:
    print(webservice_name)

sbaloanpredict
loanpredictor
sbadata
sbamodel


In [None]:
import json

x_new = [[1000014003,"ABC HOBBYCRAFT","EVANSVILLE",IN,47711,"FIFTH THIRD BANK",OH,451120,1997-02-28T00:00:00Z,1997,84,4,2,null,null,1,null,N,true,null,1999-02-28T00:00:00Z,60,000.00,0.00,P I F,0.00,60,000.00,48,000.00]]
print ('Patient: {}'.format(x_new[0]))

# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": x_new})

# Call the web service, passing the input data (the web service will also accept the data in binary format)
predictions = service.run(input_data = input_json)

# Get the predicted class - it'll be the first (and only) one.
predicted_classes = json.loads(predictions)
print(predicted_classes[0])

### Transparency

View updated featurization summary


In [None]:
custom_featurizer = fitted_model.named_steps['datatransformer']
df = custom_featurizer.get_featurization_summary()
pd.DataFrame(data=df)

In [None]:
df = custom_featurizer.get_featurization_summary(is_user_friendly=False)
pd.DataFrame(data=df)

In [None]:
df = custom_featurizer.get_stats_feature_type_summary()
pd.DataFrame(data=df)

Calculate the `root mean squared error` of the results. Convert the `y_test` dataframe to a list to compare to the predicted values. The function `mean_squared_error` takes two arrays of values and calculates the average squared error between them. Taking the square root of the result gives an error in the same units as the y variable, **cost**. It indicates roughly how far the taxi fare predictions are from the actual fares.

In [None]:
from sklearn.metrics import mean_squared_error
from math import sqrt

y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
rmse

Run the following code to calculate mean absolute percent error (MAPE) by using the full `y_actual` and `y_predict` data sets. This metric calculates an absolute difference between each predicted and actual value and sums all the differences. Then it expresses that sum as a percent of the total of the actual values.

In [None]:
sum_actuals = sum_errors = 0

for actual_val, predict_val in zip(y_actual, y_predict):
    abs_error = actual_val - predict_val
    if abs_error < 0:
        abs_error = abs_error * -1

    sum_errors = sum_errors + abs_error
    sum_actuals = sum_actuals + actual_val

mean_abs_percent_error = sum_errors / sum_actuals
print("Model MAPE:")
print(mean_abs_percent_error)
print()
print("Model Accuracy:")
print(1 - mean_abs_percent_error)

From the two prediction accuracy metrics, you see that the model is fairly good at predicting taxi fares from the data set's features, typically within +- $4.00, and approximately 15% error. 

The traditional machine learning model development process is highly resource-intensive, and requires significant domain knowledge and time investment to run and compare the results of dozens of models. Using automated machine learning is a great way to rapidly test many different models for your scenario.

## Clean up resources

Do not complete this section if you plan on running other Azure Machine Learning service tutorials.

### Stop the notebook VM

If you used a cloud notebook server, stop the VM when you are not using it to reduce cost.

1. In your workspace, select **Compute**.
1. Select the **Notebook VMs** tab in the compute page.
1. From the list, select the VM.
1. Select **Stop**.
1. When you're ready to use the server again, select **Start**.

### Delete everything

If you don't plan to use the resources you created, delete them, so you don't incur any charges.

1. In the Azure portal, select **Resource groups** on the far left.
1. From the list, select the resource group you created.
1. Select **Delete resource group**.
1. Enter the resource group name. Then select **Delete**.

You can also keep the resource group but delete a single workspace. Display the workspace properties and select **Delete**.

## Next steps

In this automated machine learning tutorial, you did the following tasks:

> * Configured a workspace and prepared data for an experiment.
> * Trained by using an automated regression model locally with custom parameters.
> * Explored and reviewed training results.

[Deploy your model](https://docs.microsoft.com/azure/machine-learning/service/tutorial-deploy-models-with-aml) with Azure Machine Learning service.