Copyright (c) Microsoft Corporation. All rights reserved.

![Impressions](https://PixelServer20190423114238.azurewebsites.net/api/impressions/NotebookVM/tutorials/regression-part2-automated-ml.png)

# Tutorial: Use machine learning to predict taxi fares

In this tutorial, you use  machine learning in Azure Machine Learning service to create a regression model to predict NYC taxi fare prices.
In this tutorial you learn the following tasks:

* Download, transform, and clean data using Azure Open Datasets
* Train an machine learning linear regression model
* Calculate model accuracy

## Download and prepare data

Import the necessary packages. The Open Datasets package contains a class representing each data source (`NycTlcGreen` for example).

In [None]:
user = "memasanz"

In [None]:
import pandas as pd
from azureml.core import Dataset
from datetime import datetime
from dateutil.relativedelta import relativedelta

Begin by creating a dataframe to hold the taxi data. Then preview the data.

In [None]:
green_taxi_dataset = Dataset.Tabular.from_parquet_files(path="https://automlsamplenotebookdata.blob.core.windows.net/automl-sample-notebook-data/green_taxi_data.parquet")
green_taxi_df = green_taxi_dataset.to_pandas_dataframe()
green_taxi_df.head(10)

In [None]:
green_taxi_df.shape

Now that the initial data is loaded, define a function to create various time-based features from the pickup datetime field. This will create new fields for the month number, day of month, day of week, and hour of day, and will allow the model to factor in time-based seasonality. 

Use the `apply()` function on the dataframe to iteratively apply the `build_time_features()` function to each row in the taxi data.

In [None]:
def build_time_features(vector):
    pickup_datetime = vector[0]
    month_num = pickup_datetime.month
    day_of_month = pickup_datetime.day
    day_of_week = pickup_datetime.weekday()
    hour_of_day = pickup_datetime.hour
    
    return pd.Series((month_num, day_of_month, day_of_week, hour_of_day))

green_taxi_df[["month_num", "day_of_month","day_of_week", "hour_of_day"]] = green_taxi_df[["lpepPickupDatetime"]].apply(build_time_features, axis=1)
green_taxi_df.head(10)

Remove some of the columns that you won't need for training or additional feature building.

In [None]:
columns_to_remove = ["lpepPickupDatetime", "lpepDropoffDatetime", "puLocationId", "doLocationId", "extra", "mtaTax",
                     "improvementSurcharge", "tollsAmount", "ehailFee", "tripType", "rateCodeID", 
                     "storeAndFwdFlag", "paymentType", "fareAmount", "tipAmount"
                    ]
for col in columns_to_remove:
    green_taxi_df.pop(col)
    
green_taxi_df.head(5)

### Cleanse data 

Run the `describe()` function on the new dataframe to see summary statistics for each field.

In [None]:
green_taxi_df.describe()

From the summary statistics, you see that there are several fields that have outliers or values that will reduce model accuracy. First filter the lat/long fields to be within the bounds of the Manhattan area. This will filter out longer taxi trips or trips that are outliers in respect to their relationship with other features. 

Additionally filter the `tripDistance` field to be greater than zero but less than 31 miles (the haversine distance between the two lat/long pairs). This eliminates long outlier trips that have inconsistent trip cost.

Lastly, the `totalAmount` field has negative values for the taxi fares, which don't make sense in the context of our model, and the `passengerCount` field has bad data with the minimum values being zero.

Filter out these anomalies using query functions, and then remove the last few columns unnecessary for training.

In [None]:
final_df = green_taxi_df.query("pickupLatitude>=40.53 and pickupLatitude<=40.88")
final_df = final_df.query("pickupLongitude>=-74.09 and pickupLongitude<=-73.72")
final_df = final_df.query("tripDistance>=0.25 and tripDistance<31")
final_df = final_df.query("passengerCount>0 and totalAmount>0")

columns_to_remove_for_training = ["pickupLongitude", "pickupLatitude", "dropoffLongitude", "dropoffLatitude"]
for col in columns_to_remove_for_training:
    final_df.pop(col)

In [None]:
final_df.shape

Call `describe()` again on the data to ensure cleansing worked as expected. 

In [None]:
final_df.describe()

## Configure workspace


Create a workspace object from the existing workspace. A [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace?view=azure-ml-py) is a class that accepts your Azure subscription and resource information. It also creates a cloud resource to monitor and track your model runs. `Workspace.from_config()` reads the file **config.json** and loads the authentication details into an object named `ws`. `ws` is used throughout the rest of the code in this tutorial.

In [None]:
from azureml.core.workspace import Workspace
ws = Workspace.from_config()

ws

Here we will save into a register folder the data set that we are going to register for later use. Notice that we have now created a new folder that holds the dataset we would like to use.

In [None]:
cwd = os. getcwd()
print(cwd)
dataset_name = user + '-ds-prepped.csv'
print(dataset_name)
dataset_dir = './register/'
os.makedirs(dataset_dir, exist_ok=True)
file_path = os.path.join(dataset_dir, dataset_name)
final_df.to_csv(file_path, index=False)

Upload the file to the datastore from the register folder to data/prepped folder

upload(src_dir, target_path=None, overwrite=False, show_progress=True)

In [None]:
from azureml.core.datastore import Datastore
ds = Datastore.get_default(ws)
ds.upload('register/', target_path='data/prepped', overwrite=True)

from azureml.core.dataset import Dataset
#create a dataset object from the uploaded file
#prepped_dataset = Dataset.File.from_files((ds, 'data/prepped'))
dataset = Dataset.Tabular.from_delimited_files(ds.path('data/prepped/' + dataset_name))
#register dataset
dataset.register(ws, dataset_name, create_new_version=True)

In [None]:
#sample of consuming the dataset.

# azureml-core of version 1.0.72 or higher is required
# azureml-dataprep[pandas] of version 1.1.34 or higher is required
from azureml.core import Workspace, Dataset

subscription_id = 'XXXX-XXX-XXX-XXX'
resource_group = 'mm-machine-learning-rg'
workspace_name = 'mm-machine-learning-ws-dev'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name=dataset_name)
dataset.to_pandas_dataframe()

### Train the linear regression model

Create an experiment object in your workspace. An experiment acts as a container for your individual runs. 

In [None]:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, user + "python-regression-taxi-experiment")

### Create Training Script

In [None]:
import os
script_folder = os.path.join(os.getcwd(), "train")
print(script_folder)
os.makedirs(script_folder, exist_ok=True)

### TODO: ADD PARAMETER FOR DATASET NAME

Below be use to update the train.py file to **write your user name**

This train script will create a trained model that has been saved to your run outputs folder.

In [None]:
%%writefile $script_folder/train.py

import os
import sys
import argparse
import joblib
import pandas as pd

from azureml.core import Run
from azureml.core.run import Run
from azureml.core import Dataset
from azureml.core import Workspace

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression


def getRuntimeArgs():
    parser = argparse.ArgumentParser()
    parser.add_argument('--data-path', type=str)
    args = parser.parse_args()
    return args


def main():
    args = getRuntimeArgs()
    run = Run.get_context()

    
    dataset_dir = './dataset/'
    os.makedirs(dataset_dir, exist_ok=True)
    ws = run.experiment.workspace
    print(ws)

    dataset_lt = Dataset.get_by_name(ws, name='memasanz-ds-prepped.csv')
    
    # Load a TabularDataset & save into pandas DataFrame
    df = dataset_lt.to_pandas_dataframe()
    df.to_csv(os.path.join(dataset_dir, 'dataset.csv'), index = False)
    

    lr = model_train(df, run)

    #copying to "outputs" directory, automatically uploads it to Azure ML
    output_dir = './outputs/'
    os.makedirs(output_dir, exist_ok=True)
    joblib.dump(value=lr, filename=os.path.join(output_dir, 'model.pkl'))

def model_train(ds_df, run):

    y_raw = ds_df['totalAmount']
    X_raw = ds_df.drop('totalAmount', axis=1)

    categorical_features = X_raw.select_dtypes(include=['object']).columns
    numeric_features = X_raw.select_dtypes(include=['int64', 'float']).columns

    categorical_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='constant', fill_value="missing")),('onehotencoder', OneHotEncoder(categories='auto', sparse=False))])

    numeric_transformer = Pipeline(steps=[('scaler', StandardScaler())])

    feature_engineering_pipeline = ColumnTransformer(
        transformers=[
            ('numeric', numeric_transformer, numeric_features),
            ('categorical', categorical_transformer, categorical_features)
        ], remainder="drop")


    # Train test split
    X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.2, random_state=0)

    clf = Pipeline(steps=[('preprocessor', feature_engineering_pipeline),('regr', LinearRegression())])
    clf.fit(X_train, y_train)
    #


    # Capture metrics
    train_acc = clf.score(X_train, y_train)
    test_acc = clf.score(X_test, y_test)
    print("Training accuracy: %.3f" % train_acc)
    print("Test data accuracy: %.3f" % test_acc)

    # Log to Azure ML
    run.log('Train accuracy', train_acc)
    run.log('Test accuracy', test_acc)

    return clf

if __name__ == "__main__":
    main()

### Create your compute

In [None]:
from azureml.core.compute import AmlCompute, ComputeTarget
from azureml.exceptions import ComputeTargetException
print(user)
compute_name = user + "-cluster"
print(compute_name)

# checks to see if compute target already exists in workspace, else create it
try:
    compute_target = ComputeTarget(workspace=ws, name=compute_name)
except ComputeTargetException:
    config = AmlCompute.provisioning_configuration(vm_size="STANDARD_D13",
                                                   min_nodes=0, 
                                                   max_nodes=1)

    compute_target = ComputeTarget.create(workspace=ws, name=compute_name, provisioning_configuration=config)
    compute_target.wait_for_completion(show_output=True, min_node_count=None, timeout_in_minutes=40)

### Create your Run Config

In [None]:
from azureml.core.conda_dependencies import CondaDependencies
dependencies = CondaDependencies()
dependencies.add_pip_package('numpy==1.17.0')
dependencies.add_pip_package('joblib==0.14.1')
dependencies.add_pip_package('scikit-learn')

#Create a Run Configuration and add this to your pythonscriptstep
from azureml.core.runconfig import RunConfiguration
run_config = RunConfiguration()
run_config.target = compute_name
run_config.environment.python.conda_dependencies = dependencies
run_config.environment.docker.enabled = True

### Select your training script and create a ScriptRunConfig
A ScriptRunConfig object packages together the environment from a RunConfiguration along with your model training script. This object can then be submitted to your experiment and model training will commence on your remote cluster. 

In this sample, we have put the training script in a separate directory which is targeted for training. This separation allows for a snapshot of just the relevant pieces of code to be stored with the Run in your AML workspace. The <code>train.py</code> file here accesses your registered datasets, trains a model, saves a pickled version, and registers the trained model.

ScriptRunConfiguration documentation: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.scriptrunconfig?view=azure-ml-py

In [None]:
from azureml.core import ScriptRunConfig
src = ScriptRunConfig(source_directory='./train', script='train.py')
src.run_config = run_config

### Submit the training run
Here, the ScriptRunConfiguration is submitted as a run which triggers your model training operation. The cluster you defined above is automatically spun up and the training procedures outlined in ./train/train.py begin. That file contains all the code needed to train and save a pickled version of your trained model. The code below will display the output logs from your training job - you can also monitor training progress inside AML studio.

Note: As you iterate on your model, you should modify the code inside ./train/train.py. The model parameters there were adjusted for rapid training and should not be used for a production scenario.

In [None]:
from azureml.widgets import RunDetails
run = experiment.submit(config=src)
RunDetails(run).show()
run.wait_for_completion(show_output=True)

In [None]:
import os
script_folder = os.path.join(os.getcwd(), "score")
print(script_folder)
os.makedirs(script_folder, exist_ok=True)

In [None]:
%%writefile $script_folder/score.py

import json
import os
import numpy as np
import pandas as pd
import joblib
from inference_schema.schema_decorators import input_schema, output_schema
from inference_schema.parameter_types.standard_py_parameter_type import StandardPythonParameterType

def init():
    global model
    
    # Update to your model's filename
    model_filename = "model.pkl"

    # AZUREML_MODEL_DIR is injected by AML
    model_dir = os.getenv('AZUREML_MODEL_DIR')

    print("Model dir:", model_dir)
    print("Model filename:", model_filename)
    
    model_path = os.path.join(model_dir, model_filename)

    # Replace this line with your model loading code
    model = joblib.load(model_path)

# Define some sample data for automatic generation of swagger interface
#make	num-of-doors	body-style
input_sample = [{
 "vendorID" : "1",
 "passengerCount":1,
 "tripDistance": 4.2,
 "month_num": "1",
 "day_of_month" : "4",
 "day_of_week" : "1",
 "hour_of_day": "18"
}]
output_sample = [18.2281]

# This will automatically unmarshall the data parameter in the HTTP request
@input_schema('data', StandardPythonParameterType(input_sample))
@output_schema(StandardPythonParameterType(output_sample))
def run(data):
    try:
        input_df = pd.DataFrame(data)
        proba = model.predict(input_df)
        
        result = {"predict_proba": proba.tolist()}
        return result
    except Exception as e:
        error = str(e)
        return error

In [None]:
from azureml.core.model import Model
model_name = user + '-python-regression'
trained_model = run.register_model(model_path='outputs/model.pkl', model_name=model_name, tags={'Model Type': 'linear regression'})

In [None]:
from azureml.core.environment import Environment
from azureml.core.model import InferenceConfig

env = Environment('tutorial-env')
cd = CondaDependencies.create(pip_packages=['azureml-dataprep[pandas,fuse]>=1.1.14', 'azureml-defaults', 'inference-schema'], conda_packages = ['scikit-learn==0.22.1'])

env.python.conda_dependencies = cd

# Register environment to re-use later
env.register(workspace = ws)

### Model Deployment

 You can register this model and deploy it to an endpoint by defining an inferencing configuration and providing a scoring script. Here the model is deployed to an Azure Container Instance which provides an API endpoint that can be used to make predictions with your model. We utilize an authentication strategy here which requires a key to be provided with any requests sent to the API. These keys can be rotated as needed and allow only approved users to access your endpoint.
 
 Azure Container Instance documentation: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-azure-container-instance

Azure Container Instances are typically lower cost and useful for dev/test purposes during model development, though we recommend deploying to an Azure Kubernetes Service cluster for production purposes.

Below, an InferenceConfig is created which uses the same python dependencies that were used during model training, and references the scoring script located at <code>./score/score.py</code>. This script loads the trained model upon initialization, and facilitates transforming data submitted to the API endpoint, making predictions with the model, and returning formatted results to the user.

In [None]:
from azureml.core.webservice import AciWebservice

aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, 
                                               memory_gb=1, 
                                               tags={"data": "taxi-prepped",  "method" : "sklearn"}, 
                                               description='Predict taxi pricing')

In [None]:
model_name

### Register your model and deploy to an authenticated endpoint 

Model registration documentation: https://docs.microsoft.com/en-us/azure/machine-learning/how-to-deploy-and-where

In [None]:
%%time
from azureml.core.webservice import Webservice
from azureml.core.model import InferenceConfig
from azureml.core.environment import Environment
from azureml.core import Workspace
from azureml.core.model import Model

ws = Workspace.from_config()
model = Model(ws, 'memasanz-python-regression')


#myenv = Environment.get(workspace=ws, name="tutorial-env", version="1")
myenv = Environment.get(workspace=ws, name="tutorial-env", version="5")
inference_config = InferenceConfig(source_directory='./score', entry_script="score.py", environment=myenv)

service = Model.deploy(workspace=ws, 
                       name=model_name +'-srv2', 
                       models=[model], 
                       inference_config=inference_config, 
                       deployment_config=aciconfig)

service.wait_for_deployment(show_output=True)

In [None]:
print('Scoring API available at: {}'.format(service.serialize()['scoringUri']))