# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [2]:
import logging
import os
import csv

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn import datasets
import pkg_resources

import helper
import azureml.core
from azureml.data.datapath import DataPath
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core.run import Run
from azureml.core.model import Model


from azureml.pipeline.steps import AutoMLStep

# Check core SDK version number
print("SDK version:", azureml.core.VERSION)

SDK version: 1.41.0


## Dataset

### Overview
The Titanic dataset used in this project was downloaded from [Kaggle](https://www.kaggle.com/competitions/titanic/overview). We are provided a train and a test dataset with 891 and 418 records, respectively, with 12 columns.
From the given features for each record we want to make a prediction if this person did survive the Titanic disaster or not. Therefore relevant features should be identified to help us with this task.

TODO: Get data. In the cell below, write code to access the data you will be using in this project. Remember that the dataset needs to be external.

In [3]:
# Load workspace from config file present at .\config.json.
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep = '\n')

quick-starts-ws-197423
aml-quickstarts-197423
southcentralus
81cefad3-d2c9-4f77-a466-99a7f541c7bb


In [4]:
# Choose a name for experiment
experiment_name = 'Titanic'
project_folder = './titanic-project'

experiment=Experiment(ws, experiment_name)
experiment

Name,Workspace,Report Page,Docs Page
Titanic,quick-starts-ws-197423,Link to Azure Machine Learning studio,Link to Documentation


In [6]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
found = False
key = "Titanic_dataset"

if key in ws.datasets.keys(): 
        found = True
        dataset = ws.datasets[key] 

if not found:
        # Create AML Dataset and register it into Workspace
        datastore = ws.get_default_datastore()
        datastore.upload(src_dir='data', target_path='data')
        train_data = datastore.path('data/train_modified.csv')
        
        dataset = Dataset.Tabular.from_delimited_files(train_data, separator=';')        
        dataset = dataset.register(workspace=ws,
                                   name=key,
                                   description="This is the complete dataset for the capstone project.")
        
dataset_filtered = dataset.keep_columns(["Survived","Pclass","Sex","SibSp","Parch","Fare","Embarked","Age"])
dataset_filtered = dataset_filtered.register(workspace=ws,
                           name=key+"_filtered",
                           description="This is the filtered dataset for the capstone project " \
                           "with only those features relevant for training.")

df = dataset_filtered.to_pandas_dataframe()
df.describe()

Unnamed: 0,Survived,Pclass,SibSp,Parch,Fare,Age
count,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,0.523008,0.381594,32.204208,29.520623
std,0.486592,0.836071,1.102743,0.806057,49.693429,13.399106
min,0.0,1.0,0.0,0.0,0.0,0.42
25%,0.0,2.0,0.0,0.0,7.9104,21.5
50%,0.0,3.0,0.0,0.0,14.4542,27.784794
75%,1.0,3.0,1.0,0.0,31.0,37.0
max,1.0,3.0,8.0,6.0,512.3292,80.0


## AutoML Configuration

TODO: Explain why you chose the automl settings and cofiguration you used below.

First, we create a compute cluster if it is not yet available. As we want to use deep learning, we choose a GPU for training.

In [5]:
from azureml.core.compute import AmlCompute
from azureml.core.compute import ComputeTarget
from azureml.core.compute_target import ComputeTargetException

num_nodes = 5

amlcompute_cluster_name = "ComputeClusterCapstone"

# Verify that cluster does not exist already
try:
    compute_target = ComputeTarget(workspace=ws, name=amlcompute_cluster_name)
    print('Found existing cluster, use it.')
except ComputeTargetException:
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_NC6',
                                                           vm_priority = 'lowpriority',
                                                           max_nodes=num_nodes)
    compute_target = ComputeTarget.create(ws, amlcompute_cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True, min_node_count = 1, timeout_in_minutes = 10)

InProgress.
SucceededProvisioning operation finished, operation "Succeeded"
Succeeded.....................................................................................................................
AmlCompute wait for completion finished

Wait timeout has been reached
Current provisioning state of AmlCompute is "Succeeded" and current node count is "0"


In [6]:
automl_settings = {
    "max_concurrent_iterations": 5,
    "max_cores_per_iteration": -1,
    "enable_dnn": True,
    "enable_early_stopping": True,
    "validation_size": 0.2,
    "primary_metric" : 'accuracy',
    "enable_voting_ensemble": False,
    "enable_stack_ensemble": False
}

automl_config = AutoMLConfig(compute_target=compute_target,
                             task = "classification",
                             training_data=dataset_filtered,
                             label_column_name="Survived",   
                             path = project_folder,
                             featurization= 'auto',
                             debug_log = "automl_errors.log",
                             **automl_settings
                            )

In [7]:
ds = ws.get_default_datastore()
metrics_output_name = 'metrics_output'
best_model_output_name = 'best_model_output'

In [8]:
remote_run = experiment.submit(automl_config, show_output=True)

Submitting remote run.
No run_configuration provided, running on ComputeClusterCapstone with default configuration
Running on remote compute: ComputeClusterCapstone


Experiment,Id,Type,Status,Details Page,Docs Page
Titanic,AutoML_13f43f05-156d-42df-882f-a54fcde10c77,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation



Current status: FeaturesGeneration. Generating features for the dataset.
Current status: DatasetFeaturizationCompleted. Completed fit featurizers and featurizing the dataset.
Current status: ModelSelection. Beginning model selection.

********************************************************************************************
DATA GUARDRAILS: 

TYPE:         Class balancing detection
STATUS:       PASSED
DESCRIPTION:  Your inputs were analyzed, and all classes are balanced in your training data.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData

********************************************************************************************

TYPE:         Missing feature values imputation
STATUS:       PASSED
DESCRIPTION:  No feature missing values were detected in the training data.
              Learn more about missing value imputation: https://aka.ms/AutomatedMLFeaturization

***********************************************************************

## Run Details

OPTIONAL: Write about the different models trained and their performance. Why do you think some models did better than others?

In [9]:
from azureml.widgets import RunDetails
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [10]:
best_run, best_model = remote_run.get_output()
print(best_run)
print(best_model)

Run(Experiment: Titanic,
Id: AutoML_13f43f05-156d-42df-882f-a54fcde10c77_38,
Type: azureml.scriptrun,
Status: Completed)
Pipeline(steps=[('datatransformer',
                 DataTransformer(enable_feature_sweeping=True, working_dir='/mnt/batch/tasks/shared/LS_root/mounts/clusters/notebook197423/code/Users/odl_user_197423')),
                ('SparseNormalizer', Normalizer(norm='max')),
                ('XGBoostClassifier',
                 XGBoostClassifier(booster='gbtree', colsample_bytree=0.9, eta=0.2, gamma=0, max_depth=6, max_leaves=7, n_estimators=50, n_jobs=0, objective='reg:logistic', problem_info=ProblemInfo(gpu_training_param_dict={'processing_unit_type': 'gpu'}), reg_alpha=2.0833333333333335, reg_lambda=0.5208333333333334, subsample=0.6, tree_method='auto'))])


[Source](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/automated-machine-learning/classification-text-dnn/auto-ml-classification-text-dnn.ipynb)

In [11]:
model_dir = "AutoML_model"  # Local folder where the model will be stored temporarily
if not os.path.isdir(model_dir):
    os.mkdir(model_dir)

best_run.download_file("outputs/model.pkl", model_dir + "/model.pkl")

In [12]:
import sklearn
from azureml.core.resource_configuration import ResourceConfiguration
# Register the model
model_name = "Titanic_AutoML_model"
best_model = Model.register(workspace=ws,
                            model_name=model_name,
                            model_path=model_dir + "/model.pkl",
                            model_framework=Model.Framework.SCIKITLEARN,
                            model_framework_version=sklearn.__version__,
                            sample_input_dataset=dataset_filtered,
                            resource_configuration=ResourceConfiguration(cpu=1, memory_in_gb=0.5),
                            description="AutoML model to predict Titanic survivors.",
                            tags=None
)

Registering model Titanic_AutoML_model


In [13]:
print(best_model.id)

Titanic_AutoML_model:1


# Prediction
Predict values for `Survived` for the Kaggle competition.

## Model Deployment

Remember you have to deploy only one of the two models you trained but you still need to register both the models. Perform the steps in the rest of this notebook only if you wish to deploy this model.

[Source1](https://docs.microsoft.com/de-de/python/api/overview/azure/ml/?view=azure-ml-py)
[Source 2](https://github.com/Azure/MachineLearningNotebooks/blob/master/how-to-use-azureml/deployment/deploy-to-cloud/model-register-and-deploy.ipynb)

TODO: In the cell below, create an inference config and deploy the model as a web service.

In [12]:
from azureml.core.model import InferenceConfig, Model
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.environment import Environment
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.webservice import Webservice

In [61]:
#environment = Environment("Titanic-environment")
env = Environment.get(ws, "AzureML-AutoML")
print(env)
for pip_package in ["skikit-learn"]:
    env.python.conda_dependencies.add_pip_package(pip_package)

# Update scoring script
datastore.upload_files(['./score.py'], overwrite=True)

# Combine scoring script & environment in Inference configuration
inference_config = InferenceConfig(entry_script="score.py",
                                   source_directory=".",
                                   environment=env)

# Set deployment configuration
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1,
                                                       memory_gb = 1,
                                                       enable_app_insights=True,
                                                       auth_enabled=False)

# Define the model, inference, & deployment configuration and web service name and location to deploy
service = Model.deploy(workspace = ws,
                       name = "titanic-webservice",
                       models = [best_model],
                       inference_config = inference_config,
                       deployment_config = deployment_config,
                       overwrite=True)
service.wait_for_deployment(show_output=True)
print(service.get_logs())

Environment(Name: AzureML-AutoML,
Version: 115)
Uploading an estimated of 1 files
Uploading ./score.py
Uploaded ./score.py, 1 files out of an estimated total of 1
Uploaded 1 files
Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2022-06-03 09:25:39+00:00 Creating Container Registry if not exists.
2022-06-03 09:25:39+00:00 Registering the environment.
2022-06-03 09:25:40+00:00 Use the existing image.
2022-06-03 09:25:40+00:00 Generating deployment configuration.
2022-06-03 09:25:41+00:00 Submitting deployment to compute..
2022-06-03 09:26:14+00:00 Checking the status of deployment titanic-webservice..
2022-06-03 09:28:15+00:00 Checking the status of inference endpoint titanic-webservice.
Succeeded
ACI service creation operation finished, operation "Succeeded"
2022-06-03T09:27:30,803388000+00:00 - gunicorn/run 
2022-06-03T09:27:30,806797400+00:00 | gu

TODO: In the cell below, send a request to the web service you deployed to test it.

In [25]:
service = Webservice(workspace=ws, name="titanic-webservice")
print(service)

AciWebservice(workspace=Workspace.create(name='quick-starts-ws-197423', subscription_id='81cefad3-d2c9-4f77-a466-99a7f541c7bb', resource_group='aml-quickstarts-197423'), name=titanic-webservice, image_id=None, image_digest=None, compute_type=ACI, state=Healthy, scoring_uri=http://8b4a0227-956e-4b86-8ac9-811bab24acc9.southcentralus.azurecontainer.io/score, tags=None, properties={'hasInferenceSchema': 'False', 'hasHttps': 'False'}, created_by={'userObjectId': '016836ad-a3f6-4af5-a9a5-b308069434a4', 'userPuId': '100320020175254D', 'userIdp': None, 'userAltSecId': None, 'userIss': 'https://sts.windows.net/660b3398-b80e-49d2-bc5b-ac1dc93b5254/', 'userTenantId': '660b3398-b80e-49d2-bc5b-ac1dc93b5254', 'userName': 'ODL_User 197423', 'upn': 'odl_user_197423@udacitylabs.onmicrosoft.com'})


In [35]:
import json
def predict_from_df(df):
    df_json = df.to_json(orient='records')

    input_payload = json.dumps({
        'data': json.loads(df_json),
        'method': 'predict'
    }, indent=2)
    #print(input_payload)

    output = service.run(input_payload)
    #print("Response:\n",output)
    return output

In [34]:
# Try to load the dataset from the Workspace. Otherwise, create it from the file
found = False
key = "Titanic_dataset_test"

if key in ws.datasets.keys(): 
    print("Found", key)
    found = True
    dataset_predict = ws.datasets[key] 

if not found:
    print("Did not find", key)
    print("Create the dataset")
    # Create AML Dataset and register it into Workspace
    datastore = ws.get_default_datastore()
    datastore.upload(src_dir='data', target_path='data')
    predict_data = datastore.path('data/test_modified.csv')

    dataset_predict = Dataset.Tabular.from_delimited_files(predict_data, separator=';')        
    dataset_predict = dataset_predict.register(workspace=ws,
                               name=key,
                               description="This is the test dataset for the capstone project.")

found = False
key_filtered = key+"_filtered"

if key_filtered in ws.datasets.keys(): 
    print("Found", key_filtered)
    found = True
    dataset_predict_filtered = ws.datasets[key_filtered] 

if not found:
    print("Did not find", key)
    print("Create the dataset")
    dataset_predict_filtered = dataset_predict.keep_columns(["Pclass","Sex","SibSp","Parch","Fare","Embarked","Age"])
    dataset_predict_filtered = dataset_predict_filtered.register(workspace=ws,
                               name=key_filtered,
                               description="This is the filtered test dataset for the capstone project " \
                               "with only those features relevant for prediction."
                               )
    
print(dataset_predict_filtered[:2])
dataframe_predict_filtered = dataset_predict_filtered.to_pandas_dataframe()

y_pred = predict_from_df(dataframe_predict_filtered)

dataset_predict_output = dataset_predict.keep_columns(["PassengerId"])
dataframe_predict_output = dataset_predict_output.to_pandas_dataframe()
dataframe_predict_output = pd.concat([dataframe_predict_output, pd.DataFrame(y_pred, columns=["Survived"])], axis=1)


datastore = ws.get_default_datastore()
dataframe_predict_output.to_csv("dataset_test_predictions_automl.csv", index=False)
dataset_predict = Dataset.Tabular.register_pandas_dataframe(dataframe_predict_output,
                                                            target=datastore,
                                                            name=key+"_filtered_predict",
                                                            description="These are the preditions for the test dataset."
                                                            )

Found Titanic_dataset_test
Found Titanic_dataset_test_filtered
Dataflow
  steps: [
    Step {
      id: 8ba00377-3543-4b30-b07b-1aae4931331b
      type: Microsoft.DPrep.GetDatastoreFilesBlock,
    },
    Step {
      id: 5b1a608e-9c75-4bca-a074-f0cc1c070fdc
      type: Microsoft.DPrep.ParseDelimitedBlock,
    },
  ]
Validating arguments.
Arguments validated.
Successfully obtained datastore reference and path.
Uploading file to managed-dataset/11033b56-872f-4213-9300-3681f4bb5e83/
Successfully uploaded file to datastore.
Creating and registering a new dataset.
Successfully created and registered a new dataset.


In [36]:
print(predict_from_df(dataframe_predict_filtered[0:10]))

[0, 0, 0, 0, 0, 0, 1, 0, 1, 0]


TODO: In the cell below, print the logs of the web service and delete the service

In [38]:
!python logs.py

2022-06-03T09:27:30,803388000+00:00 - gunicorn/run 
2022-06-03T09:27:30,806797400+00:00 | gunicorn/run | 
2022-06-03T09:27:30,806912500+00:00 - iot-server/run 
2022-06-03T09:27:30,816039200+00:00 - nginx/run 
2022-06-03T09:27:30,820831000+00:00 | gunicorn/run | ###############################################
2022-06-03T09:27:30,829672700+00:00 - rsyslog/run 
2022-06-03T09:27:30,826914300+00:00 | gunicorn/run | AzureML Container Runtime Information
2022-06-03T09:27:30,864510100+00:00 | gunicorn/run | ###############################################
2022-06-03T09:27:30,878130900+00:00 | gunicorn/run | 
2022-06-03T09:27:30,892807700+00:00 | gunicorn/run | 
2022-06-03T09:27:30,922439100+00:00 | gunicorn/run | AzureML image information: openmpi3.1.2-ubuntu18.04:20220516.v1
2022-06-03T09:27:30,928814500+00:00 | gunicorn/run | 
2022-06-03T09:27:30,930806500+00:00 | gunicorn/run | 
2022-06-03T09:27:30,937325100+00:00 | gunicorn/run | PATH environment variable: /azureml-envs/azureml

In [None]:
# Delete the webservice
service.delete()

**Submission Checklist**
- I have registered the model.
- I have deployed the model with the best accuracy as a webservice.
- I have tested the webservice by sending a request to the model endpoint.
- I have deleted the webservice and shutdown all the computes that I have used.
- I have taken a screenshot showing the model endpoint as active.
- The project includes a file containing the environment details.
