Author: Kevin ALBERT  

Created: April 2020  

# Automated Machine Learning
_**Classification of data lake data on remote compute with autoML and model registration**_

## Contents
1. [Introduction](#Introduction)
1. [Setup](#Setup)
1. [Train](#Train)
1. [Results](#Results)
1. [Test](#Test)
1. [Acknowledgements](#Acknowledgements)

## Introduction

Cleaned datasets created in datafactory onto a delta lake Gen2.  
This notebook is using delta lake data and remote compute to autoML train a classification model.  
We use example data to detect diabetic or non-diabetic based on 8 features.  

This notebook show how to:
1. Create an experiment
2. Configure AutoML
3. Train the model using remote compute
4. Explore the results
5. Test the fitted model

## Setup

### Import open-source packages

In [None]:
# import logging
# import os
# import random
# import re
# import lightgbm
# import pandas as pd
# import numpy as np
# import json
# import csv
# from matplotlib import pyplot as plt
# from matplotlib.pyplot import imshow
# from sklearn import datasets
# from shutil import copy2
# import seaborn as sns
# sns.set(color_codes='True')

### Import Azure Machine Learning SDK packages

In [15]:
#import azureml.core
from azureml.core import Workspace
from azureml.core.experiment import Experiment
from azureml.core import Dataset
from azureml.core import Datastore
from azureml.data.datapath import DataPath
from azureml.core.compute import ComputeTarget
from azureml.core.compute import AmlCompute
from azureml.core.compute import AksCompute
# from azureml.core.compute_target import ComputeTargetException
# from azureml.core.webservice import Webservice, AksWebservice
# from azureml.core.image import Image
# from azureml.core.model import Model
# from azureml.train.automl import AutoMLConfig
# from azureml.train.automl.run import AutoMLRun
# from azureml.widgets import RunDetails

### Workspace

Download **config.json** from the machine learning workspace portal

In [5]:
# load the workspace
ws = Workspace.from_config()

### Experiment

In [6]:
# choose an experiment name
experiment = Experiment(ws, 'automl-classification')

### Data

Data Factory has prepped data from /bronze to /silver to /gold and /platinum for model training  
**note:** this demonstration had files in the Data Lake Gen2 datalake container /platinum folder  
  * /datalake/platinum/diabetes.csv
  * /datalake/platinum/diabetes.parquet
  * copy from ../data/platinum/*

Register the datastore 'data lake gen2' as a **blob container**  
**optional:** manually register in ML workspace

In [7]:
ds = Datastore.register_azure_blob_container(workspace=ws,
                                             datastore_name="datalakestoragegen2",
                                             container_name="datalake",
                                             account_name="datalake21032020",
                                             account_key="Ck/4hMq3Zrzq5toZ96zE6cDncjbw2VdkR9ny1xXA3GLBwQXIv7V1ycSc/KpqyNRcoPWKtzKljjpcZVqjWOu+3Q==",
                                             create_if_not_exists=False)
# list available datastores
ws.datastores

{'azureml_globaldatasets': {
   "name": "azureml_globaldatasets",
   "container_name": "globaldatasets",
   "account_name": "studioprodweummsa01",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'datalakestoragegen2': {
   "name": "datalakestoragegen2",
   "container_name": "datalake",
   "account_name": "datalake21032020",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'data_lake_gen2': <azureml.data.azure_data_lake_datastore.AzureDataLakeGen2Datastore at 0x7f52b70ca278>,
 'workspacefilestore': {
   "name": "workspacefilestore",
   "container_name": "azureml-filestore-8ffd38a4-d688-44f6-9fc7-862df920c646",
   "account_name": "machinelstorage071578f15",
   "protocol": "https",
   "endpoint": "core.windows.net"
 },
 'workspaceblobstore': {
   "name": "workspaceblobstore",
   "container_name": "azureml-blobstore-8ffd38a4-d688-44f6-9fc7-862df920c646",
   "account_name": "machinelstorage071578f15",
   "protocol": "https",
   "endpoint": "core.windows.net"
 }}

Register file(s) into a tabular dataset  
**Note:** do not import Delta lake parquet file(s)  
**Fix:** you can import pandas single gold/*.csv or gold/*.parquet file(s)  

In [8]:
# load datastore
ds = Datastore.get(ws, 'datalakestoragegen2')
# show datastore settings
ds

{
  "name": "datalakestoragegen2",
  "container_name": "datalake",
  "account_name": "datalake21032020",
  "protocol": "https",
  "endpoint": "core.windows.net"
}

option 1: loading *.parquet

In [9]:
# setup parquet file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.parquet')]
dataset = Dataset.Tabular.from_parquet_files(path=ds_path)
# show dataset settings
dataset

{
  "source": [
    "('datalakestoragegen2', 'platinum/diabetes.parquet')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ]
}

option 2: loading *.csv

In [10]:
# setup csv file(s) into a tabular dataset
ds_path = [DataPath(ds, 'platinum/diabetes.csv')]
dataset = Dataset.Tabular.from_delimited_files(path=ds_path)
# show dataset settings
dataset

{
  "source": [
    "('datalakestoragegen2', 'platinum/diabetes.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ]
}

option 3: loading a register dataset (manually register in ML workspace)

In [13]:
# list available datasets
ws.datasets

{'diabetes_parquet_from_datastore_datalakegen2': DatasetRegistration(id='ac38e902-7833-4f0b-8785-062fe517b051', name='diabetes_parquet_from_datastore_datalakegen2', version=1, description='', tags={}), 'diabetes_parquet_from_realdatalake': DatasetRegistration(id='d4d5becf-50bd-453f-bd93-d2fe30648fcf', name='diabetes_parquet_from_realdatalake', version=1, description='', tags={}), 'diabetes_parquet_from_datalake': DatasetRegistration(id='ce05617f-a036-4f31-8959-24c412112747', name='diabetes_parquet_from_datalake', version=1, description='', tags={}), 'diabetes_from_datalake': DatasetRegistration(id='2b800b1c-3d2b-416e-b270-ce784fb8b832', name='diabetes_from_datalake', version=1, description='', tags={}), 'diabetes_from_blob': DatasetRegistration(id='9f1a6d66-9c35-4ae9-82b8-b4bfd2b03925', name='diabetes_from_blob', version=1, description='', tags={}), 'diabetes2': DatasetRegistration(id='2c81c692-c43c-4f03-9952-45124c0da47c', name='diabetes2', version=1, description='', tags={}), 'diabet

In [14]:
# load a registered dataset
dataset = Dataset.get_by_name(ws, 'diabetes_parquet_from_datastore_datalakegen2')
# show dataset settings
dataset

{
  "source": [
    "('datalakestoragegen2', 'platinum/diabetes.parquet')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ReadParquetFile",
    "DropColumns"
  ],
  "registration": {
    "id": "ac38e902-7833-4f0b-8785-062fe517b051",
    "name": "diabetes_parquet_from_datastore_datalakegen2",
    "version": 1,
    "workspace": "Workspace.create(name='machine_learning_workspace', subscription_id='43c1f93a-903d-4b23-a4bf-92bd7a150627', resource_group='myResourceGroup')"
  }
}

### Compute

option 1: Create training cluster  

In [17]:
# Specify a name for the compute (unique within the workspace)
compute_name = 'aml-cluster'
# Define compute configuration
compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', # 2CPU-7MEM-100SSD
                                                       min_nodes=0,
                                                       max_nodes=4,
                                                       vm_priority='lowpriority' # {lowpriority, dedicated}
                                                      )
# Create the compute
training_cluster = ComputeTarget.create(ws, compute_name, compute_config)
training_cluster.wait_for_completion(show_output=True)

Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


option 2: Load already known training cluster

In [21]:
# load the training cluster
training_cluster = ComputeTarget(ws, name='aml-cluster')