# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [31]:
#!pip install azureml-sdk==1.22.0

In [32]:
import logging
import os
import csv
import joblib
import json
import requests
import pandas as pd 
import numpy as np 
import azureml.core
from azureml.core import Workspace, Experiment
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.core.run import Run
from azureml.core.model import InferenceConfig 
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model


print("SDK version:", azureml.core.VERSION)

SDK version: 1.22.0


## Dataset

### Overview
The primary objective was to develop an early warning system, i.e. binary classification of failed ('Target'==1) vs. survived ('Target'==0), for the US banks using their quarterly filings with the regulator. Overall, 137 failed banks and 6,877 surviving banks were used in this machine learning exercise. Historical observations from the first 4 quarters ending 2010Q3 (stored in ./data) are used to tune the model and out-of-sample testing is performed on quarterly data starting from 2010Q4 (stored in ./oos). 

### Setting up the project

In [3]:
ws = Workspace.from_config()
ws.write_config(path='.azureml')
experiment_name = 'camels-clf'
project_folder = './dmik'
exp = Experiment(workspace=ws, name=experiment_name)
run = exp.start_logging() # added and 

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')


Workspace name: final-ws
Azure region: eastus
Subscription id: 0c66ad45-500d-48af-80d3-0039ebf1975e
Resource group: final-rgp


### Uploading the training dataset using GUI

In [33]:
dataset = ws.datasets['camels11'] 
df = dataset.to_pandas_dataframe()
#df.pop('Column2')

#len(df)
df.head()
#df.tail()

Unnamed: 0,Column2,Target,EQTA,EQTL,LLRTA,LLRGL,OEXTA,INCEMP,ROA,ROE,TDTL,TDTA,TATA
0,1252,1,0.01,0.01,0.09,0.12,0.03,-593.17,-0.14,-15.77,1.23,0.98,0.11
1,3287,1,0.08,0.19,0.0,0.01,0.02,20.9,0.0,0.05,2.21,0.91,0.36
2,5672,1,0.0,0.0,0.07,0.1,0.03,-323.52,-0.06,-27.68,1.38,0.93,0.14
3,5702,1,0.02,0.02,0.03,0.04,0.04,-153.6,-0.05,-3.12,1.2,0.9,0.09
4,8221,1,0.01,0.01,0.04,0.05,0.04,-217.89,-0.07,-6.03,1.15,0.99,0.1


### Checking for or creating appropriate `ComputeTarget`

In [5]:
cpu_cluster_name = 'final-cmp'

try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Existing compute target.')

except:
    print('Creating compute target.')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

print(compute_target.get_status())

Existing compute target.
{
  "errors": [],
  "creationTime": "2021-03-22T19:24:32.034596+00:00",
  "createdBy": {
    "userObjectId": "49e75006-b9ac-415c-9176-f83c59d4bf26",
    "userTenantId": "d689239e-c492-40c6-b391-2c5951d31d14",
    "userName": null
  },
  "modifiedTime": "2021-03-22T19:27:20.575645+00:00",
  "state": "Running",
  "vmSize": "STANDARD_DS2_V2"
}


## AutoML Configuration
### Primary metric determins configuaration
Financial metrics recorded in the last reports of the failed banks should have predictive power that is needed to forecast future failures. Due to significant class imbalances and taking into account costs accosiated with financial distress, the model should aim to maximize the recall score. In other words, accuracy is probably not the best metrics, as Type II error needs to be minimized. This is why the main focus of this classification should be on maximizing AUC, hopefully, by achieving good recall score. This is why 'norm_macro_recall' was chosen as a primary metric. Timeout and number of concurrent iterations were set conservatively to control the costs.

In [6]:
automl_settings = {
    "experiment_timeout_minutes": 20,
    "max_concurrent_iterations": 4,
    "primary_metric" : 'norm_macro_recall',
    "verbosity": logging.INFO
    }

automl_config = AutoMLConfig(
    compute_target=compute_target, 
    task = "classification",
    training_data=dataset, 
    label_column_name="Target", 
    path = project_folder,
    enable_early_stopping= True, 
    featurization= 'auto', 
    debug_log = "automl_errors.log",
    **automl_settings
    )

## Run Details

### Possible modeling choices 
Generally speaking, decision trees should work well for this task, as these models do not make any functional form assumptions, handle both categorical and continuous data well, and are easy to interpret. Tree-based models simply aim to reduce entropy at every split and are therefore very straightforward, no need to worry about missing data and scaling. They are not very stable though, as new data may produce a totally different tree, and they also tend to overfit.

Possible solution would be model averaging - employing “wisdom of the crowd”. It seems that for the present task two paths are possible: reducing variance or reducing bias. The former implies complex model, i.e. starting with a bushy, high-variance tree and resampling with replacement, what will produce a family of Random Forest models. The later implies starting with a simple model, i.e. possible a stump, high-bias classifier and learning from miss-classified instances, what will produce a family of Boosting models.


In [7]:
remote_run = exp.submit(config=automl_config, show_output=False) 
RunDetails(remote_run).show() 
remote_run.wait_for_completion(show_output=False)

Running on remote.


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

{'runId': 'AutoML_644989e4-1844-40f1-8b40-53c568d3c068',
 'target': 'final-cmp',
 'status': 'Completed',
 'startTimeUtc': '2021-03-24T01:16:38.628668Z',
 'endTimeUtc': '2021-03-24T01:32:56.026721Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'norm_macro_recall',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'final-cmp',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"0d2d11b9-cd34-4682-852f-64735b51567f\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/03-23-2021_112308_UTC/camel_data_after2010Q3.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"final-rgp\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"0c66ad45-500d-48af-80d3-0039ebf

In [9]:
#remote_run.wait_for_completion(show_output=False)

In [10]:
print("Run Status: ",remote_run.get_status())

Run Status:  Completed


## Best Model

Get the best model from the automl experiments and display all the properties of the model.



In [11]:
print(remote_run.get_metrics())

{'experiment_status': ['DatasetEvaluation', 'FeaturesGeneration', 'DatasetFeaturization', 'DatasetFeaturizationCompleted', 'DatasetBalancing', 'DatasetCrossValidationSplit', 'ModelSelection', 'BestRunExplainModel', 'ModelExplanationDataSetSetup', 'PickSurrogateModel', 'EngineeredFeatureExplanations', 'EngineeredFeatureExplanations', 'RawFeaturesExplanations', 'RawFeaturesExplanations', 'BestRunExplainModel'], 'experiment_status_description': ['Gathering dataset statistics.', 'Generating features for the dataset.', 'Beginning to fit featurizers and featurize the dataset.', 'Completed fit featurizers and featurizing the dataset.', 'Performing class balancing sweeping', 'Generating individually featurized CV splits.', 'Beginning model selection.', 'Best run model explanations started', 'Model explanations data setup completed', 'Choosing LightGBM as the surrogate model for explanations', 'Computation of engineered features started', 'Computation of engineered features completed', 'Computa

Veiw the files that were created for the best run:

In [12]:
print(remote_run.get_file_names())

['automl_driver.py', 'definition.json', 'definition_original.json', 'outputs/verifier_results.json']


View dependencies for the best run model:

In [17]:
remote_run.get_run_sdk_dependencies()

{'azureml': '0.2.7',
 'azureml-widgets': '1.24.0',
 'azureml-train': '1.24.0',
 'azureml-train-restclients-hyperdrive': '1.24.0',
 'azureml-train-core': '1.24.0',
 'azureml-train-automl': '1.22.0',
 'azureml-train-automl-runtime': '1.22.0',
 'azureml-train-automl-client': '1.24.0',
 'azureml-tensorboard': '1.22.0',
 'azureml-telemetry': '1.24.0',
 'azureml-sdk': '1.24.0',
 'azureml-samples': '0+unknown',
 'azureml-pipeline': '1.24.0',
 'azureml-pipeline-steps': '1.24.0',
 'azureml-pipeline-core': '1.24.0',
 'azureml-opendatasets': '1.22.0',
 'azureml-model-management-sdk': '1.0.1b6.post1',
 'azureml-mlflow': '1.22.0',
 'azureml-interpret': '1.22.0',
 'azureml-explain-model': '1.22.0',
 'azureml-defaults': '1.22.0',
 'azureml-dataset-runtime': '1.24.0',
 'azureml-dataprep': '2.11.2',
 'azureml-dataprep-rslex': '1.9.1',
 'azureml-dataprep-native': '30.0.0',
 'azureml-datadrift': '1.22.0',
 'azureml-core': '1.24.0.post2',
 'azureml-contrib-services': '1.22.0',
 'azureml-contrib-server': '

In [13]:
best_run, fitted_model = remote_run.get_output()

Package:azureml-automl-runtime, training version:1.24.0, current version:1.22.0
Package:azureml-core, training version:1.24.0.post1, current version:1.22.0
Package:azureml-dataprep, training version:2.11.2, current version:2.9.1
Package:azureml-dataprep-native, training version:30.0.0, current version:29.0.0
Package:azureml-dataprep-rslex, training version:1.9.1, current version:1.7.0
Package:azureml-dataset-runtime, training version:1.24.0, current version:1.22.0
Package:azureml-defaults, training version:1.24.0, current version:1.22.0
Package:azureml-interpret, training version:1.24.0, current version:1.22.0
Package:azureml-mlflow, training version:1.24.0, current version:1.22.0
Package:azureml-pipeline-core, training version:1.24.0, current version:1.22.0
Package:azureml-telemetry, training version:1.24.0, current version:1.22.0
Package:azureml-train-automl-client, training version:1.24.0, current version:1.22.0.post1
Package:azureml-train-automl-runtime, training version:1.24.0, cu

In [15]:
#This should give the preprocessor(s) and algorithm (estimator) used
#best_run, model = parent_run.get_output()
#estimator = model.steps[-1]
estimator = fitted_model.steps[1]
estimator

('prefittedsoftvotingclassifier',
 PreFittedSoftVotingClassifier(classification_labels=None,
                               estimators=[('4',
                                            Pipeline(memory=None,
                                                     steps=[('standardscalerwrapper',
                                                             <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f9cce69f9b0>),
                                                            ('randomforestclassifier',
                                                             RandomForestClassifier(bootstrap=False,
                                                                                    ccp_alpha=0.0,
                                                                                    class_weight='balanced',
                                                                                    criterion='entropy',
                                                

In [17]:
#fitted_model._final_estimator

Save the best model

In [18]:
joblib.dump(value=fitted_model, filename="fitted_automl_model.joblib")

['fitted_automl_model.joblib']

Load the fitted model for testing

In [35]:
best_model = joblib.load('fitted_automl_model.joblib')

Fetch sample dataset, isolate Target in vector 'y'

In [34]:
sample = ds.loc[df['Target']==1].sample(100)
y = sample.pop('Target')
sample.head()

Unnamed: 0,Column2,EQTA,EQTL,LLRTA,LLRGL,OEXTA,INCEMP,ROA,ROE,TDTL,TDTA,TATA
47,24067,-0.0,-0.0,0.03,0.05,0.01,-23.04,-0.01,4.25,1.45,0.8,0.27
76,57814,-0.01,-0.01,0.04,0.04,0.01,-410.83,-0.04,4.34,1.06,0.93,0.1
61,34658,0.01,0.01,0.03,0.04,0.01,-79.4,-0.02,-3.76,1.19,0.99,0.01
41,17792,0.02,0.02,0.03,0.05,0.02,-25.44,-0.01,-0.84,1.28,0.84,0.07
15,22469,0.02,0.03,0.02,0.02,0.04,-158.0,-0.05,-2.16,1.29,0.95,0.09


Run the model to produce predictions

In [36]:
print(best_model.predict(sample))

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


## Model Deployment

### Register the model

In [51]:
automl_model = remote_run.register_model(model_name='automl_model.pkl')

In [41]:
#remote_run.get_file_names()

In [39]:
# Download the conda environment file produced by AutoML and create an environment
#from azureml.core.environment import Environment
#remote_run.download_file('outputs/conda_env_v_1_0_0.yml', 'conda_env.yml')
#myenv = Environment.from_conda_specification(name = 'myenv',file_path = 'conda_env.yml')

### Create inference config

In [52]:
environment = best_run.get_environment()
entry_script='inference/scoring.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', entry_script)
inference_config = InferenceConfig(entry_script = entry_script, environment = environment) 

### Deploy the model as web service

In [53]:
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                                    memory_gb = 1, 
                                                    auth_enabled= True, 
                                                    enable_app_insights= True)

service = Model.deploy(ws, "aciservice", [automl_model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)


WebserviceException: WebserviceException:
	Message: Service aciservice with the same name already exists, please use a different service name or delete the existing service.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service aciservice with the same name already exists, please use a different service name or delete the existing service."
    }
}

Check status of the web service

In [50]:
print("Checking service status: {}".format(aci_service.state))

NameError: name 'aci_service' is not defined

If 'Healthy', get URIs

In [215]:
print("Scoring URI:\n {}".format(service.scoring_uri))
print("Swagger URI:\n {}".format(service.swagger_uri))

Scoring URI:
 http://ef644095-a149-4493-ab5e-c62cf2f7994a.eastus.azurecontainer.io/score
Swagger URI:
 http://ef644095-a149-4493-ab5e-c62cf2f7994a.eastus.azurecontainer.io/swagger.json


In [216]:
primary, secondary = service.get_keys()
print("Primary key: {},\nSecondary key: {}".format(primary, secondary)) 

Primary key: Qz2GKwMhod2SzlPx598wMPY5L8cRikRu,
Secondary key: XG4NnibDYmGRDNJouX4pfRKcJ9KpRxN9


Take a small sample of feautures of the failed banks (`'Target'=1`)

In [225]:
sample = ds.loc[df['Target']==1].sample(5)
y = sample.pop('Target')

Use this  sample to create JSON payload and headers

In [226]:
json_payload = json.dumps({'data': sample.to_dict(orient='records')})
headers = {"Content-Type": "application/json"}
headers["Authorization"] = "Bearer {}".format(primary) #{primary}"

Post payload with headers and get a response 

In [235]:
resp = requests.post(service.scoring_uri, json_payload, headers=headers)
print('\nPredicted Values:', resp.json())
print(f'\nTrue Values: ', list(y.values))


Predicted Values: {"result": [1, 1, 1, 1, 1]}

True Values:  [1, 1, 1, 1, 1]


In [236]:
print(json_payload)

{"data": [{"Column2": "182", "EQTA": 0.015013920413251787, "EQTL": 0.02451152194865504, "LLRTA": 0.022838422576785894, "LLRGL": 0.0372856975963083, "OEXTA": 0.020825832709057773, "INCEMP": -78.41095890410959, "ROA": -0.025600143117501705, "ROE": -1.7050938337801609, "TDTL": 1.4536822045036362, "TDTA": 0.8904167179131679, "TATA": 0.2226326911680848}, {"Column2": "18117", "EQTA": 0.026934538114005577, "EQTL": 0.0378484979468338, "LLRTA": 0.044843306549340274, "LLRGL": 0.06301395586135548, "OEXTA": 0.013856658624640118, "INCEMP": -260.1124694376528, "ROA": -0.033529420107377604, "ROE": -1.2448485273984624, "TDTL": 1.0720831399448, "TDTA": 0.7629381814514413, "TATA": 0.17889511695081653}, {"Column2": "35078", "EQTA": 0.013828040714855293, "EQTL": 0.016965539467323498, "LLRTA": 0.012092499648201135, "LLRGL": 0.014836214635943003, "OEXTA": 0.043895117031755713, "INCEMP": -166.67857142857142, "ROA": -0.04378254139499976, "ROE": -3.166214382632293, "TDTL": 1.0588614442577289, "TDTA": 0.8630423

In [229]:
# convert the sample records to a json data file
scoring_json = json.dumps({'data': sample.to_dict(orient='records')})
print(f'{scoring_json}')

# Set the content type
headers = {"Content-Type": "application/json"}

# set the authorization header
headers["Authorization"] = f"Bearer {primary}"

# post a request to the scoring uri
resp = requests.post(service.scoring_uri, scoring_json, headers=headers)

# print the scoring results
print('\n', resp.json())

# compare the scoring results with the corresponding y label values
print(f'\nTrue Values: {list(y.values)}')

{"data": [{"Column2": "182", "EQTA": 0.015013920413251787, "EQTL": 0.02451152194865504, "LLRTA": 0.022838422576785894, "LLRGL": 0.0372856975963083, "OEXTA": 0.020825832709057773, "INCEMP": -78.41095890410959, "ROA": -0.025600143117501705, "ROE": -1.7050938337801609, "TDTL": 1.4536822045036362, "TDTA": 0.8904167179131679, "TATA": 0.2226326911680848}, {"Column2": "18117", "EQTA": 0.026934538114005577, "EQTL": 0.0378484979468338, "LLRTA": 0.044843306549340274, "LLRGL": 0.06301395586135548, "OEXTA": 0.013856658624640118, "INCEMP": -260.1124694376528, "ROA": -0.033529420107377604, "ROE": -1.2448485273984624, "TDTL": 1.0720831399448, "TDTA": 0.7629381814514413, "TATA": 0.17889511695081653}, {"Column2": "35078", "EQTA": 0.013828040714855293, "EQTL": 0.016965539467323498, "LLRTA": 0.012092499648201135, "LLRGL": 0.014836214635943003, "OEXTA": 0.043895117031755713, "INCEMP": -166.67857142857142, "ROA": -0.04378254139499976, "ROE": -3.166214382632293, "TDTL": 1.0588614442577289, "TDTA": 0.8630423

In [49]:
aci_service.delete()

NameError: name 'aci_service' is not defined

In [55]:
# another way to test the scoring uri
print("Prediction: {}".format(service.run(scoring_json)))
print(f'True Values: {list(y.values)}')
pd.DataFrame()

Prediction: {"result": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
True Values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


### Out-of-sample Testing

In [None]:
tn, fp, fn, tp = confusion_matrix(y_oos, predictions_oos).ravel()
print("recall: {0:.5f}".format(tp/(tp+fn)))
print("precision: {0:.5f}".format(tp/(tp+fp))) #TP / (TP + FP)

Printing the logs of the web service prior to deleting the service

In [33]:
# print the logs by calling the get_logs() function of the web service
print(f'webservice logs: \n{service.get_logs()}\n')

webservice logs: 
2021-03-22T21:06:32,392545800+00:00 - iot-server/run 
2021-03-22T21:06:32,393517400+00:00 - gunicorn/run 
2021-03-22T21:06:32,423816600+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2021-03-22T21:06:32,432858200+00:00 - rsyslog/run 
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sb

## Clean Up

In [143]:
service.delete()

In [144]:
sevice.state()

NameError: name 'sevice' is not defined