# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [147]:
import logging
import os
import csv
import joblib
import json
import requests
import pandas as pd 
import numpy as np 
from azureml.core import Workspace, Experiment
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.core.model import InferenceConfig 
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model


## Dataset

### Overview
The primary objective was to develop an early warning system, i.e. binary classification of failed ('Target'==1) vs. survived ('Target'==0), for the US banks using their quarterly filings with the regulator. Overall, 137 failed banks and 6,877 surviving banks were used in this machine learning exercise. Historical observations from the first 4 quarters ending 2010Q3 (stored in ./data) are used to tune the model and out-of-sample testing is performed on quarterly data starting from 2010Q4 (stored in ./oos). 

### Setting up the project

In [148]:
ws = Workspace.from_config()
ws.write_config(path='.azureml')
experiment_name = 'camels-clf'
project_folder = './dmik'

exp = Experiment(workspace=ws, name=experiment_name)

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')

run = exp.start_logging()

Workspace name: final-ws
Azure region: eastus
Subscription id: 0c66ad45-500d-48af-80d3-0039ebf1975e
Resource group: final-rgp


### Uploading the training dataset using GUI

In [149]:
dataset = ws.datasets['camels'] 
df = dataset.to_pandas_dataframe()
df.pop('Column2')
#len(df)
df.tail()

Unnamed: 0,Target,EQTA,EQTL,LLRTA,LLRGL,OEXTA,INCEMP,ROA,ROE,TDTL,TDTA,TATA
7015,0,0.14,0.63,0.0,0.0,0.01,80.99,0.01,0.08,3.43,0.75,0.62
7016,0,0.09,0.11,0.0,0.0,0.01,46.46,0.0,0.03,1.08,0.86,0.0
7017,0,0.13,0.18,0.0,0.0,0.01,85.78,0.02,0.13,0.98,0.73,0.06
7018,0,0.77,16.59,0.0,0.02,0.07,-143.14,-0.07,-0.09,4.92,0.23,0.14
7019,0,0.1,0.19,0.0,0.0,0.0,13.05,0.0,0.03,1.5,0.83,0.02


### Checking for or creating appropriate `ComputeTarget`

In [150]:
cpu_cluster_name = 'final-cmp'

try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Existing compute target.')

except:
    print('Creating compute target.')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

print(compute_target.get_status())

Existing compute target.
{
  "errors": [],
  "creationTime": "2021-03-22T19:24:32.034596+00:00",
  "createdBy": {
    "userObjectId": "49e75006-b9ac-415c-9176-f83c59d4bf26",
    "userTenantId": "d689239e-c492-40c6-b391-2c5951d31d14",
    "userName": null
  },
  "modifiedTime": "2021-03-22T19:27:20.575645+00:00",
  "state": "Running",
  "vmSize": "STANDARD_DS2_V2"
}


## AutoML Configuration
### Primary metric determins configuaration
Financial metrics recorded in the last reports of the failed banks should have predictive power that is needed to forecast future failures. Due to significant class imbalances and taking into account costs accosiated with financial distress, the model should aim to maximize the recall score. In other words, accuracy is probably not the best metrics, as Type II error needs to be minimized. This is why the main focus of this classification should be on maximizing AUC, hopefully, by achieving good recall score. This is why 'norm_macro_recall' was chosen as a primary metric. Timeout and number of concurrent iterations were set conservatively to control the costs.

In [7]:
automl_settings = {
    "experiment_timeout_minutes": 15,
    "max_concurrent_iterations": 4,
    "primary_metric" : 'norm_macro_recall'
    }

automl_config = AutoMLConfig(
    compute_target=compute_target, 
    task = "classification",
    training_data=dataset, 
    label_column_name="Target", 
    path = project_folder,
    enable_early_stopping= True, 
    featurization= 'auto', 
    debug_log = "automl_errors.log",
    **automl_settings
    )

## Run Details

### Possible modeling choices 
Generally speaking, decision trees should work well for this task, as these models do not make any functional form assumptions, handle both categorical and continuous data well, and are easy to interpret. Tree-based models simply aim to reduce entropy at every split and are therefore very straightforward, no need to worry about missing data and scaling. They are not very stable though, as new data may produce a totally different tree, and they also tend to overfit.

Possible solution would be model averaging - employing “wisdom of the crowd”. It seems that for the present task two paths are possible: reducing variance or reducing bias. The former implies complex model, i.e. starting with a bushy, high-variance tree and resampling with replacement, what will produce a family of Random Forest models. The later implies starting with a simple model, i.e. possible a stump, high-bias classifier and learning from miss-classified instances, what will produce a family of Boosting models.


In [8]:
remote_run = exp.submit(config=automl_config) 
RunDetails(remote_run).show() # <--use Notebook widget
remote_run.wait_for_completion(show_output=True)

Running on remote.


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…


Current status: FeaturesGeneration. Generating features for the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|3                                |
+---------------------------------+

****************************************************************************************************

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a 

{'runId': 'AutoML_dbfdcfd4-7378-4756-b9e9-d7613df9a6be',
 'target': 'final-cmp',
 'status': 'Completed',
 'startTimeUtc': '2021-03-22T19:47:48.199852Z',
 'endTimeUtc': '2021-03-22T20:28:37.532955Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'norm_macro_recall',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'final-cmp',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"3b4f858c-e3e9-4129-891a-09db6e84409a\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/03-22-2021_074058_UTC/camel_data_after2010Q3.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"final-rgp\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"0c66ad45-500d-48af-80d3-0039ebf

In [120]:
print("Run Status: ",remote_run.get_status())

Run Status:  Completed


## Best Model

TODO: In the cell below, get the best model from the automl experiments and display all the properties of the model.



In [104]:
best_run, fitted_model = remote_run.get_output()


Package:azureml-automl-runtime, training version:1.24.0, current version:1.22.0
Package:azureml-core, training version:1.24.0.post1, current version:1.22.0
Package:azureml-dataprep, training version:2.11.2, current version:2.9.1
Package:azureml-dataprep-native, training version:30.0.0, current version:29.0.0
Package:azureml-dataprep-rslex, training version:1.9.1, current version:1.7.0
Package:azureml-dataset-runtime, training version:1.24.0, current version:1.22.0
Package:azureml-defaults, training version:1.24.0, current version:1.22.0
Package:azureml-interpret, training version:1.24.0, current version:1.22.0
Package:azureml-mlflow, training version:1.24.0, current version:1.22.0
Package:azureml-pipeline-core, training version:1.24.0, current version:1.22.0
Package:azureml-telemetry, training version:1.24.0, current version:1.22.0
Package:azureml-train-automl-client, training version:1.24.0, current version:1.22.0
Package:azureml-train-automl-runtime, training version:1.24.0, current 

In [105]:
fitted_model._final_estimator

PreFittedSoftVotingClassifier(classification_labels=None,
                              estimators=[('25',
                                           Pipeline(memory=None,
                                                    steps=[('standardscalerwrapper',
                                                            <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f320c1d0470>),
                                                           ('logisticregression',
                                                            LogisticRegression(C=0.040949150623804234,
                                                                               class_weight='balanced',
                                                                               dual=False,
                                                                               fit_intercept=True,
                                                                               intercept_sc...
             

In [151]:
# Save the best model
joblib.dump(value=fitted_model, filename="fitted_automl_model.joblib")

['fitted_automl_model.joblib']

In [157]:
m = joblib.load('fitted_automl_model.joblib')

In [180]:
ds = dataset.to_pandas_dataframe()
sample = ds.loc[df['Target']==1].sample(100)
y = sample.pop('Target')
sample.head()

Unnamed: 0,Column2,EQTA,EQTL,LLRTA,LLRGL,OEXTA,INCEMP,ROA,ROE,TDTL,TDTA,TATA
71,57315,0.03,0.04,0.03,0.04,0.01,-9.86,-0.0,-0.11,1.1,0.92,0.03
30,57399,0.02,0.04,0.01,0.02,0.03,-144.83,-0.02,-0.96,1.57,0.94,0.07
1,3287,0.08,0.19,0.0,0.01,0.02,20.9,0.0,0.05,2.21,0.91,0.36
68,35517,0.03,0.04,0.03,0.03,0.01,-2.12,-0.0,-0.01,0.97,0.87,0.03
116,21777,0.03,0.04,0.05,0.07,0.04,-124.49,-0.04,-1.28,1.24,0.91,0.13


In [181]:
print(m.predict(sample))

[1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]


## Model Deployment

Remember you have to deploy only one of the two models you trained.. Perform the steps in the rest of this notebook only if you wish to deploy this model.

TODO: In the cell below, register the model, create an inference config and deploy the model as a web service.

In [165]:

automl_model = remote_run.register_model(model_name='automl_model.pkl')


### Create inference config

In [152]:
environment = best_run.get_environment()
entry_script='inference/scoring.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', entry_script)
inference_config = InferenceConfig(entry_script = entry_script, environment = environment) 

### Deploy the model as web service

In [182]:
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                                    memory_gb = 1, 
                                                    auth_enabled= True, 
                                                    enable_app_insights= True)

service = Model.deploy(ws, "aciservice", [automl_model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)


WebserviceException: WebserviceException:
	Message: Service aciservice with the same name already exists, please use a different service name or delete the existing service.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service aciservice with the same name already exists, please use a different service name or delete the existing service."
    }
}

In [183]:
print("Checking service status:{}".format(service.state))

Checking service status:Unhealthy


If 'Healthy', get URIs

In [135]:
print("Scoring URI: {}".format(service.scoring_uri))
print("Swagger URI: {}".format(service.swagger_uri))

Scoring URI: http://e6d3ccd2-b44e-4259-87fe-8bdc28dff4ec.eastus.azurecontainer.io/score
Swagger URI: http://e6d3ccd2-b44e-4259-87fe-8bdc28dff4ec.eastus.azurecontainer.io/swagger.json


In [140]:
primary, secondary = service.get_keys()
print("primary key: {},\nsecond key: {}".format(primary, secondary)) 

primary key: taQRhOtWcivd0YtpegeGV1w9r77N0BT7,
second key: t2ra1FhQWOBjWebWU5EiU80S2uB35K9a


Take a small sample of feautures of the failed banks (`'Target'=1`)

In [141]:
sample = df.loc[df['Target']==1].sample(5)
y = sample.pop('Target')

In [142]:
# convert the sample records to a json data file
scoring_json = json.dumps({'data': sample.to_dict(orient='records')})
print(f'{scoring_json}')

# Set the content type
headers = {"Content-Type": "application/json"}

# set the authorization header
headers["Authorization"] = f"Bearer {primary}"

# post a request to the scoring uri
resp = requests.post(service.scoring_uri, scoring_json, headers=headers)

# print the scoring results
print('\n', resp.json())

# compare the scoring results with the corresponding y label values
print(f'\nTrue Values: {list(y.values)}')

{"data": [{"EQTA": 0.04184211984375433, "EQTL": 0.056076110917739876, "LLRTA": 0.030428428955314845, "LLRGL": 0.04077967281587191, "OEXTA": 0.017439122364739452, "INCEMP": -65.35555555555555, "ROA": -0.010184364351608167, "ROE": -0.24339981792601176, "TDTL": 1.2812669683257918, "TDTA": 0.9560385904645815, "TATA": 0.08062304346621603}, {"EQTA": 0.00500656159501208, "EQTL": 0.005842295708111368, "LLRTA": 0.03579929948128638, "LLRGL": 0.04177519635857157, "OEXTA": 0.0074746195782244874, "INCEMP": -63.828611898016995, "ROA": -0.01386240769007861, "ROE": -2.768847926267281, "TDTL": 1.0219569806010653, "TDTA": 0.8757671344379454, "TATA": 0.024451400822091258}, {"EQTA": 0.0071222856005424134, "EQTL": 0.011152235317035162, "LLRTA": 0.003551812295117658, "LLRGL": 0.005561507743255089, "OEXTA": 0.01529580811505135, "INCEMP": -185.07692307692307, "ROA": -0.029932260532585235, "ROE": -4.2026200873362445, "TDTL": 1.514863153793708, "TDTA": 0.967455198024421, "TATA": 0.10127330293662098}, {"EQTA": -

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

In [145]:
service.delete()

No service with name aciservice found to delete.


In [55]:
# another way to test the scoring uri
print("Prediction: {}".format(service.run(scoring_json)))
print(f'True Values: {list(y.values)}')
pd.DataFrame()

Prediction: {"result": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
True Values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


### Out-of-sample Testing

In [None]:
tn, fp, fn, tp = confusion_matrix(y_oos, predictions_oos).ravel()
print("recall: {0:.5f}".format(tp/(tp+fn)))
print("precision: {0:.5f}".format(tp/(tp+fp))) #TP / (TP + FP)

Printing the logs of the web service prior to deleting the service

In [33]:
# print the logs by calling the get_logs() function of the web service
print(f'webservice logs: \n{service.get_logs()}\n')

webservice logs: 
2021-03-22T21:06:32,392545800+00:00 - iot-server/run 
2021-03-22T21:06:32,393517400+00:00 - gunicorn/run 
2021-03-22T21:06:32,423816600+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2021-03-22T21:06:32,432858200+00:00 - rsyslog/run 
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sb

## Clean Up

In [143]:
service.delete()

In [144]:
sevice.state()

NameError: name 'sevice' is not defined