# Automated ML

TODO: Import Dependencies. In the cell below, import all the dependencies that you will need to complete the project.

In [5]:
import logging
import os
import csv
import joblib
import json
import requests
import pandas as pd 
import numpy as np 
from azureml.core import Workspace, Experiment
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.widgets import RunDetails
from azureml.core.run import Run
from azureml.core.model import InferenceConfig 
from azureml.core.webservice import AciWebservice, Webservice
from azureml.core.model import Model


ModuleNotFoundError: No module named 'azureml.widgets'

## Dataset

### Overview
The primary objective was to develop an early warning system, i.e. binary classification of failed ('Target'==1) vs. survived ('Target'==0), for the US banks using their quarterly filings with the regulator. Overall, 137 failed banks and 6,877 surviving banks were used in this machine learning exercise. Historical observations from the first 4 quarters ending 2010Q3 (stored in ./data) are used to tune the model and out-of-sample testing is performed on quarterly data starting from 2010Q4 (stored in ./oos). 

### Setting up the project

In [6]:
ws = Workspace.from_config()
ws.write_config(path='.azureml')
experiment_name = 'camels-clf'
project_folder = './dmik'

print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep = '\n')



UserErrorException: UserErrorException:
	Message: We could not find config.json in: /Users/dmitrymikhaylov/Documents/code/azure/udacity/camels-risk-profile-classification or in its parent directories. Please provide the full path to the config file or ensure that config.json exists in the parent directories.
	InnerException None
	ErrorResponse 
{
    "error": {
        "code": "UserError",
        "message": "We could not find config.json in: /Users/dmitrymikhaylov/Documents/code/azure/udacity/camels-risk-profile-classification or in its parent directories. Please provide the full path to the config file or ensure that config.json exists in the parent directories."
    }
}

### Uploading the training dataset using GUI

In [239]:
dataset = ws.datasets['camels'] 
df = dataset.to_pandas_dataframe()
df.pop('Column2')

len(df)
#df.head()
#df.tail()

7020

### Checking for or creating appropriate `ComputeTarget`

In [240]:
cpu_cluster_name = 'final-cmp'

try:
    compute_target = ComputeTarget(workspace=ws, name=cpu_cluster_name)
    print('Existing compute target.')

except:
    print('Creating compute target.')
    compute_config = AmlCompute.provisioning_configuration(vm_size='STANDARD_D2_V2', max_nodes=4)
    compute_target = ComputeTarget.create(ws, cpu_cluster_name, compute_config)

print(compute_target.get_status())

Existing compute target.
{
  "errors": [],
  "creationTime": "2021-03-22T19:24:32.034596+00:00",
  "createdBy": {
    "userObjectId": "49e75006-b9ac-415c-9176-f83c59d4bf26",
    "userTenantId": "d689239e-c492-40c6-b391-2c5951d31d14",
    "userName": null
  },
  "modifiedTime": "2021-03-22T19:27:20.575645+00:00",
  "state": "Running",
  "vmSize": "STANDARD_DS2_V2"
}


## AutoML Configuration
### Primary metric determins configuaration
Financial metrics recorded in the last reports of the failed banks should have predictive power that is needed to forecast future failures. Due to significant class imbalances and taking into account costs accosiated with financial distress, the model should aim to maximize the recall score. In other words, accuracy is probably not the best metrics, as Type II error needs to be minimized. This is why the main focus of this classification should be on maximizing AUC, hopefully, by achieving good recall score. This is why 'norm_macro_recall' was chosen as a primary metric. Timeout and number of concurrent iterations were set conservatively to control the costs.

In [241]:
automl_settings = {
    "experiment_timeout_minutes": 15,
    "max_concurrent_iterations": 4,
    "primary_metric" : 'norm_macro_recall',
    "verbosity": logging.INFO
    }

automl_config = AutoMLConfig(
    compute_target=compute_target, 
    task = "classification",
    training_data=dataset, 
    label_column_name="Target", 
    path = project_folder,
    enable_early_stopping= True, 
    featurization= 'auto', 
    debug_log = "automl_errors.log",
    **automl_settings
    )

## Run Details

### Possible modeling choices 
Generally speaking, decision trees should work well for this task, as these models do not make any functional form assumptions, handle both categorical and continuous data well, and are easy to interpret. Tree-based models simply aim to reduce entropy at every split and are therefore very straightforward, no need to worry about missing data and scaling. They are not very stable though, as new data may produce a totally different tree, and they also tend to overfit.

Possible solution would be model averaging - employing “wisdom of the crowd”. It seems that for the present task two paths are possible: reducing variance or reducing bias. The former implies complex model, i.e. starting with a bushy, high-variance tree and resampling with replacement, what will produce a family of Random Forest models. The later implies starting with a simple model, i.e. possible a stump, high-bias classifier and learning from miss-classified instances, what will produce a family of Boosting models.


In [8]:
exp = Experiment(workspace=ws, name=experiment_name)
#run = exp.start_logging()

remote_run = exp.submit(config=automl_config) # <-from configured trial
#remote_run = exp.start_logging() #<- interactively experimenting

#Both mechanisms create a Run object. In interactive scenarios, 
#use logging methods such as log to add measurements and metrics 
#to the trial record. In configured scenarios use status methods such as 
#get_status to retrieve information about the run
status = remote_run.get_status()

run = remote_run.get_context() 

RunDetails(remote_run).show() #<- put in 'run' insted of 'remote_run'
#print(run.get_portal_url()) #<- should give same info

Running on remote.


_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…


Current status: FeaturesGeneration. Generating features for the dataset.
Current status: ModelSelection. Beginning model selection.

****************************************************************************************************
DATA GUARDRAILS: 

TYPE:         Cross validation
STATUS:       DONE
DESCRIPTION:  Each iteration of the trained model was validated through cross-validation.
              
DETAILS:      
+---------------------------------+
|Number of folds                  |
|3                                |
+---------------------------------+

****************************************************************************************************

TYPE:         Class balancing detection
STATUS:       ALERTED
DESCRIPTION:  To decrease model bias, please cancel the current run and fix balancing problem.
              Learn more about imbalanced data: https://aka.ms/AutomatedMLImbalancedData
DETAILS:      Imbalanced data can lead to a falsely perceived positive effect of a 

{'runId': 'AutoML_dbfdcfd4-7378-4756-b9e9-d7613df9a6be',
 'target': 'final-cmp',
 'status': 'Completed',
 'startTimeUtc': '2021-03-22T19:47:48.199852Z',
 'endTimeUtc': '2021-03-22T20:28:37.532955Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'norm_macro_recall',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': None,
  'target': 'final-cmp',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"3b4f858c-e3e9-4129-891a-09db6e84409a\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"UI/03-22-2021_074058_UTC/camel_data_after2010Q3.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"final-rgp\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"0c66ad45-500d-48af-80d3-0039ebf

In [None]:
run_id = 'autoML_my_runID' #replace with run_ID
run = Run(exp, run_id)
RunDetails(run).show()

In [None]:
remote_run.wait_for_completion(show_output=True)
#run.wait_for_completion(show_output = True)

In [120]:
print("Run Status: ",remote_run.get_status())

Run Status:  Completed


## Best Model

Get the best model from the automl experiments and display all the properties of the model.



In [104]:
best_run, fitted_model = remote_run.get_output()


Package:azureml-automl-runtime, training version:1.24.0, current version:1.22.0
Package:azureml-core, training version:1.24.0.post1, current version:1.22.0
Package:azureml-dataprep, training version:2.11.2, current version:2.9.1
Package:azureml-dataprep-native, training version:30.0.0, current version:29.0.0
Package:azureml-dataprep-rslex, training version:1.9.1, current version:1.7.0
Package:azureml-dataset-runtime, training version:1.24.0, current version:1.22.0
Package:azureml-defaults, training version:1.24.0, current version:1.22.0
Package:azureml-interpret, training version:1.24.0, current version:1.22.0
Package:azureml-mlflow, training version:1.24.0, current version:1.22.0
Package:azureml-pipeline-core, training version:1.24.0, current version:1.22.0
Package:azureml-telemetry, training version:1.24.0, current version:1.22.0
Package:azureml-train-automl-client, training version:1.24.0, current version:1.22.0
Package:azureml-train-automl-runtime, training version:1.24.0, current 

In [105]:
fitted_model._final_estimator

PreFittedSoftVotingClassifier(classification_labels=None,
                              estimators=[('25',
                                           Pipeline(memory=None,
                                                    steps=[('standardscalerwrapper',
                                                            <azureml.automl.runtime.shared.model_wrappers.StandardScalerWrapper object at 0x7f320c1d0470>),
                                                           ('logisticregression',
                                                            LogisticRegression(C=0.040949150623804234,
                                                                               class_weight='balanced',
                                                                               dual=False,
                                                                               fit_intercept=True,
                                                                               intercept_sc...
             

Save the best model

In [151]:
joblib.dump(value=fitted_model, filename="fitted_automl_model.joblib")

['fitted_automl_model.joblib']

Load the fitted model for testing

In [7]:
best_model = joblib.load('fitted_automl_model.joblib')

ModuleNotFoundError: No module named 'azureml.automl.runtime'

Fetch sample dataset, isolate Target in vector 'y'

In [204]:
ds = dataset.to_pandas_dataframe()
sample = ds.loc[df['Target']==1].sample(100)
y = sample.pop('Target')
sample.head()

Unnamed: 0,Column2,EQTA,EQTL,LLRTA,LLRGL,OEXTA,INCEMP,ROA,ROE,TDTL,TDTA,TATA
55,32185,0.04,0.04,0.02,0.02,0.01,-101.28,-0.02,-0.54,0.8,0.74,0.05
33,58104,0.02,0.03,0.01,0.02,0.05,-273.26,-0.05,-2.28,1.41,0.97,0.02
59,34242,0.01,0.02,0.03,0.04,0.01,-191.31,-0.03,-2.16,1.05,0.88,0.1
91,27259,0.06,0.07,0.04,0.05,0.01,-7.06,-0.0,-0.02,0.98,0.76,0.05
11,16730,0.01,0.01,0.03,0.04,0.05,-254.15,-0.09,-12.94,1.12,0.87,0.1


Run the model to produce predictions

In [207]:
print(best_model.predict(sample))

[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1]


## Model Deployment

### Register the model

In [208]:
automl_model = remote_run.register_model(model_name='automl_model.pkl')

### Create inference config

In [209]:
environment = best_run.get_environment()
entry_script='inference/scoring.py'
best_run.download_file('outputs/scoring_file_v_1_0_0.py', entry_script)
inference_config = InferenceConfig(entry_script = entry_script, environment = environment) 

### Deploy the model as web service

In [211]:
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, 
                                                    memory_gb = 1, 
                                                    auth_enabled= True, 
                                                    enable_app_insights= True)

service = Model.deploy(ws, "aciservice", [automl_model], inference_config, deployment_config)
service.wait_for_deployment(show_output = True)


Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running..............................................................
Succeeded
ACI service creation operation finished, operation "Succeeded"


Check status of the web service

In [214]:
print("Checking service status: {}".format(service.state))

Checking service status: Healthy


If 'Healthy', get URIs

In [215]:
print("Scoring URI:\n {}".format(service.scoring_uri))
print("Swagger URI:\n {}".format(service.swagger_uri))

Scoring URI:
 http://ef644095-a149-4493-ab5e-c62cf2f7994a.eastus.azurecontainer.io/score
Swagger URI:
 http://ef644095-a149-4493-ab5e-c62cf2f7994a.eastus.azurecontainer.io/swagger.json


In [216]:
primary, secondary = service.get_keys()
print("Primary key: {},\nSecondary key: {}".format(primary, secondary)) 

Primary key: Qz2GKwMhod2SzlPx598wMPY5L8cRikRu,
Secondary key: XG4NnibDYmGRDNJouX4pfRKcJ9KpRxN9


Take a small sample of feautures of the failed banks (`'Target'=1`)

In [225]:
sample = ds.loc[df['Target']==1].sample(5)
y = sample.pop('Target')

Use this  sample to create JSON payload and headers

In [226]:
json_payload = json.dumps({'data': sample.to_dict(orient='records')})
headers = {"Content-Type": "application/json"}
headers["Authorization"] = "Bearer {}".format(primary) #{primary}"

Post payload with headers and get a response 

In [235]:
resp = requests.post(service.scoring_uri, json_payload, headers=headers)
print('\nPredicted Values:', resp.json())
print(f'\nTrue Values: ', list(y.values))


Predicted Values: {"result": [1, 1, 1, 1, 1]}

True Values:  [1, 1, 1, 1, 1]


In [236]:
print(json_payload)

{"data": [{"Column2": "182", "EQTA": 0.015013920413251787, "EQTL": 0.02451152194865504, "LLRTA": 0.022838422576785894, "LLRGL": 0.0372856975963083, "OEXTA": 0.020825832709057773, "INCEMP": -78.41095890410959, "ROA": -0.025600143117501705, "ROE": -1.7050938337801609, "TDTL": 1.4536822045036362, "TDTA": 0.8904167179131679, "TATA": 0.2226326911680848}, {"Column2": "18117", "EQTA": 0.026934538114005577, "EQTL": 0.0378484979468338, "LLRTA": 0.044843306549340274, "LLRGL": 0.06301395586135548, "OEXTA": 0.013856658624640118, "INCEMP": -260.1124694376528, "ROA": -0.033529420107377604, "ROE": -1.2448485273984624, "TDTL": 1.0720831399448, "TDTA": 0.7629381814514413, "TATA": 0.17889511695081653}, {"Column2": "35078", "EQTA": 0.013828040714855293, "EQTL": 0.016965539467323498, "LLRTA": 0.012092499648201135, "LLRGL": 0.014836214635943003, "OEXTA": 0.043895117031755713, "INCEMP": -166.67857142857142, "ROA": -0.04378254139499976, "ROE": -3.166214382632293, "TDTL": 1.0588614442577289, "TDTA": 0.8630423

In [229]:
# convert the sample records to a json data file
scoring_json = json.dumps({'data': sample.to_dict(orient='records')})
print(f'{scoring_json}')

# Set the content type
headers = {"Content-Type": "application/json"}

# set the authorization header
headers["Authorization"] = f"Bearer {primary}"

# post a request to the scoring uri
resp = requests.post(service.scoring_uri, scoring_json, headers=headers)

# print the scoring results
print('\n', resp.json())

# compare the scoring results with the corresponding y label values
print(f'\nTrue Values: {list(y.values)}')

{"data": [{"Column2": "182", "EQTA": 0.015013920413251787, "EQTL": 0.02451152194865504, "LLRTA": 0.022838422576785894, "LLRGL": 0.0372856975963083, "OEXTA": 0.020825832709057773, "INCEMP": -78.41095890410959, "ROA": -0.025600143117501705, "ROE": -1.7050938337801609, "TDTL": 1.4536822045036362, "TDTA": 0.8904167179131679, "TATA": 0.2226326911680848}, {"Column2": "18117", "EQTA": 0.026934538114005577, "EQTL": 0.0378484979468338, "LLRTA": 0.044843306549340274, "LLRGL": 0.06301395586135548, "OEXTA": 0.013856658624640118, "INCEMP": -260.1124694376528, "ROA": -0.033529420107377604, "ROE": -1.2448485273984624, "TDTL": 1.0720831399448, "TDTA": 0.7629381814514413, "TATA": 0.17889511695081653}, {"Column2": "35078", "EQTA": 0.013828040714855293, "EQTL": 0.016965539467323498, "LLRTA": 0.012092499648201135, "LLRGL": 0.014836214635943003, "OEXTA": 0.043895117031755713, "INCEMP": -166.67857142857142, "ROA": -0.04378254139499976, "ROE": -3.166214382632293, "TDTL": 1.0588614442577289, "TDTA": 0.8630423

In [145]:
service.delete()

No service with name aciservice found to delete.


In [55]:
# another way to test the scoring uri
print("Prediction: {}".format(service.run(scoring_json)))
print(f'True Values: {list(y.values)}')
pd.DataFrame()

Prediction: {"result": [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
True Values: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]


### Out-of-sample Testing

In [None]:
tn, fp, fn, tp = confusion_matrix(y_oos, predictions_oos).ravel()
print("recall: {0:.5f}".format(tp/(tp+fn)))
print("precision: {0:.5f}".format(tp/(tp+fp))) #TP / (TP + FP)

Printing the logs of the web service prior to deleting the service

In [33]:
# print the logs by calling the get_logs() function of the web service
print(f'webservice logs: \n{service.get_logs()}\n')

webservice logs: 
2021-03-22T21:06:32,392545800+00:00 - iot-server/run 
2021-03-22T21:06:32,393517400+00:00 - gunicorn/run 
2021-03-22T21:06:32,423816600+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
2021-03-22T21:06:32,432858200+00:00 - rsyslog/run 
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_2b14f450572e78de640d54eaabed5e4d/lib/libssl.so.1.0.0: no version information available (required by /usr/sb

## Clean Up

In [143]:
service.delete()

In [144]:
sevice.state()

NameError: name 'sevice' is not defined