# Automated ML


In [1]:
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.train.automl import AutoMLConfig
from azureml.core.dataset import Dataset
from azureml.core.compute import ComputeTarget
from azureml.pipeline.steps import AutoMLStep
from azureml.widgets import RunDetails
from azureml.core.model import Model, InferenceConfig
from azureml.core import Environment
from azureml.core.webservice import AciWebservice, Webservice

## Dataset

### Overview
In this project we will be using a dataset from an HR department in a company. The dataset contains entreis for employees, including personal information, curring position and work performance metrics. 
The objective is to determine if a given employee will receive a promotion. The datase is highly imbalanced, wth only around 5% of employees having received a promotion.

The given dataset is available in Kaggle [https://www.kaggle.com/shivan118/hranalysis]. We have manually downloaded the dataset and registered in our workspaces's default store with the name "hr-data"

In [2]:
ws = Workspace.from_config()

# choose a name for experiment
experiment_name = 'hr-automl'

experiment=Experiment(ws, experiment_name)

In [3]:
# get dataset by name
dataset = Dataset.get_by_name(ws, 'hr-data', version='latest')

In [4]:
# view first rows
dataset.take(5).to_pandas_dataframe()

Unnamed: 0,department,region,education,gender,recruitment_channel,no_of_trainings,age,previous_year_rating,length_of_service,KPIs_met >80%,awards_won?,avg_training_score,is_promoted
0,Sales & Marketing,region_7,Master's & above,f,sourcing,1,35,5,8,1,0,49,0
1,Operations,region_22,Bachelor's,m,other,1,30,5,4,0,0,60,0
2,Sales & Marketing,region_19,Bachelor's,m,sourcing,1,34,3,7,0,0,50,0
3,Sales & Marketing,region_23,Bachelor's,m,other,2,39,1,10,0,0,50,0
4,Technology,region_26,Bachelor's,m,other,1,45,3,2,0,0,73,0


## AutoML Configuration

We must ensure that we specify the target column is set to the same name it has in our dataset. 
The problem type is classificaton, and as metric we used the wheighted AUC since it is a good metric for inbalanced problems. 

Run configurations:
- We previously created a cluster with 4 nodes. For optimization of running time we allowed core minus 1 parallel runs (3). 
- We activated early stopping. This setting will terminate the run if there is no improvement in ten consecutive runs. The default starts counting only after the 20th iteration.
- Beacuse we have limited computing time we set the maximum duration of the experiment to be 30 minutes given Lab limitations. 
- We allowed automatic featurization. This means that automl will perform imputation and one hot encoding of categorical variables.

In [5]:
# get previously created compute target
cluster = ComputeTarget(workspace=ws, name='cluster-1')

# automl settings here
automl_settings = {
    "experiment_timeout_minutes": 30,
    "max_concurrent_iterations": 3,
    "primary_metric": "AUC_weighted"
}

# TAutoml config here
automl_config = AutoMLConfig(compute_target=cluster,
                            training_data=dataset,
                            task="classification",
                            label_column_name='is_promoted', 
                            enable_early_stopping=True,
                            featurization='auto',
                            debug_log='automl_errors.log',
                            **automl_settings
                            )

In [6]:
# Submit your experiment
remote_run = experiment.submit(automl_config)

Submitting remote run.


Experiment,Id,Type,Status,Details Page,Docs Page
hr-automl,AutoML_d23e6bc6-b464-4dbb-bc48-754e8dbdad30,automl,NotStarted,Link to Azure Machine Learning studio,Link to Documentation


## Run Details

In [7]:
RunDetails(remote_run).show()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

_UserRunWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', '…

## Best Model

In this section we get the best run, then register the best model in case we want to deploy it. Additionally we download it to our local share


In [9]:
best_run = remote_run.get_best_child()
best_run

Experiment,Id,Type,Status,Details Page,Docs Page
hr-automl,AutoML_d23e6bc6-b464-4dbb-bc48-754e8dbdad30_34,azureml.scriptrun,Completed,Link to Azure Machine Learning studio,Link to Documentation


Here is our AUC

In [10]:
best_run.get_metrics()['AUC_weighted']

0.905541571394528

And here we can see the parameters of the best model

In [38]:
print(f'ensembled_iterations: {best_run.tags["ensembled_iterations"]}')
print(f'ensembled_algorithms: {best_run.tags["ensembled_algorithms"]}')
print(f'ensemble_weights: {best_run.tags["ensemble_weights"]}')


ensembled_iterations: [0, 29, 27, 23, 22, 30, 26, 17, 18]
ensembled_algorithms: ['LightGBM', 'XGBoostClassifier', 'LightGBM', 'XGBoostClassifier', 'XGBoostClassifier', 'LightGBM', 'LogisticRegression', 'ExtremeRandomTrees', 'RandomForest']
ensemble_weights: [0.3333333333333333, 0.2, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667, 0.06666666666666667]


In [26]:
best_run.download_file('outputs/model.pkl', output_file_path='outputs/model.pkl')

## Model Deployment

The  best model od autoML had better performance, therefore we proceed to deploy it.

We fisrt have to registed the model

In [27]:
# register model
model = best_run.register_model('hr-auto-ml-model', 
                        description='best model found by automl for HR data', 
                        model_path='outputs/model.pkl')

To deploy an automl Model, the easies way is to do it from the Azure web interface. Howver, we will do it manually in this section.

To deploy our model we need o specify an inference configuration.
This must include an evironment and an scoring script used as entrypoint in our REST API. 
Automl stores for each run information about the evionment and a scoring script. 
We first will download tose files.

In [28]:
best_run.download_file('outputs/conda_env_v_1_0_0.yml', output_file_path='outputs/conda_env_v_1_0_0.yml')
best_run.download_file('outputs/scoring_file_v_1_0_0.py', output_file_path='outputs/scoring_file_v_1_0_0.py')

We the must create an environment with the conda information we downloaded from the best run and then use it in put inference configuration. We also need a deployment configuration to specify that the mdoel will run in ACI.

In [30]:
env = Environment.from_conda_specification(name='auto-ml-env', file_path='outputs/conda_env_v_1_0_0.yml')
inf_config = InferenceConfig(environment=env, 
                            source_directory='./outputs',
                            entry_script='scoring_file_v_1_0_0.py')
deployment_config = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1)

Now we are able to deploy the model

In [31]:
service = Model.deploy(ws, 'hr-automl-model-service', [model], inference_config=inf_config, deployment_config=deployment_config)
service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running
2021-05-04 15:12:56+00:00 Creating Container Registry if not exists..
2021-05-04 15:13:06+00:00 Registering the environment.
2021-05-04 15:13:07+00:00 Building image..
2021-05-04 15:26:39+00:00 Generating deployment configuration.
2021-05-04 15:26:39+00:00 Submitting deployment to compute.
2021-05-04 15:26:42+00:00 Checking the status of deployment hr-automl-model-service..
2021-05-04 15:29:45+00:00 Checking the status of inference endpoint hr-automl-model-service.
Succeeded
ACI service creation operation finished, operation "Succeeded"


Now we test the servcice with the data of a ficticious employee

In [33]:
import requests
import json

uri = service.scoring_uri
headers = {"Content-Type": "application/json"}
data = {"data": 
            [{"department": 'Finance',  "region": "region_2",  "education": "Bachelor's",  "gender": 'f', 
            "recruitment_channel": "other",  "no_of_trainings": 1,  "age": 35,  "previous_year_rating": 4, 
            "length_of_service": 7,  "KPIs_met >80%": 1, "awards_won?": 0,  "avg_training_score": 86}
            ]
}
        
        

data = json.dumps(data)

response = requests.post(uri, data=data,  headers=headers)

print(f'Response Code: {response.status_code}')
print(f'Prediction: {response.json()}')


Response Code: 200
Prediction: {"result": [1]}


We can see the logs of the service

In [34]:
print(service.get_logs())

2021-05-04T15:29:38,611864000+00:00 - iot-server/run 
2021-05-04T15:29:38,618776400+00:00 - rsyslog/run 
2021-05-04T15:29:38,622641800+00:00 - gunicorn/run 
2021-05-04T15:29:38,639478500+00:00 - nginx/run 
/usr/sbin/nginx: /azureml-envs/azureml_f8f5ff2f983718fa04a09abf22f98303/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_f8f5ff2f983718fa04a09abf22f98303/lib/libcrypto.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_f8f5ff2f983718fa04a09abf22f98303/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_f8f5ff2f983718fa04a09abf22f98303/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
/usr/sbin/nginx: /azureml-envs/azureml_f8f5ff2f983718fa04a09abf22f98303/lib/libssl.so.1.0.0: no version information available (required by /usr/sbin/nginx)
rsyslogd