# Automated ML

Steps for creating the  AutoML experiment

1. Create Cluster
2. Download data
3. Click history generation
4. Register data
5. Configure autoML experiment
6. Submit autoML for execution
7. Register the best performing model

Import Dependencies

In [1]:
import os
import pandas as pd
from azureml.data.dataset_factory import TabularDatasetFactory
from azureml.core.experiment import Experiment
from azureml.core.workspace import Workspace
from azureml.core.dataset import Dataset
from tqdm import tqdm
from utils import get_click_histories

from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.automl.core.featurization import FeaturizationConfig
from azureml.train.automl import AutoMLConfig

from azureml.widgets import RunDetails
import logging

## Initialize workspace
Initialize a Workspace object from the existing workspace you created in the Prerequisites step. Workspace.from_config() creates a workspace object from the details stored in config.json.

In [2]:
# set up workspace
ws = Workspace.from_config()
print('Workspace name: ' + ws.name, 
      'Azure region: ' + ws.location, 
      'Subscription id: ' + ws.subscription_id, 
      'Resource group: ' + ws.resource_group, sep='\n')

# create experiment
experiment_name = 'click-through-rate-prediction'
experiment = Experiment(ws, experiment_name)

Workspace name: quick-starts-ws-134610
Azure region: southcentralus
Subscription id: 3d1a56d2-7c81-4118-9790-f85d1acf0c77
Resource group: aml-quickstarts-134610


## Create or Attach existing AmlCompute

use Azure ML managed compute (AmlCompute) for our remote training compute resource.

If the AmlCompute with that name is already in your workspace, this code will skip the creation process.

In [3]:
cluster_name = "compute-cluster"

try:
    compute_target = ComputeTarget(workspace=ws, name=cluster_name)
    print('Found existing compute target.')
    
except ComputeTargetException:
    print('Creating a new compute target...')
    compute_config = AmlCompute.provisioning_configuration(
        vm_size='Standard_DS3_v2',                                                    
        min_nodes=2,                                                   
        max_nodes=5
    )

    # create the cluster
    compute_target = ComputeTarget.create(ws, cluster_name, compute_config)

compute_target.wait_for_completion(show_output=True)

# use get_status() to get a detailed status for the current cluster. 
print(compute_target.get_status().serialize())

Found existing compute target.
Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned
{'currentNodeCount': 2, 'targetNodeCount': 2, 'nodeStateCounts': {'preparingNodeCount': 0, 'runningNodeCount': 0, 'idleNodeCount': 2, 'unusableNodeCount': 0, 'leavingNodeCount': 0, 'preemptedNodeCount': 0}, 'allocationState': 'Steady', 'allocationStateTransitionTime': '2021-01-14T01:01:47.861000+00:00', 'errors': None, 'creationTime': '2021-01-14T00:59:24.252578+00:00', 'modifiedTime': '2021-01-14T00:59:41.029133+00:00', 'provisioningState': 'Succeeded', 'provisioningStateTransitionTime': None, 'scaleSettings': {'minNodeCount': 2, 'maxNodeCount': 5, 'nodeIdleTimeBeforeScaleDown': 'PT120S'}, 'vmPriority': 'Dedicated', 'vmSize': 'STANDARD_DS3_V2'}


## Dataset

### Overview

The dataset is from Kaggle, originally used as the competition dataset of the 2020 DIGIX CTR Predicition competition. The full dataset is 6GB, contains 36 columns (some masked) describing advertising behavior for seven consecutive days. Specifically, the dataset includes

* **label** - Whether the customer clicked on an advertisement
* **uid** - Unique user ID after data anonymization
* **task_id** - Unique ID of an ad task
* **adv_id** - Unique ID of an ad material
* **creat_type_cd** - Unique ID of an ad creative type
* **adv_prim_id** - Advertiser ID of an ad task
* dev_id - Developer ID of an ad task
* **inter_typ_cd** - Display form of an ad material
* **slot_id** - Ad slot ID
* **spread_app_id** - App ID of an ad task
* **tags** - App tag of an ad task
* **app_first_class** - App level-1 category of an ad task
* **app_second_class** - App level-2 category of an ad task
* **age** - User age
* **city** - Resident city of a user
* **city_rank** - Level of the resident city of a user
* **device_name** - Phone model used by a user
* **device_size** - Size of the phone used by a user
* **career** - User occupation
* **gender** - User gender
* **net_type** - Network status when a behavior occurs
* **residence** - Resident province of a user
* **his_app_size** - App storage size
* **his_on_shelf_time** - Release time
* **app_score** - App rating score
* **emui_dev** - EMUI version
* **list_time** - Model release time
* **device_price** - Device price
* **up_life_duration** - HUAWEI ID lifecycle
* **up_membership_grade** - Service membership level
* **membership_life_duration** - Membership lifecycle
* **consume_purchase** - Paid user tag
* **communication_onlinerate** - Active time by mobile phone
* **communication_avgonline_30d** - Daily active time by mobile phone
* **indu_name** - Ad industry information
* **pt_d** - Date when a behavior occurs

The task will be for each advertisement, predict whether the customer is going to click on it in the near future.

In [4]:
found = False
train_key = "CTR-Training"
test_key = "CTR-Test"
description_text = "Building structure data that could help predict damage grade"

if train_key in ws.datasets.keys() and test_key in ws.datasets.keys():
    print("Found existing dataset, using")
    found = True
    train_dataset = ws.datasets[train_key]
    test_dataset = ws.datasets[test_key]

if not found:
    # Create AML Dataset and register it into Workspace
    print(f"Did not find existing dataset with key {train_key} and {test_key}, creating")

    # run the shell script to download the data
    os.system("sh fetch_dataset.sh")

     # read data in with chunks
    num_of_chunk = 0
    chunkSize = 10 ** 6
    ctr_data = pd.DataFrame()
    n = 7

    # select 1% to form the new training data
    for chunk in tqdm(
        pd.read_csv("data/train_data.csv", iterator=True, sep="|", chunksize=chunkSize)
    ):
        num_of_chunk += 1
        chunk_label_0 = chunk[chunk["label"] == 0]
        chunk_label_1 = chunk[chunk["label"] == 1]
        ctr_data = pd.concat([ctr_data, chunk_label_0.iloc[n * 1000 : (n + 1) * 1000, :]], axis=0)
        ctr_data = pd.concat([ctr_data, chunk_label_1.iloc[n * 1000 : (n + 1) * 1000, :]], axis=0)
        
    
    # create historical click through rate features
    time_features = [
        "uid",
        "task_id",
        "adv_id",
        "adv_prim_id",
        "spread_app_id",
    ]

    for feature in time_features:
        for time_window in range(1, 7):
            ctr_data = get_click_histories(ctr_data, feature, time_window)

    # merge two csv files
    ctr_data.drop_duplicates().to_csv("data/ctr_data.csv", index=False)

    # remove old files
    os.system("rm data/train_data.csv")

    # upload the data to default datastore
    ds = ws.get_default_datastore()
    ds.upload(
        src_dir="./data",
        target_path="click_through_rate",
        overwrite=True,
        show_progress=True,
    )

    dataset = Dataset.Tabular.from_delimited_files(
        path=ds.path("click_through_rate/ctr_data.csv")
    )

    train_dataset, test_dataset = dataset.random_split(percentage=0.8, seed=7)

    # Register Dataset in Workspace
    train_dataset = train_dataset.register(workspace=ws, name=train_key, description=description_text)
    test_dataset = test_dataset.register(workspace=ws, name=test_key, description=description_text)

Found existing dataset, using


In [5]:
label_column_name = 'damage_grade'

## AutoML Configuration

Explanation of autoML settings

* `iteration_timeout_minutes` and `experiment_timeout_minutes` both controls the length of time taking to find the best model. As models found in general are very similar in performances, setting these 2 parameters to a reasonable amount will help us getting uncessary costs
* `primary_metric` set to "AUC_weighted" allows the model to adjust for imbalance in the dataset (though there is very little) and reflect the true performance for the model
* `featurization` set to "auto" for autoML to decide what features to generate
* `verbosity` set to "logging.INFO" for autoML to save all the execution related logs to the workspace for debugging purposes
* `n_cross_validations` set to 5 for autoML to test performances across 5 different folds to make the performance measure more robust

autoML config

In [6]:
# TODO: Put your automl settings here
automl_settings = {
    "iteration_timeout_minutes": 10,
    "experiment_timeout_minutes": 60,
    "enable_early_stopping": True,
    "primary_metric": 'AUC_weighted',
    "featurization": "auto",
    "verbosity": logging.INFO,
    "n_cross_validations": 5
}

In [7]:
# TODO: Put your automl config here
automl_config = AutoMLConfig(
    task= "classification",
    training_data=train_dataset,
    label_column_name="label",
    compute_target=compute_target,
     **automl_settings)

In [8]:
# TODO: Submit your experiment
remote_run = experiment.submit(automl_config)

Running on remote.


## Run Details

Better performing models tends to be tree based models, indicatiing the underlying prediction problem is a non-linear one. The best model is stack ensemble of 3 different tree based models - using stack ensemble allows these 3 different submodels to improve on each other's mistakes, which produced a better performance

use the `RunDetails` widget to show the different experiments.

In [9]:
RunDetails(remote_run).show()
remote_run.wait_for_completion()

_AutoMLWidget(widget_settings={'childWidgetDisplay': 'popup', 'send_telemetry': False, 'log_level': 'INFO', 's…

{'runId': 'AutoML_397febcb-c02a-40de-8f9c-e2552f2390d7',
 'target': 'compute-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-01-14T01:33:52.511866Z',
 'endTimeUtc': '2021-01-14T02:56:50.765276Z',
 'properties': {'num_iterations': '1000',
  'training_type': 'TrainFull',
  'acquisition_function': 'EI',
  'primary_metric': 'AUC_weighted',
  'train_split': '0',
  'acquisition_parameter': '0',
  'num_cross_validation': '5',
  'target': 'compute-cluster',
  'DataPrepJsonString': '{\\"training_data\\": \\"{\\\\\\"blocks\\\\\\": [{\\\\\\"id\\\\\\": \\\\\\"7febc34b-2ed3-4ca4-b9b3-677a9c7a87d5\\\\\\", \\\\\\"type\\\\\\": \\\\\\"Microsoft.DPrep.GetDatastoreFilesBlock\\\\\\", \\\\\\"arguments\\\\\\": {\\\\\\"datastores\\\\\\": [{\\\\\\"datastoreName\\\\\\": \\\\\\"workspaceblobstore\\\\\\", \\\\\\"path\\\\\\": \\\\\\"click_through_rate/ctr_data.csv\\\\\\", \\\\\\"resourceGroup\\\\\\": \\\\\\"aml-quickstarts-134610\\\\\\", \\\\\\"subscription\\\\\\": \\\\\\"3d1a56d2-7c81-4118-9790-f85d1acf

## Best Model

get the best model from the automl experiments and display all the properties of the model.



In [10]:
best_run, fitted_model = remote_run.get_output()

In [11]:
# register model
model = best_run.register_model(model_name='click-through-rate-predictions-AutoML', 
                                model_path='./outputs/model.pkl',
                                tags=best_run.get_metrics())

get best model parameters

In [12]:
fitted_model.steps[1][1]

StackEnsembleClassifier(base_learners=[('0',
                                        Pipeline(memory=None,
                                                 steps=[('maxabsscaler',
                                                         MaxAbsScaler(copy=True)),
                                                        ('lightgbmclassifier',
                                                         LightGBMClassifier(boosting_type='gbdt',
                                                                            class_weight=None,
                                                                            colsample_bytree=1.0,
                                                                            importance_type='split',
                                                                            learning_rate=0.1,
                                                                            max_depth=-1,
                                                                            min_c

In [13]:
fitted_model.steps[1][1].get_params()

{'base_learners': None,
 'meta_learner': None,
 'training_cv_folds': None,
 '0': Pipeline(memory=None,
          steps=[('maxabsscaler', MaxAbsScaler(copy=True)),
                 ('lightgbmclassifier',
                  LightGBMClassifier(boosting_type='gbdt', class_weight=None,
                                     colsample_bytree=1.0,
                                     importance_type='split', learning_rate=0.1,
                                     max_depth=-1, min_child_samples=20,
                                     min_child_weight=0.001, min_split_gain=0.0,
                                     n_estimators=100, n_jobs=1, num_leaves=31,
                                     objective=None, random_state=None,
                                     reg_alpha=0.0, reg_lambda=0.0, silent=True,
                                     subsample=1.0, subsample_for_bin=200000,
                                     subsample_freq=0, verbose=-10))],
          verbose=False),
 '19': Pipeline(m