This Notebook is the authoring script to run the training of LGBM on Azure Machine Learning.

Prerequisetes:
    
    1. Azure Workspace.
    2. Titanic dataset from kaggle
    3. The kaggle dataset should be stored in a separate folder

Steps:
    
    1. Import Workspace, Experiment, create compute cluster
    2. Clean, preprocess and get the dataset nearly ready for the model. Upload the dataset and register it
    3. Prepare the RunConfig file : all environment variables (docker, python packages , etc) must be described here. Define run_amlcompute object
    4. Prepare a script_params: a dictionary containing all user defined parameters that we need to pass to environment
    5. Prepare training script. Prepare ScriptRunConfig. Submit the run
    6. Done

In [1]:
#Currently ataset class is not supported by your Linux distribution.
#For Linux users, Dataset class is only supported on the following distributions:
#Red Hat Enterprise Linux, Ubuntu(Upto ubuntu 16), Fedora, and CentOS.
from dotnetcore2 import runtime
runtime.version = ("18", "10", "0")

In [9]:
from azureml.core import Experiment, Workspace, Dataset
from azureml.core.compute import ComputeTarget, AmlCompute
from azureml.core.compute_target import ComputeTargetException
from azureml.core.run import Run
from azureml.core.runconfig import RunConfiguration
from azureml.core.conda_dependencies import CondaDependencies
from azureml.core.runconfig import DEFAULT_CPU_IMAGE

import pandas as pd
import numpy as np
import os

from sklearn.preprocessing import LabelEncoder

# Step 1: Create Workspace, environment, aml compute cluster

In [3]:
ws = Workspace.from_config('~/.azureml/config.json')
exp = Experiment(workspace = ws, name = 'titanic_lgbm')

If you run your code in unattended mode, i.e., where you can't give a user input, then we recommend to use ServicePrincipalAuthentication or MsiAuthentication.
Please refer to aka.ms/aml-notebook-auth for different authentication mechanisms in azureml-sdk.


In [4]:
#Firing up compue target
vm_size = 'STANDARD_D2_V2'
max_nodes = 4
cluster_name = 'titanic-cluster'
try:
    aml_cluster = ComputeTarget(workspace=ws, name=cluster_name)    #Looking for existing compute cluster
except ComputeTargetException:
    amlconfig = AmlCompute.provisioning_configuration(vm_size=vm_size,  #If none exist, creating a new one
                                                     max_nodes = max_nodes)
    aml_cluster = ComputeTarget.create(ws, cluster_name, amlconfig)

aml_cluster.wait_for_completion(show_output = True)

Succeeded
AmlCompute wait for completion finished

Minimum number of nodes requested have been provisioned


# Step2 : Importing Data and cleaning it. Register the data

In [5]:
df = pd.read_csv('data/train.csv')
df.drop(['PassengerId'], axis=1, inplace=True)

# 'Embarked' is stored as letters, so we will concert it to numbers
embarked_encoder = LabelEncoder()
embarked_encoder.fit(df['Embarked'].fillna('Null'))
 
# Creating a new column denoting whether someone came alone 
df['Alone'] = (df['SibSp'] == 0) & (df['Parch'] == 0)

# Transform 'Embarked'
df['Embarked'].fillna('Null', inplace=True)
df['Embarked'] = embarked_encoder.transform(df['Embarked'])

# Transform 'Sex'
df.loc[df['Sex'] == 'female','Sex'] = 0
df.loc[df['Sex'] == 'male','Sex'] = 1
df['Sex'] = df['Sex'].astype('int8')

# Drop features that seem unusable. Save passenger ids if test
df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [6]:
#Path to saved dataset
data_location = os.path.join('./data', 'titanic_cleaned.csv')

#Saving the cleaned data as csv
df.to_csv(data_location)

In [7]:
#In this cell, we will upload the dataset to azure 'Datastore'

#First, lets get the datastore
datastore = ws.get_default_datastore()

#Upload the csv to above datastore
datastore.upload(src_dir = './data',                  #Source Directory
                 target_path = './data')              #Directory in blob storage where data will be stored

Uploading an estimated of 2 files
Target already exists. Skipping upload for data/titanic_cleaned.csv
Target already exists. Skipping upload for data/train.csv
Uploaded 0 files


$AZUREML_DATAREFERENCE_28445fe417894d4181260e6d863eb2a4

In [8]:
#Creating a dataset from file in blob storage
dataset = Dataset.Tabular.from_delimited_files(datastore.path(data_location))

#Registering the dataset and creating version
dataset.register(workspace = ws, name = 'titanic_cleaned', create_new_version = True)

{
  "source": [
    "('workspaceblobstore', './data/titanic_cleaned.csv')"
  ],
  "definition": [
    "GetDatastoreFiles",
    "ParseDelimited",
    "DropColumns",
    "SetColumnTypes"
  ],
  "registration": {
    "id": "9ad18a58-16d7-4318-8249-7c116c98cea1",
    "name": "titanic_cleaned",
    "version": 1,
    "workspace": "Workspace.create(name='titanic_ws', subscription_id='ea3f69e8-c36f-4fc3-8495-d53f48fcf14a', resource_group='ml_project_titanic')"
  }
}

# Step 3:Prepare Runconfig file

In [17]:
runconfig = RunConfiguration()

#Configuring environment parameters
runconfig.target = aml_cluster

runconfig.environment.docker.enabled = True
runconfig.environment.docker.base_image = DEFAULT_CPU_IMAGE

packages = ['azureml-defaults', 'azureml-contrib-interpret', 'azureml-core', 
            'azureml-telemetry', 'azureml-interpret', 'sklearn-pandas', 'azureml-dataprep',
           'numpy', 'pandas', 'matplotlib', 'seaborn', 'scikit-learn', 'lightgbm', 'umap-learn', 'joblib']

runconfig.auto_prepare_environment = True
runconfig.environment.python.user_managed_dependencies = False
runconfig.environment.python.conda_dependencies = CondaDependencies.create(
pip_packages = packages)

'enabled' is deprecated. Please use the azureml.core.runconfig.DockerConfiguration object with the 'use_docker' param instead.
'auto_prepare_environment' is deprecated and unused. It will be removed in a future release.
'auto_prepare_environment' is deprecated and unused. It will be removed in a future release.


# Step 4 : Make the scrip params variable, Define Script Run Config, run the experiment

In [21]:
#First decide which arguments you want to pass the train.py script. In this example, we will only pass the
#Model Hyperparameters 
script_params = ['--boosting', 'dart',                         
    '--learning-rate', '0.05',                     
    '--drop-rate', '0.1',                         
]                                   


In [24]:
from azureml.core import ScriptRunConfig

script = 'train_titanic.py'
script_folder = os.getcwd()

src = ScriptRunConfig(
  source_directory=script_folder,
  script=script,
  run_config=runconfig,
  arguments=script_params)

run = exp.submit(src)

run.wait_for_completion(show_output = True)

RunId: titanic_lgbm_1623203998_d8febb4d
Web View: https://ml.azure.com/runs/titanic_lgbm_1623203998_d8febb4d?wsid=/subscriptions/ea3f69e8-c36f-4fc3-8495-d53f48fcf14a/resourcegroups/ml_project_titanic/workspaces/titanic_ws&tid=90e9d100-8173-4458-9c32-166e0ec3eb49

Streaming azureml-logs/55_azureml-execution-tvmps_6b710428067810f4c9a962d51ac8fb9ed4df47e00b47c5d67e33300ec9cf117f_d.txt

2021-06-09T02:03:48Z Successfully mounted a/an Blobfuse File System at /mnt/batch/tasks/shared/LS_root/jobs/titanic_ws/azureml/titanic_lgbm_1623203998_d8febb4d/mounts/workspaceblobstore
2021-06-09T02:03:48Z The vmsize standard_d2_v2 is not a GPU VM, skipping get GPU count by running nvidia-smi command.
2021-06-09T02:03:48Z Starting output-watcher...
2021-06-09T02:03:48Z IsDedicatedCompute == True, won't poll for Low Pri Preemption
2021-06-09T02:03:49Z Executing 'Copy ACR Details file' on 10.0.0.4
2021-06-09T02:03:49Z Copy ACR Details file succeeded on 10.0.0.4. Output: 
>>>   
>>>   
Login Succeeded
Using d

{'runId': 'titanic_lgbm_1623203998_d8febb4d',
 'target': 'titanic-cluster',
 'status': 'Completed',
 'startTimeUtc': '2021-06-09T02:03:47.314668Z',
 'endTimeUtc': '2021-06-09T02:05:33.788758Z',
 'properties': {'_azureml.ComputeTargetType': 'amlcompute',
  'ContentSnapshotId': '0d40c5bc-56cb-4857-9f25-9704f78a1106',
  'ProcessInfoFile': 'azureml-logs/process_info.json',
  'ProcessStatusFile': 'azureml-logs/process_status.json'},
 'inputDatasets': [{'dataset': {'id': '9ad18a58-16d7-4318-8249-7c116c98cea1'}, 'consumptionDetails': {'type': 'Reference'}}],
 'outputDatasets': [],
 'runDefinition': {'script': 'train_titanic.py',
  'command': '',
  'useAbsolutePath': False,
  'arguments': ['--boosting',
   'dart',
   '--learning-rate',
   '0.05',
   '--drop-rate',
   '0.1'],
  'sourceDirectoryDataStore': None,
  'framework': 'Python',
  'communicator': 'None',
  'target': 'titanic-cluster',
  'dataReferences': {},
  'data': {},
  'outputData': {},
  'jobName': None,
  'maxRunDurationSeconds': 

In [25]:
print(run.get_portal_url())

https://ml.azure.com/runs/titanic_lgbm_1623203998_d8febb4d?wsid=/subscriptions/ea3f69e8-c36f-4fc3-8495-d53f48fcf14a/resourcegroups/ml_project_titanic/workspaces/titanic_ws&tid=90e9d100-8173-4458-9c32-166e0ec3eb49
