## Customer Churn Prediction using Automated ML in Azure Machine Learning using Azure ML SDK V2

This notebook guides users to create an Automl job using AML SDK V2. Once the AutoML job completes, the best performing model is registered in AML registry using MLFLow. 


This notebook uses [Telco Customer Churn data from IBM Sample Datasets](https://community.ibm.com/community/user/businessanalytics/blogs/steven-macko/2019/07/11/telco-customer-churn-1113).


### Pre-requisites
* Azure Machine Learning workspace provisioned.
* Compute Cluster provisioned in the workspace.
* Training Cluster created. Recommend to create training cluster with 3 nodes with each Node have 2 vCPUs. This can reduce the AutoML training time. This notebook takes about 12 minutes to complete the training on a 3 node CPU cluster with Standard_DS11_v2 (2 cores, 14 GB RAM, 28 GB disk). 
* Recommendation is to use AML compute instance to run the notebook.
* The notebook can be run locally but will the following dependencies installed locally:
 
    - python installed - python 3.8+
    - conda installed
    - Azure ML Python [SDK](https://learn.microsoft.com/en-us/python/api/overview/azure/ai-ml-readme?view=azure-python) and [CLI](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-cli?view=azureml-api-2&tabs=public) v2 installed.
    - Install additional dependencies in the conda_env.yml
    
            conda env create -f conda_env.yml

    


* AML workspace can be configured with Private Endpoint and necessary DNS changes are made as documented [here](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-custom-dns?view=azureml-api-2&tabs=azure-cli). 

### Login using Azure using az login

In [1]:
from azure.identity import DefaultAzureCredential, InteractiveBrowserCredential
from azure.ai.ml import MLClient
try:
    credential = DefaultAzureCredential()
    # Check if given credential can get token successfully.
    credential.get_token("https://management.azure.com/.default")
except Exception as ex:
    # Fall back to InteractiveBrowserCredential in case DefaultAzureCredential not work
    credential = InteractiveBrowserCredential()

#### Provide values for your AML workspace

    subscription_id = "YOUR_SUBSCRIPTION_ID"
    resource_group = "YOUR_RESOURCE_GROUP"
    workspace = "YOUR_WORKSPACE_NAME"

In [2]:

ml_client = None
try:
    ml_client = MLClient.from_config(credential)
except Exception as ex:
    print(ex)
    # Enter details of your Azure Machine Learning workspace
    subscription_id = "f1a8fafd-a8a3-46d8-bb5e-01deb63d275d"
    resource_group = "aml-rg"
    workspace = "aml-testws"
    ml_client = MLClient(credential, subscription_id, resource_group, workspace)

We could not find config.json in: . or in its parent directories. Please provide the full path to the config file or ensure that config.json exists in the parent directories.


Use the Sample Dataset WA_Fn-UseC_-Telco-Customer-Churn.csv and create datasets for Training and Testing. 
The test dataset is used to create a scoring job using the best performance model. 
The training and test dataset are created from the same Sample dataset. The test dataset maintains the same class balance as the Original Sample dataset. 


Provide your own unique dataset name or use system generated UUID format. Providing your own name would help to locate dataset easily in the AML workspace. 

In [17]:
import uuid

my_unique_dataset_name = uuid.uuid4().hex # Provide your own unique dataset name
my_unique_dataset_name

'1155c2a891714097b099ade1fee4bddb'

In [82]:
import pandas as pd
from sklearn.model_selection import train_test_split

# Load training data from CSV file
training_data = pd.read_csv('.././telcocustomerchurn/WA_Fn-UseC_-Telco-Customer-Churn.csv')

# Separate the data into features (X) and labels (y)
X = training_data.drop(columns=['Churn'])  # Assuming 'label' is the column containing class labels
y = training_data['Churn']

# Split the data into training and test sets while maintaining class balance
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

# Combine the features and labels for training and test sets
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

# Save the generated test data to a new CSV file
test_data.to_csv(f'.././telcocustomerchurn/{my_unique_dataset_name}-WA_Fn-UseC_-Telco-Customer-Churn_Test.csv', index=False)

# Save the training data to a new CSV file
train_data.to_csv(f'.././telcocustomerchurn/{my_unique_dataset_name}-WA_Fn-UseC_-Telco-Customer-Churn_Train.csv', index=False)



In [None]:
test_data

In [None]:
train_data

Upload datasets to Azure ML workspace Datastore

In [83]:
train_data_path = f'.././telcocustomerchurn/{my_unique_dataset_name}_train/{my_unique_dataset_name}-WA_Fn-UseC_-Telco-Customer-Churn_Train.csv'
test_data_path = f'.././telcocustomerchurn/{my_unique_dataset_name}_test/{my_unique_dataset_name}-WA_Fn-UseC_-Telco-Customer-Churn_Test.csv'

In [84]:
import mltable
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes


train_data_paths = [
    {'file': train_data_path},
]

train_table = mltable.from_delimited_files(train_data_paths)
train_table.save(f'.././telcocustomerchurn/{my_unique_dataset_name}_train')
# Save the training data to a new CSV file
train_data.to_csv(train_data_paths[0]['file'], index=False)



In [79]:
test_data_path

'../telcocustomerchurn/1155c2a891714097b099ade1fee4bddb_test/1155c2a891714097b099ade1fee4bddb-WA_Fn-UseC_-Telco-Customer-Churn_Test.csv'

In [86]:
import os 
test_data_paths = [
    {'file': test_data_path},
]


outdir = f'.././telcocustomerchurn/{my_unique_dataset_name}_test'
if not os.path.exists(outdir):
    os.mkdir(outdir)
# Save the generated test data to a new CSV file
test_data.to_csv(test_data_paths[0]['file'], index=False)

In [71]:
def create_dataset(data_path, data_asset_name, data_asset_version, asset_type):
    data_asset_def = Data(
        name=data_asset_name,
        version=data_asset_version,
        description=data_asset_name,
        path=data_path,
        type=asset_type,
    )
    data_asset = None
    ## create data asset if it doesn't already exist:
    try:
        data_asset = ml_client.data.get(name=data_asset_name, version=data_asset_version)
        print(
            f"Data asset already exists. Name: {data_asset_def.name}, version: {data_asset_def.version}"
        )
    except Exception as ex:
        ml_client.data.create_or_update(data_asset_def)
        print(f"Data asset created. Name: {data_asset_def.name}, version: {data_asset_def.version}")

    data_asset = ml_client.data.get(name=data_asset_name, version=data_asset_version)
    return data_asset

Create training Dataset in MLTABLE format. Automated ML in AML only supports MLTABLE format. 

In [87]:
data_path = f'.././telcocustomerchurn/{my_unique_dataset_name}_train'
data_asset_name = f"{my_unique_dataset_name}-wa_telco_customer_churn_train_data"
data_asset_version = "1.0"
training_data_asset =create_dataset(data_path, data_asset_name, data_asset_version, AssetTypes.MLTABLE)

Data asset already exists. Name: 1155c2a891714097b099ade1fee4bddb-wa_telco_customer_churn_train_data, version: 1.0


Create test Dataset in Uri_File format to be used in Scoring.

In [91]:
data_path = test_data_path
data_asset_name = f"{my_unique_dataset_name}-wa_telco_customer_churn_test_data"
data_asset_version = "1.0"
test_data_asset = create_dataset(data_path, data_asset_name, data_asset_version, AssetTypes.URI_FILE)


Uploading 1155c2a891714097b099ade1fee4bddb-WA_Fn-UseC_-Te... (< 1 MB): 0.00B [00:00, ?B/s] (< 1 MB): 100%|##########| 196k/196k [00:00<00:00, 827kB/s] (< 1 MB): 100%|##########| 196k/196k [00:00<00:00, 774kB/s]




Data asset created. Name: 1155c2a891714097b099ade1fee4bddb-wa_telco_customer_churn_test_data, version: 1.0


Define AutoML classification job for the Churn Prediction task. 
The training alogrithms are limited to select few to limit training time. 
For more information on supported algorithms in AutoML, please see [this](https://learn.microsoft.com/en-us/azure/machine-learning/how-to-configure-auto-train?view=azureml-api-2&tabs=python).

In [94]:
from azure.ai.ml import automl, Input

training_data_input = Input(
    type=AssetTypes.MLTABLE, path=training_data_asset.path
)


# configure the classification job
classification_job = automl.classification(
    compute="cpu-cluster",
    experiment_name=f"{my_unique_dataset_name}-wa-telco-customer-churn-classification",
    training_data=training_data_input,
    target_column_name="Churn",
    primary_metric="accuracy",
    n_cross_validations=5,
    enable_model_explainability=True,
    tags={"my_custom_tag": "My custom value"}
    
)

# Limits are all optional
classification_job.set_limits(
    timeout_minutes=600, 
    trial_timeout_minutes=20, 
    max_trials=5,
    enable_early_termination=True,
)

# Training properties are optional
classification_job.set_training(
    allowed_training_algorithms=["GradientBoosting", "DecisionTree", "LightGBM" , "RandomForest"], 
    enable_onnx_compatible_models=True
)

classification_job.set_featurization(
    mode="auto",
)

In [95]:
# Submit the AutoML job
returned_job = ml_client.jobs.create_or_update(
    classification_job
)  # submit the job to the backend

print(f"Created job: {returned_job}")

# Get a URL for the status of the job
returned_job.services["Studio"].endpoint

Created job: compute: azureml:cpu-cluster
creation_context:
  created_at: '2023-09-08T21:27:32.252811+00:00'
  created_by: Anil Dwarakanath
  created_by_type: User
display_name: jolly_circle_ywmvvvb4dc
experiment_name: 1155c2a891714097b099ade1fee4bddb-wa-telco-customer-churn-classification
featurization:
  enable_dnn_featurization: false
  mode: auto
id: azureml:/subscriptions/f1a8fafd-a8a3-46d8-bb5e-01deb63d275d/resourceGroups/aml-rg/providers/Microsoft.MachineLearningServices/workspaces/aml-testws/jobs/jolly_circle_ywmvvvb4dc
limits:
  enable_early_termination: true
  max_concurrent_trials: 1
  max_cores_per_trial: -1
  max_nodes: 1
  max_trials: 5
  timeout_minutes: 600
  trial_timeout_minutes: 20
log_verbosity: info
n_cross_validations: 5
name: jolly_circle_ywmvvvb4dc
outputs: {}
primary_metric: accuracy
properties:
  azureml.git.dirty: 'True'
  mlflow.source.git.branch: main
  mlflow.source.git.commit: 41f032aa5a37018b74add5bcf52d67676731cbc2
  mlflow.source.git.repoURL: https://g

'https://ml.azure.com/runs/jolly_circle_ywmvvvb4dc?wsid=/subscriptions/f1a8fafd-a8a3-46d8-bb5e-01deb63d275d/resourcegroups/aml-rg/workspaces/aml-testws&tid=7f1290b4-3c39-4277-a63e-c577680a12cf'

Wait till the Automl Job completes before continuing from here. 

Use MLFlow to register the best performing model. 

In [96]:
import mlflow

# Obtain the tracking URL from MLClient
MLFLOW_TRACKING_URI = ml_client.workspaces.get(
    name=ml_client.workspace_name
).mlflow_tracking_uri

print(MLFLOW_TRACKING_URI)

azureml://74f7aea0-6b0e-4b75-a8e4-001d9a929102.workspace.westus.api.azureml.ms/mlflow/v1.0/subscriptions/f1a8fafd-a8a3-46d8-bb5e-01deb63d275d/resourceGroups/aml-rg/providers/Microsoft.MachineLearningServices/workspaces/aml-testws


In [97]:
# Set the MLFLOW TRACKING URI

mlflow.set_tracking_uri(MLFLOW_TRACKING_URI)

print("\nCurrent tracking uri: {}".format(mlflow.get_tracking_uri()))


Current tracking uri: azureml://74f7aea0-6b0e-4b75-a8e4-001d9a929102.workspace.westus.api.azureml.ms/mlflow/v1.0/subscriptions/f1a8fafd-a8a3-46d8-bb5e-01deb63d275d/resourceGroups/aml-rg/providers/Microsoft.MachineLearningServices/workspaces/aml-testws


In [98]:
from mlflow.tracking.client import MlflowClient
from mlflow.artifacts import download_artifacts

# Initialize MLFlow client
mlflow_client = MlflowClient()

In [99]:

# Get the parent run
mlflow_parent_run = mlflow_client.get_run(returned_job.name)
best_child_run_id = mlflow_parent_run.data.tags['automl_best_child_run_id']
# get the best child run
best_run = mlflow_client.get_run(best_child_run_id)

In [None]:
best_run

In [100]:
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration,
    ProbeSettings,
)
from azure.ai.ml.constants import ModelType

model_name = f"{my_unique_dataset_name}-wa_telco_customer_churn_model_best"
model = Model(
    path=f"azureml://jobs/{best_run.info.run_id}/outputs/artifacts/outputs/mlflow-model/",
    name=model_name,
    description="wa_telco_customer_churn_model_best",
    type=AssetTypes.MLFLOW_MODEL,
)

# for downloaded file
# model = Model(path="artifact_downloads/outputs/model.pkl", name=model_name)

registered_model = ml_client.models.create_or_update(model)

In [101]:
# Let's pick the latest version of the model
latest_model = max(
    [(m.version) for m in ml_client.models.list(name=registered_model.name)]
)

print(latest_model)

1


Download AutoML artifacts for scoring later. We will use both local scoring and batch scoring using compute cluster. 

In [103]:
import os

# Create local folder
local_dir = ".././artifact_downloads"
if not os.path.exists(local_dir):
    os.mkdir(local_dir)

In [104]:
# Download run's artifacts/outputs
local_path = download_artifacts(
    run_id=best_run.info.run_id, artifact_path="outputs", dst_path=local_dir
)
print("Artifacts downloaded in: {}".format(local_path))
print("Artifacts: {}".format(os.listdir(local_path)))

Downloading artifacts:   0%|          | 0/20 [00:00<?, ?it/s]2023/09/08 14:49:50 INFO mlflow.store.artifact.artifact_repo: The progress bar can be disabled by setting the environment variable MLFLOW_ENABLE_ARTIFACTS_PROGRESS_BAR to false
Downloading artifacts: 100%|██████████| 20/20 [00:04<00:00,  4.66it/s]


Artifacts downloaded in: c:\source\repos\aml-customer-churn-prediction\artifact_downloads\outputs
Artifacts: ['conda_env_v_1_0_0.yml', 'engineered_feature_names.json', 'env_dependencies.json', 'featurization_summary.json', 'generated_code', 'internal_cross_validated_models.pkl', 'mlflow-model', 'model.onnx', 'model.pkl', 'model_onnx.json', 'pipeline_graph.json', 'run_id.txt', 'scoring_file_pbi_v_1_0_0.py', 'scoring_file_v_1_0_0.py', 'scoring_file_v_2_0_0.py']
