# Loan Approval from Historical Data on Azure: Create Datastore & Train Linear Model

## Create an Azure Files Datastore
Connect to the Azure Machine Learning workspace with `MLClient`.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

# Authenticate
credential = DefaultAzureCredential()

# Get a handle to the workspace
ml_client = MLClient(
    credential=credential,
    subscription_id='a134465f-eb15-4297-b902-3c97d4c81838',
    resource_group_name='aschultzdata',
    workspace_name='ds-ml-env',
)

Create the datastore specifying the name of the datastore, the description, the account, the name of the container, the protocol and the account key.

In [None]:
from azure.ai.ml.entities import AzureBlobDatastore
from azure.ai.ml.entities import AccountKeyConfiguration
from azure.ai.ml import MLClient


store = AzureBlobDatastore(
    name='loanstatus_datastore',
    description='Datastore for Loan Status',
    account_name='dsmlenv8898281366',
    container_name='loanstatus',
    protocol='https',
    credentials=AccountKeyConfiguration(
        account_key='XXXxxxXXXxXXXXxxXXXXXxXXXXXxXxxXxXXXxXXXxXXxxxXXxxXXXxXxXXXxxXxxXXXXxxxxxXXxxxxxxXXXxXXX'
    ),
)

ml_client.create_or_update(store)

AzureBlobDatastore({'type': <DatastoreType.AZURE_BLOB: 'AzureBlob'>, 'name': 'loanstatus_datastore', 'description': 'Datastore for Loan Status', 'tags': {}, 'properties': {}, 'print_as_yaml': True, 'id': '/subscriptions/a134465f-eb15-4297-b902-3c97d4c81838/resourceGroups/aschultzdata/providers/Microsoft.MachineLearningServices/workspaces/ds-ml-env/datastores/loanstatus_datastore', 'Resource__source_path': None, 'base_path': '/mnt/batch/tasks/shared/LS_root/mounts/clusters/cpu-standardds12v2/code/Users/aschultz.data/UsedCarsCarGurus/Models/DL/MLP', 'creation_context': None, 'serialize': <msrest.serialization.Serializer object at 0x7f5450284f70>, 'credentials': {'type': 'account_key'}, 'container_name': 'loanstatus', 'account_name': 'dsmlenv8898281366', 'endpoint': 'core.windows.net', 'protocol': 'https'})

Then upload the train/test sets to the `loanstatus` container.

# Train a model
First, create a compute instance to run a notebook in Azure Machine Learning studio. Then connect to the workspace and create the compute resource (CPU or GPU), if it has not all ready been created.




## Create a compute cluster to run the job

In [None]:
from azure.ai.ml.entities import AmlCompute

# Define the name of the compute cluster
cpu_compute_target = 'cpu-cluster-E8s-v3'

try:
    # Examine if the cluster already exists
    cpu_cluster = ml_client.compute.get(cpu_compute_target)
    print(
        f"You already have a cluster named {cpu_compute_target}, we'll reuse it as is."
    )

except Exception:
    print('Creating a new cpu compute target...')

    # Create the Azure Machine Learning compute object with specified components
    cpu_cluster = AmlCompute(
        name=cpu_compute_target,
        # On-demand VM service
        type='amlcompute',
        # VM Family
        size='Standard_E8s_v3',
        # Minimum nodes in the cluster
        min_instances=0,
        # Nodes in the cluster
        max_instances=1,
        # Time (seconds) the node will run after the job finishes/terminates
        idle_time_before_scale_down=180,
        # Type of tier: Dedicated or LowPriority
        tier='Dedicated',
    )
    print(
        f"AMLCompute with name {cpu_cluster.name} will be created, with compute size {cpu_cluster.size}"
    )
    # Pass the object to MLClient's create_or_update method
    cpu_cluster = ml_client.compute.begin_create_or_update(cpu_cluster)

Creating a new cpu compute target...
AMLCompute with name cpu-cluster-E8s-v3 will be created, with compute size Standard_E8s_v3


## Create the job environment
The environment lists the components of the runtime and the libraries installed on the compute for training the model.

In [None]:
import os

dependencies_dir = './dependencies'
os.makedirs(dependencies_dir, exist_ok=True)

Now we can write the `conda` file into the `dependencies` directory.

In [None]:
%%writefile {dependencies_dir}/conda.yaml
name: model-env
channels:
  - conda-forge
dependencies:
  - python=3.8.5
  - numpy=1.21.6
  - pip=23.1.2
  - scikit-learn==1.1.2
  - scipy
  - pandas>=1.1,<1.2
  - pip:
    - inference-schema[numpy-support]==1.3.0
    - mlflow
    - azureml-mlflow
    - psutil==5.9.0
    - tqdm
    - ipykernel
    - matplotlib
    - seaborn
    - eli5==0.13.0
    - shap==0.41.0
    - lime

Writing ./dependencies/conda.yaml


The created `conda.yaml` file allows for the environment to be created and registered in the workspace.

In [None]:
from azure.ai.ml.entities import Environment

custom_env_name = 'aml-loanstatus-cpu'

custom_job_env = Environment(
    name=custom_env_name,
    description='Custom environment for Used Cars XGboost job',
    tags={'scikit-learn': '1.1.2'},
    conda_file=os.path.join(dependencies_dir, 'conda.yaml'),
    image='mcr.microsoft.com/azureml/openmpi3.1.2-ubuntu18.04:latest',
)
custom_job_env = ml_client.environments.create_or_update(custom_job_env)

print(
    f"Environment with name {custom_job_env.name} is registered to workspace, the environment version is {custom_job_env.version}"
)

Environment with name aml-loanstatus-cpu is registered to workspace, the environment version is 2


## Create training script
First, the source folder where the training script, `main.py`, will be stored needs to be created.

In [None]:
import os

train_src_dir = './src'
os.makedirs(train_src_dir, exist_ok=True)

The training script consists of preparing the environment, reading the data, data preparation, model training, evaluating the model and saving/registering the model. This includes specifying the dependencies to import and utilize, setting the seed, defining the input/output arguments of `argparse`, reading the train/test sets, defining the features/target and preprocessing the data by scaling the features with the `MinMaxScaler`. Then the number of samples and features are logged with `MLFlow`. It uses this to then train a `Linear` model using the best parameters from `GridSearchCV` where the `classification_report` and `confusion_matrix` as well as the metrics `accuracy`, `precision`, `recall` and `f1_score` for the train/test sets are logged as `MLFlow` artifacts and metrics. Then the model can be saved and registered.


In [None]:
import os
import random
import numpy as np
import warnings
import argparse
import pandas as pd
import mlflow
import mlflow.sklearn
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from joblib import parallel_backend
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import accuracy_score, precision_score
from sklearn.metrics import recall_score, f1_score
import matplotlib.pyplot as plt
import seaborn as sns
warnings.filterwarnings('ignore')

seed_value = 42
os.environ['LoanStatus_Linear'] = str(seed_value)
random.seed(seed_value)
np.random.seed(seed_value)

def main():
    """Main function of the script."""

    # Input and output arguments
    parser = argparse.ArgumentParser()
    parser.add_argument('--train_data', type=str,
                        help='path to input train data')
    parser.add_argument('--test_data', type=str, help='path to input test data')
    parser.add_argument('--penalty', required=False, default='l2', type=str)
    parser.add_argument('--solver', required=False, default='lbfgs', type=str)
    parser.add_argument('--max_iter', required=False, default=100, type=int)
    parser.add_argument('--C', required=False, default=1, type=int)
    parser.add_argument('--tol', required=False, default=1e-4, type=float)
    parser.add_argument('--n_jobs', required=False, default=1, type=int)
    parser.add_argument('--registered_model_name', type=str, help='model name')
    args = parser.parse_args()

    # Start Logging
    mlflow.start_run()

    # Enable autologging
    mlflow.sklearn.autolog()

    ###################
    #<prepare the data>
    ###################
    print(" ".join(f"{k}={v}" for k, v in vars(args).items()))

    print('Input Train Data:', args.train_data)
    print('Input Test Data:', args.test_data)

    trainDF = pd.read_csv(args.train_data, low_memory=False)
    testDF = pd.read_csv(args.test_data, low_memory=False)

    train_label = trainDF[['loan_status']]
    test_label = testDF[['loan_status']]

    train_features = trainDF.drop(columns = ['loan_status'])
    test_features = testDF.drop(columns = ['loan_status'])

    print(f"Training with data of shape {train_features.shape}")

    scaler = MinMaxScaler()
    train_features = scaler.fit_transform(train_features)
    test_features = scaler.transform(test_features)

    mlflow.log_metric('num_samples', train_features.shape[0])
    mlflow.log_metric('num_features', train_features.shape[1])

    ####################
    #</prepare the data>
    ####################

    ##################
    #<train the model>
    ##################
    # Define model
    model = LogisticRegression(penalty=args.penalty,
                               solver=args.solver,
                               max_iter=args.max_iter,
                               C=args.C,
                               tol=args.tol,
                               random_state=seed_value)

    # Fit model
    with parallel_backend('threading', n_jobs=args.n_jobs):
        model.fit(train_features, train_label)

    ##################
    #</train the model>
    ##################

    #####################
    #<evaluate the model>
    #####################
    # Predict
    train_label_pred = model.predict(train_features)
    test_label_pred = model.predict(test_features)

    clr_train = classification_report(train_label, train_label_pred,
                                      output_dict=True)
    sns.heatmap(pd.DataFrame(clr_train).iloc[:-1,:].T, annot=True)
    plt.savefig('clr_train.png')
    mlflow.log_artifact('clr_train.png')
    plt.close()

    clr_test = classification_report(test_label, test_label_pred,
                                     output_dict=True)
    sns.heatmap(pd.DataFrame(clr_test).iloc[:-1,:].T, annot=True)
    plt.savefig('clr_test.png')
    mlflow.log_artifact('clr_test.png')
    plt.close()

    cm_train = confusion_matrix(train_label, train_label_pred)
    cm_train = ConfusionMatrixDisplay(confusion_matrix=cm_train)
    cm_train.plot()
    plt.savefig('cm_train.png')
    mlflow.log_artifact('cm_train.png')
    plt.close()

    cm_test = confusion_matrix(test_label, test_label_pred)
    cm_test = ConfusionMatrixDisplay(confusion_matrix=cm_test)
    cm_test.plot()
    plt.savefig('cm_test.png')
    mlflow.log_artifact('cm_test.png')
    plt.close()

    train_accuracy = accuracy_score(train_label, train_label_pred)
    train_precision = precision_score(train_label, train_label_pred)
    train_recall = recall_score(train_label, train_label_pred)
    train_f1 = f1_score(train_label, train_label_pred)

    test_accuracy = accuracy_score(test_label, test_label_pred)
    test_precision = precision_score(test_label, test_label_pred)
    test_recall = recall_score(test_label, test_label_pred)
    test_f1 = f1_score(test_label, test_label_pred)

    mlflow.log_metric('train_accuracy', train_accuracy)
    mlflow.log_metric('train_precision', train_precision)
    mlflow.log_metric('train_recall', train_recall)
    mlflow.log_metric('train_f1', train_f1)
    mlflow.log_metric('test_accuracy', test_accuracy)
    mlflow.log_metric('test_precision', test_precision)
    mlflow.log_metric('test_recall', test_recall)
    mlflow.log_metric('test_f1', test_f1)

    #####################
    #</evaluate the model>
    #####################

    ##########################
    #<save and register model>
    ##########################
    # Registering the model to the workspace
    print('Registering the model via MLFlow')
    mlflow.sklearn.log_model(
        sk_model=model,
        registered_model_name=args.registered_model_name,
        artifact_path=args.registered_model_name,
    )

    # Saving the model to a file
    mlflow.sklearn.save_model(
        sk_model=model,
        path=os.path.join(args.registered_model_name, 'trained_model'),
    )

    ###########################
    #</save and register model>
    ###########################

    # Stop Logging
    mlflow.end_run()

if __name__ == "__main__":
    main()


Writing ./src/main.py


 ## Train the model with specified components
To train the model, a `command job` configured with the input specifying the  input data and the hyperparameter conditions, which then runs the `training script` using the specified compute resource, environment, and the parameters specified to be logged needs to be submitted as a job.

In [None]:
from azure.ai.ml import command
from azure.ai.ml import Input

registered_model_name = 'loanstatus_us_linear_model'

job = command(
    inputs=dict(
        train_data=Input(
            type='uri_file',
            path='azureml://datastores/loanstatus_datastore/paths/trainDF_US.csv',
        ),
        test_data=Input(
            type='uri_file',
            path = 'azureml://datastores/loanstatus_datastore/paths/testDF_US.csv',
        ),
        penalty='l1',
        solver='saga',
        max_iter=100000,
        C=1,
        tol=1e-06,
        n_jobs=-1,
        registered_model_name=registered_model_name,
    ),

    code='./src/',
    command='python main.py --train_data ${{inputs.train_data}} --test_data ${{inputs.test_data}} --penalty ${{inputs.penalty}} --solver ${{inputs.solver}} --max_iter ${{inputs.max_iter}} --n_jobs ${{inputs.n_jobs}} --registered_model_name ${{inputs.registered_model_name}}',
    environment='aml-loanstatus-cpu@latest',
    compute='cpu-cluster-E8s-v3',
    display_name='loanstatus_us_linear_best_prediction',
)

## Submit the job
Then this job can be submitted to run in `Azure Machine Learning Studio` using the `create_or_update` command with `ml_client`.

In [None]:
ml_client.create_or_update(job)

[32mUploading src (0.01 MBs):   0%|          | 0/6105 [00:00<?, ?it/s][32mUploading src (0.01 MBs): 100%|██████████| 6105/6105 [00:00<00:00, 395731.86it/s]
[39m



Experiment,Name,Type,Status,Details Page
Linear,placid_calypso_gckg34lwnd,command,Starting,Link to Azure Machine Learning studio
