# Development

We start development on a simple machine learning model using the diabetes dataset.

## Connect to a workspace

We first connect to our Azure Machine Learning workspace. This method uses a json config file in your local directory. _ml_client_ variable will be used to interact with our workspace below.

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

ml_client = MLClient.from_config(credential=DefaultAzureCredential(), path="./")

Assuming we are constantly using this data, we load our data currently registered as a data asset in our workspace. This in a nutshell allows us to easily reference our data, along with other benefits.

In [None]:
import mltable

data_asset = ml_client.data.get(name="diabetes", version = "1")

path = {
    'file': data_asset.path
}

tbl = mltable.from_delimited_files(paths=[path])

df = tbl.to_pandas_dataframe()

There is more than one way to load data, you can also access the data directly from blob storage. Or even using the Datastore URI. The appeal of using Datastores and data assets is removal of the credentials from the workflow.

Above is the ideal state, for now we ingest the data from our local csv file.

In [None]:
import pandas as pd

df = pd.read_csv('data/diabetes.csv')

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


## Data Prep

Lets add an extra step and normalize our numeric features.

In [None]:
from sklearn.preprocessing import MinMaxScaler

# separate numeric columns (excluding the last column which is our response variable)

numeric_cols = df.select_dtypes(include=['float64','int64']).columns[:-1]

# initialize MinMaxScaler

scaler = MinMaxScaler()

# scale only the numeric columns

df[numeric_cols] = scaler.fit_transform(df[numeric_cols])

df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,0.352941,0.743719,0.590164,0.353535,0.0,0.500745,0.234415,0.483333,1
1,0.058824,0.427136,0.540984,0.292929,0.0,0.396423,0.116567,0.166667,0
2,0.470588,0.919598,0.52459,0.0,0.0,0.347243,0.253629,0.183333,1
3,0.058824,0.447236,0.540984,0.232323,0.111111,0.418778,0.038002,0.0,0
4,0.0,0.688442,0.327869,0.353535,0.198582,0.642325,0.943638,0.2,1


## Training

Training our model with xgboost.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import xgboost as xgb
import numpy as np

# separate features (x) and target (y)
x = df.drop(columns=["Outcome"])
y = df["Outcome"]

# split the data into training and test sets 
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)

# create xgboost dmatrix
dtrain = xgb.DMatrix(x_train, label=y_train)
dtest = xgb.DMatrix(x_test, label=y_test)

# train with the default parameters

bst = xgb.train({}, dtrain)

dtest = xgb.DMatrix(x_test)
y_pred = bst.predict(dtest)

# convert probabilities to binary predictions (0 or 1)

y_pred_binary = np.round(y_pred)

print(classification_report(y_test, y_pred_binary))


              precision    recall  f1-score   support

           0       0.79      0.78      0.79        99
           1       0.61      0.64      0.62        55

    accuracy                           0.73       154
   macro avg       0.70      0.71      0.71       154
weighted avg       0.73      0.73      0.73       154



Assuming we have completed our model and did hyperparameter tuning. We move onto the next step, productionalization of our notebook.

Convert the same notebook into a python script (extension .py). If working in visual studio code, the easiest way to do this is by clicking three dots > export > python script.