# Diabetes classification

This is a basic example of supervised learning classification using tabular data.

File information:

* **File name**: diabetes.csv
* **Features**:
  * **PatientID**: ID
  * **Pregnancie**: Number of times pregnant
  * **PlasmaGlucose**: Plasma glucose concentration a 2 hours in an oral glucose tolerance test
  * **DiastolicBloodPressure**: Diastolic blood pressure (mm Hg)
  * **TricepsThickness**: Triceps skin fold thickness (mm)
  * **SerumInsulin**: 2-Hour serum insulin (mu U/ml)
  * **BMI**: Body mass index (weight in kg/(height in m)^2)
  * **DiabetesPedigree**: Diabetes pedigree function
  * **Age**: years
* **Target variable**:
  * **Diabetic**: 0 (no diabetic) or 1 (diabetic)

## 1. Login

In [None]:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential

credential = DefaultAzureCredential()
ml_client = MLClient.from_config(credential = credential)

## 2. Prepare data

The data is stored in an storage account which is already connected to a Datastore type Azure Blob Storage. So, in order to make easy to get access, a data asset type file is used.

In [None]:
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
import mltable

### 2.1. Create data asset (type file)

In [None]:
# File location
datastore_name = "workspacetabulardata"
file_name = "diabetes.csv"
path = f"azureml://datastores/{datastore_name}/paths/{file_name}"

# Data asset configuration
data_asset_name = "diabetes_tabular_file"
data_asset_version = "1.0"

my_data = Data(
    path = path,
    type = AssetTypes.URI_FILE,
    description = "Diabetes dataset file",
    name = data_asset_name,
    version = data_asset_version
)

# Create data asset
ml_client.data.create_or_update(my_data)

### 2.2. Read data asset

In [None]:
# Get data asset
data_asset = ml_client.data.get(name = data_asset_name, version = data_asset_version)

# Read data asset
path = {
    "file": data_asset.path
}

tbl = mltable.from_delimited_files(paths = [path])
df = tbl.to_pandas_dataframe()
df.head()

In [None]:
df.info()

## 3. Project code

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve

In [None]:
# Load the diabetes dataset
print("Loading Data...")
data_asset = ml_client.data.get(name = data_asset_name, version = data_asset_version)
tbl = mltable.from_delimited_files(paths = [{ "file": data_asset.path }])
diabetes = tbl.to_pandas_dataframe()

# Separate features and labels
X = diabetes[['Pregnancies', 'PlasmaGlucose', 'DiastolicBloodPressure', 'TricepsThickness', 'SerumInsulin', 'BMI', 'DiabetesPedigree', 'Age']].values
y = diabetes['Diabetic'].values

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=0)

# Set regularization hyperparameter
reg = 0.01

# Train a logistic regression model
print('Training a logistic regression model with regularization rate of', reg)
model = LogisticRegression(C = 1/reg, solver = "liblinear").fit(X_train, y_train)

# Calculate accuracy
y_hat = model.predict(X_test)
acc = np.average(y_hat == y_test)
print("Accuracy:", acc)

#  Calculate AUC
y_scores = model.predict_proba(X_test)
auc = roc_auc_score(y_test,y_scores[:,1])
print("AUC: " + str(auc))