<font color='grey'>

### Step 1: Pre-processing
</font>

In [1]:
import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from joblib import dump,load

ModuleNotFoundError: No module named 'pandas_profiling'

In [None]:
# load the data

data = pd.read_csv('Diabetes.csv')
data

In [None]:
data.info()

In [None]:
data.Outcome.value_counts()

In [None]:
eda_profiling = ProfileReport(data)

In [None]:
eda_profiling

In [None]:
data.head()

In [None]:
#replace zeros with NANs
data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']] = data[['Glucose','BloodPressure','SkinThickness','Insulin','BMI']].replace(0,np.NaN)


In [None]:
data

In [None]:
data.info()

In [None]:
#function to impute the missing values with median based on Outcome class

def impute_median(data,var):
    temp = data[data[var].notnull()]
    temp = temp[[var,'Outcome']].groupby(['Outcome'])[[var]].median()
    data.loc[(data['Outcome'] == 0 ) & (data[var].isnull()), var] = temp.loc[0 ,var]
    data.loc[(data['Outcome'] == 1 ) & (data[var].isnull()), var] = temp.loc[1 ,var]
    
    return data

In [None]:
#impute values using the function
data = impute_median(data, 'Glucose')
data = impute_median(data, 'BloodPressure')
data = impute_median(data, 'SkinThickness')
data = impute_median(data, 'Insulin')
data = impute_median(data, 'BMI')


In [None]:
data.head()

In [None]:

#separate features and target as x & y
y = data['Outcome']
x = data.drop('Outcome', axis = 1)
columns = x.columns

#scale the values using a StandardScaler
scaler = StandardScaler()
scaler = scaler.fit(x)
X = scaler.transform(x)

#features DataFrame 
features = pd.DataFrame(X, columns = columns)

In [None]:
dump(scaler,'scaler.joblib')

In [None]:
features

<font color='grey'>

The first step of any machine learning problem is to analyze and explore the data. The fastest way to do that is by using pandas-profiling (I’ll definitely write about this super-useful package, someday. UPDATE: I did, and here’s more). With a quick exploration, we can notice that though there were no missing values, considerable zeros exist for variables like blood pressure, skin thickness, glucose levels, BMI, and insulin levels, which don't make sense. So we do a simple imputation with the median for these variables after grouping by the target variable.

We also scale the features using a StandardScaler to maintain the range and significance between numeric variables. We can definitely pre-process and transform more to create more relevant features; however, our focus is to demonstrate a quick end-to-end pipeline that can be further improved later.
</font>

<font color='grey'>

### Step 2: Training
</font>

In [None]:
#split data into training and test sets
x_train, x_test, y_train, y_test = train_test_split(features, y, test_size = 0.2, random_state = 42)

#define the model
model = RandomForestClassifier(n_estimators=300, bootstrap = True, max_features = 'sqrt')

#fit model to training data
model.fit(x_train, y_train)

#predict on test data
y_pred = model.predict(x_test)

#evaluate performance
print(classification_report(y_test, y_pred))

In [None]:
dump(model,'model.joblib')

<font color='grey'>

### Step 3: Inference
</font>

In [None]:
pregnancies = 2
glucose = 13
bloodpressure = 30
skinthickness = 4
insulin = 5
bmi = 5
dpf = 0.55
age = 34
feat_cols = features.columns

row = [pregnancies, glucose, bloodpressure, skinthickness, insulin, bmi, dpf, age]

In [None]:
scaler = load('scaler.joblib')

In [None]:
scaler = load('model.joblib')

In [None]:
feat_cols

In [None]:
df = pd.DataFrame([row], columns = feat_cols)
X = scaler.fit_transform(df)
features = pd.DataFrame(X, columns = feat_cols)