## Project Goals
Our project seeks to analyze health-related datasets focusing on diabetes, aiming to uncover insights into how various health conditions contribute to the diagnosis of diabetes. This analysis will cover how various health metrics relate to diabetes diagnoses. Values range from macroscopic (e.g. gender, age) to microscopic (e.g. blood glucose level), additionally including common risk factors, like history of heart disease and smoking. Following basic exploratory analysis, PCA is applied to reduce data dimensionality for clustering analysis that could potentially categorize individuals based on their likelihood of developing diabetes.

## Import Section

In [None]:
from utilities import ProjectDataFrame, ProjNormalizedDF, LogisticModel, one_hot_encoder

## 1 Data Loading and Dataset Overview

In [None]:
DATA_PATH = "diabetes_prediction_dataset.csv"
INVASIVES = ['HbA1c_level', 'blood_glucose_level']
RANDOM_STATE = 42 # for reproducibility
projectdf = ProjectDataFrame(DATA_PATH, random_state=RANDOM_STATE)

In [None]:
projectdf.dataset_overview()

## 2 Categorical Features Inspection

In [None]:
projectdf.categoricals_inspection()

## 3 Numericals Inspection

In [None]:
projectdf.numericals_inspection()

## 4 Perform Normalization

### 4.1 Handle Missing Values

####  gender - drop small number of samples

In [None]:
df = projectdf.get_data()

In [None]:
# drop rows with missing gender values
df = df.dropna(subset=['gender'])

#### smoking_history - imputation

In [None]:
# impute missing smoking history values with the most frequent value
most_frequent_smoking_history = df['smoking_history'].mode()[0]
df.loc[:, 'smoking_history'] = df.loc[:, 'smoking_history'].fillna(most_frequent_smoking_history)

#### further encoding smoking_history

In [None]:
# encoding former, ever, not current all to ever to simplify the model
df.loc[:, 'smoking_history'] = df.loc[:, 'smoking_history'].replace('former', 'ever')
df.loc[:, 'smoking_history'] = df.loc[:, 'smoking_history'].replace('not current', 'ever')
# using one hot encoding for smoking history
df = one_hot_encoder(df, "smoking_history")

### 4.2 normalization for categoricals and numericals

In [None]:
projectdf.set_data(df)
projectdf_nor = projectdf.normalize()

## 5 Principal Component Analysis

In [None]:
pca = projectdf_nor.perform_pca()

In [None]:
pca.get_cumu_var_plot()

In [None]:
pca.get_firt_two_components_scatter_plot()

## 6 Prediction Model

### 6.1 Using Raw Features

In [None]:
# initialize model training
logistic_raw = LogisticModel(projectdf_nor.features, projectdf_nor.target)
# train the model
model = logistic_raw.logistic_training(test_size=0.2, random_state=RANDOM_STATE)
# evaluate the model
logistic_raw.evaluate(model)

> With imbalanced datasets, the accuracy score can be misleading. In this specific case, the ultimate goal of our logistic regression model is to identify most of the true positive cases of diabetes while maintaining a precision rate that is significantly higher than the general prevalence of diabetes in the population, which is 8.5%.

In [None]:
logistic_raw.get_model_expression(model)

### 6.2 Use Pricinpal Components to reduce dimensions

#### using first five components (cumulatively account for 90% Variance)

In [None]:
pc_df = pca.get_project_principal_components_dataframe()

In [None]:
# initialize the model
logistic_pca = LogisticModel(pc_df.features.iloc[:, :5], pc_df.target)
# train the model
model = logistic_pca.logistic_training(test_size=0.2, random_state=RANDOM_STATE)
# evaluate the model
logistic_pca.evaluate(model)

In [None]:
logistic_pca.get_model_expression(model)

#### using first two components (cumulatively account for 50% Variance)

>While retaining components that explain a high cumulative variance captures most of the information in the features, fewer components can still be effective if the variance they capture is more related to the prediction target.

In [None]:
# initialize the model
logistic_pca = LogisticModel(pc_df.features.iloc[:, :2], pc_df.target)
# train the model
model = logistic_pca.logistic_training(test_size=0.2, random_state=RANDOM_STATE)
# evaluate the model
logistic_pca.evaluate(model)

### 6.3 Non-invasive model

>We need to consider the availability of the features.
>HbA1c_level, blood_glucose_level require invasive medical tests to get, while other features are more easily accessible
>So we also built a model without these two features to serve as a basic version of the model

In [None]:
# get the normalized data without invasive features
df_nor_noninvasive = projectdf_nor.get_data().drop(columns=INVASIVES, inplace=False)
# instantiate the ProjectDataFrameNormalized object
projectdf_nor_noninvasive = ProjNormalizedDF(df_nor_noninvasive)

In [None]:
# initialize the model
logistic_pca = LogisticModel(projectdf_nor_noninvasive.features, projectdf_nor_noninvasive.target)
# train the model
model = logistic_pca.logistic_training(test_size=0.2, random_state=RANDOM_STATE)
# evaluate the model
logistic_pca.evaluate(model, prob_threshhold=0.085)