# Resources
* https://www.kaggle.com/code/shrutimechlearn/step-by-step-diabetes-classification-knn-detailed
* https://www.kaggle.com/code/ash316/ml-from-scratch-part-2
* https://www.kaggle.com/code/pouryaayria/a-complete-ml-pipeline-tutorial-acu-86 (Promising to implement)
* https://www.kaggle.com/code/vincentlugat/pima-indians-diabetes-eda-prediction-0-906
* https://www.kaggle.com/code/faysalmiah1721758/pima-indians-diabetes

In [None]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
import pandas as pd
import numpy as np

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('/kaggle/input/pima-indians-diabetes-database/diabetes.csv')
df.head()

# Introduction and data analysis 
Describe the problem being addressed. Provide a detailed characterization of the task dataset in terms of format, volume, quality and bias.

* Problem Description: 
> Start by introducing the problem or research question that your analysis aims to address. Clearly define the scope and objectives of your analysis. Explain why this problem is important or relevant.

* Dataset Description:

> Format: 
>> Describe the format of the dataset. Is it structured (e.g., CSV, Excel) or unstructured (e.g., text, images)? Mention the data types present (e.g., numerical, categorical). 

> Volume: 
>> Specify the size of the dataset in terms of the number of records and features (columns). Mention if it's a small, medium, or large dataset.

> Quality: 
>> Discuss the quality of the dataset. Are there missing values, outliers, or data errors? How were these issues handled (e.g., data imputation, outlier removal)?

> Bias: 
>> Address any potential biases in the dataset. Bias can arise from data collection methods, sampling, or other factors. Describe how bias was considered and handled, if applicable.

## Format:
The Pima Indian Diabetics Dataset is in a structured tabular format. The dataset is available on Kaggle in the CSV (Comma-Seperated-Values) file format. The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Each data entries or rows in the dataset represents an individual, whereas each columns represents attributes related to the subject health like glucose level, blood pressure, etc. For the features, most of them are integer values with exception to "BMI" and "DiabetesPedigreeFunction" that are floating-point value.

TALK ABOUT THE OBJECTIVE OF THE DATASET SOMEWHERE?

## Volume:
The dataset consisted of 768 instances, which is considerd to be moderate number of subjects. Each instance comes with a set of medical measurements (like including glucose level, blood pressure, and BMI) and a target varialbe called Outcome that indicate whether the subject has diabetic or not. If the value in the Outcome column reads 0, then the person doesn't have diabetics; otherwise, the person with diabetic will have Outcome column reads 1. 

## Quality:
Given the statistical analysis, a very intriguing question arises. Is it possible for the columns including "Glucose", "BloodPressure", "SkinThickness", "Insulin", and "BMI" to have minimum value of zero? In this setting, the value of zero doesn't make sense. For example, a person with BMI means a person weight zero kilogram, which is impossible. We speculate that these instances were treated as missing value (replace with zero) or recording error from human mannual data entry. In order to address this problem, we will considered replacing the zero values with either mean or median of the associated column. 

We used boxplot to investigate the outlier in the dataset. Using the IQR (Inter-qaurtile Range) outlier detection approach, the data that shown as dot on the boxplot is considered as outlier. We also put multiple text annotations to indicate the number of outlier for each feature. 

HOW TO ADDRESS THE OUTLIER BEFORE TRAINING THE MODEL?

## Bias:
The bias may arise from the sampling process, data collection methods, and population representation. For instance, the dataset includes specific group of people, the Pima Indians; therefore, the analysis and model's applicability may not be suitable for other population. Henceforth, the prediction model is generalized enough to apply with other diverse setting. 

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
# To make sure that there are only two quanitiy in the Outcome (y)
df['Outcome'].unique()

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

outcome_counts = df['Outcome'].value_counts()

total_count = outcome_counts.sum()

plt.figure(figsize=(6, 5))
bars = plt.bar(x=[0, 1], height=outcome_counts, color=['blue', 'red'], alpha=0.7)
plt.title('Frequency of Test Outcome (Positive vs. Negative)')
plt.xlabel('Outcome')
plt.ylabel('Frequency')
plt.xticks([0, 1], ['Negative Tested (0)', 'Positive Tested (1)'], rotation=0)

# Customize the labels inside the bars
for i, (count, percentage) in enumerate(zip(outcome_counts, outcome_counts / total_count * 100)):
    plt.text(i, count / 2, f'{count} subjects\n({percentage:.2f}%)', ha='center', va='center', 
             fontsize=12, fontweight='bold', color='white')

plt.show()

In [None]:
num_columns = len(df.drop(['Outcome'],axis=1).columns)
num_rows = (num_columns // 3) + (num_columns % 3)

fig, axes = plt.subplots(nrows=num_rows, ncols=3, figsize=(10, 6 * num_rows // 2))
axes = axes.flatten()

for i, col in enumerate(df.drop(['Outcome'],axis=1).columns):
    ax = axes[i]
    boxplot = df[col].plot(kind='box', ax=ax, sharex=False, sharey=False, 
                           meanline=True, showmeans =True, meanprops={'color': 'red', 'linestyle': '--'})
    ax.set_title(f'{col}')
    ax.set_xticks([])

    # Calculate the median and mean
    median, mean = df[col].median(), df[col].mean() 
    
    # Mean and Median Annotation
    ax.text(0.62, 0.9, f'Median: {median:.2f}', transform=ax.transAxes, 
            fontsize=9, verticalalignment='top',color='green')
    ax.text(0.62, 0.8, f'Mean: {mean:.2f}', transform=ax.transAxes, 
            fontsize=9, verticalalignment='top',color='red')

    # Outlier calculation and Annotation
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    ax.text(0.62, 0.7, f'Outliers: {len(outliers)}', transform=ax.transAxes, fontsize=9, verticalalignment='top')

for i in range(num_columns, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()


In [None]:
num_features = len(df.columns) - 1  
num_rows,num_cols = 3, 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 8 * num_rows // 2))
axes = axes.flatten()

for i, column in enumerate(df.columns[:-1]):
    ax = axes[i]
    # Plot All the data points
    sns.histplot(data=df, x=column, bins=20, common_norm=False, ax=ax, 
                 legend=False, color='orange', alpha=0.2, edgecolor='none')
    
    # Plot the data points seperated by hue (Outcome)
    sns.histplot(data=df, x=column, hue="Outcome", bins=20, common_norm=False, 
                 ax=ax, legend=False, palette={0: 'blue', 1: 'red'}, alpha=0.4, edgecolor='none')
    
    mean = df[column].mean()
    var = df[column].var()
    skew = df[column].skew()
    
    ax.set_title(f'{column}\n(Mean={mean:.2f}, Var={var:.2f}, Skew={skew:.2f})')
    ax.set_xlabel(column)
    ax.set_ylabel('Frequency')
    
    # Calculate seperate mean and variance 
    mean_0 = df[df['Outcome'] == 0][column].mean()
    mean_1 = df[df['Outcome'] == 1][column].mean()
    var_0 = df[df['Outcome'] == 0][column].std()
    var_1 = df[df['Outcome'] == 1][column].std()

    ax.annotate(f'Mean(0): {mean_0:.2f}', xy=(0.7125, 0.85), xycoords='axes fraction', fontsize=10, color='blue')
    ax.annotate(f'Mean(1): {mean_1:.2f}', xy=(0.7125, 0.75), xycoords='axes fraction', fontsize=10, color='red')
    ax.annotate(f'Std(0): {var_0:.2f}', xy=(0.7125, 0.65), xycoords='axes fraction', fontsize=10, color='blue')
    ax.annotate(f'Std(1): {var_1:.2f}', xy=(0.7125, 0.55), xycoords='axes fraction', fontsize=10, color='red')
    
    ax.axvline(mean, color='black', linestyle='-.', linewidth=1.5)
    ax.axvline(df[df['Outcome'] == 0][column].mean(), color='blue', linestyle='--', linewidth=1.5)
    ax.axvline(df[df['Outcome'] == 1][column].mean(), color='red', linestyle='--', linewidth=1.5)

for i in range(num_features, num_rows * num_cols):
    fig.delaxes(axes[i])

custom_legend = [
    plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='orange', markersize=10),
    plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='blue', markersize=10),
    plt.Line2D([0], [0], marker='o', color='w', markerfacecolor='red', markersize=10)
]
legend_labels = ['All the Data', 'Negative Tested', 'Positive Tested']
legend = fig.legend(custom_legend, legend_labels, loc='lower right', bbox_to_anchor=(1.0, 0.0))
legend.set_bbox_to_anchor((0.90, 0.15))

plt.tight_layout()
plt.show()

In [None]:
num_rows, num_cols = 3, 3
fig, axes = plt.subplots(num_rows, num_cols, figsize=(15, 8 * num_rows // 2))
axes = axes.flatten()

for i, column in enumerate(df.columns[:-1]):
    ax = axes[i]

    # Plot the KDE (Kernel Density Estimation) plot for the entire dataset
    sns.kdeplot(data=df, x=column, ax=ax, color='orange', legend=False, linewidth=2, alpha=0.6)

    # Plot the KDE plot for data points separated by hue (Outcome)
    sns.kdeplot(data=df[df['Outcome'] == 0], x=column, ax=ax, color='blue', label='Negative Tested', linewidth=2, alpha=0.6)
    sns.kdeplot(data=df[df['Outcome'] == 1], x=column, ax=ax, color='red', label='Positive Tested', linewidth=2, alpha=0.6)
    
    # From the previous graph
    mean = df[column].mean()
    var = df[column].var()
    skew = df[column].skew()
    
    ax.set_title(f'{column}\n(Mean={mean:.2f}, Var={var:.2f}, Skew={skew:.2f})')
    ax.set_xlabel(column)
    ax.set_ylabel('Density')
    
    # Calculate seperate mean and variance 
    mean_0 = df[df['Outcome'] == 0][column].mean()
    mean_1 = df[df['Outcome'] == 1][column].mean()
    std_0 = df[df['Outcome'] == 0][column].std()
    std_1 = df[df['Outcome'] == 1][column].std()

    ax.annotate(f'Mean(0): {mean_0:.2f}', xy=(0.7125, 0.85), xycoords='axes fraction', fontsize=10, color='blue')
    ax.annotate(f'Mean(1): {mean_1:.2f}', xy=(0.7125, 0.75), xycoords='axes fraction', fontsize=10, color='red')
    ax.annotate(f'Std(0): {std_0:.2f}', xy=(0.7125, 0.65), xycoords='axes fraction', fontsize=10, color='blue')
    ax.annotate(f'Std(1): {std_1:.2f}', xy=(0.7125, 0.55), xycoords='axes fraction', fontsize=10, color='red')

    

# Remove remaining empty subplots
for i in range(num_features, num_rows * num_cols):
    fig.delaxes(axes[i])

handles, labels = ax.get_legend_handles_labels()
custom_legend = [
    plt.Line2D([0], [0], color='orange', lw=2, label='All the Data'),
    *handles 
]
legend_labels = ['All the Data', *labels]
legend = fig.legend(custom_legend, legend_labels, loc='lower right', bbox_to_anchor=(1.0, 0.0))
legend.set_bbox_to_anchor((0.90, 0.15))

plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(10, 8))
hmap = sns.heatmap(df.corr(), annot=True, cmap='RdBu', linewidths=3.5, linecolor='white')

# Add a title
plt.title("Pearson Correlation between Features")

hmap.set_xticklabels(hmap.get_xticklabels(), rotation=90, fontsize=9)
hmap.set_yticklabels(hmap.get_yticklabels(), fontsize=9)

# Customize the legend
cbar = hmap.collections[0].colorbar
cbar.set_label('Correlation', fontsize=10)
cbar.ax.tick_params(labelsize=10)

# Add additional artistic elements
plt.tight_layout()
plt.show()

# Training a Machine Learning and Deep Learning Model

### NOTE

In [None]:
num_columns = len(df.columns)
num_rows = (num_columns // 3) + (num_columns % 3)

fig, axes = plt.subplots(nrows=num_rows, ncols=3, figsize=(10, 6 * num_rows // 2))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    ax = axes[i]
    boxplot = df[col].plot(kind='box', ax=ax, sharex=False, sharey=False)
    ax.set_title(f'{col}')
    ax.set_xticks([])

    # Calculate and annotate the median and mean
    median = df[col].median()
    mean = df[col].mean()
    ax.text(0.62, 0.9, f'Median: {median:.2f}', transform=ax.transAxes, fontsize=9, verticalalignment='top')
    ax.text(0.62, 0.8, f'Mean: {mean:.2f}', transform=ax.transAxes, fontsize=9, verticalalignment='top')

    # Calculate and annotate the count of outliers
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    ax.text(0.62, 0.7, f'Outliers: {len(outliers)}', transform=ax.transAxes, fontsize=9, verticalalignment='top')

for i in range(num_columns, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()

In [None]:
import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
df.plot(kind='box', subplots=True, layout=(3,3), sharex=False, sharey=False)
plt.show()

In [None]:
# Assuming your DataFrame is named 'df'
df.boxplot(figsize=(12, 6))  # You can adjust the figsize as needed
plt.title('Boxplot of Multiple Columns')
plt.xticks(rotation=45)  # Rotate x-axis labels for readability
plt.show()

In [None]:
num_columns = len(df.columns)
num_rows = (num_columns // 3) + (num_columns % 3)

fig, axes = plt.subplots(nrows=num_rows, ncols=3, figsize=(10, 6 * num_rows//2))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    ax = axes[i]
    df[col].plot(kind='box', ax=ax, sharex=False, sharey=False)
    ax.set_title(f'{col}')
    ax.set_xticks([])

for i in range(num_columns, len(axes)):
    fig.delaxes(axes[i])

# Adjust layout and display
plt.tight_layout()
plt.show()


In [None]:
num_columns = len(df.columns)
num_rows = (num_columns // 3) + (num_columns % 3)

fig, axes = plt.subplots(nrows=num_rows, ncols=3, figsize=(10, 6 * num_rows // 2))
axes = axes.flatten()

for i, col in enumerate(df.columns):
    ax = axes[i]
    boxplot = df[col].plot(kind='box', ax=ax, sharex=False, sharey=False)
    ax.set_title(f'{col}')
    ax.set_xticks([])

    # Calculate and annotate the median and mean
    median = df[col].median()
    mean = df[col].mean()
    ax.text(0.62, 0.9, f'Median: {median:.2f}', transform=ax.transAxes, fontsize=9, verticalalignment='top')
    ax.text(0.62, 0.8, f'Mean: {mean:.2f}', transform=ax.transAxes, fontsize=9, verticalalignment='top')

    # Calculate and annotate the count of outliers
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)][col]
    ax.text(0.62, 0.7, f'Outliers: {len(outliers)}', transform=ax.transAxes, fontsize=9, verticalalignment='top')

for i in range(num_columns, len(axes)):
    fig.delaxes(axes[i])

plt.tight_layout()
plt.show()