<a href="https://colab.research.google.com/github/alivarastepour/diabetes_prediction/blob/master/diabetes_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Purpose of this notebook
This notebook aims to build a model that determines whether a person is prone to diabetes or not. Additionally, it seeks to identify a subset of features (risk factors) that can accurately predict the risk of diabetes. The weights of the optimal solution will be utilized in another project, where they will be applied to users' inputs in real time.

## Dataset
This notebook makes use of a subset of a larger dataset which aimed to collect uniform, state-specific data on preventive health practices and risk behaviors that are associated with chronic diseases, injuries, and preventable infectious diseases in the adult population. The subset used in this notebook can be accessed [here](https://www.kaggle.com/datasets/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv).

In [5]:
import pandas as pd
import numpy as np
from google.colab import drive

In [3]:
drive.mount('/drive')
DATASET_ADDRESS = '/drive/MyDrive/diabetes_info.csv'
raw_dataset = pd.read_csv(DATASET_ADDRESS)

Mounted at /drive


In [4]:
raw_dataset.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70692 entries, 0 to 70691
Data columns (total 22 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Diabetes_binary       70692 non-null  float64
 1   HighBP                70692 non-null  float64
 2   HighChol              70692 non-null  float64
 3   CholCheck             70692 non-null  float64
 4   BMI                   70692 non-null  float64
 5   Smoker                70692 non-null  float64
 6   Stroke                70692 non-null  float64
 7   HeartDiseaseorAttack  70692 non-null  float64
 8   PhysActivity          70692 non-null  float64
 9   Fruits                70692 non-null  float64
 10  Veggies               70692 non-null  float64
 11  HvyAlcoholConsump     70692 non-null  float64
 12  AnyHealthcare         70692 non-null  float64
 13  NoDocbcCost           70692 non-null  float64
 14  GenHlth               70692 non-null  float64
 15  MentHlth           

## The correlation matrix and its usage
Correlation matrix simply explains the relationship between columns of a dataset. The correlation coefficient ranges between -1 and 1. A correlation coefficient of 1 indicates a perfect positive correlation, meaning that the two variables increase or decrease together in a linear relationship. A correlation coefficient of -1 indicates a perfect negative correlation, meaning that the two variables move in opposite directions in a linear relationship. A correlation coefficient close to 0 suggests no linear relationship between the variables.

This matrix can be helpful when finding an optimal subset of features.

In [8]:
columns = raw_dataset.keys()
correlation = raw_dataset[columns].corr()
correlation["Diabetes_binary"]

Diabetes_binary         1.000000
HighBP                  0.381516
HighChol                0.289213
CholCheck               0.115382
BMI                     0.293373
Smoker                  0.085999
Stroke                  0.125427
HeartDiseaseorAttack    0.211523
PhysActivity           -0.158666
Fruits                 -0.054077
Veggies                -0.079293
HvyAlcoholConsump      -0.094853
AnyHealthcare           0.023191
NoDocbcCost             0.040977
GenHlth                 0.407612
MentHlth                0.087029
PhysHlth                0.213081
DiffWalk                0.272646
Sex                     0.044413
Age                     0.278738
Education              -0.170481
Income                 -0.224449
Name: Diabetes_binary, dtype: float64