# Diabetes Prediction Case Study

## About Dataset
This dataset is originally from the National Institute of Diabetes and Digestive and Kidney
Diseases. The objective of the dataset is to diagnostically predict whether a patient has diabetes,
based on certain diagnostic measurements included in the dataset. Several constraints were placed
on the selection of these instances from a larger database. In particular, all patients here are females
at least 21 years old of Pima Indian heritage.2
From the data set in the (.csv) File We can find several variables, some of them are independent
(several medical predictor variables) and only one target dependent variable (Outcome).

## Data
This dataset contains 768 patient observations, each with 9 predictors related to diabetes risk factors:
- `Pregnancies`: Number of pregnancies
- `Glucose`: Glucose Level in Blood (in mg/dL)
- `BloodPressue`: Blood Pressure measurement (in mm Hg)
- `SkinThickness`: Thickness of skin (in mm)
- `Insulin`: Insulin Level in Blood
- `BMI`: Body Mass Index
- `DiabetesPedigreeFunction`: Likelihood of developing diabetes based on family history as percentage
- `Age`: Age of patient (in Years)
- `Outcome`: To express the final result 1 is Yes and 0 is No


## Question of Interest
Using this dataset, we aim to answer the question: What is the likelihood of a patient having diabetes based on key diagnostic factors?

The response variable for our analysis is the diabetes outcome (a binary variable indicating whether a patient has diabetes). The explanatory variables include medical predictors such as glucose levels, blood pressure, body mass index (BMI), age, insulin levels, and the diabetes pedigree function.

### How Data Will Help 
This dataset provides specific medical measurements for 768 patients, each labeled with an outcome for diabetes. By analyzing these measurements and fitting it to a logistic regression model, we can assess how strongly each predictor correlates with the likelihood of diabetes. This enables us to construct a model that predicts the probability of diabetes for a new patient based on their diagnostic measurements. 

### Focus: Prediction and Inference
- *Predict* the probability that a new patient has diabetes, based on the explanatory variables.
- *Infer* which variables are the most significant predictors of diabetes, providing insights into how specific health factors relate to diabetes risk.

In [2]:
library(readr)

# Download the dataset
path <- "./diabetes.csv"
diabetes <- read_csv(path)
head(diabetes)
nrow

[1mRows: [22m[34m768[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, D...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
5,116,74,0,0,25.6,0.201,30,0
