Instructions

Step 1: Data Exploration with Pandas

Load the Dataset:
 
Use Pandas to load the dataset into a DataFrame.
Ensure the dataset loads properly by checking the first few rows with head().
 
General Information:
 
Display the dataset's structure, including column names, data types, and memory usage.
Identify the number of missing values or zeros in the dataset.
 
Descriptive Analysis:

 
Use the describe() function to analyze:
 
Summary statistics for each column (mean, min, max, quartiles).
Look for irregularities, such as columns with unrealistic minimum or maximum values.
 
Step 2: Data Exploration with ydata-profiling

Generate a Profiling Report:
 
Use ydata-profiling to create an interactive report that includes:
 
Column descriptions (type, unique values, missing values).
Distributions for numerical columns.
Correlation matrices to identify relationships between variables.
Highlighted outliers or anomalies.
 
Analyze the Report:
 
Identify missing values in key columns such as Glucose, Insulin, and BMI.
Examine correlations between columns like Age, Glucose, and Outcome.
Note any interesting insights or patterns (e.g., higher glucose levels correlated with diabetes diagnosis).

Step 3: Summary

Document Findings:
 
Write a summary of key observations from both Pandas and the ydata-profiling report.
Mention:
 
Patterns or trends in glucose, BMI, or pregnancies.
Any notable correlations between variables.
Issues such as missing or zero values in critical columns.
 
Suggestions:
 
Recommend next steps, such as handling missing values, addressing outliers, or exploring predictive modeling with the data.


In [4]:
import pandas as pd

In [5]:
df = pd.read_csv(r"C:\Users\USER\Desktop\Gomycode\Python_class\Checkpoints\diabetes (1).csv")
df.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


In [6]:
df.shape

(768, 9)

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB


In [8]:
df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

In [9]:
df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

In [10]:
df.isnull().sum()

Pregnancies                 0
Glucose                     0
BloodPressure               0
SkinThickness               0
Insulin                     0
BMI                         0
DiabetesPedigreeFunction    0
Age                         0
Outcome                     0
dtype: int64

In [11]:
df.describe()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
count,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0,768.0
mean,3.845052,120.894531,69.105469,20.536458,79.799479,31.992578,0.471876,33.240885,0.348958
std,3.369578,31.972618,19.355807,15.952218,115.244002,7.88416,0.331329,11.760232,0.476951
min,0.0,0.0,0.0,0.0,0.0,0.0,0.078,21.0,0.0
25%,1.0,99.0,62.0,0.0,0.0,27.3,0.24375,24.0,0.0
50%,3.0,117.0,72.0,23.0,30.5,32.0,0.3725,29.0,0.0
75%,6.0,140.25,80.0,32.0,127.25,36.6,0.62625,41.0,1.0
max,17.0,199.0,122.0,99.0,846.0,67.1,2.42,81.0,1.0


In [54]:
import ydata_profiling

In [56]:
ydata_profiling.ProfileReport(df)

Summarize dataset:   0%|          | 0/5 [00:01<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]



In [67]:
(df['BMI'] == 0).sum()

11

In [50]:
(df['Glucose'] == 0).sum()

5

In [52]:
(df['Insulin'] == 0).sum()

374

## Summary Report and Findings

### Dataset Overview
The dataset consists of 768 entries and 9 columns, with 8 numerical variables and 1 categorical variable. This analysis was conducted using pandas for data exploration and ydata profiling for deeper insights.

Key Findings from Pandas Data Exploration:

#### Data Integrity:

- There are no missing values or duplicate rows in the dataset.
- However, the dataset contains a significant number of zero values across various numerical columns, which raises concerns about the validity of certain measurements.
  
#### Unrealistic Minimum Values:

- Glucose: A glucose level of 0 mg/dL is not compatible with life. Normal blood glucose levels typically range from 70 to 100 mg/dL.
- Blood Pressure: A diastolic blood pressure level of 0 mmHg is not possible. Normal diastolic values typically range from 60 to 80 mmHg.
- Skin Thickness: A skin thickness measurement of 0 mm is unrealistic. Skin thickness can vary based on factors such as age and health conditions.
- Insulin: While a person can have an insulin level of 0, particularly in conditions like type 1 diabetes, 48.7% of the dataset has an insulin level of zero, raising concerns about the integrity of this variable. Additionally, an insulin level of 846 is noted, which is not feasible and indicates the presence of outliers.
- BMI: Given that the formula for calculating BMI is ( \text{weight in kg} / (\text{height in m})^2 ), a BMI of 0 is not feasible, as weight cannot be zero in a living person.
  
### Insights from Ydata Profiling

- Pregnancies: The analysis shows a downward trend in the number of pregnancies, indicating that more individuals in the dataset have fewer pregnancies.
  
- Glucose Levels: The distribution of glucose levels exhibits a hill-shaped trend, with a higher concentration of individuals having glucose levels in the middle range of 75-200 mg/dL.

- Insulin Levels: Similar to glucose, insulin levels also show a hill-shaped trend, with a significant number of individuals falling within the middle range.

### Correlation Analysis:

- Age vs. Insulin: The correlation chart indicates that most individuals in their 20s and 30s have an insulin level around 150, while those in their 50s and 60s have higher average insulin levels, around 300-350.
- Insulin vs. Outcome: The correlation heatmap shows a light correlation between insulin levels and the outcome variable.
#### Strong Correlations:
- Age and Pregnancies: A strong correlation exists, suggesting that as age increases, the number of pregnancies tends to decrease.
- Insulin and Skin Thickness: A strong correlation indicates that higher insulin levels may be associated with increased skin thickness.
- Glucose and Outcome: A strong correlation suggests that glucose levels may significantly impact the outcome variable.
- BMI and Skin Thickness: A strong correlation indicates that higher BMI is associated with increased skin thickness.
  
### Conclusion

The presence of unrealistic minimum values, particularly for glucose, blood pressure, skin thickness, insulin, and BMI, suggests cleaning and validation. The trends observed in pregnancies, glucose, and insulin levels provide valuable information for understanding the population's health status. Furthermore, the correlation analysis highlights significant relationships that could inform future research and health interventions.

### Recommendations
Data Cleaning: Address the unrealistic minimum values of BloodPressure, Glucose and SkinThickness and the high percentage of zero values in the insulin variable. I suggest we replace the zeros of the insulin variable with the mode of the variable reason being that, there are presense of outliers in the variable as stated above. 
