# Introduction to Machine Learning and Data Science: Exploratory Data Analysis (EDA) with Diabetes Dataset

This notebook contains exercises for an introductory course on Machine Learning and Data Science, focusing on Exploratory Data Analysis (EDA) using the Pima Indians Diabetes dataset.

We will cover essential steps in understanding, cleaning, transforming, and analyzing the data using pandas and matplotlib/seaborn. The exercises are divided into basic, intermediate, and advanced levels.

**Estimated Time:** 2 Hours

## Setup

First, we need to import the necessary libraries and load the dataset. The dataset does not come with headers, so we will define the column names based on common practice for this dataset:

1. Pregnancies
2. Glucose
3. BloodPressure
4. SkinThickness
5. Insulin
6. BMI
7. DiabetesPedigreeFunction
8. Age
9. Outcome (0 for no diabetes, 1 for diabetes)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

# URL for the Pima Indians Diabetes dataset
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
names = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']

# Load the dataset
diabetes_df = pd.read_csv(url, names=names)

# Display the first few rows to confirm loading
print(diabetes_df.head())

## Basic Exercises (Approx. 45-60 minutes)

These exercises focus on fundamental data exploration and manipulation using pandas and basic visualization.

### Exercise 1: Initial Data Inspection

1. Display the last 7 rows of the `diabetes_df` DataFrame.
2. Get a concise summary of the DataFrame using `.info()`. How many entries are there? What are the data types?
3. Generate a statistical summary of the numerical attributes using `.describe()`. Pay attention to the minimum values for columns like `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI`. Do you notice anything unusual?
4. Examine the unique values and their counts for the `Outcome` column using `.value_counts()`. Is the dataset balanced in terms of the outcome variable?

In [None]:
# Your code for Exercise 1.1


In [None]:
# Your code for Exercise 1.2


In [None]:
# Your code for Exercise 1.3


In [None]:
# Your code for Exercise 1.4


### Exercise 2: Basic Data Filtering and Selection

1. Select and display only the `Age` and `BMI` columns for the first 10 patients.
2. Filter the DataFrame to show only the patients who tested positive for diabetes (`Outcome` is 1).
3. Calculate the average `Glucose` level for patients who tested negative for diabetes (`Outcome` is 0).

In [None]:
# Your code for Exercise 2.1


In [None]:
# Your code for Exercise 2.2


In [None]:
# Your code for Exercise 2.3


### Exercise 3: Simple Visualization

1. Create a histogram for the `Age` column to visualize the age distribution of patients.
2. Generate a histogram for the `BMI` column.
3. Use seaborn's `countplot` to visualize the distribution of the `Outcome` variable (number of diabetic vs non-diabetic patients).

In [None]:
# Your code for Exercise 3.1


In [None]:
# Your code for Exercise 3.2


In [None]:
# Your code for Exercise 3.3


## Intermediate Exercises (Approx. 45-60 minutes)

These exercises delve into data cleaning, manipulation, and basic feature engineering.

### Exercise 4: Handling Missing Data (Implicit Missing Values)

As observed in Exercise 1.3, some columns have a minimum value of 0, which might represent missing data (e.g., a blood pressure of 0 is biologically impossible).

1. For the `Glucose`, `BloodPressure`, `SkinThickness`, `Insulin`, and `BMI` columns, replace the 0 values with `NaN` (Not a Number) to explicitly mark them as missing.
2. Verify the number of missing values in these columns after the replacement using `.info()` or `.isnull().sum()`. Which column has the most missing values now?
3. Choose an imputation strategy for the missing numerical values (e.g., replace with the mean or median of the respective column). Justify your choice.
4. Apply the chosen imputation strategy to fill the `NaN` values in the affected columns. Verify that there are no more missing values in these columns.

In [None]:
# Your code for Exercise 4.1


In [None]:
# Your code for Exercise 4.2


In [None]:
# Your code for Exercise 4.3 (Add comments for justification)


In [None]:
# Your code for Exercise 4.4


### Exercise 5: Creating New Features (Feature Engineering)

Create a new categorical feature based on the `Age` column.

1. Define age bins (e.g., '0-20', '21-30', '31-40', '41-50', '51+').
2. Create a new column called `AgeGroup` by categorizing the `Age` values into these bins.
3. Display the value counts for the new `AgeGroup` column.

In [None]:
# Your code for Exercise 5.1, 5.2, and 5.3


### Exercise 6: Exploring Correlations

1. Compute the correlation matrix for the numerical features in the DataFrame using `.corr()`.
2. Display the correlations of all numerical features with the `Outcome` variable, sorted in descending order. Which features have the strongest positive and negative correlations with diabetes outcome?
3. Use seaborn's `heatmap` to visualize the correlation matrix. Interpret the relationships between different features.

In [None]:
# Your code for Exercise 6.1


In [None]:
# Your code for Exercise 6.2


In [None]:
# Your code for Exercise 6.3


## Advanced Exercises (Approx. 30-45 minutes, potentially carrying over into self-study)

These exercises challenge students with more complex data transformations and analysis.

### Exercise 7: Analyzing Relationships with Outcome

Use visualizations to explore the relationship between different features and the `Outcome` (diabetes positive or negative).

1. Create a violin plot using seaborn to visualize the distribution of `Age` for both `Outcome` groups (0 and 1).
2. Create a scatter plot of `Glucose` vs `BMI`, coloring the points based on the `Outcome` variable.
3. Use seaborn's `boxplot` to compare the distribution of `BloodPressure` across different `AgeGroup` categories (created in Exercise 5).

In [None]:
# Your code for Exercise 7.1


In [None]:
# Your code for Exercise 7.2


In [None]:
# Your code for Exercise 7.3


### Exercise 8: Outlier Identification (Visualization Based)

Use box plots to visually identify potential outliers in some of the key numerical features.

1. Create box plots for `Glucose`, `BloodPressure`, `Insulin`, and `BMI`.
2. Discuss what outliers might represent in this medical dataset and how they could potentially affect machine learning models.

In [None]:
# Your code for Exercise 8.1


### Exercise 9: (Optional Stretch) Further Feature Engineering or Analysis

Choose one of the following (or come up with your own idea):

1. Create a new feature that combines `BMI` and `Age` based on common health knowledge (e.g., BMI categories for different age groups).
2. Investigate the distribution of `DiabetesPedigreeFunction` and its relationship with the `Outcome`.

In [None]:
# Your code or notes for Exercise 9
