![diabetes](https://media.healthdirect.org.au/images/inline/original/type-2-diabetes-49a3ee.jpg)

<div style="border: 1px solid #4CAF50; padding: 10px; border-radius: 5px;">
    <h2 style="color: #4CAF50;">Pre-Class Assignment (PCA) Instructions</h2>
</div>


# 

This assignment is due by **midnight on Wednesday**. The goal is to explore the diabetes dataset in preparation for the **In-Class Assignment (ICA)** on Thursday. 

### Instructions:
1. **Explore the Dataset**: Take some time to understand the data, the key features, and how the target variable (disease progression) behaves.
   
2. **Prepare for Role Playing**: Imagine you work at a company, and you’ve been tasked with analyzing and presenting this dataset to stakeholders. Consider:
   - **Who is your audience**? (e.g., older generation, Gen-Z, millennials, or a mixed audience)
   - **Invent a setting**: Is this for a drug company, a health education seminar, or another industry?
   - **What are the plot twists**? Which trends or surprising insights will make your story more engaging?
   - **Best visualizations**: Based on your audience and setting, what are the most effective ways to visualize the data? Which plots from the EDA stand out, or what additional visualizations might you create?
   - **Delivery**: How would you structure your presentation to convey the story to your audience clearly and persuasively?

3. **Read the ICA Instructions**: Make sure you review the instructions for the in-class assignment on Thursday to understand how this pre-work will fit into the larger activity. Those instructions are included with this PCA.

4. **Submit a Summary**: By Wednesday night, submit a short summary (half a page) of your observations and ideas. This should include:
   - Key insights from the dataset
   - Ideas for your role-playing scenario (audience, setting, visualizations)
   - Your name!

This summary will count toward your ICA grade. Put all of your answers into a markdown cell at the bottom of this notebook and turn that in. 

### A Brief Introduction to Diabetes:
Diabetes is a chronic condition that affects how your body turns food into energy. It occurs when the pancreas doesn't produce enough insulin or the body can’t use insulin effectively. Over time, high blood sugar levels can lead to serious health complications like heart disease, vision loss, and kidney disease. Understanding the factors that contribute to diabetes progression can help inform treatment plans and preventative measures.

In this dataset, you’ll be exploring health-related variables such as BMI, blood pressure, and glucose levels to understand their relationship with the progression of diabetes over time.


In [1]:
# Import necessary libraries
import pandas as pd
import plotly.express as px
import plotly.figure_factory as ff
import numpy as np
from sklearn.datasets import load_diabetes
import plotly.graph_objs as go

# Load the diabetes dataset and convert to a DataFrame
diabetes = load_diabetes()

diabetes_df = pd.DataFrame(data=diabetes.data, columns=diabetes.feature_names)
diabetes_df['target'] = diabetes.target

# Reduce the number of features for pair plot and focus on key variables
selected_features = ['age', 'bmi', 'bp', 's1', 'target']  # Focus on these variables

# 1. Correlation Heatmap (Interactive)
correlation_matrix = diabetes_df[selected_features].corr().values
fig_heatmap = ff.create_annotated_heatmap(
    z=correlation_matrix,
    x=selected_features,
    y=selected_features,
    colorscale='Viridis'
)
fig_heatmap.update_layout(
    title="Correlation Heatmap (Interactive)",
    xaxis_title="Features",
    yaxis_title="Features"
)
fig_heatmap.show()

# 2. Interactive Scatterplot of BMI vs Target with Regression Line
fig_scatter_bmi = px.scatter(diabetes_df, x='bmi', y='target', trendline='ols',
                             labels={'bmi':'BMI', 'target':'Disease Progression'},
                             title="Interactive Scatterplot of BMI vs Disease Progression")
fig_scatter_bmi.show()

# 3. Interactive Scatterplot of Blood Pressure vs Target with Regression Line
fig_scatter_bp = px.scatter(diabetes_df, x='bp', y='target', trendline='ols',
                            labels={'bp':'Blood Pressure', 'target':'Disease Progression'},
                            title="Interactive Scatterplot of Blood Pressure vs Disease Progression")
fig_scatter_bp.show()

# 4. Bin BMI and Blood Pressure for better categorical comparison in violin plots
diabetes_df['bmi_binned'] = pd.cut(diabetes_df['bmi'], bins=5)
diabetes_df['bp_binned'] = pd.cut(diabetes_df['bp'], bins=5)

# Convert the binned intervals to strings for compatibility with Plotly
diabetes_df['bmi_binned'] = diabetes_df['bmi_binned'].astype(str)
diabetes_df['bp_binned'] = diabetes_df['bp_binned'].astype(str)

# 5. Interactive Violin Plot for Binned BMI vs Target
fig_violin_bmi = px.violin(diabetes_df, x='bmi_binned', y='target', box=True, points='all',
                           labels={'bmi_binned':'BMI Binned', 'target':'Disease Progression'},
                           title="Interactive Violin Plot of Binned BMI vs Disease Progression")
fig_violin_bmi.show()

# 6. Interactive Violin Plot for Binned Blood Pressure vs Target
fig_violin_bp = px.violin(diabetes_df, x='bp_binned', y='target', box=True, points='all',
                          labels={'bp_binned':'Blood Pressure Binned', 'target':'Disease Progression'},
                          title="Interactive Violin Plot of Binned Blood Pressure vs Disease Progression")
fig_violin_bp.show()


# Pair Plot with Reduced Features (Ensure only numeric features are used)
numeric_features = ['age', 'bmi', 'bp', 's1', 'target']  # Focus on numeric features only

fig_splom = go.Figure(data=go.Splom(
    dimensions=[dict(label=col, values=diabetes_df[col]) for col in numeric_features],
    showupperhalf=False,  # Only show the lower half of the matrix
    diagonal_visible=False  # Hide diagonal subplots
))

fig_splom.update_layout(
    title="Scatter Plot Matrix of Selected Numeric Features",
    dragmode='select',
    width=800,
    height=800
)

fig_splom.show()

Next, read carefully the instructions for the ICA for Thursday. Invent some scenarios so that you can quickly work with your subgroup on a scenario you all agree to. 


Here are some hints, but you and your subgroup should be creative!

### Subgroup Scenarios for Data Storytelling

Below are five distinct scenarios that you will consider in your subgroup. Each scenario presents a unique audience and setting, requiring you to adapt your data narrative, tone, and visualizations accordingly. Your goal is to tailor your presentation to fit the needs and expectations of the audience described in your assigned scenario.

##### Scenario 1: **Insulin Production Company – Presenting to New Employees**
- **Audience**: New employees at an insulin production company who are mainly curious about diabetes data but have limited knowledge.
- **Objective**: Explain the relationship between health factors (BMI, glucose, etc.) and diabetes progression in a simple and informative way.
- **Tone**: Educational and accessible, focusing on the basics.
- **Challenge**: Keeping it engaging for a less technical audience while delivering key insights about the product (insulin) and its impact on health.

##### Scenario 2: **Soft Drink Company – Presenting to a Traditional CEO**
- **Audience**: The CEO of a major soft drink company, who is resistant to change and prefers things the way they’ve always been.
- **Objective**: Persuade the CEO to consider the impact of sugary drinks on health and how it might relate to diabetes, potentially advocating for product reformulation or marketing shifts.
- **Tone**: Convincing but respectful, using hard-hitting data to support your case.
- **Challenge**: Overcoming resistance and biases, especially when the data suggests a negative impact of sugary drinks on diabetes progression.

##### Scenario 3: **Health Education Nonprofit – Presenting to a Diverse Audience**
- **Audience**: A mixed group at a health education seminar, including young adults, parents, and senior citizens.
- **Objective**: Raise awareness about the risk factors for diabetes and the importance of lifestyle changes (diet, exercise) to prevent or manage the condition.
- **Tone**: Clear, engaging, and actionable, with an emphasis on how each demographic can take steps to improve their health.
- **Challenge**: Addressing a diverse audience with varying levels of health literacy and interest in the topic.

##### Scenario 4: **Medical Research Conference – Presenting to Experts**
- **Audience**: Doctors, researchers, and healthcare professionals attending a medical research conference.
- **Objective**: Present the data on diabetes progression in relation to key health metrics like BMI and glucose, aiming to spark discussion on new treatment approaches or clinical trials.
- **Tone**: Technical and data-driven, with a focus on cutting-edge research and clinical applications.
- **Challenge**: Providing enough depth to keep experts engaged while ensuring clarity and focus.

##### Scenario 5: **Fitness Company – Presenting to a Group of Fitness Trainers**
- **Audience**: Fitness trainers and health coaches at a company focused on fitness and wellness.
- **Objective**: Highlight the role of exercise, diet, and physical fitness in managing and preventing diabetes. Connect diabetes data to fitness practices.
- **Tone**: Motivational and informative, focusing on how trainers can use this information to support their clients.
- **Challenge**: Translating medical data into practical, fitness-related advice that trainers can apply in their day-to-day work.


**Answer**

**Name:** Ni, Zhiqiang

**Key Insights from the Dataset:**

From the plots I can see there are some features are highly correlated with Disease Progression. Like from plot the BMI has a strong positive correlation with disease progression, which means higher BMI are linked to get diseases. And the blood pressure also has a positive correlation with disease progression, but the correlation is not as strong as BMI which range from -0.1 to 0.15 but blood pressure is ranged from -0.1 to 0.1. Also as we can see from the correlation heatmap, the age's value are tend to be smaller than other features, which means age is not a strong factor to influence the disease progression.

**Role-Playing Scenario**
I think my audience will be a mixed group of doctors, nurses, and people want to be health. The setting will be a healthcare seminar organized by a some medical company, and they're trying to point out that there are some main risks factors for diabetes. I think the example plot for the BMI and Blood Pressure are very useful to show the relationship between these features and disease progression, and the correlation heatmap is also useful to show the relationship among multiple variables. Also the plot is interactive, which is very useful for the audience to explore the data by themselves. I think the BMI plot is a surprising insight for the audience, because many people may not realize that BMI is such a strong factor for diabetes progression. I will structure my presentation by first introducing the dataset and what feature does the data set has, then showing the plots and explaining the important thing should get out of the plot, and lastly summarizing the main points and suggesting what people can do to reduce the risk of diabetes.

