# Exploring Health Risk Factors in Simulated Patient Data
**Author:** Allie Skinner  
**Description:** An exploratory data analysis (EDA) on patient demographics, BMI, blood pressure, and diabetes risk using a fictional public health dataset.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
df = pd.read_csv('sample_public_health_data.csv')
df.head()

## Basic Data Overview

In [None]:
df.info()
df.describe()
df['Sex'].value_counts()
df['Smoker'].value_counts()
df['Has_Diabetes'].value_counts()

## Visual Explorations

In [None]:
# Distribution of BMI
sns.histplot(df['BMI'], kde=True)
plt.title('Distribution of BMI')
plt.show()

In [None]:
# Diabetes prevalence by sex
sns.countplot(x='Sex', hue='Has_Diabetes', data=df)
plt.title('Diabetes by Sex')
plt.show()

In [None]:
# Smoking vs diabetes
sns.countplot(x='Smoker', hue='Has_Diabetes', data=df)
plt.title('Diabetes by Smoking Status')
plt.show()

In [None]:
# Correlation heatmap
sns.heatmap(df.corr(numeric_only=True), annot=True, cmap='coolwarm')
plt.title('Correlation Between Numeric Features')
plt.show()

## Key Takeaways
- Most patients are non-smokers, but smokers appear to have slightly higher diabetes rates.  
- BMI and age are positively correlated with diabetes risk.  
- No major sex difference in diabetes prevalence in this small dataset.

## Next Steps
- Build a logistic regression model to predict diabetes.  
- Explore interaction effects between variables (e.g., Age + Smoking).  
- Expand this dataset or find similar open health data for comparison.