<a href="https://colab.research.google.com/github/appliedcode/mthree-c422/blob/mthree-c422-Avantika/Data_Ethics_Practice2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Problem Set: Data Ethics in Healthcare Wearable Device Data Management
## Scenario

You are part of a team managing health and fitness wearable devices that track heart rate, steps, sleep patterns, and GPS location for users across different countries. The company aims to use this data for public health studies while ensuring ethical data handling throughout its lifecycle.

You will work with a synthetic dataset representing anonymized wearable device data along with user consent status.

## Dataset Setup (Colab‑Ready)

In [1]:
import pandas as pd
import numpy as np

np.random.seed(123)
n = 1000

consent_types = ["explicit", "implicit", "none"]
countries = ["USA", "UK", "Germany", "India", "Japan"]

data = pd.DataFrame({
    "Device_ID": range(1, n+1),
    "User_Age": np.random.randint(18, 80, n),
    "Country": np.random.choice(countries, n),
    "Consent_Type": np.random.choice(consent_types, n, p=[0.65, 0.25, 0.10]),
    "Average_HeartRate": np.random.randint(50, 160, n),
    "Daily_Steps": np.random.randint(1000, 20000, n),
    "Sleep_Hours": np.round(np.random.uniform(3, 10, n), 1),
    "GPS_Location_Share": np.random.choice([0, 1], n, p=[0.7, 0.3]),
    "Health_Alert": np.random.choice([0, 1], n, p=[0.85, 0.15])
})

display(data.head())

Unnamed: 0,Device_ID,User_Age,Country,Consent_Type,Average_HeartRate,Daily_Steps,Sleep_Hours,GPS_Location_Share,Health_Alert
0,1,63,Japan,implicit,71,11067,3.4,0,0
1,2,20,India,none,57,2646,3.1,1,0
2,3,46,Germany,explicit,89,14118,9.3,0,1
3,4,52,USA,explicit,143,7137,8.7,1,0
4,5,56,UK,explicit,77,17093,7.6,0,0


## Exercises

### 1. Data Collection, Privacy & Consent

**Task 1.1:** Filter out any records where consent is "none".
**Task 1.2:** Discuss how different consent methods affect the dataset size and representativeness.

In [5]:
# Task 1.1: Filter out any records where consent is "none".
data_filtered = data[data['Consent_Type'] != 'none']
display(data_filtered.head())
print(f"Original dataset size: {len(data)}")
print(f"Filtered dataset size: {len(data_filtered)}")

Unnamed: 0,Device_ID,User_Age,Country,Consent_Type,Average_HeartRate,Daily_Steps,Sleep_Hours,GPS_Location_Share,Health_Alert
0,1,63,Japan,implicit,71,11067,3.4,0,0
2,3,46,Germany,explicit,89,14118,9.3,0,1
3,4,52,USA,explicit,143,7137,8.7,1,0
4,5,56,UK,explicit,77,17093,7.6,0,0
5,6,35,Germany,explicit,63,6836,3.1,0,0


Original dataset size: 1000
Filtered dataset size: 902


### 2. Implicit vs Explicit Consent

**Task 2.1:** Count the proportion of explicit, implicit, and no consent.
**Task 2.2:** Explain risks of using implicitly collected wearable device data.

In [8]:
# Task 2.1: Count the proportion of explicit, implicit, and no consent.
consent_counts = data['Consent_Type'].value_counts(normalize=True) * 100
print("Proportion of each consent type:")
print(consent_counts)

Proportion of each consent type:
Consent_Type
explicit    63.4
implicit    26.8
none         9.8
Name: proportion, dtype: float64



    Explicit Consent: Leads to smaller but ethically sound and potentially more representative datasets of users willing to share data.
    Implicit Consent: Can yield larger datasets but raises ethical concerns and may not be truly representative.
    No Consent: Unethical and unusable data.


### 3. Bias in Data

**Task 3.1:** Check if Health_Alert rates differ significantly by Country.
**Task 3.2:** Discuss whether the difference is due to actual health conditions or uneven device adoption rates.

In [None]:
# Task 3.1: Check if Health_Alert rates differ significantly by Country.
# We can group by country and calculate the mean of 'Health_Alert' (since it's 0 or 1)
health_alert_by_country = data.groupby('Country')['Health_Alert'].mean() * 100
print("Percentage of Health Alerts by Country:")
print(health_alert_by_country)

Percentage of Health Alerts by Country:
Country
Germany     9.615385
India      13.461538
Japan      16.161616
UK         15.873016
USA        15.228426
Name: Health_Alert, dtype: float64


The observed differences in Health_Alert rates across countries in this synthetic dataset could be attributed to several factors, and without more information, it's difficult to definitively say whether it's due to actual health conditions or uneven device adoption rates. Here's a discussion of both possibilities:

*   **Actual Health Conditions:** It is plausible that there are genuine differences in the prevalence of health conditions that trigger alerts across different countries due to factors like lifestyle, genetics, healthcare access, or environmental factors. If the device accurately reflects these conditions, the data might be showing real variations in health.

*   **Uneven Device Adoption Rates and User Demographics:** This is a significant potential source of bias in wearable device data. Device adoption rates can vary significantly by country due to factors like income levels, technological infrastructure, marketing efforts, and cultural attitudes towards wearable technology. Furthermore, the demographics of users who adopt these devices might differ across countries (e.g., age, socioeconomic status, tech-savviness), and these demographic differences could correlate with health conditions or activity levels that trigger alerts. For instance, if a country with a higher proportion of older adults has higher adoption rates, this could naturally lead to a higher observed Health_Alert rate if health issues are more common in that demographic.

*   **Reporting Bias and Algorithm Differences:** There could also be variations in how users respond to or report health issues, or even subtle differences in how the device's algorithms trigger alerts based on regional variations or software versions.

**Conclusion:** In a real-world scenario, it would be crucial to investigate these potential biases before drawing conclusions about actual health conditions based on this data. Analyzing the demographics of device users in each country and potentially comparing the wearable data to other public health data sources would be necessary to understand the contributing factors to the observed differences in Health_Alert rates. For this synthetic dataset, the differences are simply a result of the random distribution based on the seed.

### 4. Data Minimization

**Task 4.1:** Create a dataset for step count analysis only, excluding health-sensitive info.

### 4. Data Minimization

**Task 4.1:** Create a dataset for step count analysis only, excluding health-sensitive info.

In [7]:
# Task 4.1: Create a dataset for step count analysis only, excluding health-sensitive info.
step_data = data[['Device_ID', 'Country', 'Consent_Type', 'Daily_Steps']]
display(step_data.head())
print(f"Original dataset columns: {data.columns.tolist()}")
print(f"Step data dataset columns: {step_data.columns.tolist()}")

Unnamed: 0,Device_ID,Country,Consent_Type,Daily_Steps
0,1,Japan,implicit,11067
1,2,India,none,2646
2,3,Germany,explicit,14118
3,4,USA,explicit,7137
4,5,UK,explicit,17093


Original dataset columns: ['Device_ID', 'User_Age', 'Country', 'Consent_Type', 'Average_HeartRate', 'Daily_Steps', 'Sleep_Hours', 'GPS_Location_Share', 'Health_Alert']
Step data dataset columns: ['Device_ID', 'Country', 'Consent_Type', 'Daily_Steps']
