# EDA

In [None]:
* total rows
* number of unique participants per file
* overlap between major domains (activity, heart rate, sleep)

In [2]:
import pandas as pd
from pathlib import Path

# === Step 1 ‚Äî Setup paths ===
data_folder = Path("data")
periods = [
    ("3.12.16-4.11.16", "Fitabase Data 3.12.16-4.11.16"),
    ("4.12.16-5.12.16", "Fitabase Data 4.12.16-5.12.16")
]

# === Step 2 ‚Äî Define files to include ===
# We'll focus on datasets that contain participant-level data with "Id" columns
file_names = [
    "heartrate_seconds_merged.csv",
    "dailyActivity_merged.csv",
    "minuteSleep_merged.csv",
    "sleepDay_merged.csv",
    "minuteStepsNarrow_merged.csv",
    "minuteCaloriesNarrow_merged.csv",
    "minuteIntensitiesNarrow_merged.csv"
]

# === Step 3 ‚Äî Load and combine ===
datasets = {}

for file_name in file_names:
    dfs = []
    for period, folder in periods:
        path = data_folder / f"mturkfitbit_export_{period}" / folder / file_name
        if path.exists():
            df = pd.read_csv(path)
            df["Period"] = period  # Track which period the rows came from
            dfs.append(df)
        else:
            print(f"‚ö†Ô∏è Missing: {path.name}")
    if dfs:
        combined_df = pd.concat(dfs, ignore_index=True)
        if "Id" in combined_df.columns:
            datasets[file_name] = combined_df
        else:
            print(f"‚ö†Ô∏è Skipping {file_name} ‚Äî no 'Id' column found.")

# === Step 4 ‚Äî Summary statistics ===
summary = []
for name, df in datasets.items():
    summary.append({
        "Dataset": name,
        "Rows": len(df),
        "Unique Participants": df["Id"].nunique(),
        "Columns": len(df.columns)
    })

summary_df = pd.DataFrame(summary).sort_values("Unique Participants", ascending=False)
print("üìã Participant Summary Across Datasets:\n")
print(summary_df.to_string(index=False))

# === Step 5 ‚Äî Domain overlap ===
ids = {name: set(df["Id"].unique()) for name, df in datasets.items()}
id_sets = {
    "Activity": ids.get("dailyActivity_merged.csv", set()),
    "HeartRate": ids.get("heartrate_seconds_merged.csv", set()),
    "Sleep": ids.get("sleepDay_merged.csv", ids.get("minuteSleep_merged.csv", set())),
}

print("\nü§ù Overlap Summary:")
print(f"Activity only: {len(id_sets['Activity'] - (id_sets['HeartRate'] | id_sets['Sleep']))}")
print(f"HeartRate only: {len(id_sets['HeartRate'] - (id_sets['Activity'] | id_sets['Sleep']))}")
print(f"Sleep only: {len(id_sets['Sleep'] - (id_sets['HeartRate'] | id_sets['Activity']))}")
print(f"Activity ‚à© HeartRate: {len(id_sets['Activity'] & id_sets['HeartRate'])}")
print(f"Activity ‚à© Sleep: {len(id_sets['Activity'] & id_sets['Sleep'])}")
print(f"HeartRate ‚à© Sleep: {len(id_sets['HeartRate'] & id_sets['Sleep'])}")
print(f"All three overlap: {len(id_sets['Activity'] & id_sets['HeartRate'] & id_sets['Sleep'])}")

# Optional: show detailed participant counts per period
print("\nüß© Participants per period:")
for file_name, df in datasets.items():
    counts = df.groupby("Period")["Id"].nunique()
    print(f"{file_name}:")
    for period, n in counts.items():
        print(f"  {period}: {n}")

‚ö†Ô∏è Missing: sleepDay_merged.csv
üìã Participant Summary Across Datasets:

                           Dataset    Rows  Unique Participants  Columns
          dailyActivity_merged.csv    1397                   35       16
      minuteStepsNarrow_merged.csv 2770620                   35        4
   minuteCaloriesNarrow_merged.csv 2770620                   35        4
minuteIntensitiesNarrow_merged.csv 2770620                   35        4
            minuteSleep_merged.csv  387080                   25        5
               sleepDay_merged.csv     413                   24        6
      heartrate_seconds_merged.csv 3638339                   15        4

ü§ù Overlap Summary:
Activity only: 8
HeartRate only: 0
Sleep only: 0
Activity ‚à© HeartRate: 15
Activity ‚à© Sleep: 24
HeartRate ‚à© Sleep: 12
All three overlap: 12

üß© Participants per period:
heartrate_seconds_merged.csv:
  3.12.16-4.11.16: 14
  4.12.16-5.12.16: 14
dailyActivity_merged.csv:
  3.12.16-4.11.16: 35
  4.12.16-5.12.1

# What data do we have?

* **Activity data (steps, minutes, etc.)**: 1,397 rows from **35 people**.
* **Sleep (minute-by-minute)**: 387,080 rows from **25 people**.
* **Sleep (daily totals)**: 413 rows from **24 people**.
* **Heart rate (every few seconds)**: 3,638,339 rows from **15 people**.
* Lots of ‚Äúminute-level‚Äù activity files (steps, calories, intensity) each cover **35 people**.

# Who overlaps where?

Think of ‚Äúoverlap‚Äù as ‚Äúthe same person shows up in both groups.‚Äù

* **Activity ‚à© Heart Rate**: **15** people
* **Activity ‚à© Sleep**: **24** people
* **Heart Rate ‚à© Sleep**: **12** people
* **All three (Activity ‚à© Heart Rate ‚à© Sleep)**: **12** people

üëâ Translation: while activity covers most people, **heart rate is the smallest group**, and only **a dozen** people have *all* three types of data at the same time.

# Time periods covered

* Data comes from two back-to-back windows: **Mar 12‚ÄìApr 11, 2016** and **Apr 12‚ÄìMay 12, 2016**.
* Daily sleep totals are **missing** for the first period (the file isn‚Äôt there), so daily sleep data mostly comes from **Apr‚ÄìMay**.
* Heart-rate participants are stable (14‚Äì14 across periods), but still a **small crowd**.

# Data quality + completeness (why this matters)

* **Heart rate wear time** (how long devices were worn each day) varies a lot by person and day.
  Fewer hours worn ‚áí more gaps ‚áí harder to estimate things like ‚Äúresting‚Äù heart rate.
* **Uneven sample sizes**: 35 have activity, but only 15 have HR. That makes HR-based findings **less reliable**.
* **Missing files** (like daily sleep for the first month) limit comparisons across the full 2-month span.
* **Real-life noise**: caffeine, stress, sickness, or a missed night wearing the device can mess with patterns.

# What simple questions can this data answer well?

* **Activity patterns** (steps and intense-minutes) across **many people** and **both months**.
* **Sleep minutes at the minute level** for **~25 people**, including bed/wake patterns and nightly totals.
* **Cross-domain questions** (like ‚Äúdoes more activity relate to more sleep?‚Äù) are possible, **but only for the overlap groups**. Best case is **12 people** with activity + sleep + heart rate together.

# What the EDA suggests (high-level takeaways)

1. **Plenty of activity data, limited heart-rate data.**
   If your question *needs* heart rate, expect small samples and cautious conclusions.
2. **Decent sleep coverage, but split across two formats** (minute-by-minute and daily totals) and **missing daily totals** for the first period.
3. **Only 12 people have the full picture** (activity + sleep + heart rate).
   That‚Äôs your ‚Äúgold‚Äù group for any combined analysis‚Äîjust remember it‚Äôs small.
4. **Comparisons across the two months are uneven** because not every file exists for both months.
5. **Before fancy modeling**, it‚Äôs smart to:

   * Check **per-person coverage** (how many days each person actually wore the device).
   * Decide on **inclusion rules** (e.g., ‚Äúuse a day only if ‚â•12 hours of HR wear time‚Äù).
   * Align **dates carefully** (sleep that ends on a morning belongs to that day‚Äôs HR/activity).

# What analyses make sense next (and what to watch out for)

* **Activity ‚Üî Sleep (weekly):** You can group by week per person and compare sleep hours between people with **‚â•150 ‚Äúvery active‚Äù minutes/week** vs. those with less.
  Expect **small differences** and **noisy results** unless you restrict to people with many valid weeks.
* **Sleep ‚Üî Heart Rate (daily):** You *can* estimate a next-day ‚Äúresting-ish‚Äù HR by taking a **low percentile** of daytime HR after waking (e.g., 15th percentile).
  With only **~12‚Äì15 people**, don‚Äôt overclaim‚Äîreport averages, spreads, and p-values, but keep conclusions careful.
* **Wear-time checks are crucial.** Low wear time can fake ‚Äúlow HR‚Äù or ‚Äúno steps,‚Äù so set a **wear-time threshold** before trusting a day.

# Plain-English conclusion

* You‚Äôve got **lots of activity data**, **pretty good sleep data**, and **much less heart-rate data**.
* Only **about a dozen people** have *all three* types of data at the same time, which **limits how strong** any ‚Äúsleep affects heart rate‚Äù or ‚Äúactivity affects sleep‚Äù claims can be.
* The data is **good for learning and demonstrating methods** (grouping by week, joining datasets, making comparisons), but **not great for proving small effects**.
* Be transparent: explain the **small sample for HR**, **missing files** for the first month‚Äôs daily sleep, and the need to **filter by wear time**. That keeps your science honest and your results believable.
