# Health Data Generation Documentation

This documentation outlines the process for generating simulated health data for a user. The data spans heart rate, calories burned, and weight measurements over a specific date range.

## Overview

The script generates health data for a single user (`user_id = 1234`) from January 1, 2024, to January 30, 2024. Data is structured into three pandas DataFrames: `heart_rate_df`, `calorie_df`, and `weight_df`, each for different health metrics.

### Data Generation Steps

1. **Date and Time Setup:**
   - Generates a date range for the specified period, repeating each date 5 times for multiple daily data points.
   - Assigns fixed times (`00:00:00`, `08:00:00`, `12:00:00`, `16:00:00`, `20:00:00`) to these dates for measurement timestamps.

2. **Heart Rate Data:**
   - Generates random heart rate values between 60 and 120 bpm for each timestamp.

3. **Calories Data:**
   - Generates calorie expenditure data, with 90% of points between 1 and 2 calories and 10% between 2 and 9 calories.
   - Shuffles the calorie data to randomize the distribution.

4. **Weight Data:**
   - Generates random weight measurements between 62 kg and 78 kg for each timestamp.

### DataFrames Creation

- Creates three DataFrames for heart rate, calories, and weight. Each contains:
  - `email`: Placeholder email address (`"pps006@cancerbase.org"`).
  - `user_id`: User ID (`1234`).
  - `date`: Measurement date.
  - `time`: Measurement time.
  - Respective health metric (`heart_rate`, `calories`, `weight_kg`).

### Dependencies

- pandas
- numpy

### Note

Users can also import similar structured data from a database. Ensure the imported data matches the format of the generated DataFrames. Example code for reading data from a database:

```python
import pandas as pd
import sqlalchemy

# Database connection example
engine = sqlalchemy.create_engine('database_connection_string')
query = "SELECT * FROM your_table_name"

# Reading data into a DataFrame
df = pd.read_sql_query(query, engine)


In [2]:
import pandas as pd
import numpy as np

user_id = 1234
np.random.seed(0)

# generate data range
date_range = pd.date_range(start="2024-01-01", end="2024-01-30")
repeated_dates = np.repeat(date_range, 5)

new_time_samples = [ "00:00:00", "08:00:00", "12:00:00", "16:00:00", "20:00:00"]
new_time_data = np.tile(new_time_samples, len(date_range))
# gemerate time
time_samples = ["20:41:00", "14:48:00", "14:47:00", "14:46:00", "14:45:00", "09:32:00", "04:48:00", "09:27:00", "02:21:00", "09:34:00"]
time_data = np.random.choice(time_samples, size=len(date_range))

# heart rate
heart_rate_data = np.random.randint(low=60, high=120, size=len(repeated_dates))
# calories
calories_data = np.concatenate([
    np.random.uniform(1, 2, size=int(len(repeated_dates) * 0.9)), 
    np.random.uniform(2, 9, size=int(len(repeated_dates) * 0.1)) 
])
# make data disorder
np.random.shuffle(calories_data)  

# weight
weight_data = np.random.uniform(62, 78, size=len(repeated_dates))


# 3 df
heart_rate_df = pd.DataFrame({
    "email": ["pps006@cancerbase.org"] * len(repeated_dates),
    "user_id": [user_id] * len(repeated_dates),
    "date": repeated_dates,
    "time": new_time_data,
    "heart_rate": heart_rate_data
})
calorie_df = pd.DataFrame({
    "email": ["pps006@cancerbase.org"] * len(repeated_dates),
    "user_id": [user_id] * len(repeated_dates),
    "date": repeated_dates,
    "time": new_time_data,
    "calories": calories_data
})
weight_df = pd.DataFrame({
    "email": ["pps006@cancerbase.org"] * len(repeated_dates),
    "user_id": [user_id] * len(repeated_dates),
    "date": repeated_dates,
    "time": new_time_data,
    "weight_kg": weight_data
})
print(heart_rate_df)
print(calorie_df)
print(weight_df)

                     email  user_id       date      time  heart_rate
0    pps006@cancerbase.org     1234 2024-01-01  00:00:00          78
1    pps006@cancerbase.org     1234 2024-01-01  08:00:00          95
2    pps006@cancerbase.org     1234 2024-01-01  12:00:00          84
3    pps006@cancerbase.org     1234 2024-01-01  16:00:00         109
4    pps006@cancerbase.org     1234 2024-01-01  20:00:00         111
..                     ...      ...        ...       ...         ...
145  pps006@cancerbase.org     1234 2024-01-30  00:00:00         106
146  pps006@cancerbase.org     1234 2024-01-30  08:00:00         102
147  pps006@cancerbase.org     1234 2024-01-30  12:00:00         111
148  pps006@cancerbase.org     1234 2024-01-30  16:00:00         116
149  pps006@cancerbase.org     1234 2024-01-30  20:00:00         100

[150 rows x 5 columns]
                     email  user_id       date      time  calories
0    pps006@cancerbase.org     1234 2024-01-01  00:00:00  1.463451
1    pps006@ca

# `get_mets`

This function merges heart rate, calorie, and weight data into a single DataFrame and calculates the Metabolic Equivalent of Task (METs) for each data point. METs are a standard unit of measure that quantifies the energy expenditure of physical activities.

## Parameters

- `heart_rate_df`: DataFrame containing heart rate data. It must include the columns `date`, `time`, and `heart_rate`.
- `calorie_df`: DataFrame containing calorie data. It must include the columns `date`, `time`, and `calories`.
- `weight_df`: DataFrame containing weight data. It must include the columns `date` and `weight_kg`.

## Returns

- `merged_df`: A pandas DataFrame containing the merged heart rate, calorie, and weight data along with calculated METs values.

## Process

1. **Data Merging:**
   - Merges `heart_rate_df` with `calorie_df` on `date` and `time` columns using a left join to combine heart rate and calorie data.
   - Further merges the resulting DataFrame with `weight_df` on `date` column using a left join, adding weight data to each date.

2. **METs Calculation:**
   - Converts calories to joules by multiplying by 4.184 (since 1 calorie is approximately 4.184 joules).
   - Calculates METs by dividing joules by weight in kilograms.
   - Normalizes METs based on the mode (most frequently occurring value) of the METs to adjust for the most common energy expenditure rate.

3. **Time Format Adjustment:**
   - Converts the `time` column to a pandas datetime format for easier manipulation and analysis.




In [9]:
def get_mets(heart_rate_df, calorie_df,weight_df):
    merged_df = heart_rate_df[['date','time','heart_rate']].merge(calorie_df[['date','time','calories']], on=['date','time'], how='left') 
    
    merged_df = merged_df.merge(weight_df[['date','weight_kg']], on='date', how='left')

    
    merged_df['joule'] = merged_df['calories'] * 4.184
    
    merged_df['mets'] = merged_df['joule'] / merged_df['weight_kg']
    
    if(merged_df.shape[0] == 0):
        print('no merged data')
        return merged_df
    times = 1.00 / merged_df['mets'].mode().iloc[0]
    merged_df['mets'] = merged_df['mets'] * times
    merged_df['time'] = pd.to_datetime(merged_df['time'], format='%H:%M:%S')
    
    return merged_df
merged_df = get_mets(heart_rate_df, calorie_df,weight_df)
merged_df


Unnamed: 0,date,time,heart_rate,calories,weight_kg,joule,mets
0,2024-01-01,1900-01-01 00:00:00,78,1.463451,62.565799,6.123079,1.773488
1,2024-01-01,1900-01-01 00:00:00,78,1.463451,68.886439,6.123079,1.610763
2,2024-01-01,1900-01-01 00:00:00,78,1.463451,70.160270,6.123079,1.581518
3,2024-01-01,1900-01-01 00:00:00,78,1.463451,70.578840,6.123079,1.572139
4,2024-01-01,1900-01-01 00:00:00,78,1.463451,72.902280,6.123079,1.522034
...,...,...,...,...,...,...,...
745,2024-01-30,1900-01-01 20:00:00,100,1.411397,69.287106,5.905284,1.544486
746,2024-01-30,1900-01-01 20:00:00,100,1.411397,68.427417,5.905284,1.563890
747,2024-01-30,1900-01-01 20:00:00,100,1.411397,65.974615,5.905284,1.622032
748,2024-01-30,1900-01-01 20:00:00,100,1.411397,70.093862,5.905284,1.526709


# `categorize_mets`

This function categorizes METs (Metabolic Equivalent of Task) values into activity levels and aggregates the data by day. It adds a new column to the input DataFrame to label each METs value with its corresponding activity category. Finally, it summarizes the time spent in each activity category per day.

## Parameters

- `merged_df`: DataFrame containing METs data. It must include the columns `date`, `time`, and `mets`.

## Returns

- `mets_df`: A pandas DataFrame that includes the date, the time spent in each activity level per day, total active time, and total non-sedentary time.

## Process

1. **Categorization of METs:**
   - METs values are categorized into four activity levels: `sedentary`, `lightly_active`, `fairly_active`, and `very_active`, based on their value.
   - A new column, `mets_category`, is added to `merged_df` with these labels.

2. **Aggregation:**
   - The data is grouped by `date` and `mets_category`, and the time spent in each category is calculated and divided by 60 to convert minutes into hours.
   - The results are reshaped into a wide format, where each activity level becomes a column, and missing categories are filled with 0.

3. **Total Active and Non-Sedentary Time Calculation:**
   - A new column, `total_active`, sums the time across all activity levels for each day.
   - Another column, `non-sedentary`, sums the time across `lightly_active`, `fairly_active`, and `very_active` categories, exclerged_df)

print(mets_df.head())


In [12]:
def categorize_mets(merged_df):
    def divide_mets(mets):
        if mets < 1.5:
            return 'sedentary'
        elif 1.5 <= mets < 3.0:
            return 'lightly_active'
        elif 3.0 <= mets < 6.0:
            return 'fairly_active'
        else:
            return 'very_active'
            
    merged_df['mets_category'] = merged_df['mets'].apply(divide_mets)
    
    # categorized by date and level
    grouped = merged_df.groupby(['date', 'mets_category'])['time'].count() / 60
    mets_df = grouped.unstack().reset_index().fillna(0)
    # mets_df.columns.name = None

    if 'sedentary' not in mets_df.columns:
        mets_df['sedentary'] = 0
    if 'lightly_active' not in mets_df.columns:
        mets_df['lightly_active'] = 0
    if 'fairly_active' not in mets_df.columns:
        mets_df['fairly_active'] = 0
    if 'very_active' not in mets_df.columns:
        mets_df['very_active'] = 0
        
    mets_df['total_active'] = mets_df[['sedentary', 'lightly_active', 'fairly_active', 'very_active']].sum(axis=1)
    mets_df['non-sedentary'] = mets_df[['lightly_active', 'fairly_active', 'very_active']].sum(axis=1)
    
    
    return mets_df
mets_df = categorize_mets(merged_df)
mets_df


mets_category,date,fairly_active,lightly_active,sedentary,very_active,total_active,non-sedentary
0,2024-01-01,0.0,0.166667,0.166667,0.083333,0.416667,0.25
1,2024-01-02,0.0,0.4,0.016667,0.0,0.416667,0.4
2,2024-01-03,0.083333,0.333333,0.0,0.0,0.416667,0.416667
3,2024-01-04,0.0,0.083333,0.166667,0.166667,0.416667,0.25
4,2024-01-05,0.0,0.316667,0.016667,0.083333,0.416667,0.4
5,2024-01-06,0.083333,0.25,0.083333,0.0,0.416667,0.333333
6,2024-01-07,0.0,0.183333,0.233333,0.0,0.416667,0.183333
7,2024-01-08,0.0,0.233333,0.183333,0.0,0.416667,0.233333
8,2024-01-09,0.0,0.266667,0.066667,0.083333,0.416667,0.35
9,2024-01-10,0.0,0.35,0.066667,0.0,0.416667,0.35
