# Question 1: datetime Fundamentals and Time Series Indexing

This question focuses on datetime handling and time series indexing using patient vital signs data.

## Setup

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import os

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
plt.style.use('default')
sns.set_style('whitegrid')

# Create output directory
os.makedirs('output', exist_ok=True)

## Part 1.1: Load and Explore Data

**Note:** This dataset contains realistic healthcare data characteristics:
- **200 patients** with daily vital signs over 1 year
- **Missing visits**: Patients miss approximately 5% of scheduled visits (realistic!)
- **Different start dates**: Not all patients start monitoring on January 1st (some join later)
- When selecting data by date ranges, you may find that some patients don't have data for certain periods - this is expected and realistic

In [2]:
# Load patient vital signs data
patient_vitals = pd.read_csv('data/patient_vitals.csv')

print("Patient vitals shape:", patient_vitals.shape)
print("Patient vitals columns:", patient_vitals.columns.tolist())

# Display sample data
print("\nPatient vitals sample:")
print(patient_vitals.head())
print("\nData summary:")
print(patient_vitals.describe())

# Check date range and missing data patterns
print(f"\nDate range: {patient_vitals['date'].min()} to {patient_vitals['date'].max()}")
print(f"Unique patients: {patient_vitals['patient_id'].nunique()}")
print(f"Total records: {len(patient_vitals)}")
print(f"Expected records (200 patients × 365 days): {200 * 365:,}")
print(f"Missing visits: ~{200 * 365 - len(patient_vitals):,} records")

Patient vitals shape: (18250, 7)
Patient vitals columns: ['date', 'patient_id', 'temperature', 'heart_rate', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'weight']

Patient vitals sample:
         date patient_id  temperature  heart_rate  blood_pressure_systolic  \
0  2023-01-01      P0001    98.389672          71                      119   
1  2023-01-02      P0001    98.492046          67                      117   
2  2023-01-03      P0001    98.790163          70                      113   
3  2023-01-04      P0001    98.635781          74                      117   
4  2023-01-05      P0001    98.051660          67                      118   

   blood_pressure_diastolic     weight  
0                        84  68.996865  
1                        82  67.720215  
2                        78  67.846825  
3                        82  67.693993  
4                        83  68.228852  

Data summary:
        temperature    heart_rate  blood_pressure_systolic  \
count  182

## Part 1.2: datetime Operations

**TODO: Perform datetime operations**

In [3]:
# Convert date column to datetime
patient_vitals['date'] = pd.to_datetime(patient_vitals['date'])

# Set datetime as index
patient_vitals = patient_vitals.set_index('date').sort_index()

# Extract year, month, day from DatetimeIndex
patient_vitals['year'] = patient_vitals.index.year
patient_vitals['month'] = patient_vitals.index.month
patient_vitals['day'] = patient_vitals.index.day

# --------------------------------------------------------
# Calculate days_since_start for each patient
# --------------------------------------------------------
# Reset index temporarily
patient_vitals_reset = patient_vitals.reset_index()

# Compute days since each patient's first recorded date
patient_vitals_reset['days_since_start'] = (
    patient_vitals_reset.groupby('patient_id')['date']
    .transform(lambda x: (x - x.min()).dt.days)
)

# Restore datetime index
patient_vitals = patient_vitals_reset.set_index('date')

# --------------------------------------------------------
# Create business day clinic schedule
# --------------------------------------------------------
clinic_dates = pd.bdate_range(
    start=patient_vitals.index.min(),
    end=patient_vitals.index.max()
)

# Daily, weekly, and monthly monitoring schedules
daily_range = pd.date_range(
    start=patient_vitals.index.min(),
    end=patient_vitals.index.max(),
    freq='D'
)

weekly_range = pd.date_range(
    start=patient_vitals.index.min(),
    end=patient_vitals.index.max(),
    freq='W-MON'
)

monthly_range = pd.date_range(
    start=patient_vitals.index.min(),
    end=patient_vitals.index.max(),
    freq='M'
)

# --------------------------------------------------------
# Visit pattern analysis
# --------------------------------------------------------
patient_dates_set = set(patient_vitals.index.date)
clinic_dates_set = set(clinic_dates.date)

visits_on_clinic_days = len(patient_dates_set & clinic_dates_set)
visits_on_weekends = len(patient_dates_set) - visits_on_clinic_days

print(f"Visits on clinic business days: {visits_on_clinic_days}")
print(f"Visits on weekends: {visits_on_weekends}")
print(f"Total unique visit dates: {len(patient_dates_set)}")

# --------------------------------------------------------
# Save datetime analysis results
# --------------------------------------------------------
datetime_analysis = patient_vitals.reset_index()[[
    'date',
    'patient_id',
    'year',
    'month',
    'day',
    'days_since_start',
    'temperature'
]]

datetime_analysis.to_csv('output/q1_datetime_analysis.csv', index=False)

datetime_analysis.head()


Visits on clinic business days: 260
Visits on weekends: 105
Total unique visit dates: 365


  monthly_range = pd.date_range(


Unnamed: 0,date,patient_id,year,month,day,days_since_start,temperature
0,2023-01-01,P0001,2023,1,1,0,98.389672
1,2023-01-01,P0024,2023,1,1,0,97.552103
2,2023-01-01,P0025,2023,1,1,0,98.806201
3,2023-01-01,P0047,2023,1,1,0,98.943464
4,2023-01-01,P0026,2023,1,1,0,98.758551


## Part 1.3: Time Zone Handling

**TODO: Handle time zones**

In [7]:
import pytz

# --------------------------------------------------------
# Create timezone-aware datetime values
# --------------------------------------------------------

# Current time in UTC (already tz-aware)
utc_time = pd.Timestamp.now(tz='UTC')

# Convert UTC time to Eastern Time
eastern_time = utc_time.tz_convert('US/Eastern')

# --------------------------------------------------------
# Localize patient_vitals to UTC and convert to Eastern
# --------------------------------------------------------

# Localize naive timestamps to UTC
patient_vitals_tz = patient_vitals.tz_localize('UTC')

# Convert UTC → Eastern
patient_vitals_tz_eastern = patient_vitals_tz.tz_convert('US/Eastern')

# --------------------------------------------------------
# Handle Daylight Savings Time transition
# --------------------------------------------------------

dst_date_utc = pd.Timestamp('2023-03-12 10:00:00', tz='UTC')
dst_time_eastern = dst_date_utc.tz_convert('US/Eastern')

# --------------------------------------------------------
# Create detailed timezone report (200+ words)
# --------------------------------------------------------

timezone_report = f"""
Timezone Handling Report for Clinical Trial Data
-------------------------------------------------

1. Original Timezone
The patient vital signs dataset originally contained datetime values without any timezone 
information, meaning they were "naive" timestamps. Naive timestamps do not contain any 
offset information and do not indicate whether the recorded time corresponds to local 
time, UTC, or any other regional standard. In clinical settings, naive timestamps can lead 
to major inconsistencies when data are collected from multiple sites or during daylight 
saving time transitions.

2. Localization Method
The first step in making the dataset timezone-aware was to assign a baseline timezone. 
Consistent with best practices in medical informatics, we localized the datetime index to 
UTC using tz_localize('UTC'). This does not shift the timestamps; instead, it simply states 
that the naive timestamps should be interpreted as UTC. UTC is recommended because it has 
no daylight saving time adjustments, eliminating ambiguity.

3. Timezone Conversion
After localizing the data to UTC, we converted the timestamps to US/Eastern using 
tz_convert('US/Eastern'). This conversion correctly adjusts timestamps to reflect the local 
time of an East Coast clinical site, including daylight saving time transitions.

4. DST Handling
Using UTC as the baseline timezone prevents common ambiguity issues that arise during the 
start or end of daylight saving time. For example, the timestamp {dst_date_utc} in UTC 
corresponds to {dst_time_eastern} in US/Eastern. While Eastern Time "jumps" forward during 
DST, UTC remains constant, ensuring correct time alignment across all patient records.

5. Example
UTC example: {utc_time}
Converted to Eastern: {eastern_time}

Overall, storing data in UTC and converting to local time only when needed is the safest, 
most reproducible method for handling clinical time series data.
"""
with open('output/q1_timezone_report.txt', 'w') as f:
    f.write(timezone_report)

print("Saved timezone report → output/q1_timezone_report.txt")


Saved timezone report → output/q1_timezone_report.txt


## Submission Checklist

Before moving to Question 2, verify you've created:

- [ ] `output/q1_datetime_analysis.csv` - datetime analysis results
- [ ] `output/q1_timezone_report.txt` - timezone handling report
