# Question 1: datetime Fundamentals and Time Series Indexing

This question focuses on datetime handling and time series indexing using patient vital signs data.

## Setup

In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import os

# Set random seed for reproducibility
np.random.seed(42)

# Set plotting style
plt.style.use('default')
sns.set_style('whitegrid')

# Create output directory
os.makedirs('output', exist_ok=True)

## Part 1.1: Load and Explore Data

**Note:** This dataset contains realistic healthcare data characteristics:
- **200 patients** with daily vital signs over 1 year
- **Missing visits**: Patients miss approximately 5% of scheduled visits (realistic!)
- **Different start dates**: Not all patients start monitoring on January 1st (some join later)
- When selecting data by date ranges, you may find that some patients don't have data for certain periods - this is expected and realistic

In [39]:
# Load patient vital signs data
patient_vitals = pd.read_csv('data/patient_vitals.csv')

print("Patient vitals shape:", patient_vitals.shape)
print("Patient vitals columns:", patient_vitals.columns.tolist())

# Display sample data
print("\nPatient vitals sample:")
print(patient_vitals.head())
print("\nData summary:")
print(patient_vitals.describe())

# Check date range and missing data patterns
print(f"\nDate range: {patient_vitals['date'].min()} to {patient_vitals['date'].max()}")
print(f"Unique patients: {patient_vitals['patient_id'].nunique()}")
print(f"Total records: {len(patient_vitals)}")
print(f"Expected records (200 patients × 365 days): {200 * 365:,}")
print(f"Missing visits: ~{200 * 365 - len(patient_vitals):,} records")

Patient vitals shape: (18250, 7)
Patient vitals columns: ['date', 'patient_id', 'temperature', 'heart_rate', 'blood_pressure_systolic', 'blood_pressure_diastolic', 'weight']

Patient vitals sample:
         date patient_id  temperature  heart_rate  blood_pressure_systolic  \
0  2023-01-01      P0001    98.389672          71                      119   
1  2023-01-02      P0001    98.492046          67                      117   
2  2023-01-03      P0001    98.790163          70                      113   
3  2023-01-04      P0001    98.635781          74                      117   
4  2023-01-05      P0001    98.051660          67                      118   

   blood_pressure_diastolic     weight  
0                        84  68.996865  
1                        82  67.720215  
2                        78  67.846825  
3                        82  67.693993  
4                        83  68.228852  

Data summary:
        temperature    heart_rate  blood_pressure_systolic  \
count  182

## Part 1.2: datetime Operations

**TODO: Perform datetime operations**

In [40]:
# Convert date column to datetime
patient_vitals['date'] = pd.to_datetime(patient_vitals['date'])

# Set datetime column as index
patient_vitals = patient_vitals.set_index('date')

print(patient_vitals.index)

# Extract year, month, day components from datetime index
patient_vitals['year'] = patient_vitals.index.year  # Extract from index
patient_vitals['month'] = patient_vitals.index.month  # Extract from index
patient_vitals['day'] = patient_vitals.index.day  # Extract from index

# Calculate time differences (e.g., days since first measurement)
# Note: Since patients start at different times, calculate days_since_start per patient
patient_vitals_reset = patient_vitals.reset_index()
patient_vitals_reset['days_since_start'] = patient_vitals_reset.groupby('patient_id')['date'].transform(lambda x: (x - x.min()).dt.days)
patient_vitals = patient_vitals_reset.set_index('date')

# Create business day ranges for clinic visit schedules
clinic_dates = pd.bdate_range(start = patient_vitals.index.min(), end = patient_vitals.index.max()) 

# Create date ranges with different frequencies
daily_range = pd.date_range(start = patient_vitals.index.min(), end = patient_vitals.index.max(), freq = 'D')  # Daily monitoring schedule
weekly_range = pd.date_range(start = patient_vitals.index.min(), end = patient_vitals.index.max(), freq = 'W-MON')  # Weekly lab test schedule (Mondays)
monthly_range = pd.date_range(start = patient_vitals.index.min(), end = patient_vitals.index.max(), freq = 'MS')  # Monthly checkup schedule

# Use date ranges to analyze visit patterns
# Check how many patient visits occurred on clinic business days vs weekends
patient_dates_set = set(patient_vitals.index.date)
clinic_dates_set = set(clinic_dates.date)
visits_on_clinic_days = len(patient_dates_set & clinic_dates_set)
visits_on_weekends = len(patient_dates_set) - visits_on_clinic_days
print(f"Visits on clinic business days: {visits_on_clinic_days}")
print(f"Visits on weekends: {visits_on_weekends}")
print(f"Total unique visit dates: {len(patient_dates_set)}")

# Save results as 'output/q1_datetime_analysis.csv'
# Create a DataFrame with datetime analysis results including:
# - date (datetime index or column)
# - year, month, day (extracted from datetime)
# - days_since_start (calculated time differences)
# - patient_id
# - At least one original column (e.g., temperature, heart_rate)
datetime_analysis = patient_vitals[['patient_id', 'year', 'month', 'day', 'days_since_start', 'temperature']].copy()
datetime_analysis = datetime_analysis.reset_index()
datetime_analysis.to_csv('output/q1_datetime_analysis.csv', index = False)

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07', '2023-01-08',
               '2023-01-09', '2023-01-10',
               ...
               '2023-12-22', '2023-12-23', '2023-12-24', '2023-12-25',
               '2023-12-26', '2023-12-27', '2023-12-28', '2023-12-29',
               '2023-12-30', '2023-12-31'],
              dtype='datetime64[ns]', name='date', length=18250, freq=None)
Visits on clinic business days: 260
Visits on weekends: 105
Total unique visit dates: 365


## Part 1.3: Time Zone Handling

**TODO: Handle time zones**

In [41]:
# Create timezone-aware datetime (for multi-site clinical trials)
utc_time = pd.Timestamp.now(tz = 'UTC')  # Current time in UTC
eastern_time = utc_time.tz_convert('US/Eastern')  # Convert to US Eastern

# Convert between different timezones
# Create timezone-aware DataFrame from patient_vitals
patient_vitals_tz = patient_vitals.copy()  
patient_vitals_tz.index = patient_vitals_tz.index.tz_localize('UTC') # Localize to UTC
patient_vitals_tz_eastern = patient_vitals_tz.copy()
patient_vitals_tz_eastern.index = patient_vitals_tz_eastern.tz_convert('US/Eastern')  # Convert to Eastern time

# Handle daylight saving time transitions
# Create datetime that spans DST transition
# Note: Using UTC avoids DST ambiguity issues - UTC has no daylight saving time
# Best practice: Store data in UTC, convert to local timezones only when needed
dst_date_utc = pd.Timestamp('2023-03-12 10:00:00', tz='UTC')  # UTC time avoids DST issues
dst_time_eastern = dst_date_utc.tz_convert('US/Eastern')  # Convert UTC to Eastern

# Document timezone operations
# Create a report string with the following sections:
# 1. Original timezone: Describe what timezone your original data was in (or if it was naive)
# 2. Localization method: Explain how you localized the data (e.g., tz_localize('UTC'))
# 3. Conversion: Describe what timezone you converted to (e.g., 'US/Eastern')
# 4. DST handling: Document any issues or observations about daylight saving time transitions
#    Note: Explain why using UTC as the base timezone avoids DST ambiguity issues
# 5. Example: Show at least one example of a datetime before and after conversion
# Minimum length: 200 words
timezone_report = f"""
Documentation of our timezone operations:

- What timezone was your original data in?
    a. Before we set a timezone, our data is in a 'naive' state/format. 
    This means that our data did not contain any timezone information at all. 
    In turn, a time like '09:00:00' could refer to a US Eastern timezone or even a US Western timezone. 
    We don't know.

- How did you localize the data?
    a. In order to localize our data, we use the 'tz_localize()' method and set our timezone to UTC, Coordinated Universal Time.
    We specifically localize our data here: patient_vitals_tz.index = patient_vitals_tz.index.tz_localize('UTC')

- What timezone did you convert to?
    a. We later converted our timezone from UTC to US Eastern using the 'tz_convert()' method.
    We specifically convert our data here: patient_vitals_tz_eastern.index = patient_vitals_tz_eastern.tz_convert('US/Eastern') 
    Note that because we did not want to get rid of our UTC-localized data, we created a new eastern-focused DataFrame for patient vitals.


- What issues did you encounter with DST? (Note: Using UTC avoids DST ambiguity)
    a. While DST (Daylight Saving Time) would normally be an issue with time data, due to 'fall back' and 'spring forward' leading to time ambiguity, we did not face any issues.
    The reason for this is because UTC remains constant year-round, avoiding DST entirely.
    In turn, conversion to other timezones, like US Eastern, can be easily made afterwards without any timezone/DST hassle.
    

- Include at least one example showing a datetime before and after conversion
    a. An example of datetime conversion before and after conversion, following DST transition time:
    UTC: {dst_date_utc}
    Eastern: {dst_time_eastern}

- Explain why UTC is recommended as the base timezone for storing temporal data
    a.  UTC is recommended as a base timezone for storing temporal data because it is a singular, universally accepted time format that helps avoid complications that arise from timezones and DST.
"""

# Save results as 'output/q1_timezone_report.txt'
with open('output/q1_timezone_report.txt', 'w') as f:
     f.write(timezone_report)

## Submission Checklist

Before moving to Question 2, verify you've created:

- [ ] `output/q1_datetime_analysis.csv` - datetime analysis results
- [ ] `output/q1_timezone_report.txt` - timezone handling report
