# Step 1: Rhythm ItemID Discovery

This notebook identifies which `itemid` values in the `chartevents` table contain rhythm information, specifically atrial fibrillation (AF) charting.

## Approach
We'll query the `chartevents` table to find itemids that frequently contain AF-related terms in their values.

In [None]:
# Import libraries
import pandas as pd
from google.cloud import bigquery
import sys
sys.path.append('..')
from config import *

# Initialize BigQuery client
client = bigquery.Client(project=OUTPUT_PROJECT_ID)

## Query 1: Discover Rhythm ItemIDs

This query searches for itemids that contain AF-related terms and counts how frequently they appear.

In [None]:
rhythm_discovery_query = f"""
-- Find likely rhythm itemids and how AF is charted
WITH rhythm_candidates AS (
  SELECT
    ce.itemid,
    di.label AS item_label,
    di.category AS item_category,
    COUNTIF(REGEXP_CONTAINS(LOWER(ce.value), r'{AF_REGEX}|{AFLUTTER_REGEX}')) AS af_hits,
    COUNT(*) AS total_count,
    ROUND(COUNTIF(REGEXP_CONTAINS(LOWER(ce.value), r'{AF_REGEX}|{AFLUTTER_REGEX}')) / COUNT(*) * 100, 2) AS af_percentage
  FROM `{CHARTEVENTS_TABLE}` ce
  JOIN `{D_ITEMS_TABLE}` di USING (itemid)
  WHERE ce.value IS NOT NULL
  GROUP BY ce.itemid, item_label, item_category
)
SELECT *
FROM rhythm_candidates
WHERE af_hits > 100
ORDER BY af_hits DESC
LIMIT 20;
"""

print("Running rhythm discovery query...")
print("This may take a few minutes as it scans the chartevents table...\n")

rhythm_candidates = client.query(rhythm_discovery_query).to_dataframe()
print(f"Found {len(rhythm_candidates)} rhythm-related itemids\n")
rhythm_candidates

## Query 2: Sample Values for Top ItemIDs

Let's look at actual sample values for the top itemids to validate they contain rhythm information.

In [None]:
# Get the top itemid
if len(rhythm_candidates) > 0:
    top_itemid = rhythm_candidates.iloc[0]['itemid']
    
    sample_values_query = f"""
    SELECT 
      value,
      COUNT(*) as count
    FROM `{CHARTEVENTS_TABLE}`
    WHERE itemid = {top_itemid}
      AND value IS NOT NULL
    GROUP BY value
    ORDER BY count DESC
    LIMIT 30;
    """
    
    print(f"Sample values for itemid {top_itemid}:")
    sample_values = client.query(sample_values_query).to_dataframe()
    sample_values

## Query 3: Examine All Rhythm-Related Categories

Let's see which categories contain rhythm information.

In [None]:
category_query = f"""
SELECT 
  di.category,
  COUNT(DISTINCT di.itemid) as num_items,
  COUNTIF(REGEXP_CONTAINS(LOWER(di.label), r'rhythm|cardiac|ecg|ekg')) as rhythm_related_items
FROM `{D_ITEMS_TABLE}` di
WHERE di.category IS NOT NULL
GROUP BY di.category
HAVING rhythm_related_items > 0
ORDER BY rhythm_related_items DESC;
"""

categories = client.query(category_query).to_dataframe()
print("Categories containing rhythm-related items:")
categories

## Save Validated Rhythm ItemIDs

Based on the results above, save the validated itemids to use in subsequent analyses.

In [None]:
# Filter for itemids with high AF percentage and reasonable sample size
validated_itemids = rhythm_candidates[
    (rhythm_candidates['af_percentage'] > 1.0) &  # At least 1% of values are AF-related
    (rhythm_candidates['total_count'] > 1000)     # Reasonable sample size
]

print(f"Validated {len(validated_itemids)} rhythm itemids for AF detection:")
print(validated_itemids[['itemid', 'item_label', 'af_hits', 'total_count', 'af_percentage']])

# Save to CSV
validated_itemids.to_csv('../data/validated_rhythm_itemids.csv', index=False)
print("\nSaved to data/validated_rhythm_itemids.csv")

## Export ItemID List for SQL Queries

Create a list that can be used in SQL queries.

In [None]:
# Create comma-separated list for SQL
itemid_list = validated_itemids['itemid'].tolist()
print(f"RHYTHM_ITEMIDS = {itemid_list}")
print(f"\nUse this in your SQL queries: WHERE itemid IN ({', '.join(map(str, itemid_list))})")

## Summary

This notebook has:
1. Identified itemids in chartevents that contain AF rhythm charting
2. Validated the quality of these itemids by examining sample values
3. Saved the validated itemids for use in AF episode detection

Next step: Use these itemids to build AF timelines in `02_af_timeline_construction.ipynb`