In [43]:
%pip install -r requirements.txt

Note: you may need to restart the kernel to use updated packages.


In [44]:

from trino import dbapi
from trino.auth import OAuth2Authentication

conn = dbapi.connect(
    host="smart-cities-trino.pre-prod.cloud.vtti.vt.edu",
    port=443,
    http_scheme="https",
    auth=OAuth2Authentication(),   # <-- this is the right one
    catalog="smartcities_iceberg",  # optional default
    # schema="...",                # optional default
)

In [45]:

cur = conn.cursor()
cur.execute("SHOW SCHEMAS")
print(cur.fetchall())

Open the following URL in browser for the external authentication:
https://smart-cities-trino.pre-prod.cloud.vtti.vt.edu/oauth2/token/initiate/1ef0d9d32250a69654c2ab03667bebaf209a9e81962f545c80ee0980d09aa9a0
[['alexandria'], ['cci'], ['falls-church'], ['information_schema'], ['smart_cities_test'], ['system'], ['tables'], ['vtti']]


In [46]:
cur.execute("SHOW CATALOGS")
print(cur.fetchall())

[['smartcities_iceberg'], ['system']]


In [47]:
cur.execute("SHOW SCHEMAS FROM smartcities_iceberg")
print(cur.fetchall())

[['alexandria'], ['cci'], ['falls-church'], ['information_schema'], ['smart_cities_test'], ['system'], ['tables'], ['vtti']]


In [48]:
cur.execute("""
SELECT table_schema, table_name
FROM smartcities_iceberg.information_schema.tables
ORDER BY table_schema, table_name
LIMIT 20
""")
print(cur.fetchall())

[['alexandria', 'bsm'], ['alexandria', 'psm'], ['alexandria', 'safety-event'], ['alexandria', 'speed-distribution'], ['alexandria', 'vehicle-count'], ['alexandria', 'vru-count'], ['cci', 'bsm'], ['falls-church', 'hiresdata'], ['falls-church', 'maple_washington'], ['falls-church', 'mediantraveltimes'], ['falls-church', 'old_hiresdata'], ['falls-church', 'old_mediantraveltimes'], ['falls-church', 'old_priority_requests'], ['falls-church', 'old_safety_conflicts'], ['falls-church', 'old_safety_pedcompliance'], ['falls-church', 'old_safety_redlightrunners'], ['falls-church', 'old_safety_simpledelay'], ['falls-church', 'old_tmc'], ['falls-church', 'old_tmc_crosswalk'], ['falls-church', 'old_tmc_lanes']]


## Best Practices for Querying

### 1. Always Verify Column Names First

**CRITICAL:** Column names may not be what you expect! Run the schema verification cells in this notebook to see actual field names.

```python
# See actual columns in a table
cur.execute("SELECT * FROM smartcities_iceberg.alexandria.bsm LIMIT 1")
actual_columns = [desc[0] for desc in cur.description]
print(actual_columns)
```

**Common Issues:**
- ❌ `vehicle_id` may not exist
- ❌ `latitude`, `longitude` → Use `lat`, `lon` instead
- ✅ Always check the schema verification output first!

### 2. Use Cursor Method (Not pd.read_sql)

To avoid pandas SQLAlchemy warnings, use the cursor method:

```python
# Good: Use cursor method
cur.execute("SELECT * FROM table LIMIT 5")
columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df = pd.DataFrame(results, columns=columns)

# Avoid: pd.read_sql() triggers warning
# df = pd.read_sql("SELECT * FROM table", conn)
```

### 3. Start with SELECT *

When exploring or unsure about column names, use `SELECT *` to get all columns:

```sql
SELECT 
    from_unixtime(publish_timestamp / 1000) as timestamp,
    *
FROM smartcities_iceberg.alexandria.bsm
LIMIT 10
```

Then filter columns in pandas:
```python
df_filtered = df[['timestamp', 'lat', 'lon', 'speed']]
```

### 4. Handle publish_timestamp Correctly

`publish_timestamp` is a **bigint** (milliseconds since Unix epoch), NOT a timestamp type.

**Convert to readable timestamp:**
```sql
from_unixtime(publish_timestamp / 1000) as timestamp
```

**Filter by time:**
```sql
WHERE publish_timestamp >= to_unixtime(current_timestamp - interval '24' hour) * 1000
```

**Use in aggregations:**
```sql
date_trunc('hour', from_unixtime(publish_timestamp / 1000))
```

In [49]:
import pandas as pd

# Get all tables from all schemas
cur.execute("""
SELECT table_schema, table_name
FROM smartcities_iceberg.information_schema.tables
WHERE table_schema NOT IN ('information_schema', 'system')
ORDER BY table_schema, table_name
""")
all_tables = cur.fetchall()

tables_df = pd.DataFrame(all_tables, columns=['Schema', 'Table'])
print(f"Total tables found: {len(tables_df)}\n")
print(tables_df.to_string(index=False))

Total tables found: 51

           Schema                      Table
       alexandria                        bsm
       alexandria                        psm
       alexandria               safety-event
       alexandria         speed-distribution
       alexandria              vehicle-count
       alexandria                  vru-count
              cci                        bsm
     falls-church                  hiresdata
     falls-church           maple_washington
     falls-church          mediantraveltimes
     falls-church              old_hiresdata
     falls-church      old_mediantraveltimes
     falls-church      old_priority_requests
     falls-church       old_safety_conflicts
     falls-church   old_safety_pedcompliance
     falls-church old_safety_redlightrunners
     falls-church     old_safety_simpledelay
     falls-church                    old_tmc
     falls-church          old_tmc_crosswalk
     falls-church              old_tmc_lanes
     falls-church           saf

In [50]:
# Function to examine table structure
def describe_table(schema, table):
    cur.execute(f"""
    SELECT column_name, data_type
    FROM smartcities_iceberg.information_schema.columns
    WHERE table_schema = '{schema}' AND table_name = '{table}'
    ORDER BY ordinal_position
    """)
    columns = cur.fetchall()
    return pd.DataFrame(columns, columns=['Column', 'Data Type'])


# Let's examine some key tables from each schema
print("=" * 80)
print("ALEXANDRIA SCHEMA - BSM Table")
print("=" * 80)
print(describe_table('alexandria', 'bsm'))

ALEXANDRIA SCHEMA - BSM Table
                      Column  Data Type
0                       city    varchar
1               intersection    varchar
2                      table    varchar
3          publish_timestamp     bigint
4              location_name    varchar
5                source_type    varchar
6                vendor_name    varchar
7             vendor_version    varchar
8                       misc    varchar
9                    msg_cnt    integer
10                        id     bigint
11                  sec_mark    integer
12                       lat     double
13                       lon     double
14                      elev       real
15       accuracy_semi_major       real
16       accuracy_semi_minor       real
17      accuracy_orientation       real
18              transmission    integer
19                     speed       real
20                   heading       real
21                     angle       real
22                 accel_lon       real
23        

In [51]:
print("\n" + "=" * 80)
print("ALEXANDRIA SCHEMA - PSM Table (Pedestrian Safety Message)")
print("=" * 80)
print(describe_table('alexandria', 'psm'))

print("\n" + "=" * 80)
print("ALEXANDRIA SCHEMA - Safety Event Table")
print("=" * 80)
print(describe_table('alexandria', 'safety-event'))

print("\n" + "=" * 80)
print("ALEXANDRIA SCHEMA - Vehicle Count Table")
print("=" * 80)
print(describe_table('alexandria', 'vehicle-count'))

print("\n" + "=" * 80)
print("FALLS CHURCH SCHEMA - HiRes Data Table")
print("=" * 80)
print(describe_table('falls-church', 'hiresdata'))


ALEXANDRIA SCHEMA - PSM Table (Pedestrian Safety Message)
                  Column  Data Type
0                   city    varchar
1           intersection    varchar
2                  table    varchar
3      publish_timestamp     bigint
4          location_name    varchar
5            source_type    varchar
6            vendor_name    varchar
7         vendor_version    varchar
8                   misc    varchar
9                msg_cnt    integer
10            basic_type    integer
11                    id     bigint
12              sec_mark    integer
13                   lat     double
14                   lon     double
15                  elev       real
16   accuracy_semi_major       real
17   accuracy_semi_minor       real
18  accuracy_orientation       real
19                 speed       real
20               heading       real
21             accel_lon       real
22             accel_lat       real
23            accel_vert       real
24             accel_yaw       real
25  e

In [None]:
# Store schemas in variables for later use to ensure correct field names
print("=" * 80)
print("STORING SCHEMAS FOR KEY TABLES")
print("=" * 80)

# Get and store BSM schema
bsm_schema = describe_table('alexandria', 'bsm')
print("\nAlexandria BSM Schema:")
print(bsm_schema.to_string(index=False))
bsm_columns = bsm_schema['Column'].tolist()

# Get and store PSM schema
psm_schema = describe_table('alexandria', 'psm')
print("\n\nAlexandria PSM Schema:")
print(psm_schema.to_string(index=False))
psm_columns = psm_schema['Column'].tolist()

# Get and store Safety Event schema
safety_event_schema = describe_table('alexandria', 'safety-event')
print("\n\nAlexandria Safety Event Schema:")
print(safety_event_schema.to_string(index=False))
safety_event_columns = safety_event_schema['Column'].tolist()

# Get and store Vehicle Count schema
vehicle_count_schema = describe_table('alexandria', 'vehicle-count')
print("\n\nAlexandria Vehicle Count Schema:")
print(vehicle_count_schema.to_string(index=False))
vehicle_count_columns = vehicle_count_schema['Column'].tolist()

# Get and store Falls Church HiRes Data schema
hiresdata_schema = describe_table('falls-church', 'hiresdata')
print("\n\nFalls Church HiRes Data Schema:")
print(hiresdata_schema.to_string(index=False))
hiresdata_columns = hiresdata_schema['Column'].tolist()

print("\n" + "=" * 80)
print("Schemas stored! Use these column lists to ensure correct field names in queries.")
print("=" * 80)

In [None]:
# Helper function to check if columns exist in a table
def check_columns(columns_to_check, available_columns, table_name):
    """
    Verify that requested columns exist in the table schema
    
    Args:
        columns_to_check: List of column names you want to use
        available_columns: List of actual columns from the schema
        table_name: Name of the table (for error messages)
    
    Returns:
        tuple: (all_valid: bool, missing_columns: list)
    """
    missing = [col for col in columns_to_check if col not in available_columns]
    
    if missing:
        print(f"WARNING: The following columns do not exist in {table_name}:")
        print(f"  Missing: {missing}")
        print(f"  Available columns: {available_columns}")
        return False, missing
    else:
        print(f"✓ All columns exist in {table_name}")
        return True, []

# Example: Verify BSM columns before using them
print("Example: Checking if columns exist in BSM table:")
columns_i_want = ['publish_timestamp', 'vehicle_id', 'lat', 'lon', 'speed', 'heading']
check_columns(columns_i_want, bsm_columns, 'alexandria.bsm')

# Show common column name corrections
print("\n" + "=" * 80)
print("COMMON FIELD NAME CORRECTIONS:")
print("=" * 80)
print("BSM Table:")
print("  ✗ latitude  → ✓ lat")
print("  ✗ longitude → ✓ lon")
print("\nSafety Event Table:")
print("  ✗ latitude  → ✓ lat")
print("  ✗ longitude → ✓ lon")
print("=" * 80)

## Using Stored Schemas

The schemas for key tables are now stored in variables:
- `bsm_columns` - List of columns in alexandria.bsm
- `psm_columns` - List of columns in alexandria.psm
- `safety_event_columns` - List of columns in alexandria.safety-event
- `vehicle_count_columns` - List of columns in alexandria.vehicle-count
- `hiresdata_columns` - List of columns in falls-church.hiresdata

**Always verify column names before writing queries** to avoid errors!

In [None]:
# Example: Dynamically build a query using stored schema
def build_select_query(table_schema, table_name, columns_to_select, available_columns, limit=10):
    """
    Build a SELECT query dynamically after verifying columns exist
    
    Args:
        table_schema: Schema name (e.g., 'alexandria')
        table_name: Table name (e.g., 'bsm')
        columns_to_select: List of columns to select
        available_columns: List of available columns from schema
        limit: Number of rows to return
    
    Returns:
        str: SQL query string or None if columns are invalid
    """
    # Verify all columns exist
    is_valid, missing = check_columns(columns_to_select, available_columns, f"{table_schema}.{table_name}")
    
    if not is_valid:
        print(f"Cannot build query - missing columns: {missing}")
        return None
    
    # Build column list
    column_str = ", ".join(columns_to_select)
    
    # Build query
    query = f"""
    SELECT {column_str}
    FROM smartcities_iceberg.{table_schema}."{table_name}"
    LIMIT {limit}
    """
    
    return query

# Example: Build a safe BSM query
print("Building a query for BSM data:")
columns_needed = ['publish_timestamp', 'vehicle_id', 'lat', 'lon', 'speed']
query = build_select_query('alexandria', 'bsm', columns_needed, bsm_columns, limit=5)

if query:
    print("\nGenerated Query:")
    print(query)

In [None]:
# Let's see what the ACTUAL column names are by querying sample data
print("=" * 80)
print("VERIFYING ACTUAL COLUMN NAMES IN BSM TABLE")
print("=" * 80)

# Get a sample row to see actual column names
cur.execute("""
SELECT *
FROM smartcities_iceberg.alexandria.bsm
LIMIT 1
""")

# Get actual column names from the query result
actual_bsm_columns = [desc[0] for desc in cur.description]
print(f"\nActual columns in BSM table ({len(actual_bsm_columns)} total):")
for i, col in enumerate(actual_bsm_columns, 1):
    print(f"  {i:2d}. {col}")

# Store for comparison
print(f"\nColumns stored from schema query: {len(bsm_columns)} total")
if set(actual_bsm_columns) != set(bsm_columns):
    print("WARNING: Mismatch between schema query and actual columns!")
else:
    print("✓ Schema matches actual columns")

In [None]:
# Check Safety Event columns too
print("\n" + "=" * 80)
print("VERIFYING ACTUAL COLUMN NAMES IN SAFETY-EVENT TABLE")
print("=" * 80)

cur.execute("""
SELECT *
FROM smartcities_iceberg.alexandria."safety-event"
LIMIT 1
""")

actual_safety_columns = [desc[0] for desc in cur.description]
print(f"\nActual columns in Safety-Event table ({len(actual_safety_columns)} total):")
for i, col in enumerate(actual_safety_columns, 1):
    print(f"  {i:2d}. {col}")

print("\n" + "=" * 80)
print("STORE THESE VERIFIED COLUMNS FOR USE IN QUERIES")
print("=" * 80)
print(f"\nbsm_columns_verified = {actual_bsm_columns}")
print(f"\nsafety_event_columns_verified = {actual_safety_columns}")

# Update the stored columns with verified ones
bsm_columns = actual_bsm_columns
safety_event_columns = actual_safety_columns

## ⚠️ IMPORTANT: Workflow for Using This Notebook

**Before running the example queries below, you MUST:**

1. **Run ALL cells in order** from the beginning of the notebook
2. **Pay special attention to the schema verification cells** (above) which show you the ACTUAL column names
3. **Note the actual column names** displayed in the verification output
4. **Update queries** to use the correct column names based on what you see

### Common Issues:
- ❌ Assuming field names like `vehicle_id`, `latitude`, `longitude` 
- ✅ Use the ACTUAL field names shown in the verification cells above
- ❌ Using `timestamp` instead of `publish_timestamp` for filtering
- ✅ `publish_timestamp` is a **bigint** (milliseconds) - use conversion functions

### Safe Approach:
All example queries below now use `SELECT *` to get ALL columns. After seeing what's available, you can filter to specific columns in your DataFrame:

```python
# After running a query with SELECT *
df_filtered = df[['column1', 'column2', 'column3']]
```

In [None]:
# Function to check data availability and time range
def check_data_range(schema, table, timestamp_col='publish_timestamp'):
    try:
        query = f"""
        SELECT 
            COUNT(*) as record_count,
            from_unixtime(MIN({timestamp_col}) / 1000) as earliest_record,
            from_unixtime(MAX({timestamp_col}) / 1000) as latest_record
        FROM smartcities_iceberg.{schema}."{table}"
        """
        cur.execute(query)
        result = cur.fetchone()
        return {
            'Schema': schema,
            'Table': table,
            'Record Count': result[0] if result else 0,
            'Earliest': result[1] if result else None,
            'Latest': result[2] if result else None
        }
    except Exception as e:
        return {
            'Schema': schema,
            'Table': table,
            'Error': str(e)
        }


# Check data ranges for key tables
print("=" * 80)
print("DATA AVAILABILITY AND TIME RANGES")
print("=" * 80)

ranges = []
tables_to_check = [
    ('alexandria', 'bsm'),
    ('alexandria', 'psm'),
    ('alexandria', 'safety-event'),
    ('alexandria', 'vehicle-count'),
    ('falls-church', 'hiresdata'),
]

for schema, table in tables_to_check:
    print(f"\nChecking {schema}.{table}...")
    range_info = check_data_range(schema, table)
    ranges.append(range_info)
    if 'Error' not in range_info:
        print(f"  Records: {range_info['Record Count']:,}")
        print(
            f"  Time range: {range_info['Earliest']} to {range_info['Latest']}")
    else:
        print(f"  Error: {range_info['Error']}")

DATA AVAILABILITY AND TIME RANGES

Checking alexandria.bsm...
  Error: Could not convert '+57641-08-14 10:06:55.000 America/New_York' into the associated python type

Checking alexandria.psm...
  Error: Could not convert '+57644-02-14 14:15:42.000 America/New_York' into the associated python type

Checking alexandria.safety-event...
  Error: Could not convert '+57527-01-15 15:25:49.000 America/New_York' into the associated python type

Checking alexandria.vehicle-count...
  Error: Could not convert '+57527-02-26 06:51:24.000 America/New_York' into the associated python type

Checking falls-church.hiresdata...
  Error: TrinoUserError(type=USER_ERROR, name=SYNTAX_ERROR, message="line 6:39: mismatched input '-'. Expecting: ',', '.', 'AS', 'CROSS', 'EXCEPT', 'FETCH', 'FOR', 'FULL', 'GROUP', 'HAVING', 'INNER', 'INTERSECT', 'JOIN', 'LEFT', 'LIMIT', 'MATCH_RECOGNIZE', 'NATURAL', 'OFFSET', 'ORDER', 'RIGHT', 'TABLESAMPLE', 'UNION', 'WHERE', 'WINDOW', <EOF>, <identifier>", query_id=20251102_2012

In [53]:
# Get sample data from BSM table (using cursor to avoid pandas warning)
print("=" * 80)
print("SAMPLE DATA: Alexandria BSM (Basic Safety Message)")
print("=" * 80)

cur.execute("""
SELECT *
FROM smartcities_iceberg.alexandria.bsm
LIMIT 5
""")

columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df_bsm = pd.DataFrame(results, columns=columns)
print(df_bsm)

SAMPLE DATA: Alexandria BSM (Basic Safety Message)
         city   intersection           table  publish_timestamp  \
0  alexandria  glebe-potomac  alexandria.bsm   1762109170268000   
1  alexandria  glebe-potomac  alexandria.bsm   1762109170366000   
2  alexandria  glebe-potomac  alexandria.bsm   1762109170366000   
3  alexandria  glebe-potomac  alexandria.bsm   1762109170366000   
4  alexandria  glebe-potomac  alexandria.bsm   1762109170366000   

     location_name source_type vendor_name vendor_version  misc  msg_cnt  ...  \
0  Glebe & Potomac        ittf        derq           rev1  V2.2      127  ...   
1  Glebe & Potomac        ittf        derq           rev1  V2.2       98  ...   
2  Glebe & Potomac        ittf        derq           rev1  V2.2      125  ...   
3  Glebe & Potomac        ittf        derq           rev1  V2.2       88  ...   
4  Glebe & Potomac        ittf        derq           rev1  V2.2       23  ...   

   brake_applied_status  traction_control_status  anti_lock

In [54]:
# Get sample data from Safety Event table (using cursor to avoid pandas warning)
print("\n" + "=" * 80)
print("SAMPLE DATA: Alexandria Safety Events")
print("=" * 80)

cur.execute("""
SELECT *
FROM smartcities_iceberg.alexandria."safety-event"
LIMIT 5
""")

columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df_safety = pd.DataFrame(results, columns=columns)
print(df_safety)


SAMPLE DATA: Alexandria Safety Events
  event_type                  event_id      time_at_site       detection_area  \
0        LCV  690786324fd3ef0012dcd06b  1762082786008000            South Leg   
1        LCV  6906c2af86085d0012be2701  1762036335412000            South Leg   
2         IC  6906dde90f79ab001299157d  1762043305433000  Intersection Center   
3     NM-VRU  69078b0c86085d0012bf0dc1  1762084023089000            North Leg   
4         IC  690789dd33ed800012f8ba2e  1762083724923000            South Leg   

  camera_id direction movement      object1_class object2_class        city  \
0         0      None     None               None          None  alexandria   
1         0      None     None               None          None  alexandria   
2         0      None     None               None          None  alexandria   
3         0      None     None  Passenger Vehicle    Pedestrian  alexandria   
4         0      None     None               None          None  alexandria   


In [55]:
# Get sample data from Falls Church median travel times (using cursor to avoid pandas warning)
print("\n" + "=" * 80)
print("SAMPLE DATA: Falls Church - Median Travel Times")
print("=" * 80)

cur.execute("""
SELECT *
FROM smartcities_iceberg."falls-church".mediantraveltimes
LIMIT 5
""")

columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df_travel = pd.DataFrame(results, columns=columns)
print(df_travel)


SAMPLE DATA: Falls Church - Median Travel Times
           city data_provider                           table  \
0  falls-church     MioVision  falls-church.mediantraveltimes   
1  falls-church     MioVision  falls-church.mediantraveltimes   
2  falls-church     MioVision  falls-church.mediantraveltimes   
3  falls-church     MioVision  falls-church.mediantraveltimes   
4  falls-church     MioVision  falls-church.mediantraveltimes   

                    src_intersection_id src_intersection  \
0  1909c589-3cfa-4e8b-b759-792c667baa96      birch-broad   
1  1909c589-3cfa-4e8b-b759-792c667baa96      birch-broad   
2  1909c589-3cfa-4e8b-b759-792c667baa96      birch-broad   
3  1909c589-3cfa-4e8b-b759-792c667baa96      birch-broad   
4  1909c589-3cfa-4e8b-b759-792c667baa96      birch-broad   

                src_intersection_name  src_intersection_lat  \
0  Birch Street and West Broad Street             38.893546   
1  Birch Street and West Broad Street             38.893546   
2  Birch S

## API Summary

This smart-cities API provides access to connected vehicle and traffic management data from multiple cities:

### Available Data Types:

1. **BSM (Basic Safety Message)**: Real-time vehicle position, speed, heading from connected vehicles
2. **PSM (Pedestrian Safety Message)**: Pedestrian detection and safety data
3. **Safety Events**: Detected safety conflicts and incidents
4. **Vehicle/VRU Counts**: Traffic volume data
5. **Speed Distribution**: Speed profiles across locations
6. **High-Resolution Traffic Data**: Detailed signal performance metrics
7. **Travel Times**: Corridor travel time measurements

### Geographic Coverage:

- Alexandria, VA
- Falls Church, VA
- CCI (Center for Connected Infrastructure)
- VTTI (Virginia Tech Transportation Institute)

### Data Collection Approaches:

Run the cells below to see examples of collecting data over time periods.


## Example 1: Collecting BSM Data Over a Time Period

This example shows how to collect vehicle trajectory data (BSM) for a specific time range:


In [None]:
# Collect BSM data for the last 24 hours
# Using SELECT * to get all columns - customize after verifying column names above
# Note: publish_timestamp is a bigint (milliseconds since epoch)

cur.execute("""
SELECT 
    from_unixtime(publish_timestamp / 1000) as timestamp,
    *
FROM smartcities_iceberg.alexandria.bsm
WHERE publish_timestamp >= to_unixtime(current_timestamp - interval '24' hour) * 1000
ORDER BY publish_timestamp DESC
LIMIT 1000
""")

columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df_bsm_daily = pd.DataFrame(results, columns=columns)

print(f"Collected {len(df_bsm_daily)} BSM records from the last 24 hours")
print(f"\nColumns available: {list(df_bsm_daily.columns)}")
print(f"\nFirst few rows:")
print(df_bsm_daily.head())

# After seeing the columns, you can select specific ones like this:
# columns_i_want = ['timestamp', 'publish_timestamp', 'lat', 'lon', 'speed', 'heading']
# df_bsm_daily_filtered = df_bsm_daily[columns_i_want]

## Example 2: Collecting Safety Events Over a Date Range

This example shows how to collect safety event data between specific dates:


In [None]:
# Collect safety events for a specific date range
# Using SELECT * to get all columns - customize after verifying column names above
# Note: publish_timestamp is a bigint (milliseconds since epoch)

cur.execute("""
SELECT 
    from_unixtime(publish_timestamp / 1000) as timestamp,
    *
FROM smartcities_iceberg.alexandria."safety-event"
WHERE publish_timestamp BETWEEN 
    to_unixtime(timestamp '2024-01-01 00:00:00') * 1000 AND 
    to_unixtime(timestamp '2024-12-31 23:59:59') * 1000
ORDER BY publish_timestamp DESC
""")

columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df_safety_events = pd.DataFrame(results, columns=columns)

print(f"Collected {len(df_safety_events)} safety events")
print(f"\nColumns available: {list(df_safety_events.columns)}")

if len(df_safety_events) > 0:
    # Check if 'event_type' column exists before using it
    if 'event_type' in df_safety_events.columns:
        print("\nEvent type breakdown:")
        print(df_safety_events['event_type'].value_counts())
    print(f"\nFirst few rows:")
    print(df_safety_events.head())

## Example 3: Aggregated Traffic Data by Hour

This example shows how to aggregate traffic metrics over time periods:


In [None]:
# Aggregate vehicle counts by hour for the last 7 days
# Note: publish_timestamp is a bigint (milliseconds since epoch)
# Note: Adjust aggregation fields based on actual columns available

cur.execute("""
SELECT 
    date_trunc('hour', from_unixtime(publish_timestamp / 1000)) as hour,
    COUNT(*) as vehicle_count,
    AVG(speed) as avg_speed,
    MAX(speed) as max_speed,
    MIN(speed) as min_speed
FROM smartcities_iceberg.alexandria.bsm
WHERE publish_timestamp >= to_unixtime(current_timestamp - interval '7' day) * 1000
GROUP BY date_trunc('hour', from_unixtime(publish_timestamp / 1000))
ORDER BY hour DESC
""")

columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df_hourly_traffic = pd.DataFrame(results, columns=columns)

print(f"Collected {len(df_hourly_traffic)} hourly aggregates")
print(f"\nColumns: {list(df_hourly_traffic.columns)}")
print(df_hourly_traffic.head(10))

# Plot if matplotlib is available
try:
    import matplotlib.pyplot as plt
    
    if len(df_hourly_traffic) > 0:
        fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 8))
        
        ax1.plot(df_hourly_traffic['hour'], df_hourly_traffic['vehicle_count'])
        ax1.set_ylabel('Vehicle Count')
        ax1.set_title('Hourly Vehicle Counts - Last 7 Days')
        ax1.grid(True)
        
        ax2.plot(df_hourly_traffic['hour'], df_hourly_traffic['avg_speed'])
        ax2.set_ylabel('Average Speed')
        ax2.set_xlabel('Time')
        ax2.set_title('Average Speed by Hour')
        ax2.grid(True)
        
        plt.tight_layout()
        plt.show()
except ImportError:
    print("\nInstall matplotlib to visualize the data: %pip install matplotlib")

## Example 4: Batch Collection Over Multiple Days

This example shows how to collect data in batches over a longer time period:


In [None]:
from datetime import datetime, timedelta

# Function to collect data in daily batches (using cursor to avoid pandas warning)
def collect_data_batches(start_date, end_date, table_schema, table_name, columns_to_select='*'):
    """
    Collect data in daily batches to avoid memory issues with large datasets
    
    Args:
        start_date: Start date (string or datetime)
        end_date: End date (string or datetime)
        table_schema: Database schema name
        table_name: Table name
        columns_to_select: Columns to select (default '*' for all columns)
    
    Returns:
        List of dataframes, one per day
    """
    if isinstance(start_date, str):
        start_date = datetime.strptime(start_date, '%Y-%m-%d')
    if isinstance(end_date, str):
        end_date = datetime.strptime(end_date, '%Y-%m-%d')
    
    batches = []
    current_date = start_date
    
    while current_date <= end_date:
        next_date = current_date + timedelta(days=1)
        
        # Note: publish_timestamp is a bigint (milliseconds since epoch)
        # Using SELECT * to get all columns, or specify columns_to_select
        if columns_to_select == '*':
            select_clause = "from_unixtime(publish_timestamp / 1000) as timestamp, *"
        else:
            select_clause = f"from_unixtime(publish_timestamp / 1000) as timestamp, {columns_to_select}"
        
        query = f"""
        SELECT {select_clause}
        FROM smartcities_iceberg.{table_schema}."{table_name}"
        WHERE publish_timestamp >= to_unixtime(timestamp '{current_date.strftime('%Y-%m-%d 00:00:00')}') * 1000
          AND publish_timestamp < to_unixtime(timestamp '{next_date.strftime('%Y-%m-%d 00:00:00')}') * 1000
        """
        
        print(f"Collecting data for {current_date.strftime('%Y-%m-%d')}...")
        
        # Use cursor instead of pd.read_sql to avoid pandas warning
        cur.execute(query)
        columns = [desc[0] for desc in cur.description]
        results = cur.fetchall()
        df_batch = pd.DataFrame(results, columns=columns)
        
        print(f"  Found {len(df_batch)} records")
        
        if len(df_batch) > 0:
            batches.append(df_batch)
        
        current_date = next_date
    
    return batches

# Example: Collect safety events for a week in daily batches
print("=" * 80)
print("Collecting safety events in daily batches...")
print("=" * 80)

# Note: Using SELECT * to get all columns - adjust dates based on actual data availability
batches = collect_data_batches(
    start_date='2024-01-01',
    end_date='2024-01-07',
    table_schema='alexandria',
    table_name='safety-event',
    columns_to_select='*'  # Get all columns
)

if batches:
    # Combine all batches
    df_all = pd.concat(batches, ignore_index=True)
    print(f"\nTotal records collected: {len(df_all)}")
    print(f"\nColumns available: {list(df_all.columns)}")
    print(f"\nDate range: {df_all['timestamp'].min()} to {df_all['timestamp'].max()}")
    print(f"\nFirst few rows:")
    print(df_all.head())
else:
    print("\nNo data found in this date range")

## Tips for Data Collection

### STEP 1: Always Verify Column Names First!

**Run the schema verification cells at the beginning of this notebook to see actual column names.**

Don't assume field names! Common mistakes:
- ❌ `vehicle_id` - may not exist
- ❌ `latitude`, `longitude` - actual names are `lat`, `lon`
- ✅ Check the verification output to see real column names

### STEP 2: Understand the Data Types

**publish_timestamp is a BIGINT (Unix epoch in milliseconds)**

Since `publish_timestamp` is stored as a bigint (milliseconds since Unix epoch), you need to convert it:

**For display/conversion to readable timestamp:**
```sql
from_unixtime(publish_timestamp / 1000) as timestamp
```

**For time-based filtering:**
```sql
WHERE publish_timestamp >= to_unixtime(current_timestamp - interval '24' hour) * 1000
```

**For date range filtering:**
```sql
WHERE publish_timestamp BETWEEN 
    to_unixtime(timestamp '2024-01-01 00:00:00') * 1000 AND 
    to_unixtime(timestamp '2024-12-31 23:59:59') * 1000
```

**For aggregation:**
```sql
date_trunc('hour', from_unixtime(publish_timestamp / 1000))
```

### STEP 3: Start with SELECT * 

When exploring a new table, always start with `SELECT *` to see all available columns:

```python
cur.execute("SELECT * FROM smartcities_iceberg.alexandria.bsm LIMIT 5")
columns = [desc[0] for desc in cur.description]
results = cur.fetchall()
df = pd.DataFrame(results, columns=columns)
print(df.columns.tolist())  # See what columns are actually available
```

Then filter in pandas after seeing the data:
```python
df_filtered = df[['column1', 'column2', 'column3']]
```

### Performance Tips:
1. **Verify column names first** - Run schema verification cells
2. **Start with SELECT *** - See what's available before selecting specific columns
3. **Add LIMIT**: Always test queries with LIMIT first to check structure
4. **Use WHERE clauses**: Filter by publish_timestamp to reduce data volume
5. **Aggregate when possible**: Use GROUP BY and aggregation functions (COUNT, AVG, etc.)
6. **Batch large requests**: Use the batch collection function for multi-day/multi-week requests

### Common Time Intervals:
- Last hour: `to_unixtime(current_timestamp - interval '1' hour) * 1000`
- Last 24 hours: `to_unixtime(current_timestamp - interval '24' hour) * 1000`
- Last week: `to_unixtime(current_timestamp - interval '7' day) * 1000`
- Last month: `to_unixtime(current_timestamp - interval '30' day) * 1000`

### Key Tables for Traffic Safety Analysis:
- **alexandria.bsm**: Vehicle trajectories (position, speed, acceleration)
- **alexandria.safety-event**: Detected safety conflicts
- **alexandria.vehicle-count**: Traffic volumes
- **alexandria.speed-distribution**: Speed profiles
- **falls-church.hiresdata**: High-resolution signal data
- **falls-church.mediantraveltimes**: Travel time measurements