# Hurricane Tweet Spatiotemporal Analysis Pipeline
## Transforming Social Media Data into Time-Enabled GIS Outputs

**Author:** Claude Code  
**Date:** 2025-10-29  
**Purpose:** Convert hurricane-related tweet streams into ArcGIS Pro-compatible time-enabled spatial outputs

---

## Solution Overview

### Methodology
This pipeline implements a **multi-scale vector aggregation approach** with adaptive temporal binning:

1. **Spatial Strategy**: Aggregate tweets to state and county boundaries using spatial joins
2. **Temporal Strategy**: Adaptive binning (2-hour for Helene, 6-hour for Francine)
3. **Entity Resolution**: Hybrid approach combining spatial location with GPE text matching
4. **Output Format**: GeoPackage with time-enabled polygon features

### Input Data
- Hurricane Francine tweets (2,303 records, Sep 9-16, 2024)
- Hurricane Helene tweets (3,007 records, Sep 26-27, 2024)
- US State boundaries (52 features)
- US County boundaries (3,222 features)
- Global cities reference database

### Output Data
- `hurricane_analysis_output.gpkg` containing:
  - `helene_states_timeseries`: State-level aggregates over time
  - `helene_counties_timeseries`: County-level aggregates over time
  - `francine_states_timeseries`: State-level aggregates over time
  - `francine_counties_timeseries`: County-level aggregates over time

---

## 1. Environment Setup and Imports

In [1]:
import geopandas as gpd
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

print("Environment Setup Complete")
print(f"GeoPandas version: {gpd.__version__}")
print(f"Pandas version: {pd.__version__}")

Environment Setup Complete
GeoPandas version: 1.1.1
Pandas version: 2.3.2


## 2. Configuration and Paths

In [2]:
# Input data paths (as specified in requirements)
DATA_DIR = Path(r'C:\Users\colto\Documents\GitHub\Tweet_project\data')

PATHS = {
    'helene_tweets': DATA_DIR / 'geojson' / 'helene.geojson',
    'francine_tweets': DATA_DIR / 'geojson' / 'francine.geojson',
    'states_shp': DATA_DIR / 'shape_files' / 'cb_2023_us_state_20m.shp',
    'counties_shp': DATA_DIR / 'shape_files' / 'cb_2023_us_county_20m.shp',
    'cities_csv': DATA_DIR / 'tables' / 'cities1000.csv'
}

# Output path
OUTPUT_PATH = Path(r'C:\Users\colto\Documents\GitHub\Tweet_project\output')
OUTPUT_PATH.mkdir(exist_ok=True)
OUTPUT_GPKG = OUTPUT_PATH / 'hurricane_analysis_output.gpkg'

# Temporal binning configuration (adaptive by hurricane)
TIME_CONFIG = {
    'helene': {
        'bin_hours': 2,  # 2-hour bins for concentrated 2-day event
        'description': '2-hour intervals'
    },
    'francine': {
        'bin_hours': 6,  # 6-hour bins for 7-day event
        'description': '6-hour intervals'
    }
}

# Target CRS (match reference shapefiles)
TARGET_CRS = 'EPSG:4269'  # NAD83

print("Configuration loaded successfully")
print(f"Output will be saved to: {OUTPUT_GPKG}")

Configuration loaded successfully
Output will be saved to: C:\Users\colto\Documents\GitHub\Tweet_project\output\hurricane_analysis_output.gpkg


## 3. Data Loading Module

Load all input datasets and perform initial preprocessing.

In [3]:
def load_tweet_data(geojson_path, hurricane_name):
    """
    Load hurricane tweet GeoJSON and preprocess.
    
    Parameters:
    -----------
    geojson_path : Path
        Path to GeoJSON file
    hurricane_name : str
        Name of hurricane (for labeling)
    
    Returns:
    --------
    GeoDataFrame with standardized schema
    """
    print(f"Loading {hurricane_name} tweets from {geojson_path.name}...")
    
    # Load GeoJSON
    gdf = gpd.read_file(geojson_path)
    
    # Parse time field to datetime
    gdf['time'] = pd.to_datetime(gdf['time'])
    
    # Add hurricane label
    gdf['hurricane'] = hurricane_name
    
    # Ensure CRS is set (should be EPSG:4326 from GeoJSON)
    if gdf.crs is None:
        gdf.set_crs('EPSG:4326', inplace=True)
    
    # Reproject to target CRS
    gdf = gdf.to_crs(TARGET_CRS)
    
    # Handle missing GPE/FAC/LOC values
    for col in ['GPE', 'FAC', 'LOC']:
        if col in gdf.columns:
            gdf[col] = gdf[col].fillna('').astype(str)
    
    print(f"  Loaded {len(gdf)} tweets")
    print(f"  Time range: {gdf['time'].min()} to {gdf['time'].max()}")
    print(f"  Duration: {(gdf['time'].max() - gdf['time'].min()).total_seconds() / 3600:.1f} hours")
    
    return gdf


def load_reference_boundaries():
    """
    Load state and county boundary shapefiles.
    
    Returns:
    --------
    tuple: (states_gdf, counties_gdf)
    """
    print("Loading reference boundaries...")
    
    # Load states
    states = gpd.read_file(PATHS['states_shp'])
    print(f"  Loaded {len(states)} states (CRS: {states.crs})")
    
    # Load counties
    counties = gpd.read_file(PATHS['counties_shp'])
    print(f"  Loaded {len(counties)} counties (CRS: {counties.crs})")
    
    # Calculate areas in square kilometers for density calculations
    states['area_sqkm'] = states.to_crs('EPSG:5070').geometry.area / 1e6  # Albers Equal Area
    counties['area_sqkm'] = counties.to_crs('EPSG:5070').geometry.area / 1e6
    
    return states, counties


def load_cities_reference():
    """
    Load cities reference database (GeoNames format).
    Filter to US cities only for efficiency.
    
    Returns:
    --------
    DataFrame with city names and coordinates
    """
    print("Loading cities reference...")
    
    cities = pd.read_csv(PATHS['cities_csv'])
    
    # Filter to US cities only
    us_cities = cities[cities['country_code'] == 'US'].copy()
    
    print(f"  Loaded {len(us_cities)} US cities (filtered from {len(cities)} global)")
    
    return us_cities


# Execute data loading
print("=" * 80)
print("MODULE 1: DATA LOADING")
print("=" * 80)

helene_tweets = load_tweet_data(PATHS['helene_tweets'], 'Helene')
francine_tweets = load_tweet_data(PATHS['francine_tweets'], 'Francine')
states_ref, counties_ref = load_reference_boundaries()
cities_ref = load_cities_reference()

print("\nData loading complete!\n")

MODULE 1: DATA LOADING
Loading Helene tweets from helene.geojson...
  Loaded 3007 tweets
  Time range: 2024-09-26 02:29:25+00:00 to 2024-09-27 19:59:41+00:00
  Duration: 41.5 hours
Loading Francine tweets from francine.geojson...


  Loaded 2303 tweets
  Time range: 2024-09-09 11:00:36+00:00 to 2024-09-16 15:24:14+00:00
  Duration: 172.4 hours
Loading reference boundaries...
  Loaded 52 states (CRS: EPSG:4269)


  Loaded 3222 counties (CRS: EPSG:4269)
Loading cities reference...


  Loaded 17244 US cities (filtered from 161521 global)

Data loading complete!



## 4. Spatial Processing Module

Perform spatial joins to assign tweets to counties and states.

In [4]:
def spatial_join_tweets_to_boundaries(tweets_gdf, counties_gdf, states_gdf):
    """
    Spatially join tweets to county and state boundaries.
    
    This is the primary method for geographic assignment,
    using actual tweet coordinates rather than text mentions.
    
    Parameters:
    -----------
    tweets_gdf : GeoDataFrame
        Tweet points
    counties_gdf : GeoDataFrame
        County polygons
    states_gdf : GeoDataFrame
        State polygons
    
    Returns:
    --------
    GeoDataFrame with county and state assignments
    """
    print(f"Spatial join: {len(tweets_gdf)} tweets to boundaries...")
    
    # Ensure same CRS
    tweets_gdf = tweets_gdf.to_crs(TARGET_CRS)
    counties_gdf = counties_gdf.to_crs(TARGET_CRS)
    states_gdf = states_gdf.to_crs(TARGET_CRS)
    
    # Join to counties (gets us state too via county attributes)
    tweets_with_county = gpd.sjoin(
        tweets_gdf,
        counties_gdf[['geometry', 'NAME', 'STUSPS', 'STATE_NAME', 'GEOID']],
        how='left',
        predicate='within'
    )
    
    # Rename joined columns for clarity
    tweets_with_county = tweets_with_county.rename(columns={
        'NAME': 'county_name',
        'STUSPS': 'state_abbr',
        'STATE_NAME': 'state_name',
        'GEOID': 'county_geoid'
    })
    
    # For tweets not within any county, try state-level join
    # First, separate tweets with and without county assignments
    no_county_mask = tweets_with_county['county_name'].isna()
    no_county_indices = tweets_with_county[no_county_mask].index
    
    if len(no_county_indices) > 0:
        print(f"  {len(no_county_indices)} tweets not in any county, trying state-level join...")
        
        # Get the original tweet data for orphaned tweets (before any joins)
        orphaned_tweets = tweets_gdf.loc[no_county_indices].copy()
        
        # Join orphaned tweets to states
        orphaned_with_state = gpd.sjoin(
            orphaned_tweets,
            states_gdf[['geometry', 'NAME', 'STUSPS']],
            how='left',
            predicate='within'
        )
        
        # Update state info for these tweets in the main dataframe
        for idx in orphaned_with_state.index:
            if pd.notna(orphaned_with_state.loc[idx, 'NAME']):
                tweets_with_county.loc[idx, 'state_name'] = orphaned_with_state.loc[idx, 'NAME']
                tweets_with_county.loc[idx, 'state_abbr'] = orphaned_with_state.loc[idx, 'STUSPS']
    
    # Clean up index_right column from spatial join
    if 'index_right' in tweets_with_county.columns:
        tweets_with_county = tweets_with_county.drop(columns=['index_right'])
    
    # Report statistics
    tweets_with_state = tweets_with_county[tweets_with_county['state_name'].notna()]
    tweets_with_county_complete = tweets_with_county[tweets_with_county['county_name'].notna()]
    
    print(f"  Spatial join complete:")
    print(f"    {len(tweets_with_county_complete)} tweets matched to counties ({len(tweets_with_county_complete)/len(tweets_gdf)*100:.1f}%)")
    print(f"    {len(tweets_with_state)} tweets matched to states ({len(tweets_with_state)/len(tweets_gdf)*100:.1f}%)")
    print(f"    {len(tweets_gdf) - len(tweets_with_state)} tweets outside all boundaries")
    
    return tweets_with_county


# Execute spatial processing
print("=" * 80)
print("MODULE 2: SPATIAL PROCESSING")
print("=" * 80)

print("\nProcessing Helene tweets...")
helene_spatial = spatial_join_tweets_to_boundaries(helene_tweets, counties_ref, states_ref)

print("\nProcessing Francine tweets...")
francine_spatial = spatial_join_tweets_to_boundaries(francine_tweets, counties_ref, states_ref)

print("\nSpatial processing complete!\n")

MODULE 2: SPATIAL PROCESSING

Processing Helene tweets...
Spatial join: 3007 tweets to boundaries...
  18 tweets not in any county, trying state-level join...
  Spatial join complete:
    2989 tweets matched to counties (99.4%)
    2989 tweets matched to states (99.4%)
    18 tweets outside all boundaries

Processing Francine tweets...
Spatial join: 2303 tweets to boundaries...
  Spatial join complete:
    2303 tweets matched to counties (100.0%)
    2303 tweets matched to states (100.0%)
    0 tweets outside all boundaries

Spatial processing complete!



## 5. Temporal Binning Module

Create time bins and assign tweets to temporal intervals.

In [5]:
def create_time_bins(tweets_gdf, hurricane_name, bin_hours):
    """
    Create temporal bins and assign each tweet to a bin.
    
    Parameters:
    -----------
    tweets_gdf : GeoDataFrame
        Tweets with time field
    hurricane_name : str
        Hurricane name (for reporting)
    bin_hours : int
        Hours per bin
    
    Returns:
    --------
    GeoDataFrame with bin assignments and bin metadata DataFrame
    """
    print(f"Creating {bin_hours}-hour time bins for {hurricane_name}...")
    
    # Get time range
    min_time = tweets_gdf['time'].min()
    max_time = tweets_gdf['time'].max()
    
    # Round min_time down to nearest bin interval
    start_time = min_time.floor(f'{bin_hours}h')
    
    # Round max_time up to nearest bin interval
    end_time = (max_time.ceil(f'{bin_hours}h'))
    
    # Create bin edges
    bin_edges = pd.date_range(start=start_time, end=end_time, freq=f'{bin_hours}h')
    
    print(f"  Time range: {min_time} to {max_time}")
    print(f"  Bin range: {start_time} to {end_time}")
    print(f"  Number of bins: {len(bin_edges) - 1}")
    
    # Assign each tweet to a bin
    tweets_gdf['time_bin_idx'] = pd.cut(
        tweets_gdf['time'],
        bins=bin_edges,
        labels=range(len(bin_edges) - 1),
        include_lowest=True
    )
    
    # Create bin metadata table
    bins_metadata = pd.DataFrame({
        'bin_idx': range(len(bin_edges) - 1),
        'time_start': bin_edges[:-1],
        'time_end': bin_edges[1:]
    })
    
    bins_metadata['time_mid'] = bins_metadata['time_start'] + (bins_metadata['time_end'] - bins_metadata['time_start']) / 2
    
    # Join bin metadata to tweets
    tweets_gdf['time_bin_idx'] = tweets_gdf['time_bin_idx'].astype(float)  # Handle NaN
    tweets_with_bins = tweets_gdf.merge(
        bins_metadata,
        left_on='time_bin_idx',
        right_on='bin_idx',
        how='left'
    )
    
    # Report bin distribution
    bin_counts = tweets_with_bins.groupby('bin_idx').size()
    print(f"  Tweets per bin: min={bin_counts.min()}, max={bin_counts.max()}, mean={bin_counts.mean():.1f}")
    
    return tweets_with_bins, bins_metadata


# Execute temporal binning
print("=" * 80)
print("MODULE 3: TEMPORAL BINNING")
print("=" * 80)

print("\nBinning Helene tweets...")
helene_binned, helene_bins = create_time_bins(
    helene_spatial,
    'Helene',
    TIME_CONFIG['helene']['bin_hours']
)

print("\nBinning Francine tweets...")
francine_binned, francine_bins = create_time_bins(
    francine_spatial,
    'Francine',
    TIME_CONFIG['francine']['bin_hours']
)

print("\nTemporal binning complete!\n")

MODULE 3: TEMPORAL BINNING

Binning Helene tweets...
Creating 2-hour time bins for Helene...
  Time range: 2024-09-26 02:29:25+00:00 to 2024-09-27 19:59:41+00:00
  Bin range: 2024-09-26 02:00:00+00:00 to 2024-09-27 20:00:00+00:00
  Number of bins: 21
  Tweets per bin: min=57, max=239, mean=143.2

Binning Francine tweets...
Creating 6-hour time bins for Francine...
  Time range: 2024-09-09 11:00:36+00:00 to 2024-09-16 15:24:14+00:00
  Bin range: 2024-09-09 06:00:00+00:00 to 2024-09-16 18:00:00+00:00
  Number of bins: 30
  Tweets per bin: min=1, max=445, mean=79.4

Temporal binning complete!



## 6. Aggregation Module

Aggregate tweets to state and county polygons by time bin.

In [6]:
def aggregate_to_states(tweets_gdf, states_ref, hurricane_name):
    """
    Aggregate tweets to state boundaries by time bin.
    
    Parameters:
    -----------
    tweets_gdf : GeoDataFrame
        Tweets with spatial and temporal assignments
    states_ref : GeoDataFrame
        State boundary polygons
    hurricane_name : str
        Hurricane name
    
    Returns:
    --------
    GeoDataFrame with aggregated state-level features
    """
    print(f"Aggregating {hurricane_name} tweets to states...")
    
    # Filter to tweets with state assignments
    tweets_with_state = tweets_gdf[tweets_gdf['state_name'].notna()].copy()
    print(f"  {len(tweets_with_state)} tweets with state assignments")
    
    # Group by state and time bin
    agg_dict = {
        'time': 'count',  # Count tweets
        'GPE': lambda x: ', '.join(set([item.strip() for s in x for item in str(s).split(',') if item.strip()])),  # Collect unique GPE mentions
    }
    
    state_timeseries = tweets_with_state.groupby(
        ['state_name', 'state_abbr', 'bin_idx', 'time_start', 'time_end', 'time_mid'],
        as_index=False
    ).agg(agg_dict)
    
    # Rename aggregated columns
    state_timeseries = state_timeseries.rename(columns={'time': 'tweet_count'})
    state_timeseries = state_timeseries.rename(columns={'GPE': 'mentioned_entities'})
    
    # Join with state geometries
    state_features = states_ref[['NAME', 'STUSPS', 'geometry', 'area_sqkm']].merge(
        state_timeseries,
        left_on='NAME',
        right_on='state_name',
        how='inner'
    )
    
    # Calculate density
    state_features['tweets_per_1000sqkm'] = (state_features['tweet_count'] / state_features['area_sqkm']) * 1000
    
    # Add hurricane label
    state_features['hurricane'] = hurricane_name
    
    # Convert to GeoDataFrame
    state_gdf = gpd.GeoDataFrame(state_features, geometry='geometry', crs=TARGET_CRS)
    
    # Select final columns
    final_cols = [
        'state_name', 'state_abbr', 'hurricane',
        'time_start', 'time_end', 'time_mid',
        'tweet_count', 'tweets_per_1000sqkm', 'area_sqkm',
        'mentioned_entities', 'geometry'
    ]
    state_gdf = state_gdf[final_cols]
    
    print(f"  Created {len(state_gdf)} state-time features")
    print(f"  States represented: {state_gdf['state_name'].nunique()}")
    print(f"  Time bins: {state_gdf['time_start'].nunique()}")
    
    return state_gdf


def aggregate_to_counties(tweets_gdf, counties_ref, hurricane_name):
    """
    Aggregate tweets to county boundaries by time bin.
    
    Parameters:
    -----------
    tweets_gdf : GeoDataFrame
        Tweets with spatial and temporal assignments
    counties_ref : GeoDataFrame
        County boundary polygons
    hurricane_name : str
        Hurricane name
    
    Returns:
    --------
    GeoDataFrame with aggregated county-level features
    """
    print(f"Aggregating {hurricane_name} tweets to counties...")
    
    # Filter to tweets with county assignments
    tweets_with_county = tweets_gdf[tweets_gdf['county_name'].notna()].copy()
    print(f"  {len(tweets_with_county)} tweets with county assignments")
    
    # Group by county and time bin
    agg_dict = {
        'time': 'count',
        'GPE': lambda x: ', '.join(set([item.strip() for s in x for item in str(s).split(',') if item.strip()])),
    }
    
    county_timeseries = tweets_with_county.groupby(
        ['county_geoid', 'county_name', 'state_name', 'state_abbr', 'bin_idx', 'time_start', 'time_end', 'time_mid'],
        as_index=False
    ).agg(agg_dict)
    
    # Rename aggregated columns
    county_timeseries = county_timeseries.rename(columns={'time': 'tweet_count'})
    county_timeseries = county_timeseries.rename(columns={'GPE': 'mentioned_entities'})
    
    # Join with county geometries
    county_features = counties_ref[['GEOID', 'NAME', 'STATE_NAME', 'geometry', 'area_sqkm']].merge(
        county_timeseries,
        left_on='GEOID',
        right_on='county_geoid',
        how='inner'
    )
    
    # Calculate density
    county_features['tweets_per_1000sqkm'] = (county_features['tweet_count'] / county_features['area_sqkm']) * 1000
    
    # Add hurricane label
    county_features['hurricane'] = hurricane_name
    
    # Convert to GeoDataFrame
    county_gdf = gpd.GeoDataFrame(county_features, geometry='geometry', crs=TARGET_CRS)
    
    # Select final columns
    final_cols = [
        'county_name', 'state_name', 'state_abbr', 'hurricane',
        'time_start', 'time_end', 'time_mid',
        'tweet_count', 'tweets_per_1000sqkm', 'area_sqkm',
        'mentioned_entities', 'geometry'
    ]
    county_gdf = county_gdf[final_cols]
    
    print(f"  Created {len(county_gdf)} county-time features")
    print(f"  Counties represented: {county_gdf['county_name'].nunique()}")
    print(f"  Time bins: {county_gdf['time_start'].nunique()}")
    
    return county_gdf


# Execute aggregation
print("=" * 80)
print("MODULE 4: AGGREGATION")
print("=" * 80)

print("\nAggregating Helene data...")
helene_states_agg = aggregate_to_states(helene_binned, states_ref, 'Helene')
helene_counties_agg = aggregate_to_counties(helene_binned, counties_ref, 'Helene')

print("\nAggregating Francine data...")
francine_states_agg = aggregate_to_states(francine_binned, states_ref, 'Francine')
francine_counties_agg = aggregate_to_counties(francine_binned, counties_ref, 'Francine')

print("\nAggregation complete!\n")

MODULE 4: AGGREGATION

Aggregating Helene data...
Aggregating Helene tweets to states...
  2989 tweets with state assignments
  Created 131 state-time features
  States represented: 10
  Time bins: 21
Aggregating Helene tweets to counties...
  2989 tweets with county assignments
  Created 561 county-time features
  Counties represented: 128
  Time bins: 21

Aggregating Francine data...
Aggregating Francine tweets to states...
  2303 tweets with state assignments
  Created 129 state-time features
  States represented: 10
  Time bins: 29
Aggregating Francine tweets to counties...
  2303 tweets with county assignments
  Created 381 county-time features
  Counties represented: 83
  Time bins: 29

Aggregation complete!



## 7. Output Generation Module

Export aggregated features to GeoPackage format for ArcGIS Pro.

In [7]:
def export_to_geopackage(layer_dict, output_path):
    """
    Export multiple layers to a single GeoPackage.
    
    Parameters:
    -----------
    layer_dict : dict
        Dictionary mapping layer names to GeoDataFrames
    output_path : Path
        Output GeoPackage path
    """
    print(f"Exporting layers to {output_path}...")
    
    # Remove existing file if present
    if output_path.exists():
        output_path.unlink()
        print(f"  Removed existing file")
    
    for layer_name, gdf in layer_dict.items():
        print(f"  Writing layer '{layer_name}': {len(gdf)} features...")
        
        # Ensure datetime columns are properly formatted
        for col in ['time_start', 'time_end', 'time_mid']:
            if col in gdf.columns:
                gdf[col] = pd.to_datetime(gdf[col])
        
        # Write to GeoPackage
        gdf.to_file(output_path, layer=layer_name, driver='GPKG')
    
    print(f"\n  Successfully created {output_path}")
    print(f"  File size: {output_path.stat().st_size / 1024 / 1024:.2f} MB")


# Execute export
print("=" * 80)
print("MODULE 5: OUTPUT GENERATION")
print("=" * 80)

output_layers = {
    'helene_states_timeseries': helene_states_agg,
    'helene_counties_timeseries': helene_counties_agg,
    'francine_states_timeseries': francine_states_agg,
    'francine_counties_timeseries': francine_counties_agg
}

export_to_geopackage(output_layers, OUTPUT_GPKG)

print("\nOutput generation complete!\n")

MODULE 5: OUTPUT GENERATION
Exporting layers to C:\Users\colto\Documents\GitHub\Tweet_project\output\hurricane_analysis_output.gpkg...
  Writing layer 'helene_states_timeseries': 131 features...
  Writing layer 'helene_counties_timeseries': 561 features...


  Writing layer 'francine_states_timeseries': 129 features...


  Writing layer 'francine_counties_timeseries': 381 features...



  Successfully created C:\Users\colto\Documents\GitHub\Tweet_project\output\hurricane_analysis_output.gpkg


  File size: 2.09 MB

Output generation complete!



## 8. Summary Statistics and Verification

In [8]:
print("=" * 80)
print("SUMMARY STATISTICS")
print("=" * 80)

print("\n" + "=" * 40)
print("HELENE (Hurricane)")
print("=" * 40)

print(f"\nState-level aggregates:")
print(f"  Total features: {len(helene_states_agg)}")
print(f"  States affected: {helene_states_agg['state_name'].nunique()}")
print(f"  Time bins: {helene_states_agg['time_start'].nunique()}")
print(f"  Total tweet count: {helene_states_agg['tweet_count'].sum()}")
print(f"  Top 5 states by total tweets:")
top_helene_states = helene_states_agg.groupby('state_name')['tweet_count'].sum().sort_values(ascending=False).head(5)
for state, count in top_helene_states.items():
    print(f"    {state}: {count}")

print(f"\nCounty-level aggregates:")
print(f"  Total features: {len(helene_counties_agg)}")
print(f"  Counties affected: {helene_counties_agg['county_name'].nunique()}")
print(f"  Time bins: {helene_counties_agg['time_start'].nunique()}")
print(f"  Total tweet count: {helene_counties_agg['tweet_count'].sum()}")
print(f"  Top 5 counties by total tweets:")
top_helene_counties = helene_counties_agg.groupby(['county_name', 'state_abbr'])['tweet_count'].sum().sort_values(ascending=False).head(5)
for (county, state), count in top_helene_counties.items():
    print(f"    {county}, {state}: {count}")

print("\n" + "=" * 40)
print("FRANCINE (Hurricane)")
print("=" * 40)

print(f"\nState-level aggregates:")
print(f"  Total features: {len(francine_states_agg)}")
print(f"  States affected: {francine_states_agg['state_name'].nunique()}")
print(f"  Time bins: {francine_states_agg['time_start'].nunique()}")
print(f"  Total tweet count: {francine_states_agg['tweet_count'].sum()}")
print(f"  Top 5 states by total tweets:")
top_francine_states = francine_states_agg.groupby('state_name')['tweet_count'].sum().sort_values(ascending=False).head(5)
for state, count in top_francine_states.items():
    print(f"    {state}: {count}")

print(f"\nCounty-level aggregates:")
print(f"  Total features: {len(francine_counties_agg)}")
print(f"  Counties affected: {francine_counties_agg['county_name'].nunique()}")
print(f"  Time bins: {francine_counties_agg['time_start'].nunique()}")
print(f"  Total tweet count: {francine_counties_agg['tweet_count'].sum()}")
print(f"  Top 5 counties by total tweets:")
top_francine_counties = francine_counties_agg.groupby(['county_name', 'state_abbr'])['tweet_count'].sum().sort_values(ascending=False).head(5)
for (county, state), count in top_francine_counties.items():
    print(f"    {county}, {state}: {count}")

print("\n" + "=" * 80)
print("VERIFICATION COMPLETE")
print("=" * 80)

SUMMARY STATISTICS

HELENE (Hurricane)

State-level aggregates:
  Total features: 131
  States affected: 10
  Time bins: 21
  Total tweet count: 2989
  Top 5 states by total tweets:
    Florida: 2274
    Georgia: 446
    North Carolina: 101
    South Carolina: 70
    Tennessee: 23

County-level aggregates:
  Total features: 561
  Counties affected: 128
  Time bins: 21
  Total tweet count: 2989
  Top 5 counties by total tweets:
    Polk, FL: 1484
    Dodge, GA: 147
    Leon, FL: 142
    Hillsborough, FL: 138
    Bacon, GA: 133

FRANCINE (Hurricane)

State-level aggregates:
  Total features: 129
  States affected: 10
  Time bins: 29
  Total tweet count: 2303
  Top 5 states by total tweets:
    Louisiana: 2025
    Mississippi: 83
    Florida: 65
    Texas: 47
    Tennessee: 23

County-level aggregates:
  Total features: 381
  Counties affected: 83
  Time bins: 29
  Total tweet count: 2303
  Top 5 counties by total tweets:
    Avoyelles, LA: 1201
    Orleans, LA: 267
    Terrebonne, LA: 13

## 9. ArcGIS Pro Usage Instructions

### Loading the Data in ArcGIS Pro

1. **Add Data to Map**
   - Open ArcGIS Pro
   - Create new Map view
   - Add Data → Browse to `hurricane_analysis_output.gpkg`
   - Select desired layer (state or county level, Helene or Francine)

2. **Enable Time**
   - Right-click layer → Properties
   - Navigate to Time tab
   - Check "Layer Time" box
   - Configure:
     - **Time Type**: `Time Instant with Time Extent` or `Time Extent`
     - **Start Time Field**: `time_start`
     - **End Time Field**: `time_end`
   - Click OK

3. **Symbolize by Tweet Count**
   - Right-click layer → Symbology
   - Choose **Graduated Colors**
   - Field: `tweet_count` (or `tweets_per_1000sqkm` for density)
   - Method: Natural Breaks (Jenks) or Quantile
   - Color Scheme: Yellow-Orange-Red (sequential)
   - Adjust class breaks as needed

4. **Activate Time Slider**
   - View → Time Slider (or click Time Slider icon)
   - Configure:
     - Span: Current Time Extent
     - Step Interval: Matches your bins (2 hours for Helene, 6 hours for Francine)
   - Click Play to animate

### Interpretation

- **State-level layers**: Show broad regional patterns, good for overview
- **County-level layers**: Show detailed local hotspots
- **tweet_count**: Absolute number of tweets (affected by population)
- **tweets_per_1000sqkm**: Normalized density (better for comparing regions)
- **mentioned_entities**: Text field showing all GPE entities mentioned in tweets for that polygon-time combination

### Recommended Analysis Workflows

1. **Identify Peak Impact Times**: Use time slider to find bins with highest counts
2. **Compare Hurricanes**: Load both Francine and Helene layers to compare spatial footprints
3. **Hotspot Analysis**: Use county-level data for Emerging Hot Spot Analysis tool
4. **Export Maps**: Create map series showing evolution over key time periods

---

## Technical Notes

- **CRS**: EPSG:4269 (NAD83) - matches Census boundaries
- **Temporal Binning**: Adaptive (2h for Helene, 6h for Francine)
- **Spatial Assignment**: Primary method is spatial join (tweet coordinates), GPE field is secondary
- **Area Calculations**: Computed in EPSG:5070 (Albers Equal Area) for accuracy
- **Density Metric**: Tweets per 1,000 sq km (scale chosen for readability)

---

## Pipeline Complete

Output file: `hurricane_analysis_output.gpkg` contains 4 layers ready for ArcGIS Pro time-enabled visualization.

This notebook is self-contained and reproducible. Re-run all cells to regenerate outputs.