# Data Source Exploration - Phase 1

This notebook explores the available data sources for the LA Healthcare Access Mapping project.

## Objectives
1. Test API connections to verify access
2. Examine data structures and available fields
3. Download sample data for initial exploration
4. Document data quality considerations

## Setup

In [None]:
import pandas as pd
import numpy as np
import requests
import json
from pathlib import Path
import os
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Set up paths
project_root = Path.cwd().parent
data_raw = project_root / 'data' / 'raw'
data_raw.mkdir(parents=True, exist_ok=True)

print("‚úì Imports successful")
print(f"Project root: {project_root}")

## 1. California Health Facilities Data

**Source**: CA Dept of Public Health via data.chhs.ca.gov  
**Dataset**: Licensed and Certified Healthcare Facility Listing  
**Last Updated**: January 28, 2026  
**Update Frequency**: Monthly

In [None]:
# Direct CSV download URL
ca_facilities_url = "https://data.chhs.ca.gov/dataset/3b5b80e8-6b8d-4715-b3c0-2699af6e72e5/resource/f0ae5731-fef8-417f-839d-54a0ed3a126e/download/health_facility_locations.csv"

print("Fetching California healthcare facilities data...")
try:
    ca_facilities = pd.read_csv(ca_facilities_url)
    print(f"‚úì Downloaded {len(ca_facilities)} facilities")
    print(f"Columns: {ca_facilities.shape[1]}")
    
    # Filter to LA County
    la_facilities = ca_facilities[ca_facilities['COUNTY_NAME'].str.contains('Los Angeles', case=False, na=False)]
    print(f"‚úì Filtered to {len(la_facilities)} LA County facilities")
    
    # Save to raw data
    output_file = data_raw / 'ca_health_facilities_full.csv'
    ca_facilities.to_csv(output_file, index=False)
    print(f"‚úì Saved full dataset to {output_file}")
    
except Exception as e:
    print(f"Error: {e}")
    la_facilities = None

In [None]:
# Explore the data structure
if la_facilities is not None:
    print("\n=== Data Structure ===")
    print(f"Shape: {la_facilities.shape}")
    print(f"\nColumn Names:\n{list(la_facilities.columns)}")
    
    # Display first few rows
    print("\n=== Sample Data ===")
    display(la_facilities.head())
    
    # Check facility types
    if 'FACILITY_TYPE' in la_facilities.columns:
        print("\n=== Facility Types ===")
        print(la_facilities['FACILITY_TYPE'].value_counts())
    
    # Check for location data
    lat_cols = [col for col in la_facilities.columns if 'lat' in col.lower()]
    lon_cols = [col for col in la_facilities.columns if 'lon' in col.lower()]
    print(f"\n=== Location Columns ===")
    print(f"Potential latitude columns: {lat_cols}")
    print(f"Potential longitude columns: {lon_cols}")

In [None]:
# Check data quality
if la_facilities is not None:
    print("\n=== Data Quality Assessment ===")
    
    # Missing values
    missing = la_facilities.isnull().sum()
    missing_pct = (missing / len(la_facilities)) * 100
    missing_df = pd.DataFrame({
        'Missing Count': missing,
        'Missing %': missing_pct
    }).sort_values('Missing %', ascending=False)
    
    print("\nColumns with missing data:")
    display(missing_df[missing_df['Missing Count'] > 0].head(10))
    
    # Check for duplicate facility names
    if 'FACILITY_NAME' in la_facilities.columns:
        duplicates = la_facilities['FACILITY_NAME'].duplicated().sum()
        print(f"\nDuplicate facility names: {duplicates}")

## 2. US Census Bureau API - Test Connection

**Source**: American Community Survey 5-Year Estimates  
**Geographic Level**: Census Tracts in LA County  
**Latest**: 2020-2024 estimates

In [None]:
# Check for Census API key
census_api_key = os.getenv('CENSUS_API_KEY')

if not census_api_key:
    print("‚ö† Census API key not found!")
    print("Get a free key at: https://api.census.gov/data/key_signup.html")
    print("Add to .env file: CENSUS_API_KEY=your_key_here")
else:
    print("‚úì Census API key found")

In [None]:
# Test Census API with a simple query
if census_api_key:
    # LA County FIPS: State 06, County 037
    base_url = "https://api.census.gov/data/2022/acs/acs5"
    
    # Request basic variables
    variables = [
        'NAME',  # Geographic name
        'B01003_001E',  # Total population
        'B19013_001E',  # Median household income
        'B01002_001E'   # Median age
    ]
    
    params = {
        'get': ','.join(variables),
        'for': 'tract:*',  # All census tracts
        'in': 'state:06 county:037',  # LA County
        'key': census_api_key
    }
    
    try:
        print("Testing Census API...")
        response = requests.get(base_url, params=params, timeout=30)
        response.raise_for_status()
        
        data = response.json()
        census_df = pd.DataFrame(data[1:], columns=data[0])
        
        print(f"‚úì Successfully retrieved {len(census_df)} census tracts")
        
        # Convert numeric columns
        for col in ['B01003_001E', 'B19013_001E', 'B01002_001E']:
            census_df[col] = pd.to_numeric(census_df[col], errors='coerce')
        
        # Create GEOID
        census_df['GEOID'] = census_df['state'] + census_df['county'] + census_df['tract']
        
        print("\n=== Sample Census Data ===")
        display(census_df.head())
        
        # Save sample
        output_file = data_raw / 'census_sample_la_county.csv'
        census_df.to_csv(output_file, index=False)
        print(f"\n‚úì Saved sample census data to {output_file}")
        
    except requests.exceptions.RequestException as e:
        print(f"Error accessing Census API: {e}")
        census_df = None
else:
    census_df = None

In [None]:
# Analyze census data
if census_df is not None:
    print("\n=== Census Data Summary ===")
    
    # Population statistics
    print(f"\nTotal LA County Population: {census_df['B01003_001E'].sum():,.0f}")
    print(f"Average tract population: {census_df['B01003_001E'].mean():,.0f}")
    print(f"Median tract population: {census_df['B01003_001E'].median():,.0f}")
    
    # Income statistics
    print(f"\nMedian household income (county avg): ${census_df['B19013_001E'].median():,.0f}")
    print(f"Income range: ${census_df['B19013_001E'].min():,.0f} - ${census_df['B19013_001E'].max():,.0f}")
    
    # Age statistics
    print(f"\nMedian age (county avg): {census_df['B01002_001E'].median():.1f} years")

## 3. Download Census TIGER Shapefiles

**Note**: TIGER shapefiles are large. We'll document the download URLs for manual download.  
**LA County**: State 06, County 037

In [None]:
# Document TIGER/Line shapefile URLs for LA County
tiger_urls = {
    'Census Tracts 2023': 'https://www2.census.gov/geo/tiger/TIGER2023/TRACT/tl_2023_06_tract.zip',
    'LA County Boundary 2023': 'https://www2.census.gov/geo/tiger/TIGER2023/COUNTY/tl_2023_us_county.zip',
    'ZCTAs 2023': 'https://www2.census.gov/geo/tiger/TIGER2023/ZCTA520/tl_2023_us_zcta520.zip'
}

print("\n=== TIGER/Line Shapefile Download URLs ===")
print("\nDownload these files manually and extract to data/external/")
for name, url in tiger_urls.items():
    print(f"\n{name}:")
    print(f"  {url}")

print("\n\nExample download commands:")
print("""\ncd data/external/
wget https://www2.census.gov/geo/tiger/TIGER2023/TRACT/tl_2023_06_tract.zip
unzip tl_2023_06_tract.zip
""")

# Note: We can also use geopandas to filter to LA County after loading California tracts
print("\nüìù Note: After downloading, filter California tracts to LA County (COUNTYFP='037')")

## 4. Data Quality Findings & Recommendations

### Healthcare Facilities Data
- ‚úì **Pros**: Comprehensive, includes lat/long, updated monthly, free access
- ‚ö† **Considerations**: 
  - Check for missing coordinates
  - Verify facility types relevant to urgent care analysis
  - Some facilities may be duplicated across datasets
  - Need to geocode any facilities with missing coordinates

### Census Data
- ‚úì **Pros**: Official source, detailed demographics, tract-level granularity
- ‚ö† **Considerations**:
  - Some variables may have missing values (coded as -666666666)
  - 5-year estimates are 2020-2024 (most recent available)
  - Need to handle margin of error for ACS estimates
  - API rate limits exist (though generous)

### Geographic Boundaries
- ‚úì **Pros**: Standardized format, contains GEOIDs for joining
- ‚ö† **Considerations**:
  - Large file sizes (150MB+ for CA tracts)
  - Need to filter to LA County only
  - Coordinate reference system must be standardized (use EPSG:4326)

### Recommendations
1. **Priority**: Start with CA DHHS facility data - most comprehensive and current
2. **Geographic Level**: Use census tracts for analysis (good balance of granularity and reliability)
3. **Census Variables**: Focus on ACS 5-year estimates for stability
4. **Data Validation**: Cross-check facility counts between sources
5. **Update Frequency**: Re-download facilities data monthly; census data annually

## 5. Next Steps

‚úÖ **Phase 1 Complete!** You've successfully:
- Verified API access to healthcare facilities data
- Tested Census Bureau API connection
- Downloaded sample data
- Documented data structures and quality considerations

### Ready for Phase 2: Data Collection

1. **Run data collection scripts**:
   ```bash
   python src/data_collection/fetch_facilities.py
   python src/data_collection/fetch_census_data.py
   ```

2. **Download TIGER shapefiles** (see URLs above)

3. **Create additional exploration notebooks**:
   - Facility type analysis
   - Demographic patterns
   - Initial geographic visualization

4. **Move to Phase 3**: Data cleaning and processing

In [None]:
# Summary of files created
print("\n=== Files Created ===")
for file in data_raw.glob('*'):
    if file.is_file():
        size_mb = file.stat().st_size / (1024 * 1024)
        print(f"  {file.name} ({size_mb:.2f} MB)")