# Phase 1: Data Ingestion - Scrape and Consolidate

## Overview

This notebook downloads monthly crime incident data from the OpenDataPhilly Carto API and consolidates it into a single optimized Parquet file. Run this notebook to refresh your dataset or create an initial dataset for analysis.

### Steps
1. **Scrape**: Download monthly CSVs from the OpenDataPhilly API
2. **Consolidate**: Merge all months into a single DataFrame
3. **Optimize**: Reduce file size by optimizing data types
4. **Save**: Export to Parquet format for efficient storage and loading

### Expected Output
- Processed data file: `data/processed/crime_incidents_combined.parquet`
- File size: ~100-200 MB (compressed with Parquet)
- Records: ~3.5M+ (varies by API availability)

## Cell 1: Import and Configure

In [1]:
import sys
import subprocess
from pathlib import Path

# Add project root to path
PROJECT_ROOT = Path.cwd().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

print(f"Project root: {PROJECT_ROOT}")
print(f"Python version: {sys.version}")

Project root: /Users/dustinober/Projects/Crime Incidents Philadelphia
Python version: 3.14.2 (main, Dec  5 2025, 16:49:16) [Clang 17.0.0 (clang-1700.6.3.2)]


## Cell 2: Run Scraper

This cell calls the scraper script to download monthly crime data from OpenDataPhilly API.

In [2]:
# Run the scraper
scraper_script = PROJECT_ROOT / "scripts" / "helper" / "scrape.py"

print("Starting data scrape from OpenDataPhilly API...")
print("This may take several minutes depending on API response times.\n")

result = subprocess.run([sys.executable, str(scraper_script)], capture_output=True, text=True)

print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

if result.returncode == 0:
    print("\n✓ Scrape completed successfully")
else:
    print(f"\n✗ Scrape failed with return code {result.returncode}")

Starting data scrape from OpenDataPhilly API...
This may take several minutes depending on API response times.

Downloading 2006-01 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_01.csv
Downloading 2006-02 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_02.csv
Downloading 2006-03 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_03.csv
Downloading 2006-04 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_04.csv
Downloading 2006-05 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_05.csv
Downloading 2006-06 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_06.csv
Downloading 2006-07 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_07.csv
Downloading 2006-08 (Attempt 1)...
Successfully saved data/raw/philly_crime_data/incidents_2006_08.csv
Downloading 2006-09 (Attempt 1)...
Successfully saved data/raw/p

## Cell 3: Run Consolidation and Optimization

This cell consolidates all monthly CSV files into a single optimized Parquet file.

In [7]:
# Run the consolidation script
consolidate_script = PROJECT_ROOT / "scripts" / "helper" / "csv_to_parquet.py"

print("Consolidating CSV files and optimizing...")
print("This may take a few minutes for large datasets.\n")

result = subprocess.run([sys.executable, str(consolidate_script)], capture_output=True, text=True)

print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

if result.returncode == 0:
    print("\n✓ Consolidation completed successfully")
else:
    print(f"\n✗ Consolidation failed with return code {result.returncode}")

Consolidating CSV files and optimizing...
This may take a few minutes for large datasets.

CSV to Parquet Conversion

Found 239 CSV files to process
CSV Directory: /Users/dustinober/Projects/Crime Incidents Philadelphia/notebooks/phase_01_data_ingestion/data/raw/philly_crime_data
Output File: /Users/dustinober/Projects/Crime Incidents Philadelphia/scripts/data/processed/crime_incidents_combined.parquet

Reading CSV files...
  [1/239] Processing: incidents_2006_01.csv
  [11/239] Processing: incidents_2006_11.csv
  Chunk complete: 20 files read, combining...
  [21/239] Processing: incidents_2007_09.csv
  Chunk complete: 21 files read, combining...
  Chunk complete: 22 files read, combining...
  Chunk complete: 23 files read, combining...
  Chunk complete: 24 files read, combining...
  Chunk complete: 25 files read, combining...
  Chunk complete: 26 files read, combining...
  Chunk complete: 27 files read, combining...
  Chunk complete: 28 files read, combining...
  Chunk complete: 29 fil

## Cell 4: Verify Output

In [4]:
import pandas as pd
from src.data import loader

# Load the consolidated data
try:
    df = loader.load_crime_data()
    print(f"✓ Successfully loaded {len(df):,} crime records")
    print(f"\nDataFrame shape: {df.shape}")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nData types:\n{df.dtypes}")
    print(f"\nFirst few rows:")
    print(df.head())
    print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
except Exception as e:
    print(f"✗ Error loading data: {e}")

✓ Successfully loaded 3,496,353 crime records

DataFrame shape: (3496353, 16)

Columns: ['the_geom', 'cartodb_id', 'the_geom_webmercator', 'objectid', 'dc_dist', 'psa', 'dispatch_date_time', 'dispatch_date', 'dispatch_time', 'hour', 'dc_key', 'location_block', 'ucr_general', 'text_general_code', 'point_x', 'point_y']

Data types:
the_geom                dictionary<values=string, indices=int32, order...
cartodb_id                                                 int64[pyarrow]
the_geom_webmercator    dictionary<values=string, indices=int32, order...
objectid                                                   int64[pyarrow]
dc_dist                                                    int64[pyarrow]
psa                     dictionary<values=string, indices=int8, ordere...
dispatch_date_time                         timestamp[ns, tz=UTC][pyarrow]
dispatch_date           dictionary<values=string, indices=int16, order...
dispatch_time           dictionary<values=string, indices=int16, order...
ho

## Summary

✓ **Data ingestion complete!** The consolidated dataset is ready for analysis.

### What's Next?
- Proceed to **Phase 2: Exploration** (`phase_02_exploration/01_data_overview.ipynb`) to understand the data structure and quality
- Or skip directly to later phases if you've already completed exploration

### Data Location
- **Consolidated file**: `data/processed/crime_incidents_combined.parquet`
- **Raw monthly CSVs**: `data/raw/` (organized by year/month)

### To Refresh Data in Future Sessions
- Simply re-run all cells in this notebook
- Or create a scheduled task to run `scripts/helper/refresh_data.py` periodically