# Phase 1: Data Ingestion - Scrape and Consolidate

## Overview

This notebook downloads monthly crime incident data from the OpenDataPhilly Carto API and consolidates it into a single optimized Parquet file. Run this notebook to refresh your dataset or create an initial dataset for analysis.

### Steps
1. **Scrape**: Download monthly CSVs from the OpenDataPhilly API
2. **Consolidate**: Merge all months into a single DataFrame
3. **Optimize**: Reduce file size by optimizing data types
4. **Save**: Export to Parquet format for efficient storage and loading

### Expected Output
- Processed data file: `data/processed/crime_incidents_combined.parquet`
- File size: ~100-200 MB (compressed with Parquet)
- Records: ~3.5M+ (varies by API availability)

## Cell 1: Import and Configure

In [None]:
import sys
import subprocess
from pathlib import Path

# Add project root to path
PROJECT_ROOT = Path.cwd().parent.parent
sys.path.insert(0, str(PROJECT_ROOT))

print(f"Project root: {PROJECT_ROOT}")
print(f"Python version: {sys.version}")

## Cell 2: Run Scraper

This cell calls the scraper script to download monthly crime data from OpenDataPhilly API.

In [None]:
# Run the scraper
scraper_script = PROJECT_ROOT / "scripts" / "helper" / "scrape.py"

print("Starting data scrape from OpenDataPhilly API...")
print("This may take several minutes depending on API response times.\n")

result = subprocess.run([sys.executable, str(scraper_script)], capture_output=True, text=True)

print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

if result.returncode == 0:
    print("\n✓ Scrape completed successfully")
else:
    print(f"\n✗ Scrape failed with return code {result.returncode}")

## Cell 3: Run Consolidation and Optimization

This cell consolidates all monthly CSV files into a single optimized Parquet file.

In [None]:
# Run the consolidation script
consolidate_script = PROJECT_ROOT / "scripts" / "helper" / "csv_to_parquet.py"

print("Consolidating CSV files and optimizing...")
print("This may take a few minutes for large datasets.\n")

result = subprocess.run([sys.executable, str(consolidate_script)], capture_output=True, text=True)

print(result.stdout)
if result.stderr:
    print("STDERR:", result.stderr)

if result.returncode == 0:
    print("\n✓ Consolidation completed successfully")
else:
    print(f"\n✗ Consolidation failed with return code {result.returncode}")

## Cell 4: Verify Output

In [None]:
import pandas as pd
from src.data import loader

# Load the consolidated data
try:
    df = loader.load_crime_data()
    print(f"✓ Successfully loaded {len(df):,} crime records")
    print(f"\nDataFrame shape: {df.shape}")
    print(f"\nColumns: {list(df.columns)}")
    print(f"\nData types:\n{df.dtypes}")
    print(f"\nFirst few rows:")
    print(df.head())
    print(f"\nDate range: {df['date'].min()} to {df['date'].max()}")
except Exception as e:
    print(f"✗ Error loading data: {e}")

## Summary

✓ **Data ingestion complete!** The consolidated dataset is ready for analysis.

### What's Next?
- Proceed to **Phase 2: Exploration** (`phase_02_exploration/01_data_overview.ipynb`) to understand the data structure and quality
- Or skip directly to later phases if you've already completed exploration

### Data Location
- **Consolidated file**: `data/processed/crime_incidents_combined.parquet`
- **Raw monthly CSVs**: `data/raw/` (organized by year/month)

### To Refresh Data in Future Sessions
- Simply re-run all cells in this notebook
- Or create a scheduled task to run `scripts/helper/refresh_data.py` periodically