# A1: Data Sources Setup

---

## Overview

This notebook connects to the Berkeley Open Data Portal API and downloads permit data.

**Inputs:** None (fetches from API)

**Outputs:**
- `zoning_permits.csv`
- `building_permits.csv`
- `planning_records.csv`

**Dependencies:** sodapy, pandas

---

## 1. Setup & Imports

In [1]:
# Standard imports
import sys
import os
from pathlib import Path
from datetime import datetime

# Add modules to path
sys.path.insert(0, str(Path.cwd().parent.parent))

# Import our modules
from modules.data_loader import (
    get_socrata_client,
    load_permits_from_api,
    save_to_database,
    DATASETS
)

# Third-party
import pandas as pd

# Configuration
import json
with open('../../config/berkeley_config.json') as f:
    CONFIG = json.load(f)

DATA_DIR = Path(CONFIG['paths']['data_dir'])

print(f"Data directory: {DATA_DIR}")
print(f"Available datasets: {list(DATASETS.keys())}")

Data directory: /Users/johngage/berkeley-data
Available datasets: ['business_licenses', 'building_permits', 'zoning_permits', 'planning_records', 'crime_incidents', 'restaurant_inspections']


## 2. API Configuration

Get your free API token from:
https://data.cityofberkeley.info/profile/edit/developer_settings

In [13]:
# Load environment variables
try:
    from dotenv import load_dotenv
    load_dotenv(DATA_DIR / '.env')
    print("Loaded .env file")
except:
    print("Note: python-dotenv not installed (optional)")

# Check for API token
APP_TOKEN = os.environ.get('BERKELEY_APP_TOKEN')

# TODO: If no environment variable, set your token here:
APP_TOKEN = "z1ZX3Y2jwZ_BCAoo_iIe1h14HMMAzjpPOV_M"

if APP_TOKEN:
    print(f"API token loaded: {APP_TOKEN[:8]}...")
else:
    print("WARNING: No API token found!")
    print("Get your free token at: https://data.cityofberkeley.info/profile/edit/developer_settings")

Loaded .env file
API token loaded: z1ZX3Y2j...


## 3. Available Datasets

Berkeley Open Data Portal datasets relevant to housing:

In [14]:
# Display available datasets
print("Berkeley Open Data - Housing Related Datasets:\n")
print("="*60)

for name, dataset_id in DATASETS.items():
    info = CONFIG['api']['datasets'].get(name, {})
    desc = info.get('description', 'No description')
    print(f"{name}")
    print(f"  ID: {dataset_id}")
    print(f"  Description: {desc}")
    print()

Berkeley Open Data - Housing Related Datasets:

business_licenses
  ID: rwnf-bu3w
  Description: Active business licenses

building_permits
  ID: ydr8-5enu
  Description: Building permits

zoning_permits
  ID: vkhm-tsvp
  Description: Zoning permits

planning_records
  ID: rk4r-58ys
  Description: Planning records

crime_incidents
  ID: k2nh-s5h5
  Description: No description

restaurant_inspections
  ID: b47j-kakm
  Description: No description



## 4. Fetch Zoning Permits

Zoning permits are the first step in the housing development pipeline.

In [15]:
# Fetch zoning permits
print("Fetching Zoning Permits...")
print("="*60)

df_zoning = load_permits_from_api(
    'zoning_permits',
    limit=50000,
    app_token=APP_TOKEN
)

if df_zoning is not None:
    print(f"\nShape: {df_zoning.shape}")
    print(f"\nColumns:")
    for col in df_zoning.columns:
        print(f"  - {col}")
    
    print(f"\nSample records:")
    display(df_zoning.head(3))

Fetching Zoning Permits...
Using app token: z1ZX3Y2j...
Fetching zoning_permits from Berkeley Open Data...
Error fetching data: 403 Client Error: Forbidden


## 5. Fetch Building Permits

Building permits are issued after zoning approval.

In [None]:
# Fetch building permits
print("Fetching Building Permits...")
print("="*60)

df_building = load_permits_from_api(
    'building_permits',
    limit=50000,
    app_token=APP_TOKEN
)

if df_building is not None:
    print(f"\nShape: {df_building.shape}")
    print(f"\nColumns:")
    for col in df_building.columns:
        print(f"  - {col}")
    
    print(f"\nSample records:")
    display(df_building.head(3))

## 6. Document Data Schemas

Examine and document the schema for each dataset.

In [None]:
def document_schema(df, name):
    """Document dataframe schema"""
    print(f"\n{'='*60}")
    print(f"SCHEMA: {name}")
    print(f"{'='*60}")
    print(f"Records: {len(df):,}")
    print(f"Columns: {len(df.columns)}")
    print()
    
    for col in df.columns:
        dtype = df[col].dtype
        non_null = df[col].notna().sum()
        pct = 100 * non_null / len(df)
        sample = df[col].dropna().iloc[0] if non_null > 0 else 'N/A'
        if isinstance(sample, str) and len(sample) > 40:
            sample = sample[:40] + '...'
        print(f"{col}")
        print(f"  Type: {dtype}, Non-null: {pct:.0f}%")
        print(f"  Sample: {sample}")
        print()

# Document schemas
if df_zoning is not None:
    document_schema(df_zoning, 'Zoning Permits')

if df_building is not None:
    document_schema(df_building, 'Building Permits')

## 7. Export Data

Save fetched data to CSV files.

In [None]:
# Create timestamp for file names
timestamp = datetime.now().strftime('%Y%m%d')

# Export zoning permits
if df_zoning is not None:
    zoning_path = DATA_DIR / f'zoning_permits_{timestamp}.csv'
    df_zoning.to_csv(zoning_path, index=False)
    print(f"Saved: {zoning_path}")

# Export building permits
if df_building is not None:
    building_path = DATA_DIR / f'building_permits_{timestamp}.csv'
    df_building.to_csv(building_path, index=False)
    print(f"Saved: {building_path}")

print(f"\nData export complete!")

## 8. Load to Database (Optional)

Load data into SQLite for Datasette.

In [None]:
# Save to database
DB_PATH = CONFIG['paths']['database']

if df_zoning is not None:
    save_to_database(df_zoning, 'zoning_permits', DB_PATH)

if df_building is not None:
    save_to_database(df_building, 'building_permits', DB_PATH)

print(f"\nData loaded to: {DB_PATH}")

---

## Summary

This notebook:
- Connected to Berkeley Open Data Portal
- Downloaded zoning and building permits
- Documented data schemas
- Exported to CSV and SQLite

**Next:** Run `A2_address_standardization.ipynb` to standardize addresses.