# Housing Data Workflow Notebook

Modular workflow where you can run individual steps independently.
Run cells in order or skip any steps you don't need.

Each step shows dataframe views and statistics for inspection.

## Quick Start
- Run **Setup** cell first
- Then run any combination of Step 1-4 cells
- Skip cells you don't want to execute
- Each cell is self-contained and shows results

## 🔧 Setup

Run this cell first to import modules and define helper functions.

In [1]:
import sys
from pathlib import Path
from typing import Optional
import time

import pandas as pd

# Add current directory to path for local imports
sys.path.append(".")

# Import our workflow modules
from fetch_affordable_housing_data import update_local_data, verify_and_fetch_hpd_data
from query_ll44_funding import query_and_add_financing
from query_dob_filings import query_dob_filings
from query_co_filings import query_co_filings
from HPD_DOB_Join_On_BIN import create_separate_timelines
from create_timeline_chart import create_timeline_chart, create_financing_charts
from data_quality import quality_tracker

print("✅ All imports successful")

# Helper functions
def _normalize_bin(bin_value) -> Optional[str]:
    """Normalize BIN to a clean string."""
    if pd.isna(bin_value):
        return None
    try:
        return str(int(float(bin_value)))
    except (TypeError, ValueError):
        value = str(bin_value).strip()
        return value or None

def _write_bin_file(source_csv: Path, output_txt: Path) -> Path:
    """Extract BINs from a CSV and write them to a text file for CO searches."""
    df = pd.read_csv(source_csv)
    candidate_cols = [col for col in df.columns if col.lower() in ("bin", "bin_normalized")]
    if not candidate_cols:
        raise SystemExit(f"Could not find a BIN column in {source_csv}")

    bins = [_normalize_bin(val) for val in df[candidate_cols[0]].dropna()]
    bins = sorted({b for b in bins if b})

    output_txt.parent.mkdir(parents=True, exist_ok=True)
    output_txt.write_text("\n".join(bins))
    print(f"Wrote {len(bins)} BINs to {output_txt}")
    return output_txt

print("✅ Helper functions defined")

✅ All imports successful
✅ Helper functions defined


## 📥 Step 1: Fetch HPD Data

Load or refresh the HPD affordable housing dataset.

**Options:**
- Set `refresh_data = True` to fetch fresh data
- Set `refresh_data = False` to use existing data

In [2]:
# Step 1 Configuration
refresh_data = False  # Set to True to fetch fresh HPD data
hpd_output_path = "data/raw/Affordable_Housing_Production_by_Building.csv"  # Output path for HPD data

print("=" * 70)
print("STEP 1: FETCH HPD DATA")
print("=" * 70)

# Start quality tracking
quality_tracker.start_processing()

if refresh_data:
    print("Fetching fresh HPD data from NYC Open Data...")
    hpd_df, hpd_csv = update_local_data(hpd_output_path)
else:
    print("Verifying local HPD data against API...")
    hpd_df, hpd_csv = verify_and_fetch_hpd_data(output_path=hpd_output_path)

if not hpd_csv.exists():
    raise SystemExit(f"HPD dataset not found at {hpd_csv}")

# Record initial dataset size
quality_tracker.analyze_hpd_data(hpd_df, "Full_HPD_Dataset")
quality_tracker.record_pipeline_stage("raw_hpd_data", len(hpd_df), "Raw HPD affordable housing dataset")

print(f"✅ Step 1 complete: {len(hpd_df):,} records loaded")
print(f"📁 Data location: {hpd_csv}")

# Display the dataframe
print("\n🔍 HPD Dataset Overview:")
print(f"Shape: {hpd_df.shape}")
print("\nColumns:")
for col in hpd_df.columns:
    print(f"  - {col}")

print("\n📊 Sample Data:")
display(hpd_df.head())

print("\n📈 Basic Statistics:")
display(hpd_df.describe(include="all"))

STEP 1: FETCH HPD DATA
Verifying local HPD data against API...
STEP 1: VERIFY AND FETCH HPD DATA
Found local HPD data file: data/raw/Affordable_Housing_Production_by_Building.csv
Local file has 8,604 records

Fetching 100 sample records from API for verification...
Fetching affordable housing data from NYC Open Data API...
Endpoint: https://data.cityofnewyork.us/resource/hg8x-zxpr.json
Fetching records 1-100...
  Retrieved 100 records (total: 100)

Completed! Retrieved 100 total records
API sample has 100 records
✅ Local data has sufficient records - assuming current
Using existing local data
✅ Step 1 complete: 8,604 records loaded
📁 Data location: data/raw/Affordable_Housing_Production_by_Building.csv

🔍 HPD Dataset Overview:
Shape: (8604, 41)

Columns:
  - Project ID
  - Project Name
  - Project Start Date
  - Project Completion Date
  - Building ID
  - Number
  - Street
  - Borough
  - Postcode
  - BBL
  - BIN
  - Community Board
  - Council District
  - Census Tract
  - NTA - Neigh

Unnamed: 0,Project ID,Project Name,Project Start Date,Project Completion Date,Building ID,Number,Street,Borough,Postcode,BBL,...,2-BR Units,3-BR Units,4-BR Units,5-BR Units,6-BR+ Units,Unknown-BR Units,Counted Rental Units,Counted Homeownership Units,All Counted Units,Total Units
0,44218,MEC E. 125TH ST. PARCEL B WEST,2018-12-31T00:00:00.000,,987329,2319,3 AVENUE,Manhattan,10035,1017907501,...,129.0,15.0,,,,,297.0,,297.0,404
1,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,927737,335,RALPH AVENUE,Brooklyn,11233,3015560003,...,11.0,,,,,,,13.0,13.0,13
2,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,969695,35,ROCHESTER AVENUE,Brooklyn,11233,3017090009,...,6.0,,,,,,,8.0,8.0,8
3,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,975702,18-22,SUYDAM PLACE,Brooklyn,11233,3017090028,...,1.0,,,,,,,15.0,15.0,15
4,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,977564,329,RALPH AVENUE,Brooklyn,11233,3015560007,...,7.0,,,,,,,10.0,10.0,10



📈 Basic Statistics:


Unnamed: 0,Project ID,Project Name,Project Start Date,Project Completion Date,Building ID,Number,Street,Borough,Postcode,BBL,...,2-BR Units,3-BR Units,4-BR Units,5-BR Units,6-BR+ Units,Unknown-BR Units,Counted Rental Units,Counted Homeownership Units,All Counted Units,Total Units
count,8604.0,8604,8604,0.0,6889.0,8604,8604,8604,6886.0,6809.0,...,5433.0,3151.0,532.0,63.0,24.0,633.0,6242.0,2473.0,8576.0,8604.0
unique,5252.0,3538,2124,,6781.0,2933,1486,5,157.0,5768.0,...,,,,,,,,,,
top,75173.0,CONFIDENTIAL,2021-06-30T00:00:00.000,,967819.0,----,----,Brooklyn,11221.0,2051410120.0,...,,,,,,,,,,
freq,114.0,1715,146,,3.0,1715,1715,3395,248.0,58.0,...,,,,,,,,,,
mean,,,,,,,,,,,...,17.777839,10.523009,5.137218,2.746032,3.75,5.805687,37.060077,24.030732,33.903568,46.695607
std,,,,,,,,,,,...,29.800153,17.621614,8.477841,3.537669,7.51954,35.639215,63.206671,70.96487,66.16285,93.375204
min,,,,,,,,,,,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,,,,,,,,,,,...,2.0,1.0,1.0,1.0,1.0,1.0,5.0,1.0,2.0,3.0
50%,,,,,,,,,,,...,7.0,4.0,2.0,1.0,1.0,1.0,13.0,1.0,8.0,12.0
75%,,,,,,,,,,,...,18.0,11.0,6.0,3.5,2.25,2.0,41.0,2.0,31.0,44.0


## 💰 Step 2: Add Financing Classification

Classify projects by financing type (HPD vs Private).

**Depends on:** Step 1
**Options:**
- Set `skip_financing = True` to skip this step
- Customize output path if needed

In [3]:
# Step 2 Configuration
skip_financing = False  # Set to True to skip financing classification
refresh_ll44 = False  # Set to True to fetch fresh LL44 data
refresh_ll44_eligibility = False  # Set to True to fetch fresh LL44 eligibility data
financing_output_path = None  # Custom output path (None = auto-generate)

print("\n" + "=" * 70)
print("STEP 2: ADD FINANCING CLASSIFICATION")

# Handle LL44 cache refresh if requested
if refresh_ll44_eligibility:
    print("Force refreshing LL44 eligibility cache...")
    from query_ll44_funding import verify_and_fetch_ll44_eligibility_data
    ll44_eligibility_df, ll44_eligibility_path = verify_and_fetch_ll44_eligibility_data(use_existing=False)
    print(f"LL44 eligibility cache refreshed: {len(ll44_eligibility_df)} records\n")

if refresh_ll44:
    print("Force refreshing LL44 funding cache...")
    from query_ll44_funding import verify_and_fetch_ll44_data
    ll44_cache_df, ll44_cache_path = verify_and_fetch_ll44_data(use_existing=False)
    print(f"LL44 funding cache refreshed: {len(ll44_cache_df)} records\n")

    print("Force refreshing LL44 funding cache...")
print("=" * 70)

if skip_financing:
    print("⏭️ Skipping financing classification as requested.")
    financing_df = hpd_df.copy()
    quality_tracker.record_pipeline_stage("after_financing_skip", len(financing_df), "Financing classification skipped")
    building_csv = hpd_csv
else:
    output_path = Path(financing_output_path) if financing_output_path else Path(
        "data/processed/Affordable_Housing_Production_by_Building_with_financing.csv"
    )
    output_path.parent.mkdir(parents=True, exist_ok=True)

    print(f"Classifying financing types -> {output_path}")
    financing_df = query_and_add_financing(str(hpd_csv), output_path=str(output_path), use_cache=not refresh_ll44, use_eligibility_cache=not refresh_ll44_eligibility)
    building_csv = output_path

    # Record dataset after financing classification
    quality_tracker.analyze_hpd_data(financing_df, "Filtered_HPD")
    quality_tracker.record_pipeline_stage("after_financing", len(financing_df), "Added LL44 financing classification")

print(f"✅ Step 2 complete: {len(financing_df):,} records with financing classification")

# Display the dataframe with financing info
print("\n🔍 Financing Classification Results:")
print(f"Shape: {financing_df.shape}")

# Check if financing columns were added
financing_cols = [col for col in financing_df.columns if "financ" in col.lower() or "ll44" in col.lower()]
print(f"\nFinancing-related columns: {financing_cols}")

print("\n📊 Sample Data with Financing:")
display(financing_df.head())

# Show financing type distribution
if financing_cols:
    for col in financing_cols:
        if col in financing_df.columns:
            print(f"\n📈 Distribution of {col}:")
            display(financing_df[col].value_counts(dropna=False))


STEP 2: ADD FINANCING CLASSIFICATION
Classifying financing types -> data/processed/Affordable_Housing_Production_by_Building_with_financing.csv
Reading HPD data to extract project IDs: data/raw/Affordable_Housing_Production_by_Building.csv
Found 5252 unique project IDs

Querying Local Law 44 funding database...
Number of project IDs to check: 5252
Checking local LL44 cache...
VERIFYING LL44 FUNDING CACHE
Found recent LL44 cache file: data/raw/ll44_funding_data.csv
File age: 0:04:43.614088
Using existing cached data
Loaded 1270 LL44 records from cache
Found 941 matching funding records in cache
Reading HPD data from: data/raw/Affordable_Housing_Production_by_Building.csv
Total HPD projects: 8604
Unique project IDs in HPD data: 5252
Project IDs with LL44 funding: 678

Financing type distribution:
  Privately Financed: 6,366 projects
  HPD Financed: 2,238 projects

Updated HPD data saved to: data/processed/Affordable_Housing_Production_by_Building_with_financing.csv
✅ Step 2 complete: 8,

Unnamed: 0,Project ID,Project Name,Project Start Date,Project Completion Date,Building ID,Number,Street,Borough,Postcode,BBL,...,3-BR Units,4-BR Units,5-BR Units,6-BR+ Units,Unknown-BR Units,Counted Rental Units,Counted Homeownership Units,All Counted Units,Total Units,Financing Type
0,44218,MEC E. 125TH ST. PARCEL B WEST,2018-12-31T00:00:00.000,,987329.0,2319,3 AVENUE,Manhattan,10035.0,1017908000.0,...,15.0,,,,,297.0,,297.0,404,HPD Financed
1,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,927737.0,335,RALPH AVENUE,Brooklyn,11233.0,3015560000.0,...,,,,,,,13.0,13.0,13,HPD Financed
2,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,969695.0,35,ROCHESTER AVENUE,Brooklyn,11233.0,3017090000.0,...,,,,,,,8.0,8.0,8,HPD Financed
3,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,975702.0,18-22,SUYDAM PLACE,Brooklyn,11233.0,3017090000.0,...,,,,,,,15.0,15.0,15,HPD Financed
4,44223,ROCHESTER SUYDAM PHASE 1,2021-06-30T00:00:00.000,,977564.0,329,RALPH AVENUE,Brooklyn,11233.0,3015560000.0,...,,,,,,,10.0,10.0,10,HPD Financed



📈 Distribution of Financing Type:


Financing Type
Privately Financed    6366
HPD Financed          2238
Name: count, dtype: int64

## 🏗️ Step 3A: Query DOB Filings

Search for DOB New Building filings.

**Depends on:** Step 2
**Options:**
- Set `skip_dob = True` to use existing DOB data
- Set `use_bbl_fallback = False` to disable BBL fallback

In [None]:
# Step 3A Configuration
skip_dob = False  # Set to True to use existing DOB data
use_bbl_fallback = True  # Set to False to disable BBL fallback
dob_output_path = None  # Custom DOB output path

print("\n" + "=" * 70)
print("STEP 3A: QUERY DOB FILINGS")
print("=" * 70)

dob_output = Path(dob_output_path) if dob_output_path else Path(
    f"data/processed/{building_csv.stem}_dob_filings.csv"
)
dob_output.parent.mkdir(parents=True, exist_ok=True)

# Check for existing DOB files when skipping
if skip_dob:
    print("⏭️ Using existing DOB data")
    # Look for existing files
    alt_dob_path = Path(f"data/external/{building_csv.stem}_dob_filings.csv")
    if dob_output.exists():
        print(f"📁 Using existing DOB data at {dob_output}")
        dob_df = pd.read_csv(dob_output)
    elif alt_dob_path.exists():
        print(f"📁 Using existing DOB data from external folder: {alt_dob_path}")
        dob_output = alt_dob_path
        dob_df = pd.read_csv(dob_output)
    else:
        print("⚠️ No existing DOB data found")
        dob_df = None
        dob_output = None
else:
    print(f"🔍 Querying DOB APIs using {building_csv} -> {dob_output}")
    print("   This may take several minutes...")
    query_dob_filings(
        str(building_csv),
        output_path=str(dob_output),
        use_bbl_fallback=use_bbl_fallback,
    )
    print(f"✅ DOB query completed: {dob_output}")
    dob_df = pd.read_csv(dob_output)

# Display DOB data if available
if dob_df is not None:
    print(f"📊 DOB Filings Data: {dob_df.shape[0]} records")
    print("Columns:")
    for col in dob_df.columns:
        print(f"  - {col}")
    
    print("\n📊 Sample DOB Data:")
    display(dob_df.head())
    
    # Show some statistics
    if "filing_date" in dob_df.columns:
        print("\n📈 DOB Filing Date Statistics:")
        display(dob_df["filing_date"].describe())
else:
    print("⚠️ No DOB data available")

## 🏛️ Step 3B: Query Certificate of Occupancy

Search for Certificate of Occupancy filings.

**Depends on:** Step 2
**Options:**
- Set `skip_co = True` to use existing CO data

In [None]:
# Step 3B Configuration
skip_co = False  # Set to True to use existing CO data
co_output_path = None  # Custom CO output path

print("\n" + "=" * 70)
print("STEP 3B: QUERY CERTIFICATE OF OCCUPANCY")
print("=" * 70)

# Generate BIN file for CO searches
bin_output = Path("data/processed/workflow_bins.txt")
bin_file = _write_bin_file(building_csv, bin_output)

print(f"\n📋 BIN file created: {bin_file}")
print(f"Contains {len(bin_file.read_text().split())} BINs")

co_output = Path(co_output_path) if co_output_path else Path(
    f"data/processed/{bin_file.stem}_co_filings.csv"
)
co_output.parent.mkdir(parents=True, exist_ok=True)

if skip_co:
    print("⏭️ Using existing CO data")
    # Look for existing CO files
    alt_co_path = Path(f"data/external/{bin_file.stem}_co_filings.csv")
    if co_output.exists():
        print(f"📁 Using existing CO data at {co_output}")
        co_df = pd.read_csv(co_output)
    elif alt_co_path.exists():
        print(f"📁 Using existing CO data from external folder: {alt_co_path}")
        co_output = alt_co_path
        co_df = pd.read_csv(co_output)
    else:
        print("⚠️ No existing CO data found")
        co_df = None
        co_output = None
else:
    print(f"🏛️ Querying CO APIs using {bin_file} -> {co_output}")
    query_co_filings(str(bin_file), output_path=str(co_output))
    co_df = pd.read_csv(co_output)

# Display CO data if available
if co_df is not None:
    print(f"📊 Certificate of Occupancy Data: {co_df.shape[0]} records")
    print("Columns:")
    for col in co_df.columns:
        print(f"  - {col}")
    
    print("\n📊 Sample CO Data:")
    display(co_df.head())
    
    # Show some statistics
    if "issue_date" in co_df.columns:
        print("\n📈 CO Issue Date Statistics:")
        display(co_df["issue_date"].describe())
else:
    print("⚠️ No CO data available")

## 📊 Step 4: Generate Timelines and Charts

Create timeline visualizations from enriched data.

**Depends on:** Steps 2, 3A
**Options:**
- Set `skip_join = True` to skip timeline creation
- Set `skip_charts = True` to skip chart generation

In [None]:
# Step 4 Configuration
skip_join = False   # Set to True to skip timeline creation
skip_charts = False # Set to True to skip chart generation

print("\n" + "=" * 70)
print("STEP 4: GENERATE TIMELINES AND CHARTS")
print("=" * 70)

if skip_join:
    print("⏭️ Skipping timeline join step.")
else:
    if dob_output is None or not dob_output.exists():
        print("⚠️ No DOB data available; skipping timeline creation.")
    else:
        print("🔗 Building timelines...")
        create_separate_timelines(
            str(building_csv),
            str(dob_output),
            str(co_output) if co_output else None,
        )
        
        # Load and display timeline data
        hpd_timeline = Path(str(building_csv).replace(".csv", "_hpd_financed_timeline.csv"))
        private_timeline = Path(str(building_csv).replace(".csv", "_privately_financed_timeline.csv"))
        
        if hpd_timeline.exists():
            hpd_timeline_df = pd.read_csv(hpd_timeline)
            print(f"\n📊 HPD Financed Timeline Data ({hpd_timeline_df.shape[0]} records):")
            display(hpd_timeline_df.head())
            
            # Show event type distribution
            if "event_type" in hpd_timeline_df.columns:
                print("\n📈 Event Types in HPD Timeline:")
                display(hpd_timeline_df["event_type"].value_counts())
        
        if private_timeline.exists():
            private_timeline_df = pd.read_csv(private_timeline)
            print(f"\n📊 Privately Financed Timeline Data ({private_timeline_df.shape[0]} records):")
            display(private_timeline_df.head())
            
            # Show event type distribution
            if "event_type" in private_timeline_df.columns:
                print("\n📈 Event Types in Private Timeline:")
                display(private_timeline_df["event_type"].value_counts())

if skip_charts:
    print("⏭️ Skipping chart generation.")
else:
    # Charts
    print("\n📈 Generating charts...")
    default_timeline_stem = "Affordable_Housing_Production_by_Building_with_financing"
    if Path(building_csv).name == f"{default_timeline_stem}.csv":
        create_financing_charts()
        print("✅ Created financing-specific charts")
    else:
        hpd_timeline = Path(str(building_csv).replace(".csv", "_hpd_financed_timeline.csv"))
        private_timeline = Path(str(building_csv).replace(".csv", "_privately_financed_timeline.csv"))
        
        if hpd_timeline.exists():
            create_timeline_chart(str(hpd_timeline))
            print(f"✅ Created HPD financed timeline chart")
        else:
            print(f"⚠️ No HPD financed timeline found; skipping.")

        if private_timeline.exists():
            create_timeline_chart(str(private_timeline))
            print(f"✅ Created privately financed timeline chart")
        else:
            print(f"⚠️ No privately financed timeline found; skipping.")

print("\n✅ Step 4 complete")

## 📋 Final Summary

Generate data quality report and workflow summary.

**Optional:** Run this at the end to see final statistics.

In [None]:
print("\n" + "=" * 70)
print("📊 FINAL DATA QUALITY REPORT")
print("=" * 70)

# Generate final data quality report and Sankey diagram
quality_tracker.end_processing()
report_filename = quality_tracker.save_report_to_file("notebook_workflow")
sankey_filename = quality_tracker.generate_sankey_diagram()
quality_tracker.print_report()

print("\n🎉 WORKFLOW COMPLETED!")
print(f"📊 Data quality report: {report_filename}")
if sankey_filename:
    print(f"📊 Sankey diagram: {sankey_filename}")

# Summary of what we accomplished
print("\n📋 WORKFLOW SUMMARY:")
try:
    print(f"• HPD Records Processed: {len(hpd_df):,}")
except NameError:
    print("• HPD Records: Step 1 not run")
try:
    print(f"• Records with Financing: {len(financing_df):,}")
except NameError:
    print("• Records with Financing: Step 2 not run")
try:
    if dob_df is not None:
        print(f"• DOB Filings Found: {len(dob_df):,}")
    else:
        print("• DOB Filings: No data")
except NameError:
    print("• DOB Filings: Step 3A not run")
try:
    if co_df is not None:
        print(f"• CO Filings Found: {len(co_df):,}")
    else:
        print("• CO Filings: No data")
except NameError:
    print("• CO Filings: Step 3B not run")

print("\n✅ Notebook workflow complete!")
print("Each step showed dataframe views for inspection.")