# Update interconnection FYI data and validate against LBNL + GridStatus data

In [4]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [5]:
import pandas as pd
import dbcp
from dbcp.extract.helpers import cache_gcs_archive_file_locally
from dbcp.helpers import get_sql_engine

In [6]:
pd.set_option('display.max_columns', None)

# Raw Data

In [18]:
old_fyi = dbcp.extract.fyi_queue.extract("gs://dgm-archive/interconnection.fyi/interconnection_fyi_dataset_2025-09-01.csv")
old_fyi = old_fyi["fyi_queue"]

In [4]:
new_fyi = dbcp.extract.fyi_queue.extract("gs://dgm-archive/interconnection.fyi/interconnection_fyi_dataset_2025-10-01.csv")
new_fyi = new_fyi["fyi_queue"]

## Compare max dates of raw data
Print out the latest date a project entered a queue for each ISO in the old and new data. We should expect the latest project date in the new data to be larger than the that of the old data. Notable exceptions:
* PJM: PJM [is working through a backlog of projects](https://www.utilitydive.com/news/pjm-fast-track-reliability-projects-interconnection-queue-invenergy/729311/) and isn't accepting new projects until mid 2026.

In [None]:
for power_market in old_fyi.power_market.unique():
    print(power_market)
    old_df = old_fyi[old_fyi.power_market == power_market]
    new_df = new_fyi[new_fyi.power_market == power_market]
    
    old_df.loc[:, 'queue_date'] = pd.to_datetime(old_df.loc[:, 'queue_date'])
    new_df.loc[:, 'queue_date'] = pd.to_datetime(new_df.loc[:, 'queue_date'])
    
    print(f" - Old max date {old_df['queue_date'].max()}")
    print(f" - New max date {new_df['queue_date'].max()}")
    print()

## Compare data warehouse tables to raw data

In [8]:
engine = get_sql_engine()
with engine.connect() as con:
    fyi_locations = pd.read_sql_table("fyi_locations", con, schema="private_data_warehouse")
    fyi_projects = pd.read_sql_table("fyi_projects", con, schema="private_data_warehouse")
    fyi_res_cap = pd.read_sql_table("fyi_resource_capacity", con, schema="private_data_warehouse")

We deduplicate the data so there are project IDs in the raw data that aren't in the data warehouse tables, but ensure that we're not losing an unexpectedly high number. During the creation of the data warehouse tables we log how many projects are dropped because they are found to be duplicates. Make sure that no table is missing many more than that number of IDs. The location table will have more missing IDs because there is more nullness in the location columns than in the capacity columns.

In [9]:
print(len(set(new_fyi.unique_id) - set(fyi_projects.project_id)))
print(len(set(new_fyi.unique_id) - set(fyi_locations.project_id)))
print(len(set(new_fyi.unique_id) - set(fyi_res_cap.project_id)))

3358
4078
3358


## Compare data mart tables
Compare the old and new total active capacity in regions.

### How to grab the new data
To get the new data, replace the URI in `dbcp.etl.etl_fyi_queue` with the updated GCS URI. Then run `make all`. There might be some data validation errors due to small changes in the expected number of projects. If the changes seem reasonable, just update the expected value in the assertion. If they don't seem reason, do some digging!

Once the ETL successfully finishes the new data is available in the databse.

<!-- - download the `dev` data to compare to
- load the relevent tables

data warehouse
- check the old and new iso have a similar n and capacity
- plot total capacity


data mart:
- total capacity, n_projects and max date have all the same: caiso, ercot, pjm
- total capacity, n_projects and max date have all increased: miso, pjm, spp, nyiso, isone
- withdrawn and in service capacity have increased: miso, pjm, spp, nyiso, isone

- active capacity has changed for isos in GS_REGIONS
- how much has the active capacity changed by? -->

In [7]:
fyi_all_projects_long_format = pd.read_parquet("/app/data/output/data_mart/fyi_projects_long_format.parquet")

In [8]:
# filter for active projects
fyi_projects_long_format = fyi_all_projects_long_format[fyi_all_projects_long_format.queue_status.isin(["active"])]

### How to grab the old data
The following code grabs the latest version number for data in the development datasets then downloads the parquet file.

In [7]:
from google.cloud import bigquery

def get_bigquery_table_version(dataset_id, table_name, project_id="dbcp-dev-350818"):
    """
    Get the data version of a BigQuery table.

    The dbcp.commands.publish script generates a version number for each data release
    and adds it as a label to the BQ tables.

    Args:
        dataset_id: the BQ dataset ID
        table_name: the name of the table
        project_id: the GCP project id

    Return:
        the current DBCP version number of the requested table
    """
    client = bigquery.Client()

    table_ref = f"{project_id}.{dataset_id}.{table_name}"
    table = client.get_table(table_ref)  # Fetch table metadata

    labels = table.labels  # Get the labels dictionary
    return labels["version"]

# TODO: update this once we figure out where the long format table will land

In [4]:
from dbcp.extract.helpers import cache_gcs_archive_file_locally

table_name = "fyi_projects_long_format"
version = get_bigquery_table_version("data_mart_dev", table_name)
uri = f"gs://dgm-outputs/{version}/data_mart/{table_name}.parquet"
data_cache = "/app/data/gcp_outputs"

fyi_projects_long_format_path = cache_gcs_archive_file_locally(uri, data_cache)
old_fyi_projects_long = pd.read_parquet(iso_projects_long_format_path)

## Compare to LBNL + GridStatus ISO queue data

In [10]:
engine = get_sql_engine()
with engine.connect() as con:
    iso_projects_long_format = pd.read_sql_table("iso_projects_long_format", con, schema="data_mart")

In [11]:
iso_projects_long_format.queue_status.value_counts()

active    10350
Name: queue_status, dtype: int64

In [12]:
iso_projects_long_format.resource_clean.value_counts()

Solar                    4753
Battery Storage          3593
Onshore Wind              936
Natural Gas               469
Unknown                   274
Offshore Wind              68
Other                      52
Hydro                      46
Geothermal                 34
Oil                        30
Nuclear                    28
Coal                       22
Other Storage              17
Biofuel                    11
Pumped Storage              8
Municipal Solid Waste       4
Biomass                     4
Hydrogen                    1
Name: resource_clean, dtype: int64

In [13]:
fyi_projects_long_format.resource_clean.value_counts()

Solar              4788
Battery Storage    3145
Onshore Wind        968
Natural Gas         595
Other               558
Hydro                47
Nuclear              42
Geothermal           40
Oil                  25
Offshore Wind        24
Coal                 22
Biofuel              20
Pumped Storage       16
Biomass              12
Other Storage        11
Waste Heat            4
Name: resource_clean, dtype: int64

Compare the county coverage of the datasets

In [26]:
len(fyi_projects_long_format.county_id_fips.unique()), len(iso_projects_long_format.county_id_fips.unique())

(1890, 1912)

In [22]:
from dbcp.constants import FYI_RESOURCE_DICT
clean_resources = [resource for resource, codes_dict in FYI_RESOURCE_DICT.items() if codes_dict["type"] == "Renewable"]

In [27]:
len(fyi_projects_long_format[fyi_projects_long_format.resource_clean.isin(clean_resources)].county_id_fips.unique())

1818

In [28]:
len(iso_projects_long_format[iso_projects_long_format.resource_clean.isin(clean_resources)].county_id_fips.unique())

1850

Compare metrics between datasets for each ISO.

In [30]:
def agg_iso_projects_long_format(df, iso_col, id_col):
    """Calculate some aggregate metrics for each ISO"""
    agg = df.groupby(iso_col).agg({id_col: "count", "capacity_mw": "sum", "date_entered_queue": "max"})
    agg = agg.rename(columns={id_col: "n_projects", "capacity_mw": "total_capacity_mw", "date_entered_queue": "max_date_entered_queue"})
    return agg

fyi_project_agg = agg_iso_projects_long_format(fyi_projects_long_format, "power_market", "project_id")
iso_project_agg = agg_iso_projects_long_format(iso_projects_long_format, "iso_region", "surrogate_id")

In [31]:
fyi_project_agg.max_date_entered_queue

power_market
AESO        2025-08-05
CAISO       2025-02-12
ERCOT       2025-08-26
ISONE       2024-12-12
MISO        2025-09-18
NYISO       2025-06-03
PJM         2023-06-30
SPP         2025-09-15
Southeast   2025-08-22
West        2025-09-10
Name: max_date_entered_queue, dtype: datetime64[ns]

In [32]:
both_project_aggs = fyi_project_agg.merge(iso_project_agg, how="outer", left_index=True, right_index=True, validate="1:1", suffixes=("_fyi", "_iso"))
both_project_aggs

Unnamed: 0,n_projects_fyi,total_capacity_mw_fyi,max_date_entered_queue_fyi,n_projects_iso,total_capacity_mw_iso,max_date_entered_queue_iso
AESO,229,47993.074,2025-08-05,,,NaT
CAISO,658,196593.869732,2025-02-12,900.0,269052.636325,2023-04-17 00:00:00
ERCOT,1851,384004.98,2025-08-26,1793.0,386328.6,2025-09-23 00:00:00
ISONE,163,38726.153824,2024-12-12,95.0,19560.3305,2024-12-12 00:00:00
MISO,1777,333040.53,2025-09-18,1813.0,337477.89,2025-10-07 04:00:00
NYISO,402,126845.27,2025-06-03,354.0,54695.38,2025-09-02 00:00:00
PJM,1799,192009.0128,2023-06-30,1608.0,131084.1638,2023-06-30 00:00:00
SPP,708,152685.537,2025-09-15,763.0,168384.879,2025-10-02 00:00:00
Southeast,806,122282.093,2025-08-22,930.0,128113.102,2024-12-19 00:00:00
West,1924,471270.1439,2025-09-10,2026.0,383993.53,2024-12-30 00:00:00


In [33]:
# Calculate the differences between the old and new
for col in iso_project_agg.columns:
    if pd.api.types.is_datetime64_any_dtype(iso_project_agg[col]):
        continue
    else:
        both_project_aggs[f"{col}_pct_diff"] = (both_project_aggs[f"{col}_fyi"] - both_project_aggs[f"{col}_iso"]) / both_project_aggs[f"{col}_iso"]

Ideally a less than 20% percent change in capacity for each region. It's expected that there will be more capacity in FYI than in GS + LBNL because data from more utilities are included in the FYI data. It's not too worrying if the differences in this chart are positive, it's more worrying if they're negative.

CAISO is updated by LBNL annually, not by quarterly GS updates, so this difference in update frequency can likely account for much of the difference in CAISO numbers.

In [34]:
iso_project_agg

both_project_aggs.sort_values(by="total_capacity_mw_iso", ascending=False)[["n_projects_pct_diff", "total_capacity_mw_pct_diff"]] * 100

Unnamed: 0,n_projects_pct_diff,total_capacity_mw_pct_diff
ERCOT,3.234802,-0.601462
West,-5.034551,22.728668
MISO,-1.985659,-1.314859
CAISO,-26.888889,-26.931075
SPP,-7.208388,-9.323487
PJM,11.878109,46.477658
Southeast,-13.333333,-4.551454
NYISO,13.559322,131.91222
ISONE,71.578947,97.983126
AESO,,


## Dig deeper into project level changes for regions with big differences in capacity

Start with ISOs where the FYI capacity is less than the GS capacity.

* Were projects that are not active in FYI withdrawn recently? Vice versa?

In [35]:
from dbcp.data_mart.projects import create_long_format

# The dataframe this function returns includes all projects, active, withdrawn and operational. ERCOT only tracks active projects.
iso_all_projects_long_format = create_long_format(engine, active_projects_only=False)

In [36]:
iso_region = "SPP"

fyi_iso = fyi_all_projects_long_format.query("power_market == @iso_region")
gs_lbnl_iso = iso_all_projects_long_format.query("iso_region == @iso_region")

In [37]:
fyi_iso.queue_status.value_counts()

withdrawn      1592
active          708
operational     253
suspended         7
Name: queue_status, dtype: int64

In [38]:
gs_lbnl_iso.queue_status.value_counts()

withdrawn      1572
active          858
operational     274
suspended         7
Name: queue_status, dtype: int64

In [39]:
fyi_iso.queue_id.is_unique

True

In [40]:
fyi_iso[fyi_iso.queue_id.duplicated(keep=False)].head(5)

Unnamed: 0,state,county,project_id,queue_id,date_proposed_online,developer,power_market,interconnection_status,point_of_interconnection,project_name,date_entered_queue,queue_status,iso,utility,is_actionable,is_nearly_certain,actual_completion_date,withdrawn_date,capacity_mw,resource_clean,state_id_fips,county_id_fips,frac_locations_in_county,source,state_permitting_type,co2e_tonnes_per_year,ordinance_earliest_year_mentioned,ordinance_jurisdiction_name,ordinance_jurisdiction_type,ordinance_text,ordinance_via_reldi,ordinance_via_solar_nrel,ordinance_via_wind_nrel,ordinance_via_nrel_is_de_facto,ordinance_via_self_maintained,ordinance_is_restrictive,is_hybrid,resource_class


In [41]:
len(gs_lbnl_iso[gs_lbnl_iso.queue_id.duplicated()])

138

In [42]:
active_gs = gs_lbnl_iso[gs_lbnl_iso.queue_status == "active"]

In [43]:
not_active_fyi = fyi_iso[fyi_iso.queue_status != "active"]

In [44]:
# look at projects active in GS which are not active in FYI
not_active_fyi[not_active_fyi.queue_id.isin(active_gs.queue_id)].queue_status.value_counts()

withdrawn      109
operational      1
Name: queue_status, dtype: int64

In [44]:
# make sure projects were withdrawn recently
not_active_fyi[not_active_fyi.queue_id.isin(active_gs.queue_id)].withdrawn_date.value_counts()

2025-08-25    73
2025-08-28    19
2025-09-09     2
2025-07-29     2
2025-09-05     1
2025-09-23     1
2025-08-11     1
2025-09-17     1
2025-07-22     1
2025-08-22     1
2025-09-18     1
2025-08-29     1
2025-07-31     1
2025-08-27     1
Name: withdrawn_date, dtype: int64

In [46]:
# does this missing capacity make up the difference in total capacity?
not_active_fyi[not_active_fyi.queue_id.isin(active_gs.queue_id)].capacity_mw.sum()/active_gs.capacity_mw.sum()

0.14109784462354874

In [45]:
# look at projects in GS which aren't in FYI
# it is likely that these projects were dropped during the deduplication cleaning
# step in the transform. You can spot check to make sure that a different project ID with the
# same interconnection point, capacity, resource etc. is in the data
active_gs[~active_gs.queue_id.isin(fyi_iso.queue_id)].sort_values(by="capacity_mw", ascending=False).head(5)

Unnamed: 0,state,county,queue_id,is_nearly_certain,project_id,project_name,capacity_mw,developer,entity,iso_region,utility,date_proposed_online,point_of_interconnection,is_actionable,resource_clean,queue_status,date_entered_queue,actual_completion_date,withdrawn_date,interconnection_status,state_id_fips,county_id_fips,frac_locations_in_county,source,state_permitting_type,co2e_tonnes_per_year,ordinance_earliest_year_mentioned,ordinance_jurisdiction_name,ordinance_jurisdiction_type,ordinance_text,ordinance_via_reldi,ordinance_via_solar_nrel,ordinance_via_wind_nrel,ordinance_via_nrel_is_de_facto,ordinance_via_self_maintained,ordinance_is_restrictive,is_hybrid,resource_class,surrogate_id
6198,New Mexico,Roosevelt,GEN-2024-304,False,42635,,500.0,,SPP,SPP,SPS,2029-12-31 00:00:00,Crossroads 345 kV Substation,True,Solar,active,2025-03-01,NaT,NaT,DISIS STAGE,35,35041,1.0,gridstatus,Local,,,,,,False,,,,False,False,False,renewable,6198
6388,New Mexico,Chaves,GEN-2024-199,False,42825,,500.0,,SPP,SPP,SPS,2028-12-04 00:00:00,Eddy County - Crossroads 345 kV Line,True,Onshore Wind,active,2025-03-01,NaT,NaT,DISIS STAGE,35,35005,1.0,gridstatus,Local,,,,,,False,,,,False,False,False,renewable,6388
4935,Oklahoma,Texas,GEN-2016-142,True,41351,,350.0,,SPP,SPP,AEP,2025-12-31 00:00:00,Riverside 345kV Substation,False,Onshore Wind,active,2016-11-29,NaT,NaT,IA FULLY EXECUTED/ON SCHEDULE,40,40139,1.0,gridstatus,Hybrid,,,,,,False,,,,,False,False,renewable,4935
9842,Texas,Hansford,GEN-2024-046,False,46337,,345.0,,SPP,SPP,SPS,2029-12-01 00:00:00,Hitchland-Moore 230 kV line,True,Solar,active,2024-10-30,NaT,NaT,DISIS STAGE,48,48195,1.0,gridstatus,Local,,,,,,False,,,,,False,False,renewable,9842
7571,Nebraska,Cass,GEN-2024-212,False,44036,,303.0,,SPP,SPP,OPPD,2035-02-01 00:00:00,Substation 3740 345 kV,True,Natural Gas,active,2025-03-01,NaT,NaT,DISIS STAGE,31,31025,1.0,gridstatus,Hybrid,166918.128068,,,,,False,False,False,False,True,True,False,fossil,7571


# Everything below here is the wild west

1. Go through the rest of this notebook and compare to GS
2. Update validation doc
3. Look at data mart to-do list and check off
4. look at GS update notebook and add FYI update cells.

In [46]:
iso_region = "NYISO"

fyi_iso = fyi_all_projects_long_format.query("power_market == @iso_region")
gs_lbnl_iso = iso_all_projects_long_format.query("iso_region == @iso_region")

In [47]:
non_active_gs = gs_lbnl_iso[gs_lbnl_iso.queue_status != "active"]

In [48]:
# look at projects in FYI which are not active in GS
active_in_fyi_inactive_in_gs = fyi_iso[(fyi_iso.queue_status == "active") & (fyi_iso.queue_id.isin(non_active_gs.queue_id))]

In [50]:
active_in_fyi_inactive_in_gs

Unnamed: 0,state,county,project_id,queue_id,date_proposed_online,developer,power_market,interconnection_status,point_of_interconnection,project_name,date_entered_queue,queue_status,iso,utility,is_actionable,is_nearly_certain,actual_completion_date,withdrawn_date,capacity_mw,resource_clean,state_id_fips,county_id_fips,frac_locations_in_county,source,state_permitting_type,co2e_tonnes_per_year,ordinance_earliest_year_mentioned,ordinance_jurisdiction_name,ordinance_jurisdiction_type,ordinance_text,ordinance_via_reldi,ordinance_via_solar_nrel,ordinance_via_wind_nrel,ordinance_via_nrel_is_de_facto,ordinance_via_self_maintained,ordinance_is_restrictive,is_hybrid,resource_class


In [56]:
# look at projects that are active in FYI and not in GS
active_in_fyi_not_in_gs = fyi_iso[
    (fyi_iso.queue_status == "active") & 
    (~fyi_iso.queue_id.isin(gs_lbnl_iso.queue_id)) &
    (~fyi_iso.capacity_mw.isnull())
]

In [57]:
active_in_fyi_not_in_gs.sort_values(by="capacity_mw", ascending=False)

Unnamed: 0,state,county,project_id,queue_id,date_proposed_online,developer,power_market,interconnection_status,point_of_interconnection,project_name,date_entered_queue,queue_status,iso,utility,is_actionable,is_nearly_certain,actual_completion_date,withdrawn_date,capacity_mw,resource_clean,state_id_fips,county_id_fips,frac_locations_in_county,source,state_permitting_type,co2e_tonnes_per_year,ordinance_earliest_year_mentioned,ordinance_jurisdiction_name,ordinance_jurisdiction_type,ordinance_text,ordinance_via_reldi,ordinance_via_solar_nrel,ordinance_via_wind_nrel,ordinance_via_nrel_is_de_facto,ordinance_via_self_maintained,ordinance_is_restrictive,is_hybrid,resource_class
19545,New York,Delaware,nyiso-1288,1288,2028-12-01,,NYISO,System Impact Study,Fraser 345 kV and Rainey 345 kV substations,CPNY-X,2021-11-23,active,nyiso,"NYSEG, ConEd",True,False,NaT,NaT,1300.0,Other,36.0,36025.0,1.0,fyi,Hybrid,,,,,,False,,,,,False,False,fossil
18921,,,nyiso-0631,0631,2026-05-01,Transmission Developers Inc.,NYISO,System Impact Study,Astoria Annex 345kV,NS Power Express,2017-05-02,active,nyiso,NYPA,True,False,NaT,NaT,1000.0,Other,,,1.0,fyi,,,,,,,False,False,True,True,,True,False,fossil
20287,New York,Chautauqua,nyiso-c24-292,C24-292,2030-12-01,BayWa r.e.,NYISO,,Dunkirk to Falconer Line 160 115 kV,"Loganberry Energy Storage, LLC",2024-09-27,active,nyiso,NM-NG,False,False,NaT,NaT,600.0,Battery Storage,36.0,36013.0,1.0,fyi,Hybrid,,2019.0,Portland,city,"In June 2019, the Town of Portland adopted a m...",True,False,True,True,,True,False,storage
19160,,,nyiso-0887,0887,2026-05-01,Transmission Developers Inc.,NYISO,System Impact Study,Astoria Annex 345kV,CH Uprate,2019-06-12,active,nyiso,NYPA,True,False,NaT,NaT,250.0,Other,,,1.0,fyi,,,,,,,False,False,True,True,,True,False,fossil
19977,New York,Cortland,nyiso-276,276,2026-04-01,EDF Renewables,NYISO,Facility Study,Cortland - Fenner 115kV,Homer Solar Energy Center,2008-01-30,active,nyiso,NM-NG,True,False,NaT,NaT,90.0,Solar,36.0,36023.0,1.0,fyi,Hybrid,,,,,,False,,,,,False,False,renewable


In [63]:
# see if these projects have dupliate entries under a different queue ID in GS
# it's worth checking the raw data to see if these projects get dropped during deduplication
gs_lbnl_iso[gs_lbnl_iso.project_name == "CPNY-X"]

Unnamed: 0,state,county,queue_id,is_nearly_certain,project_id,project_name,capacity_mw,developer,entity,iso_region,utility,date_proposed_online,point_of_interconnection,is_actionable,resource_clean,queue_status,date_entered_queue,actual_completion_date,withdrawn_date,interconnection_status,state_id_fips,county_id_fips,frac_locations_in_county,source,state_permitting_type,co2e_tonnes_per_year,ordinance_earliest_year_mentioned,ordinance_jurisdiction_name,ordinance_jurisdiction_type,ordinance_text,ordinance_via_reldi,ordinance_via_solar_nrel,ordinance_via_wind_nrel,ordinance_via_nrel_is_de_facto,ordinance_via_self_maintained,ordinance_is_restrictive,is_hybrid,resource_class,surrogate_id


In [71]:
# it's worth checking the most recent raw data to see if these projects get dropped during deduplication
raw_gs = pd.read_parquet("/app/data/data_cache/gridstatus/interconnection_queues/parquet/nyiso.parquet#1761671632253516")

In [81]:
# if they appear in the raw data, it's worth tracing through the ETL with the debugger
# to see where the project gets dropped
raw_gs[raw_gs["Queue ID"] == "1288"]

Unnamed: 0,Queue ID,Project Name,Interconnecting Entity,County,State,Interconnection Location,Transmission Owner,Generation Type,Capacity (MW),Summer Capacity (MW),Winter Capacity (MW),Queue Date,Status,Proposed Completion Date,Withdrawn Date,Withdrawal Comment,Actual Completion Date,Proposed In-Service Date,Proposed Initial-Sync Date,Last Updated Date,Z,S,Availability of Studies,SGIA Tender Date
111,1288,CPNY-X,,"Delaware, Queens",NY,Fraser 345 kV and Rainey 345 kV substations,"NYSEG, ConEd",DC Transmission,1300.0,1300.0,1300.0,2021-11-23,Active,2028-12-01,,,NaT,NaT,NaT,2025-01-31,"E,J",10.0,"SRIS, FS",NaT


In [76]:
active_in_fyi_not_in_gs.resource_clean.value_counts()

Other              3
Solar              1
Battery Storage    1
Name: resource_clean, dtype: int64