# Allocation of Demand based on Disaggregation and Re-aggregation of Data

This notebook is used to disaggregate demand from the FERC-714 dataset, based on the population from census data, then re-aggregating it to different various geometries like REEDs balancing areas, county-level, or at the state level.

## Datasets and inputs used

1. FERC-714 form: Energy sales timeseries data for every planning area

2. 2010 US Census data: Census tract geometries, and tract-level population and characteristics

3. ReEDs balancing geometries: ReEDs geometries containing county level data

4. US Planning Areas: Contains 97 planning area geometries

## Core functions

1. Function to find intersection of the large and small geometries

2. Function to normalize and redistribute from area to another attribute e.g. population

3. Map functions:

4. Timeseries functions: Functions for allocation

## Disposable parts of analysis

1. Cells doing auxiliary analysis like multiple-counted areas, visualizations for state, county and census tracts 

## Intermediate and Final Datasets

Intermediate datasets are required to limit the number of time-consuming calculations. Primarily, the overlay calculation takes excessive time.

### Intermediate Datasets

Area mapping of planning area with census tracts.

### Final Datasets

ReEDs aggregated demand timeseries data

## Workflow

1. There are generally two geometries: one containing primary smaller non-intersecting geometries like tracts, and the other with larger intersecting geometries

### Disaggregation of data

1. 

### Reaggregation of data

1. 


In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
import sys
import os
import pathlib
import time
import requests
import json
import datetime
import pickle

import pandas as pd
import numpy as np
import scipy.stats

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import matplotlib.cm as cm
from matplotlib import colors
from matplotlib.legend import Legend
import matplotlib.patches as mpatches
import seaborn as sns

import pyproj
from geopandas import gpd
from shapely.geometry import Point
from shapely.ops import unary_union
import geopandas
import fiona
from geopandas import GeoDataFrame
import addfips

import pudl
from pudl.analysis.demand_mapping import (create_intersection_matrix,
                                          create_stacked_intersection_df,
                                          extract_multiple_tracts_demand_ratios,
                                          extract_time_series_demand_multiple_tracts,
                                          matrix_linear_scaling)

from tqdm import tqdm
tqdm.pandas()

  from pandas import Panel


In [3]:
sns.set()
%matplotlib inline
mpl.rcParams['figure.figsize'] = (10,4)
mpl.rcParams['figure.dpi'] = 150
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [4]:
import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)
handler = logging.StreamHandler(stream=sys.stdout)
log_format = '%(asctime)s [%(levelname)8s] %(name)s:%(lineno)s %(message)s'
formatter = logging.Formatter(log_format)
handler.setFormatter(formatter)
logger.handlers = [handler]

In [5]:
pudl_settings = pudl.workspace.setup.get_defaults()
pudl_settings
API_KEY_EIA = "d2b250683a925a1bddcd63c5d12698c0"
# API_KEY_EIA = os.environ["API_KEY_EIA"]

# Obtain non-PUDL data
Some of the data we're using for this analysis has not yet been fully integrated into PUDL, so we are managing it ad-hoc in a directory at `pudl_settings["data_dir"]/local`

## FERC Form 714
* Download the raw file if we don't have it already.
* Run the draft Extract and Transform steps on it.
* Merge the respondent ID and Planning Area Hourly Demand dataframes.

In [6]:
%%time
def download_zip_url(url, save_path, chunk_size=128):
    r = requests.get(url, stream=True)
    with save_path.open(mode='wb') as fd:
        for chunk in r.iter_content(chunk_size=chunk_size):
            fd.write(chunk)

local_data = pathlib.Path(pudl_settings["data_dir"]) / "local"

CPU times: user 36 µs, sys: 4 µs, total: 40 µs
Wall time: 41.7 µs


In [7]:
# %%time

# ferc714_url = "https://www.ferc.gov/docs-filing/forms/form-714/data/form714-database.zip"
# ferc714_dir = local_data / "ferc714"
# ferc714_dir.mkdir(parents=True, exist_ok=True)

# ferc714_save_path = ferc714_dir / "ferc714.zip"
# if ferc714_save_path.exists():
#     logger.info("Already have FERC 714 data, not downloading.")
# else:
#     logger.info("Downloading fresh FERC 714 data.")
#     download_zip_url(ferc714_url, ferc714_save_path)

# raw_ferc714_dfs = pudl.extract.ferc714.extract(pudl_settings=pudl_settings)
# tfr_ferc714_dfs = pudl.transform.ferc714.transform(raw_ferc714_dfs)
# pa_demand_ferc714_df = pd.merge(
#     tfr_ferc714_dfs["demand_hourly_pa_ferc714"],
#     tfr_ferc714_dfs["respondent_id_ferc714"]
# )

## US Census Tract Geometries and Population
* This US census tract data comes from: http://www2.census.gov/geo/tiger/TIGER2010DP1/Profile-County_Tract.zip

In [8]:
# %%time
# esri_dir = local_data / "esri"
# esri_dir.mkdir(parents=True, exist_ok=True)
# esri_tract_path = esri_dir / "USA_Census_Tract_Boundaries/v10/tracts.gdb"
# census_tract_gdf = (
#     gpd.read_file(esri_tract_path, driver='FileGDB', layer='tracts')
#     # THIS CREATES INVALID FIPS CODES: LEADING ZEROS ARE REQUIRED.
#     .assign(STATE_FIPS=lambda x: pd.to_numeric(x.STATE_FIPS))
#     # Remove all islands and non-mainland states and territories
#     .query("STATE_FIPS<=56 & STATE_FIPS not in (2, 15, 44)")
#     # Project to US Albers conic equal-area projection
#     .to_crs("ESRI:102003")
# )
# census_tract_gdf.sample(5)

# Please don't store FIPS codes as integers
* The FIPS codes are strings because of the leading zeroes
* Turning them into integers, but still calling them FIPS codes will result in bugs later when someone expects them to be normal FIPS codes.
* If you need to use them in a numeric context, you can cast them to `int` just in the context of the comparison

In [9]:
%%time
uscb_census2010_url = "http://www2.census.gov/geo/tiger/TIGER2010DP1/Profile-County_Tract.zip"
uscb_census2010_dir = local_data / "uscb" / "census2010"
uscb_census2010_dir.mkdir(parents=True, exist_ok=True)

uscb_census2010_zipfile = uscb_census2010_dir / "census2010.zip"
uscb_census2010_gdb_dir = uscb_census2010_dir / "census2010.gdb"

if not uscb_census2010_gdb_dir.is_dir():
    logger.info("No Census GeoDB found. Downloading from US Census Bureau.")
    # Download to appropriate location
    download_zip_url(uscb_census2010_url, uscb_census2010_zipfile)
    # Unzip because we can't use zipfile paths with geopandas
    with zipfile.ZipFile(uscb_census2010_zipfile, 'r') as zip_ref:
        zip_ref.extractall(uscb_dir)
        # Grab the UUID based directory name so we can change it:
        extract_root = uscb_dir / pathlib.Path(zip_ref.filelist[0].filename).parent
    extract_root.rename(uscb_census2010_gdb_dir)
else:
    logger.info("We've already got the 2010 Census GeoDB.")

logger.info("Extracting the GeoDB into a GeoDataFrame")
census_tract_gdf = gpd.read_file(uscb_census2010_gdb_dir, driver='FileGDB', layer='Tract_2010Census_DP1')

## Creating columns for county and state level aggregation
census_tract_gdf["STATE_FIPS"] = census_tract_gdf["GEOID10"].str[:2]
census_tract_gdf["STATE_FIPS_int"] = pd.to_numeric(census_tract_gdf["GEOID10"].str[:2])
census_tract_gdf["STCOFIPS"] = census_tract_gdf["GEOID10"].str[:5]

census_tract_gdf = (
    census_tract_gdf
    # Remove all islands and non-mainland states and territories
    .query("STATE_FIPS_int<=56 & STATE_FIPS_int not in (2, 15, 44)")
    # Project to US Albers conic equal-area projection
    .to_crs("ESRI:102003")
)
census_tract_gdf["SQMI"] = census_tract_gdf.area / 10 ** 6 / 1.60934 ** 2
census_tract_gdf = census_tract_gdf.rename(columns={
    "DP0010001": "POPULATION",
    "GEOID10": "FIPS"
})
census_tract_gdf.drop("STATE_FIPS_int", axis=1)
census_tract_gdf.sample(5)

2020-06-01 20:42:01,530 [    INFO] root:19 We've already got the 2010 Census GeoDB.
2020-06-01 20:42:01,531 [    INFO] root:21 Extracting the GeoDB into a GeoDataFrame
CPU times: user 46.9 s, sys: 849 ms, total: 47.8 s
Wall time: 47.7 s


Unnamed: 0,FIPS,NAMELSAD10,ALAND10,AWATER10,INTPTLAT10,INTPTLON10,POPULATION,DP0010002,DP0010003,DP0010004,DP0010005,DP0010006,DP0010007,DP0010008,DP0010009,DP0010010,DP0010011,DP0010012,DP0010013,DP0010014,DP0010015,DP0010016,DP0010017,DP0010018,DP0010019,DP0010020,DP0010021,DP0010022,DP0010023,DP0010024,DP0010025,DP0010026,DP0010027,DP0010028,DP0010029,DP0010030,DP0010031,DP0010032,DP0010033,DP0010034,DP0010035,DP0010036,DP0010037,DP0010038,DP0010039,DP0010040,DP0010041,DP0010042,DP0010043,DP0010044,...,DP0120015,DP0120016,DP0120017,DP0120018,DP0120019,DP0120020,DP0130001,DP0130002,DP0130003,DP0130004,DP0130005,DP0130006,DP0130007,DP0130008,DP0130009,DP0130010,DP0130011,DP0130012,DP0130013,DP0130014,DP0130015,DP0140001,DP0150001,DP0160001,DP0170001,DP0180001,DP0180002,DP0180003,DP0180004,DP0180005,DP0180006,DP0180007,DP0180008,DP0180009,DP0190001,DP0200001,DP0210001,DP0210002,DP0210003,DP0220001,DP0220002,DP0230001,DP0230002,Shape_Length,Shape_Area,geometry,STATE_FIPS,STATE_FIPS_int,STCOFIPS,SQMI
37780,29095013405,Census Tract 134.05,27013810.0,172736.0,38.868383,-94.575924,2089,258,182,136,145,178,152,112,77,69,105,108,107,134,92,80,82,45,27,921,128,86,74,79,57,49,45,38,29,48,54,53,51,38,31,33,20,8,1168,130,96,62,66,121,...,74,54,20,0,0,0,861,592,312,306,89,33,18,253,205,269,229,87,20,142,65,330,237,2.34,2.78,951,861,90,29,10,14,4,5,28,2.8,6.9,861,478,383,1000,1015,2.09,2.65,0.261734,0.002822,"MULTIPOLYGON (((123498.903 154145.756, 123504....",29,29,29095,10.496837
53603,40109101500,Census Tract 1015,1914099.0,0.0,35.4878023,-97.5032497,1656,95,75,76,107,99,142,121,88,113,140,159,178,91,51,46,23,21,31,847,41,41,50,67,48,63,60,51,56,77,85,95,43,31,15,11,7,6,809,54,34,26,40,51,...,36,35,1,42,35,7,754,349,119,185,62,32,5,132,52,405,334,174,23,160,38,161,150,2.09,3.01,988,754,234,47,1,10,22,7,147,2.7,10.2,754,340,414,796,782,2.34,1.89,0.057267,0.00019,"MULTIPOLYGON (((-136006.914 -223702.072, -1359...",40,40,40109,0.739043
35987,27131070800,Census Tract 708,9161527.0,0.0,44.2714413,-93.2773605,7296,589,545,529,497,353,473,470,451,474,527,531,457,378,297,264,205,132,124,3552,302,283,266,256,166,243,231,230,236,261,247,225,178,139,109,87,55,38,3744,287,262,263,241,187,...,0,0,0,82,38,44,2804,1910,953,1457,640,140,93,313,220,894,740,297,81,443,263,1016,760,2.57,3.1,2992,2804,188,90,0,53,11,8,26,2.4,11.6,2804,2115,689,5625,1589,2.66,2.31,0.157192,0.001033,"MULTIPOLYGON (((217705.390 762303.250, 217693....",27,27,27131,3.537303
55609,42003202200,Census Tract 2022,881778.0,0.0,40.4540547,-80.0636247,2568,158,142,173,172,180,166,139,164,176,206,210,184,145,94,96,66,65,32,1218,97,74,93,86,101,78,59,80,88,82,100,79,61,47,39,28,23,3,1350,61,68,80,86,79,...,0,0,0,0,0,0,1085,677,270,364,123,57,22,256,125,408,346,145,27,201,84,322,276,2.37,2.98,1245,1085,160,41,5,27,5,3,79,3.7,9.5,1085,698,387,1660,908,2.38,2.35,0.053674,9.4e-05,"MULTIPOLYGON (((1333836.466 443651.470, 133384...",42,42,42003,0.340457
20175,13057090902,Census Tract 909.02,18756256.0,252422.0,34.0882297,-84.4491443,13319,1222,966,788,624,478,996,1402,1377,1073,920,881,733,719,434,298,172,143,93,6389,615,502,395,308,239,455,672,708,539,457,416,306,302,181,130,77,52,35,6930,607,464,393,316,239,...,0,0,0,0,0,0,5154,3639,1840,2968,1497,186,88,485,255,1515,1207,433,45,774,224,1959,851,2.58,3.08,5407,5154,253,44,3,107,21,24,54,2.3,6.7,5154,4540,614,11600,1719,2.56,2.8,0.19648,0.001857,"MULTIPOLYGON (((1052360.163 -319621.329, 10522...",13,13,13057,7.339327


## Electricity Planning Area Geometries
* For now planning areas come from this DHS open dataset: https://hifld-geoplatform.opendata.arcgis.com/datasets/electric-planning-areas

In [10]:
import zipfile
hifld_pa_url = "https://opendata.arcgis.com/datasets/7d35521e3b2c48ab8048330e14a4d2d1_0.gdb"
hifld_dir = local_data / "hifld"
hifld_dir.mkdir(parents=True, exist_ok=True)
hifld_pa_zipfile = hifld_dir / "electric_planning_areas.gdb.zip"
hifld_pa_gdb_dir = hifld_dir / "electric_planning_areas.gdb"
if not hifld_pa_gdb_dir.is_dir():
    logger.info("No Planning Area GeoDB found. Downloading from HIFLD.")
    # Download to appropriate location
    download_zip_url(hifld_pa_url, hifld_pa_zipfile)
    # Unzip because we can't use zipfile paths with geopandas
    with zipfile.ZipFile(hifld_pa_zipfile, 'r') as zip_ref:
        zip_ref.extractall(hifld_dir)
        # Grab the UUID based directory name so we can change it:
        extract_root = hifld_dir / pathlib.Path(zip_ref.filelist[0].filename).parent
    extract_root.rename(hifld_pa_gdb_dir)
else:
    logger.info("We've already got the planning area GeoDB.")

logger.info("Extracting the GeoDB into a GeoDataFrame")
epas_gdf = pudl.transform.ferc714.electricity_planning_areas(pudl_settings)
logger.info("Dropping Planning Areas in AK and HI.")
# * 3522 = Chugach Electric Association (AK)
# * 19547 = Hawaii Electric Co 
logger.info("Reprojecting to US Albers Conic Equal Area projection.")
ak_hi_planning_area_ids = [3522, 19547]
epas_gdf = (
    epas_gdf.query("ID not in @ak_hi_planning_area_ids")
    # Project to US Albers conic equal-area projection
    .to_crs("ESRI:102003")
)

# Validating geometries
epas_gdf["geometry"] = epas_gdf.buffer(0)

2020-06-01 20:42:49,270 [    INFO] root:18 We've already got the planning area GeoDB.
2020-06-01 20:42:49,271 [    INFO] root:20 Extracting the GeoDB into a GeoDataFrame
2020-06-01 20:42:51,069 [    INFO] root:22 Dropping Planning Areas in AK and HI.
2020-06-01 20:42:51,069 [    INFO] root:25 Reprojecting to US Albers Conic Equal Area projection.


## State Annual Energy Sales for Residential, Industrial Comparison
* The sales data for 2018 comes from https://www.eia.gov/electricity/data/state/sales_annual.xlsx
* The State FIPS codes mapping with US state abbreviations can be found from https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696

# To add state FIPS codes, use the `addfips` package
* This is part of the `pudl-dev` conda environment.
* We're already using it to add state & county FIPS codes to the EIA 860 and 861 data

# In general, try not to make anything year-specific
* We may be focusing on a particular year right now, but in the future we'll probably want to do other years, or more years.
* The year or range of years that we're looking at is the kind of thing that should be a notebook parameter

In [11]:
# reporting_year = 2018
# sales_df = pd.read_excel("https://www.eia.gov/electricity/data/state/sales_annual.xlsx", skiprows=[0])
# sales_df = sales_df[sales_df["Year"]==reporting_year]
# sales_df = sales_df[~sales_df["State"].isin(["US", "HI", "AK", "DC"])]
# # print(sales_df["Industry Sector Category"].unique())
# sales_df = sales_df[sales_df["Industry Sector Category"]=="Total Electric Industry"]
# sales_df["FIPS"] = sales_df["State"].map(addfips.AddFIPS().get_state_fips)


# states_fips_lookup = pd.read_html("https://www.nrcs.usda.gov/wps/portal/nrcs/detail/?cid=nrcs143_013696")[0].iloc[:-1]
# states_fips_lookup = states_fips_lookup.astype({
#     "FIPS": "int32"
# })

# states_fips_lookup["FIPS"] = states_fips_lookup["FIPS"].map("{0:02d}".format)

# # states_fips_lookup["FIPS"] = states_fips_lookup["FIPS"].apply(lambda x: int(x))

# sales_df = sales_df[["State", "Total", "FIPS","Residential", "Commercial", "Industrial"]]

# sales_df.head()

## State Monthly Energy Sales from EIA
* The state monthly sales from EIA can be extracted using the EIA API
* A file containing the API key is required. It should be located in `PUDL_IN/data/local/eia/api_key.txt` 

In [12]:
# eia_dir = pathlib.Path(pudl_settings["data_dir"]) / "local" / "eia"

# sales_data = (
#     json.loads(
#         requests.get("http://api.eia.gov/category/?api_key="+API_KEY_EIA+"&category_id=38").text)
# )

# sales_data_urls = [a["series_id"]
#                    for a in sales_data["category"]["childseries"]
#                    if a["series_id"][-1] == "M"]

# sales_data_states_names = [a["name"][30:-23]
#                            for a in sales_data["category"]["childseries"]
#                            if a["series_id"][-1] == "M"]


# df_sales_eia = []

# for index in tqdm(range(len(sales_data_urls))):
#     df_sales_eia.append(pd.DataFrame((json.loads(requests
#             .get("http://api.eia.gov/series/?api_key="+API_KEY_EIA+"&series_id=" +
#                  sales_data_urls[index])
#             .text))["series"][0]["data"])
#                         .rename(columns={0: "utc_datetime", 1: sales_data_states_names[index]})
#                         .set_index("utc_datetime"))
    
    
# df_sales_eia = pd.concat(df_sales_eia, axis=1).reset_index()
# df_sales_eia["utc_datetime"] = pd.to_numeric(df_sales_eia["utc_datetime"])


# df_sales_eia_2018 = df_sales_eia[df_sales_eia["utc_datetime"] // 100 == 2018]
# df_sales_eia_2018 = (df_sales_eia_2018.set_index("utc_datetime")
#                      .stack()
#                      .reset_index()
#                      .rename(columns={"level_1": "State", 0: "GWh"}))

# df_sales_eia_2018["State"] = df_sales_eia_2018["State"].str.strip()


# ## Scaling Demand Time Series by a constant factor to make the total sums equal
# # df_check = df_state_sales_2018.merge(df_sales_eia_2018, how="inner")

# # df_check["GWh_adjusted"] = df_check["GWh_allocated"] * df_check["GWh"].sum() / df_check["GWh_allocated"].sum()

## NREL ReEDS Geometries
* The geometries should be located in `PUDL_IN/data/local/nrel/reeds`

In [13]:
# %%time
# reeds_path = local_data / "nrel/reeds"
# if not reeds_path.is_dir():
#     raise FileNotFoundError(
#         f"ReEDS Balancing Area geometries not found."
#         f"Expected them at {reeds_path}"
#     )
# reeds_gdf = gpd.read_file(reeds_path)
# reeds_gdf = (
#     reeds_gdf.assign(pca_num=lambda x: pd.to_numeric(x.pca.replace("^p", value="", regex=True)))
#     .query("pca_num<=134")
#     .to_crs("ESRI:102003")
# )
# reeds_gdf.sample(10)

# Dissolving census tracts to county level

In [14]:
%%time
county_gdf =  census_tract_gdf[["STCOFIPS", "POPULATION", "geometry"]].dissolve(by="STCOFIPS",
                                                                              aggfunc=np.sum,
                                                                              as_index=False)

county_gdf = county_gdf[["STCOFIPS", "POPULATION", "geometry"]]

CPU times: user 2min 55s, sys: 0 ns, total: 2min 55s
Wall time: 2min 55s


# Function for complete disjoint geometries 

In [21]:
%%time

def edit_id_set(row, ID):
    if row["geom_type"] == "geometry_new_int":
        return frozenset(list(row["ID"]) + [ID])
    
    else:
        return row["ID"]

def complete_disjoint_geoms(epas_gdf, num_last=np.inf):

    tqdm_max = min(epas_gdf.shape[0], num_last)

    for index, row in tqdm(epas_gdf[["ID", "geometry"]].iterrows(), total=tqdm_max):

        if index == 0:
            gdf_disjoint = pd.DataFrame(row).T
            gdf_disjoint["ID"] = gdf_disjoint["ID"].apply(lambda x: frozenset([x]))
            gdf_disjoint = GeoDataFrame(gdf_disjoint, geometry="geometry", crs=epas_gdf.crs)
            gdf_disjoint_cur_union = unary_union(gdf_disjoint["geometry"])


        elif index < tqdm_max:



            gdf_disjoint["geometry_new_diff"] = gdf_disjoint.difference(row["geometry"])
            gdf_disjoint["geometry_new_int"] = gdf_disjoint.intersection(row["geometry"])
            gdf_disjoint=gdf_disjoint.drop("geometry", axis=1)

            gdf_disjoint = (gdf_disjoint
                            .set_index("ID")
                            .stack()
                            .reset_index()
                            .rename(columns={"level_1": "geom_type", 0: "geometry"}))


            gdf_disjoint["ID"] = gdf_disjoint.apply(lambda x: edit_id_set(x, row["ID"]), axis=1)

            gdf_disjoint=gdf_disjoint.append({
                "ID": frozenset([row["ID"]]),
                "geom_type": "geometry_new_sole",
                "geometry": row["geometry"].difference(gdf_disjoint_cur_union)
            }, ignore_index=True)

            gdf_disjoint=GeoDataFrame(gdf_disjoint, geometry="geometry", crs=epas_gdf.crs)
            gdf_disjoint=gdf_disjoint.drop("geom_type", axis=1)[(gdf_disjoint["geometry"].area!=0)]
            gdf_disjoint_cur_union = unary_union([gdf_disjoint_cur_union, row["geometry"]])
            
    return gdf_disjoint
        



CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 4.77 µs


In [22]:
%%time
epas_complete_disjoint=complete_disjoint_geoms(epas_gdf, num_last=25)

95it [00:10,  8.70it/s]                        

CPU times: user 10.9 s, sys: 4.01 ms, total: 10.9 s
Wall time: 10.9 s





# Two-intersection maps

In [16]:
def disjoint_geoms(gdf):
    
    gdf_intersection = geopandas.overlay(gdf[["ID", "geometry"]],
                                         gdf[["ID", "geometry"]],
                                         how="intersection")
    
    gdf_intersection = gdf_intersection[["ID_1", "ID_2", "geometry"]]
    gdf_intersection = gdf_intersection.query("ID_1 != ID_2")
    gdf_intersection.reset_index(drop=True, inplace=True)
    
    dict_pa_geom = (gdf[["ID", "geometry"]]
                    .set_index("ID")["geometry"]
                    .buffer(0)
                    .to_dict())
    
    gdf_intersection["ID1_diff_ID2"]=(gdf_intersection
                                      .progress_apply(lambda row:(dict_pa_geom[row["ID_1"]]
                                                                  .difference(dict_pa_geom[row["ID_2"]])),axis=1))
    
    gdf_intersection = (gdf_intersection
                        .rename(columns={"geometry": "geom_int",
                                         "ID1_diff_ID2": "geom_diff"})
                        .set_index(["ID_1", "ID_2"])
                        .stack()
                        .reset_index()
                        .rename(columns={"level_2": "geom_type", 0: "geometry"}))
    
    gdf_intersection=GeoDataFrame(gdf_intersection, geometry="geometry", crs=epas_gdf.crs)
                     
    return gdf_intersection

def preprocess_column_names(layers, prefixes):
    
    for i, layer in enumerate(layers):
        layer.columns = [prefixes[i] + "_" + column
                         if column != "geometry"
                         else column for column in layer.columns]
        
    return

In [20]:
epas_int_diff

Unnamed: 0,ID_1,ID_2,geom_type,geometry
0,189,7801,geom_int,"MULTIPOLYGON (((966164.039 -736913.041, 965957..."
1,189,7801,geom_diff,"MULTIPOLYGON (((830200.280 -599705.078, 821175..."
2,6452,7801,geom_int,"POLYGON ((1337021.962 -646419.989, 1337021.769..."
3,6452,7801,geom_diff,"MULTIPOLYGON (((1600551.626 -1201879.335, 1600..."
4,21554,7801,geom_int,"MULTIPOLYGON (((1096188.920 -767895.690, 10957..."
...,...,...,...,...
875,229,24211,geom_diff,"MULTIPOLYGON (((-2026916.238 -179613.204, -202..."
876,6455,18445,geom_int,"MULTIPOLYGON (((1106387.403 -715843.949, 11063..."
877,6455,18445,geom_diff,"MULTIPOLYGON (((1060040.122 -696614.337, 10600..."
878,21554,18445,geom_int,"MULTIPOLYGON (((1136478.485 -713336.699, 11367..."


In [19]:
%%time
epas_int_diff = disjoint_geoms(epas_gdf)

100%|██████████| 440/440 [00:56<00:00,  7.80it/s]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gdf_intersection["ID1_diff_ID2"]=(gdf_intersection


CPU times: user 4min 31s, sys: 1.54 s, total: 4min 33s
Wall time: 4min 33s


In [25]:
attributes={
    "county_gdf_STCOFIPS":"constant",
    "county_gdf_POPULATION":"uniform",
    "epa_ID_1":"constant",
    "epa_ID_2":"constant",
    "epa_ID": "constant"
    "epa_geom_type":"constant"
    
}

layers=[epas_int_diff, county_gdf]
prefixes=["county", "epa"]
disjoint=[True, True]

preprocess_column_names(layers, prefixes)

In [33]:
def layer_intersection(layer1, layer2, attributes):
    
    layer1_uniforms = [k for k, v in attributes.items() if ((k in layer1.columns) and (v=="uniform"))]
    layer2_uniforms = [k for k, v in attributes.items() if ((k in layer2.columns) and (v=="uniform"))]
    
    layer1_constants = [k for k, v in attributes.items() if ((k in layer1.columns) and (v=="constant"))][0]
    layer2_constants = [k for k, v in attributes.items() if ((k in layer2.columns) and (v=="constant"))][0]
    
    layer_new = geopandas.overlay(layer1, layer2)
    layer_new["area"] = layer_new.area
    
    layer1["layer1_area"] = layer1.area
    layer2["layer2_area"] = layer2.area
    layer_new["layernew_area"] = layer_new.area
    
    layer_new = (layer_new
                 .merge(layer1[layer1_constants + ["layer1_area"]])
                 .merge(layer2[layer2_constants + ["layer2_area"]]))
    
    layer_new["layer1_areafraction"] = layer_new["layernew_area"] / layer_new["layer1_area"]
    layer_new["layer2_areafraction"] = layer_new["layernew_area"] / layer_new["layer2_area"]
    
    for uniform in layer1_uniforms:
        layer_new[uniform] = layer_new[uniform] * layer_new["layer1_areafraction"]
        
    for uniform in layer2_uniforms:
        layer_new[uniform] = layer_new[uniform] * layer_new["layer2_areafraction"]
        
    del layer1["layer1_area"]
    del layer2["layer2_area"]
    del layer_new["layernew_area"]
    del layer_new["layer1_areafraction"]
    del layer_new["layer2_areafraction"]
    
    return layer_new
    
    
layer_new = layer_intersection(epas_int_diff, county_gdf, attributes)

In [None]:
def flatten(layers, attributes, disjoint):
    
    
    for i, layer in enumerate(layers):
        
        if disjoint(i) == False:
            layer = disjoint_geoms(layer, attributes)
            
        else:
            pass
        
        
        if i == 0:
            layer_new = layer
            
        else:
            layer_new = layer_intersection(layer_new, layer, attributes)
            
    return layer_new