# Aggregating American Community Survey Data to arbitrary geographies

**TLDR:** This notebook provides reusable code to aggregate ACS data to any arbitrary geography, assuming you've already done the work to map Census Blocks to the aggregate geography. John Keefe has done that work for [NYC police precincts](https://johnkeefe.net/nyc-police-precinct-and-census-data), [Chicago police districts](https://johnkeefe.net/chicago-race-and-ethnicity-data-by-police-district), and [Washington, DC police zones](https://johnkeefe.net/race-and-ethnicity-data-by-washington-dc-police-zones).

-----

[John Keefe](https://johnkeefe.net/) has recently been producing demographic profiles of policing geographies. ([NYC police precincts](https://johnkeefe.net/nyc-police-precinct-and-census-data) | [Chicago police districts](https://johnkeefe.net/chicago-race-and-ethnicity-data-by-police-district) | [Washington, DC police zones](https://johnkeefe.net/race-and-ethnicity-data-by-washington-dc-police-zones)). (Since I'm in Chicago, I'll use "district" as the catchall for "policing geography")

His is a commonly used method for doing this: compare US Census block maps to GIS maps for the districts, making a crosswalk which indicates which blocks are in which district, and then adding up the data for those blocks. Census blocks are the smallest unit of geography used for tabulating Census data, and they are generally small enough that they fall within other geographies, although, in his process, John has had to resolve cases where districts cross through blocks. (Only a few, and he details his decisions in the posts linked above.)

Unfortunately, the Census only publishes block-level data every ten years, which means that these demographics are now ten years out of date. Also, they are limited to the data collected for the decennial census: sex, age, race, ethnicity (hispanic/latino or not), and a few things about housing and how people living in the same home are related to each other. 

The Census publishes another data product, the American Community Survey (ACS), each year, but it doesn't tabulate data at the block level. (The ACS also includes data about education and income among other topics beyond what's covered in the Decennial Census.)  It would be great to use this more recent, and more expansive data, but how can we deal with the lack of block-level data?

One might simply use John's process with block groups or tracts. Undoubtedly, there would be more cases where those Census geographies are split by districts. John's approach to researching each split and making a judgment call assigning it to a single district would be time consuming, and would probably lead to distortions of the data. 

Of course, this isn't a completely new problem. The general solution is to compute a weighting factor for each part of the split Census geography in relation to the districts or other geographies. Then, when adding up estimates from the Census geographies, allocate the estimates for split geographies based on the weighting factor.

An obvious approach to weighting is by area, but that assumes that the data used to produce the Census estimates is evenly distributed, which is generally not the case. A little better approach is to divide the population of the blocks in each split by the total population of the geography. As with area, this assumes that demographics of people are evenly spread through the geography, but there's a limit to what we can do with available data, so it's a compromise we have to accept.  

Since the two main things that the Census counts are people and housing units, it's also common to provide a second factor based on the ratio of housing units in each split to the whole, and to choose between the "population weight factor" and the "housing unit weight factor" based on what kind of estimates you're trying to aggregate.

*Warning:* There's a risk that using 2010 block population and housing unit data to compute weighting factors, and then using those to allocate 2018 ACS data, will also produce distortions. For example, Chicago's Robert Taylor Homes were demolished in 2007, so the 2010 base population and housing unit counts for blocks there are zero or close to zero. By 2018, those blocks have begun redeveloping. Of course, the ACS is already a survey which produces imprecise estimates, so the real lesson is simply that you should use care when writing or talking about analysis based on this data. 

In [1]:
# Basic boilerplate
import pandas as pd
import cenpy
import os
import re

# set this environment variable, or edit this cell to configure your own Census API key if necessary
CENSUS_API_KEY = os.environ.get('CENSUS_API_KEY') 



In [2]:
#
# Utility functions. You probably won't use these directly.
#

def _block_id_substr(block_id,substr_len):
    """Given a Census block identifier, return the beginning `substr_len` characters.
       Use 2 for state, 5 for (state and county), 11 for tract, or 12 for block group. 
       Given the structure of block IDs, no other value for substr_len is appropriate,
       but that's on you.
    """
    if type(block_id) != str:
        block_id = str(block_id).zfill(15) # in case they were read in as numbers instead of strings. 
    if 'US' in block_id:
        block_id = block_id.split('US')[1]
    if len(block_id) == 15 and block_id.isdigit():
        return block_id[0:substr_len]
    raise ValueError(f"Invalid block ID {block_id}")
        
def tract_id_from_block_id(block_id):
    return _block_id_substr(block_id,11)

def bg_id_from_block_id(block_id):
    return _block_id_substr(block_id,12)



## Step 0.5: Add population and housing unit counts to a block assignment file.

John Keefe's process begins with creating a "Rosetta Stone" file, which we could also call a "block assignment file". For Chicago, this file is called `chicago_2010blocks_2020policedistricts_key.csv`. His data files (e.g. `chicago_2010blocks_2020policedistricts_population.csv`) don't have housing unit estimates, and in any case, for other aggregation projects, those won't exist, so let's begin by adding these base numbers to the rosetta stone/block assignment file.

This also serves as a generalized tool which could be used to create his data file, by passing different variables as the `api_vars` argument.


In [3]:
def block_data_for_county(state, county,api_vars=None):
    """Given state and county FIPS codes, return 2010 Decennial Census data for all Census blocks in that county. By default, this function
    will return total housing units and total population from the 2010 Decennial Census, but, optionally, you can provide a different list of
    valid variables for the 2010 Decennial Census SF1 API. (DECENNIALSF12010 in cenpy terms)
    """
    predicate_xref = {
        'int': int,
        'float': float
    }
    if api_vars is None:
        api_vars = ['H001001','P001001']
    api_con = cenpy.products.APIConnection('DECENNIALSF12010',apikey=CENSUS_API_KEY)
    df = api_con.query(cols=api_vars,geo_unit='block',geo_filter={'state': state, 'county': county})    
    def concat_geoid(row):
        parts = [row.loc[c] for c in ['state', 'county', 'tract', 'block']]
        return ''.join(parts)
    df['geoid'] = df.apply(concat_geoid,result_type='expand',axis=1)
    predicates = api_con.variables['predicateType']
    # cenpy returns numeric values as strings so we need to convert them
    # use the API's predicateType value to distinguish between int and float values.
    for var in api_vars:
        if var == 'P001001': # Census API incorrectly returns this as NaN :'(
            vtype = 'int'
        else:
            vtype = predicates.loc[var]
        df[var] = df[var].astype(predicate_xref.get(vtype,'object')) 
    return df.set_index('geoid').drop(['state','county','tract','block'],axis='columns')


In [4]:
def _extract_unique_counties(id_list):
    """Given a list of tract, block group, or block IDs, return a DataFrame representing the unique set of 
    state/counties which includes all of the IDs. This can be done because the first five digits
    of these identifiers are built from the two-digit state FIPS code and the three-digit county FIPS code.
    
    The input can be a pandas Series or Index or a simple list, or anything, really, that can be made into a pandas Series.
    Note that the data in the `id_list` argument should be strings, not numbers.
    
    The returned DataFrame will have two columns, `state` and `county`.
    
    This is needed because the Census API doesn't support querying a list of tracts, block groups, or blocks. Instead, one must ask
    for "all tracts (etc.) in a given state/county".
    """
    id_series = pd.Series(id_list)
    state_series = id_series.apply(lambda x: x[0:2])
    county_series = id_series.apply(lambda x: x[2:5])
    return pd.DataFrame({'state': state_series, 'county': county_series}).drop_duplicates()
    
def add_block_data(df, vars=None):
    """Given a DataFrame `df` with an index of 15-char block geoIDs, return a dataframe with the 
    same data plus columns with the Decennial Census SF1 data for the given `vars`.  If no vars are 
    specified, return total population (P001001) and total housing units (H001001)"""
    # block queries must be by state and county, so figure out which states and counties we need to deal with.
    state_county = _extract_unique_counties(df.index)
    blocks = []
    for idx,row in state_county.iterrows():
        blocks.append(block_data_for_county(row['state'], row['county']))
    block_df = pd.concat(blocks)
    return df.join(block_df)

### Example: add the base population and housing unit counts to the block assignment file

1. Read in John's "rosetta stone" block assignment file, `chicago_2010blocks_2020policedistricts_key.csv`. Make sure that the block IDs are treated as strings to avoid problems with states that have leading '0' in their FIPS codes.
2. Use `add_block_data` to get the data needed for the next step.

In [5]:
baf = pd.read_csv('chicago_2010blocks_2020policedistricts_key.csv',dtype={'GEOID10': 'object'})
blocks_with_data = add_block_data(baf.set_index('GEOID10'))
blocks_with_data.head()

Unnamed: 0_level_0,dist_num,H001001,P001001
GEOID10,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
170318105015005,31,1,2
170318105015000,31,0,0
170318105023016,31,100,466
170317709014009,31,30,82
170318105012006,31,73,189


## Step 1: Create a tract-level or block-group level crosswalk with weighting factors

The `blocks_to_tracts` function assumes you have a table where each row represents a Census block, and which has at least four columns:

* Block ID: This can be either a 15-digit "short geoid" (eg _170318369002003_) or a longer block geoid of the kind returned from the Census Bureau API (eg _1000000US170318369002003_).
* Total housing units for the block
* Total population for the block
* an identifier for the geography to which you've assigned the block.

If you have the assignments but not the population and housing units, see above for how to add those values. And, technically, this can be used with only one of the two total counts, although we recommend computing both population and housing weighting factors.

More technical details to using this function: you must read your block assignment file into a `pandas DataFrame` and set the DataFrame's index to the block ID column. The other columns can have any names you want: you specify the names when you invoke this function. 

The return is a DataFrame which typically has four columns, assuming that both `pop_column` and `hu_column` were specified: 

* `tract`: the tract ID which is in, or partially in, the aggregate geography.
* _`[aggregate_id]`_: a column which preserves your aggregate geography identifiers, which has the same name as the input `DataFrame`
* `hufactor`: a number from 0 to 1 representing the proportion of housing units in the whole tract which are in the intersection of this tract and aggregate geography (if no housing column is specified, this column will not be in the return DataFrame)
* `pfactor`: a number from 0 to 1 representing the proportion of population in the whole tract which are in the intersection of this tract and aggregate geography (if no population column is specified, this column will not be in the return DataFrame)


In [6]:
def blocks_to_tracts(df, group_column, pop_column=None, hu_column=None, na_to_zero=True):
    """Given a dataframe that maps blocks to some other geography, return a new data frame that contains
    weighting factors for the distribution of population and housing units for tracts related to that other 
    geography.
    
    Arguments:
    * df: a pandas DataFrame. This DataFrame *must* have a specific index, Census block identifiers. These can 
    either be 15-digit strings or "long" block identifiers, e.g. either '170318105015005' or '1000000US170318105015005'
    After the first argument, a DataFrame, the next three arguments should be the names of the columns in the given dataframe which have that data. 
    * group_column: a column of arbitrary identifiers which will be used to determine which tracts are 
                    contained or partially contained within that geography. 
    * hu_column: If not None, this should be the name of a DataFrame column containing the total number of housing units 
                 for that block. 
    * pop_column: If not None, this should be the name of a DataFrame colum containing the total number of people for 
                  that block.

    An optional kwarg, `na_to_zero` can be passed to control how this function handles division-by-zero --
    that is, the cases when there are no housing units or no population in an entire tract. This would result in 
    `hufactor` or `pfactor` values of NaN, which would lead to unexpected behavior if one of those factors were 
    used in allocating estimates across split tracts. Therefore, by default, this function replaces NaN values 
    with zero. Set `na_to_zero=False` to have the returned data frame preserve NA values.

    The returned dataframe will have the following columns. The first two are part of a multi-index.
    * tract (index) - the first 11 digits of the block IDs, appearing once if the tract is completely contained 
              within the grouping geography, or more than once if blocks in the tract are split across 
              grouping geographies
    * [group_column] (index) indicating the aggregate
              geography which contains some or all of the tract
    * hufactor - (if hu_column was specified) a number from 0 to 1 indicating the fraction of housing units in the whole tract which are 
                 in the portion of the tract which is within this aggregate geography
    * [hu_column] - (if hu_column was specified) a column with the same name as the input "hu_column" with the value of the
                    sum of values for that column for the given tract or part of a tract
    * pfactor -  (if pop_column was specified) a number from 0 to 1 indicating the fraction of population in the whole tract which are 
                 in the portion of the tract which is within this aggregate geography
    * [pop_column] - (if pop_column was specified) a column with the same name as the input "pop_column" with the value of the
                     sum of values for that column for the given tract or part of a tract
                 
    This can then be used to allocate tract-level estimates across aggregate geographies based on whether 
    the estimate being aggregated is of people or of housing units. If aggregating estimates with a universe
    other than total population or total housing units, be aware that the subset in that universe may not be
    equally distributed across blocks. In theory, one could use the same method with other block-level 
    statistics (like population above or below a certain age) to get other allocation factors, but that's 
    reserved for future work.
    """
    by_group = _factorizer(df, group_column, tract_id_from_block_id, pop_column=pop_column, hu_column=hu_column, na_to_zero=na_to_zero)
    return by_group.reset_index().rename(columns={'census_geog': 'tract'}).set_index(['tract', group_column])

def blocks_to_bgs(df, group_column, pop_column=None, hu_column=None, na_to_zero=True):
    by_group = _factorizer(df, group_column, bg_id_from_block_id, pop_column=pop_column, hu_column=hu_column, na_to_zero=na_to_zero)
    return by_group.reset_index().rename(columns={'census_geog': 'block group'}).set_index(['block group', group_column])
    
def _factorizer(df, group_column, idfunc, pop_column=None, hu_column=None, na_to_zero=True):
    "Utility function representing the common behavior independent of census geography type"
    index_col_name = df.index.name
    if index_col_name is None: index_col_name = 'index' # how Pandas names anon indices
    df = df.reset_index()
    df['census_geog'] = df[index_col_name].apply(idfunc)
    sum_cols = []
    if hu_column:
        sum_cols.append(hu_column)
    if pop_column:
        sum_cols.append(pop_column)
    summed = df.groupby(['census_geog'])[sum_cols].sum()
    by_group = df.groupby(['census_geog',group_column])[sum_cols].sum()
    by_group = by_group.reset_index(group_column) # push 
    if hu_column:
        by_group['hufactor'] =  by_group[hu_column] / summed[hu_column]
    if pop_column:
        by_group['pfactor'] = by_group[pop_column] / summed[pop_column]
    if na_to_zero:
        by_group = by_group.fillna(0)
    return by_group
    

### Example: compute weighting factors

In [7]:
# `blocks_with_data` was created in an example above
weight_factors = blocks_to_tracts(blocks_with_data,'dist_num',pop_column='P001001',hu_column='H001001')

# test to see some split tracts. 
# For tracts with non-zero factors, the factors should add up to 1
# for tracts with zero values for either/both factors, those are cases where all blocks in those tracts have
# no population or housing; there are corresponding values for the same tract 
# with a 1.0 factor which are hidden by the dataframe filter condition.
weight_factors[weight_factors['pfactor'] < 1].sort_values('tract').head()

Unnamed: 0_level_0,Unnamed: 1_level_0,H001001,P001001,hufactor,pfactor
tract,dist_num,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
17031030200,20,88,130,0.032282,0.024213
17031030200,24,2638,5239,0.967718,0.975787
17031081403,1,0,0,0.0,0.0
17031150501,16,1449,3643,0.997934,0.998629
17031150501,31,3,5,0.002066,0.001371


**Note:** If you plan to work with these geographies a lot, save this file to a CSV. When you load it, remember to set the (multi-)index to the first two columns.

## Step 2: Fetch data for current ACS tables, aggregated to your geographies

(API in progress)

To fetch data, you must specify the ACS data product and variables from that product. Because we use `cenpy` "under the hood," data products are specified using `cenpy`'s codes. (You can learn more about these products from the [API documentation](https://api.census.gov/data/2018/acs/acs5.html)). One product, the _Comparison Profiles_, is not supported because it is not published at the Census tract level.

Here are examples, other years are available.

* `ACSDT5Y2018`: Detailed Tables (>64,000 [variables](https://api.census.gov/data/2018/acs/acs5/variables.html))
* `ACSDP5Y2018`: Data Profiles (>2,400 [variables](https://api.census.gov/data/2018/acs/acs5/profile/variables.html))
* `ACSST5Y2018`: Subject Tables (>66,000 [variables](https://api.census.gov/data/2018/acs/acs5/subject/variables.html))

It's up to you to make sure that the variables passed in `var_list` are actually available in ACS product represented by `acs_product`. Making it easier to find specific variables is outside the scope of this service.

Also, note that at this time, the aggregation does not handle  aggregating percentages, margins of error, or median values.


In [8]:
def unit_data_for_county(state, county, acs_product, var_list, census_geog='tract'):
    """Given state and county FIPS codes, return 2010 Decennial Census data for all Census blocks in that county. By default, this function
    will return total housing units and total population from the 2010 Decennial Census, but, optionally, you can provide a different list of
    valid variables for the 2010 Decennial Census SF1 API. (DECENNIALSF12010 in cenpy terms)
    """
    valid_census_geog = ['tract', 'block group']
    if census_geog not in valid_census_geog:
        raise ValueError(f"census_geog must be one of [{'|'}.join(valid_census_geog)]")
    
    predicate_xref = {
        'int': int,
        'float': float
    }

    api_con = cenpy.products.APIConnection(acs_product,apikey=CENSUS_API_KEY)
    df = api_con.query(cols=var_list,geo_unit=census_geog, geo_filter={'state': state, 'county': county})    

    geoid_cols = ['state', 'county', 'tract']
    if census_geog == 'block group':
        geoid_cols.append('block group')

    def concat_geoid(row):
        parts = [row.loc[c] for c in geoid_cols]
        return ''.join(parts)
    df['geoid'] = df.apply(concat_geoid,result_type='expand',axis=1) 
    predicates = api_con.variables['predicateType']
    # cenpy returns numeric values as strings so we need to convert them
    # use the API's predicateType value to distinguish between int and float values.
    for var in var_list:
        vtype = predicates.loc[var]
        df[var] = df[var].astype(predicate_xref.get(vtype,'object')) 
    return df.drop(geoid_cols,axis='columns').rename(columns={'geoid': census_geog}).set_index(census_geog)




VALID_ACS_PRODUCT = re.compile('^ACS(DP|DT|ST)5Y20\d{2}$') # doesn't actually validate that it's a valid ACS year
def _validate_cenpy_code(code):
    if not VALID_ACS_PRODUCT.match(code):
        return False
    try: # since the format doesn't test the actual year, make sure it's in Cenpy
        _ = cenpy.explorer.available().loc[code]
    except KeyError:
        return False
    return True


def aggregate_acs(acs_product, var_list, weight_factor, census_geog='tract'):
    """Given the `cenpy` code for a Census data product"""
    
    valid_census_geog = ['tract', 'block group']
    if census_geog not in valid_census_geog:
        raise ValueError(f"census_geog must be one of [{'|'}.join(valid_census_geog)]")
    
    # do some validation
    if type(weight_factor) is not pd.Series or len(weight_factor.index.names) != 2 or weight_factor.index.names[0] != census_geog:
        raise ValueError(f"weight_factor should be a multi-indexed pandas.Series with its first part called '{census_geog}'")
        
    if not _validate_cenpy_code(acs_product):
        raise ValueError("invalid acs_product code")

    # this is an incomplete check on valid variables in var_list
    # it will catch margin of error but not percents, they have a less obvious pattern
    if len(var_list) != len(list(filter(lambda x: x[-1] == 'E',var_list))):
        raise ValueError("invalid variables in var_list") 
        
    agg_col = weight_factor.index.names[-1] # save the name of the 'custom geography'
    state_county = _extract_unique_counties(weight_factor.index.get_level_values(census_geog))
    geographies = []
    for idx,row in state_county.iterrows():
        geographies.append(unit_data_for_county(row['state'], row['county'], acs_product, var_list, census_geog))
    
    geographies = pd.concat(geographies)
    factored = {}
    for col in var_list:
        factored[col] = geographies[col] * weight_factor
    factored = pd.DataFrame(factored)
    return factored.reset_index().drop(columns=[census_geog]).groupby(agg_col).sum()


## An example

To do something like what John did, we need to identify the variables. If you know the right table, you can get the variable list from the Census API. To keep things smaller, we'll just use table B02001: Race, equivalent to Decennial Census table P3. 

* [B02001 variables (HTML)](https://api.census.gov/data/2018/acs/acs5/groups/B02001.html)
* [B02001 variables (JSON)](https://api.census.gov/data/2018/acs/acs5/groups/B02001.json)

After reviewing that, we'll make `variables` to organize the data we want to actually aggregate, and to help us label the columns later.

B02001 is from the "Detailed Tables" dataset and we want the most recent data, so we'll use the Cenpy product code `ACSDT5Y2018`

Earlier, we created `weight_factors` telling us how tract-level population and housing units are divided across the police districts. Since data in B02001 counts people, not housing units, we pass in the `pfactor`.

In [9]:
variables = {
    'B02001_001E': "Total",
    'B02001_002E': "White alone",
    'B02001_003E': "Black or African American alone",
    'B02001_004E': "American Indian and Alaska Native alone",
    'B02001_005E': "Asian alone",
    'B02001_006E': "Native Hawaiian and Other Pacific Islander alone",
    'B02001_007E': "Some other race alone",
    'B02001_008E': "Two or more races"
}
aggregated = aggregate_acs('ACSDT5Y2018', sorted(variables.keys()), weight_factors['pfactor'])
aggregated.rename(columns=variables)

Unnamed: 0_level_0,Total,White alone,Black or African American alone,American Indian and Alaska Native alone,Asian alone,Native Hawaiian and Other Pacific Islander alone,Some other race alone,Two or more races
dist_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,77629.26511,43213.54203,14630.182002,195.0,16301.704815,9.0,795.831092,2484.005172
2,99557.424294,22182.806855,65941.914918,267.0,6951.802179,0.0,1171.083467,3042.816876
3,73030.676642,3248.129496,67071.548905,297.769399,775.092493,0.0,516.22562,1121.910729
4,119538.02444,39892.433519,73459.014347,120.230601,292.105329,18.035817,4031.060018,1725.14481
5,73343.956343,3216.005597,68914.948881,56.0,145.0,11.0,277.001866,724.0
6,85782.999955,1465.41074,82380.182518,61.0,326.0,43.964183,496.314929,1010.127585
7,57165.536827,1886.414072,52792.652227,103.0,79.997165,0.0,1770.473363,533.0
8,250872.0,118047.0,49103.0,1029.0,3209.0,41.0,72622.0,6821.0
9,161499.84028,72435.314747,15567.22566,1294.0,31196.309976,0.0,38057.827504,2949.162393
10,110641.958556,42405.617779,35991.018397,464.563936,260.902611,14.0,29896.056142,1609.799691


### Compare aggregating tracts and aggregating blocks

In [10]:
# `blocks_with_data` was created in an example above
bg_weight_factors = blocks_to_bgs(blocks_with_data,'dist_num',pop_column='P001001',hu_column='H001001')

bg_aggregated = aggregate_acs('ACSDT5Y2018', sorted(variables.keys()), bg_weight_factors['pfactor'], 'block group')
bg_aggregated.rename(columns=variables)

Unnamed: 0_level_0,Total,White alone,Black or African American alone,American Indian and Alaska Native alone,Asian alone,Native Hawaiian and Other Pacific Islander alone,Some other race alone,Two or more races
dist_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,77518.184824,43101.164616,14569.694895,195.0,16336.232762,9.0,799.544786,2507.547765
2,99608.251055,22207.242663,65970.095958,267.0,6951.802179,0.0,1169.293379,3042.816876
3,72123.647387,3200.132486,66110.920755,302.0,765.197821,0.0,545.2132,1200.183124
4,120488.570633,39935.604901,74469.391199,116.0,302.0,18.0,4001.0,1646.574534
5,71608.027434,3119.010619,67340.016814,56.0,145.0,11.0,242.0,695.0
6,85810.297186,1464.38448,82420.48724,61.0,326.0,44.0,484.0,1010.425466
7,57046.079369,1867.808641,52676.38857,103.0,79.995595,0.0,1785.886564,533.0
8,250806.0,117989.0,49103.0,1029.0,3209.0,41.0,72614.0,6821.0
9,161479.798731,72439.68544,15563.485361,1294.0,31176.910816,0.0,38058.606857,2947.110256
10,110655.160876,42455.537846,35976.338223,464.519934,256.677499,14.0,29874.584521,1613.502853


In [11]:
pd.options.display.float_format = "{:,.2f}".format
pct_diff = (bg_aggregated - aggregated)/bg_aggregated * 100
pct_diff.insert(0,'tract_total_pop',aggregated['B02001_001E'])
pct_diff.rename(columns=variables)

Unnamed: 0_level_0,tract_total_pop,Total,White alone,Black or African American alone,American Indian and Alaska Native alone,Asian alone,Native Hawaiian and Other Pacific Islander alone,Some other race alone,Two or more races
dist_num,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,77629.27,-0.14,-0.26,-0.42,0.0,0.21,0.0,0.46,0.94
2,99557.42,0.05,0.11,0.04,0.0,0.0,,-0.15,0.0
3,73030.68,-1.26,-1.5,-1.45,1.4,-1.29,,5.32,6.52
4,119538.02,0.79,0.11,1.36,-3.65,3.28,-0.2,-0.75,-4.77
5,73343.96,-2.42,-3.11,-2.34,0.0,0.0,0.0,-14.46,-4.17
6,85783.0,0.03,-0.07,0.05,0.0,0.0,0.08,-2.54,0.03
7,57165.54,-0.21,-1.0,-0.22,0.0,-0.0,,0.86,0.0
8,250872.0,-0.03,-0.05,0.0,0.0,0.0,0.0,-0.01,0.0
9,161499.84,-0.01,0.01,-0.02,0.0,-0.06,,0.0,-0.07
10,110641.96,0.01,0.12,-0.04,-0.01,-1.65,0.0,-0.07,0.23


Above we see the percentage describerence between aggregating by block group and by tract. While in many cases, the difference isn't great, that there are some greater than 5% and a few >10% does give us pause. Not to mention more than 44% difference in district 22 for "Some other race alone"!

While I didn't check it out systematically, the larger discrepancies mostly seem to go with either small populations (like "Some other race alone") or small total populations (like district 31, which is actually areas of unincorporated Cook County for which Chicago Police have some authority.)  That said, District 22 has a big discrepancy even in total population (-5%), and the population is close to the median.

I don't know of any other source we could consult to see if one is "more correct" than another. Maybe it's worth trying something where the "custom geography" is a real Census geography like a county.



### Potentially interesting rollups.

* [NYC Geographic Relationships](https://www1.nyc.gov/site/planning/planning-level/nyc-population/nyc-population-geographic-relationships.page) Neighborhood Tabulation Areas are built from Census tracts, so alignment is straightforward -- but people are interested in Community Districts. Is there an xref from NTA to CD?
* Seattle had some that were from tracts
* Chicago Community Areas of course -- also from tracts
