In [1]:
import os
import sys

sys.path.insert(1, os.path.join(os.getcwd(), os.pardir))

In [2]:
import censusdis.redistricting as crd

from censusdis.states import STATE_NJ

import divintseg as dis

import pandas as pd

## Introduction

In this notebook, we will demonstrate how to use the [`censusdis`](https://github.com/vengroff/censusdis) package 
to download some US Census data and then how to use the [`divintseg`](https://github.com/vengroff/divintseg) package 
to compute some diversity and integration metrics.

In this example, we will look at data from the towns of South Orange
and Maplewood (collectively known as SoMa) in Essex County,
NJ. 

Once you are familiar with the API and how to use it, you can easily experiment with
similar analysis of the area where you live.

## US Census API Key

The US Census API uses a key to identify callers. If you don't already have a key, you can request
one [here](https://api.census.gov/data/key_signup.html). Please put your key into this cell before 
running the notebook.

For small queries like in this demo notebook, the API seems to work without a key, so you can leave
it set to `None`, but for more serious work you will want to obtain a key.

In [3]:
CENSUS_API_KEY=None

## Basic Configuration

### Year

We can choose which census year we want to look at, 2000, 2010, or 2020.

In [4]:
YEAR = 2020

### Field Group

A field group is a set of fields that the US Census uses to provide data
in the various data sets it publishes. These groups cover all kinds of
topics. We are interested in demographics and we are going to be using
redistricting data, so the field groups available are those summarized 
[here](https://api.census.gov/data/2020/dec/pl/groups.html). Don't worry
if nothing on that page means anything to you right now. We'll explain it
here.

If we choose P1, then the data is grouped purely based on race, 
not taking ethnicity into account at all. If we choose P2, then the data is
first grouped by ethnicity, with people reporting Hispanic or Latino ethinicity
put into one group regardless of their race. Everyone else is then divided into
groups based on their race.

Thus, P2 has one group that P1 does not have, which is Hispanic or Latino of 
any race. In the P1 data set, people who are in the Hispanic or Latino group 
in P2 are instead classified into one of the race-based groups. 

For more information, including additional options P3 and P4, see this
additional 
[documentation](https://www.census.gov/programs-surveys/decennial-census/about/rdo/summary-files.html).

In [5]:
FIELD_GROUP = 'P2'

### Fields

Since the specific fields that exist within a field group vary by year,
we have to make a metadata query to find out what they are. This is a
simple one-liner.

In [6]:
field_names, total_field, fields_by_race = crd.metadata(YEAR, FIELD_GROUP)

Each of the fields has a name (in `field_names`), and the API also groups them into 
racial groups (in `fields by race`) in case we want to use that for additional 
analysis. 

For the work we are doing here, we are just going to pass some of this metadata
on to our data query. But we will take a quick look just to see the richness of the
data provided by the census for people who are multiracial.

In [7]:
field_names

{'P2_070N': 'White; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race',
 'P2_071N': 'Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race',
 'P2_073N': 'White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander; Some Other Race',
 'P2_060N': 'Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander',
 'P2_030N': 'White; Black or African American; Asian',
 'P2_054N': 'White; Black or African American; Asian; Some Other Race',
 'P2_042N': 'Black or African American; Asian; Native Hawaiian and Other Pacific Islander',
 'P2_066N': 'White; Black or African American; American Indian and Alaska Native; Asian; Native Hawaiian and Other Pacific Islander',
 'P2_043N': 'Black or African American; Asian; Some Other Race',
 'P2_031N': 'White; Black or African

## South Orange and Maplewood Tracts in Essex County, NJ

We want to take a look at data from the towns of South Orange
and Maplewood (collectively known as SoMa) in Essex County,
NJ. 

### Essex County

For Essex County, we found the code `'013`' on this 
[wikipedia page](https://en.wikipedia.org/wiki/List_of_counties_in_New_Jersey).

In [8]:
COUNTY_ESSEX_NJ = '013'

### SoMa Tracts

We found the tracts that make up the two towns by looking
at 
[this map](https://www2.census.gov/geo/maps/dc10map/tract/st34_nj/c34013_essex/DC10CT_C34013_002.pdf).
We format them as strings using the convention of
Census API, which is a six-digit string.

In [9]:
tracts_soma = [f"{t:06}" for t in range(19000, 20000, 100)]
tracts_soma

['019000',
 '019100',
 '019200',
 '019300',
 '019400',
 '019500',
 '019600',
 '019700',
 '019800',
 '019900']

### SoMa Data Query

Now we can query the data. Lets look at the arguments to our call one by one:

- `STATE_NJ` - this is the state we are interested in.
- `YEAR`- the year we want data for. For the redistricting API, 2000, 2010, and 2020 are the three valid option.
- `block` - the resolution of data we want. We want a row for each block in SoMa. 
- `field_names.keys()` - these are all the fields we want data for. The available fields vary by year and group, which
   is why we made the `crd.metadata` call above to get them.
- `county=COUNTY_ESSEX_NJ` - this is a filter. We only want data for Essex County.
- `tract=tracts_soma` - this is a second filter saying that within the county we only want the specificed tracts.
- `key=CENSUS_API_KEY` - our API key.

The return value will be a `pd.DataFrame` containing a row for each block (the resolution we specified). In order to make
analysis of diversity and integration at various levels of geographic aggregation easier (e.g. using the 
[divintseg](https://github.com/vengroff/divintseg) package) the identifiers of all of the nested geographies from the 
state down to the block are included in each row. In this case that means we have columns for STATE, COUNTY, TRACT, BLOCK_GROUP,
and BLOCK. After these columns, we have one column for each of the demographic fields we asked for.

In [10]:
df_soma = crd.data(
    STATE_NJ, 
    YEAR, 
    'block',
    field_names.keys(),
    county=COUNTY_ESSEX_NJ,
    tract=tracts_soma,
    key=CENSUS_API_KEY,
)

In [11]:
df_soma

Unnamed: 0,STATE,COUNTY,TRACT,BLOCK_GROUP,BLOCK,P2_070N,P2_071N,P2_073N,P2_060N,P2_030N,...,P2_024N,P2_037N,P2_025N,P2_013N,P2_018N,P2_006N,P2_007N,P2_019N,P2_008N,P2_009N
0,34,013,019400,1,1004,0,0,0,0,0,...,0,0,0,1,0,0,0,3,0,0
1,34,013,019400,1,1005,0,0,0,0,0,...,0,0,0,3,0,8,0,1,1,0
2,34,013,019400,1,1006,0,0,0,0,0,...,0,0,0,0,0,4,0,0,1,0
3,34,013,019400,1,1007,0,0,0,0,0,...,0,0,0,0,0,4,0,0,8,0
4,34,013,019400,1,1008,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
512,34,013,019900,2,2007,0,0,0,0,0,...,0,0,0,0,0,3,0,0,6,0
513,34,013,019900,2,2010,0,0,0,0,0,...,0,0,0,6,0,3,0,0,1,0
514,34,013,019900,3,3001,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
515,34,013,019900,3,3005,0,0,0,0,0,...,0,0,0,0,0,1,0,0,2,0


## Compute Diversity and Integration

Now that we have the census data telling us how many people of each group there are
in each block of SoMa, we can calculate diversity and inclusion at the tract over
block level. For a detailed explination of what we are actually calculating here, see
the [README.md](https://github.com/vengroff/divintseg/blob/main/README.md) in the 
[divintseg package](https://github.com/vengroff/divintseg).

In [12]:
dis.di(df_soma, by='TRACT', over='BLOCK', drop_non_numeric=True)

Unnamed: 0_level_0,diversity,integration
TRACT,Unnamed: 1_level_1,Unnamed: 2_level_1
19000,0.59433,0.562676
19100,0.692372,0.594267
19200,0.669885,0.637017
19300,0.656123,0.617187
19400,0.43242,0.407238
19500,0.50085,0.471264
19600,0.697732,0.618226
19700,0.623099,0.565054
19800,0.582375,0.504864
19900,0.44207,0.419475
