# Exploring a Group of Variables

There are thousands of different variables avaialble via the US Census
API. One way to navigate through them is to look through a hierarchy
of web pages starting with a year, like 2020, on a page like 
https://api.census.gov/data/2020.html. 

From here we can navigate down
to a particular data source, by following the link 
in the *Group List* column of the first row, which is for the dataset
`acs/acs5`. This takes us to 
https://api.census.gov/data/2020/acs/acs5/groups.html. 

From there,
we can choose the group named B03002, which takes us to 
https://api.census.gov/data/2020/acs/acs5/groups/B03002.html, where we 
can see all the variables in the group. 

Some of these variables are estimates,
and some of them are annotations. Normally, we are interested in the estimates.

Every variable has a label, which is a string of components seperated by
!!. For example, B03002_007E has the label, 
"Estimate!!Total:!!Not Hispanic or Latino:!!Native Hawaiian and Other Pacific Islander alone".
The !! seperators imply a tree among the variables, from the root all the way down
to leaves. Internal nodes in the tree represent variables that are aggregates of 
those lower than them in the tree. For example, B03002_002E is the size of the 
population that is not Hispanic or Latino, regardless of race. It is the sum of
B03002_003E, B03002_004E, B03002_005E, B03002_006E, B03002_007E, B03002_008E, and B03002_009E,
which count people of different races who are not Hispanic or Latino. All of these
are leaves of the tree, except B03002_009E, which is further subdivided into
B03002_010E and B03002_011E.

All of this can get really confusing when you look at it in the tabular form
on the web page. In order to make it less confusing, the `censusdis` package
includes code to fetch and present the variable hierarchy in a more understandable
way. That's what the remainder of this notebook is about.

## Import and Configuration

In [1]:
# So we can run from within the censusdis project and find the packages we need.
import os
import sys

sys.path.append(
    os.path.join(os.path.abspath(os.path.join(os.path.curdir, os.path.pardir)))
)

In [2]:
import censusdis.data as ced
import censusdis.geography as cgeo
from censusdis.states import STATE_NJ

In [3]:
YEAR = 2020
DATASET = "acs/acs5"
GROUP = "B03002"

## Programmatic Access to a Group Variable Tree

We can get the whole collection of variables in tree form and print it.
This format makes it easier to see the relationships we saw in the
table at https://api.census.gov/data/2020/acs/acs5/groups/B03002.html.

We can see clearly where variables exist at internal nodes of the tree
and where they exist at the leaves. Not all internal nodes have variables
but all leaves do.

In [4]:
tree = ced.variables.group_tree(DATASET, YEAR, GROUP)
print(tree)

+ Estimate
    + Total: (B03002_001E)
        + Not Hispanic or Latino: (B03002_002E)
            + White alone (B03002_003E)
            + Black or African American alone (B03002_004E)
            + American Indian and Alaska Native alone (B03002_005E)
            + Asian alone (B03002_006E)
            + Native Hawaiian and Other Pacific Islander alone (B03002_007E)
            + Some other race alone (B03002_008E)
            + Two or more races: (B03002_009E)
                + Two races including Some other race (B03002_010E)
                + Two races excluding Some other race, and three or more races (B03002_011E)
        + Hispanic or Latino: (B03002_012E)
            + White alone (B03002_013E)
            + Black or African American alone (B03002_014E)
            + American Indian and Alaska Native alone (B03002_015E)
            + Asian alone (B03002_016E)
            + Native Hawaiian and Other Pacific Islander alone (B03002_017E)
            + Some other race alone (B0300

### Accessing a Sub-Tree

Most of the time we are only interested in variables that are estimates, so 
we can look down in that part of the tree alone.

In [5]:
print(tree["Estimate"])

+ Total: (B03002_001E)
    + Not Hispanic or Latino: (B03002_002E)
        + White alone (B03002_003E)
        + Black or African American alone (B03002_004E)
        + American Indian and Alaska Native alone (B03002_005E)
        + Asian alone (B03002_006E)
        + Native Hawaiian and Other Pacific Islander alone (B03002_007E)
        + Some other race alone (B03002_008E)
        + Two or more races: (B03002_009E)
            + Two races including Some other race (B03002_010E)
            + Two races excluding Some other race, and three or more races (B03002_011E)
    + Hispanic or Latino: (B03002_012E)
        + White alone (B03002_013E)
        + Black or African American alone (B03002_014E)
        + American Indian and Alaska Native alone (B03002_015E)
        + Asian alone (B03002_016E)
        + Native Hawaiian and Other Pacific Islander alone (B03002_017E)
        + Some other race alone (B03002_018E)
        + Two or more races: (B03002_019E)
            + Two races includin

### Leaves

In many cases, we are really just interested in the leaves, because the
internal nodes of the tree contain variables that are aggregate sums of the 
subtrees below them.

In [6]:
leaves = ced.variables.group_leaves(DATASET, YEAR, GROUP)
leaves

['B03002_003E',
 'B03002_004E',
 'B03002_005E',
 'B03002_006E',
 'B03002_007E',
 'B03002_008E',
 'B03002_010E',
 'B03002_011E',
 'B03002_013E',
 'B03002_014E',
 'B03002_015E',
 'B03002_016E',
 'B03002_017E',
 'B03002_018E',
 'B03002_020E',
 'B03002_021E']

Notice that the set of leaves we got does not include those for
annotations. If we really want to see those too, we can add an
optional argument.

In [7]:
all_leaves = ced.variables.group_leaves(DATASET, YEAR, GROUP, skip_annotations=False)
all_leaves

['B03002_003E',
 'B03002_003EA',
 'B03002_003M',
 'B03002_003MA',
 'B03002_004E',
 'B03002_004EA',
 'B03002_004M',
 'B03002_004MA',
 'B03002_005E',
 'B03002_005EA',
 'B03002_005M',
 'B03002_005MA',
 'B03002_006E',
 'B03002_006EA',
 'B03002_006M',
 'B03002_006MA',
 'B03002_007E',
 'B03002_007EA',
 'B03002_007M',
 'B03002_007MA',
 'B03002_008E',
 'B03002_008EA',
 'B03002_008M',
 'B03002_008MA',
 'B03002_010E',
 'B03002_010EA',
 'B03002_010M',
 'B03002_010MA',
 'B03002_011E',
 'B03002_011EA',
 'B03002_011M',
 'B03002_011MA',
 'B03002_013E',
 'B03002_013EA',
 'B03002_013M',
 'B03002_013MA',
 'B03002_014E',
 'B03002_014EA',
 'B03002_014M',
 'B03002_014MA',
 'B03002_015E',
 'B03002_015EA',
 'B03002_015M',
 'B03002_015MA',
 'B03002_016E',
 'B03002_016EA',
 'B03002_016M',
 'B03002_016MA',
 'B03002_017E',
 'B03002_017EA',
 'B03002_017M',
 'B03002_017MA',
 'B03002_018E',
 'B03002_018EA',
 'B03002_018M',
 'B03002_018MA',
 'B03002_020E',
 'B03002_020EA',
 'B03002_020M',
 'B03002_020MA',
 'B03002_0

## Programmatic Access to Geographic Hierarchies

It's great to know the variables that are available, but in order to make full
use of the US Census API and the `censusdis` API around it, we have to know 
something about the geography hierarchies that are avaialble for each dataset 
in each year it is available. These are available on web pages like 
https://api.census.gov/data/2020/acs/acs5/geography.html, which contains a table
of the geographies supported by the ACS5 dataset we have been looking at
for the year 2020.

But again, we'd prefer to have access to this information in a pythonic way.
The most common way to get this is by calling `censusdis.geography.geo_path_py_specs` 
as shown below. This gives us the available hierarchies using Python-friendly
snake-case names that we can use to download data.

Each item in the dictionary has a key that represents the geography hierarchy and
a value that is the list of the components of the hierachy in snake-case as it can
be passed as an argument to `censusdis.data.download`.

In [8]:
cgeo.geo_path_snake_specs(DATASET, YEAR)

{'010': ['us'],
 '020': ['region'],
 '030': ['division'],
 '040': ['state'],
 '050': ['state', 'county'],
 '060': ['state', 'county', 'county_subdivision'],
 '067': ['state', 'county', 'county_subdivision', 'subminor_civil_division'],
 '070': ['state', 'county', 'county_subdivision', 'place_remainder_or_part'],
 '140': ['state', 'county', 'tract'],
 '150': ['state', 'county', 'tract', 'block_group'],
 '155': ['state', 'place', 'county_or_part'],
 '160': ['state', 'place'],
 '170': ['state', 'consolidated_city'],
 '172': ['state', 'consolidated_city', 'place_or_part'],
 '230': ['state', 'alaska_native_regional_corporation'],
 '250': ['american_indian_area_alaska_native_area_hawaiian_home_land'],
 '251': ['american_indian_area_alaska_native_area_hawaiian_home_land',
  'tribal_subdivision_remainder'],
 '252': ['american_indian_area_alaska_native_area_reservation_or_statistical_entity_only'],
 '254': ['american_indian_area_off_reservation_trust_land_only_hawaiian_home_land'],
 '256': ['ame

## Loading Data for Leaves

Once we know the variables we want, for example the leaves of group B03002 that we found
above, and we select one of the geogpraphy types and fill in the particular values we want
at each level, we can load data for them. In our case, we will use the one with the key `'610'`,
which represents the districts of the upper house of the legeslature of a state. The value
associated with that key is a list of the keyword arguments we can pass to `censusdis.data.download`
to specifiy what state and what district or disctricts within the state we want data for.

This call will load data for all the variablles at the leaves of the dataset we examined earlier for
all state senate districts in NJ by using the keyword arguments 
`state=STATE_NJ` and `state_legislative_district_upper_chamber="*"`.

In [9]:
ced.download(
    DATASET, YEAR, leaves, state=STATE_NJ, state_legislative_district_upper_chamber="*"
)

Unnamed: 0,STATE,STATE_LEGISLATIVE_DISTRICT_UPPER_CHAMBER,B03002_003E,B03002_004E,B03002_005E,B03002_006E,B03002_007E,B03002_008E,B03002_010E,B03002_011E,B03002_013E,B03002_014E,B03002_015E,B03002_016E,B03002_017E,B03002_018E,B03002_020E,B03002_021E
0,34,1,142196,21618,925,1963,6,371,487,4965,24460,1820,191,107,25,9969,2943,1306
1,34,2,107867,31770,547,17962,98,650,694,5588,16532,1398,694,128,16,18301,4893,1337
2,34,3,149678,33696,452,4264,24,237,358,4327,15417,931,179,22,22,6880,3862,590
3,34,4,148292,39088,216,7032,17,507,411,5723,8885,2564,287,41,0,5886,1202,1466
4,34,5,114996,42660,206,7225,18,690,142,3783,16543,2455,239,64,24,24589,5088,963
5,34,6,144747,22647,138,21829,63,1447,570,4509,11936,1762,92,40,71,9901,2583,1188
6,34,7,134612,47217,52,12108,50,871,561,7288,9159,1511,332,207,4,4596,2102,948
7,34,8,160610,22823,35,7878,113,388,561,6264,8291,1528,170,49,134,4174,2510,984
8,34,9,194162,7586,117,5480,46,418,196,4949,11273,928,12,64,0,3328,1989,183
9,34,10,188982,7378,188,5765,136,986,209,4130,11925,653,114,34,0,4483,2396,689


## Error Handling

What happens if we get one of the keywords wrong? We can always look 
a few cells up at the output from `cgeo.geo_path_snake_specs`, but we
also get a friendly exception message to tell us what the options are.

In [10]:
try:
    ced.download(DATASET, YEAR, leaves, state=STATE_NJ, unknown_geo="*")
except ced.CensusApiException as e:
    print(e)

Unable to match the geography specification {'state': '34', 'unknown_geo': '*'}.
Supported geographies for dataset='acs/acs5' in year=2020 are:
['us']
['region']
['division']
['state']
['state', 'county']
['state', 'county', 'county_subdivision']
['state', 'county', 'county_subdivision', 'subminor_civil_division']
['state', 'county', 'county_subdivision', 'place_remainder_or_part']
['state', 'county', 'tract']
['state', 'county', 'tract', 'block_group']
['state', 'place', 'county_or_part']
['state', 'place']
['state', 'consolidated_city']
['state', 'consolidated_city', 'place_or_part']
['state', 'alaska_native_regional_corporation']
['american_indian_area_alaska_native_area_hawaiian_home_land']
['american_indian_area_alaska_native_area_hawaiian_home_land', 'tribal_subdivision_remainder']
['american_indian_area_alaska_native_area_reservation_or_statistical_entity_only']
['american_indian_area_off_reservation_trust_land_only_hawaiian_home_land']
['american_indian_area_alaska_native_area_