# Exploring a Group of Variables

There are thousands of different variables avaialble via the US Census
API. One way to navigate through them is to look through a hierarchy
of web pages starting with a year, like 2020, on a page like 
https://api.census.gov/data/2020.html. 

From here we can navigate down
to a particular data source, by following the link 
in the *Group List* column of the first row, which is for the dataset
`acs/acs5`. This takes us to 
https://api.census.gov/data/2020/acs/acs5/groups.html. 

From there,
we can choose the group named B03002, which takes us to 
https://api.census.gov/data/2020/acs/acs5/groups/B03002.html, where we 
can see all the variables in the group. 

Some of these variables are estimates,
and some of them are annotations. Normally, we are interested in the estimates.

Every variable has a label, which is a string of components seperated by
!!. For example, B03002_007E has the label, 
"Estimate!!Total:!!Not Hispanic or Latino:!!Native Hawaiian and Other Pacific Islander alone".
The !! seperators imply a tree among the variables, from the root all the way down
to leaves. Internal nodes in the tree represent variables that are aggregates of 
those lower than them in the tree. For example, B03002_002E is the size of the 
population that is not Hispanic or Latino, regardless of race. It is the sum of
B03002_003E, B03002_004E, B03002_005E, B03002_006E, B03002_007E, B03002_008E, and B03002_009E,
which count people of different races who are not Hispanic or Latino. All of these
are leaves of the tree, except B03002_009E, which is further subdivided into
B03002_010E and B03002_011E.

All of this can get really confusing when you look at it in the tabular form
on the web page. In order to make it less confusing, the `censusdis` package
includes code to fetch and present the variable hierarchy in a more understandable
way. That's what the remainder of this notebook is about.

# Import and Configuration

In [1]:
import censusdis.data as ced
from censusdis.states import STATE_NJ

In [2]:
YEAR = 2020
DATASET = "acs/acs5"
GROUP = "B03002"

# Programmatic Access to a Group Variable Tree

We can get the whole collection of variables in tree form and print it.
This format makes it easier to see the relationships we saw in the
table at https://api.census.gov/data/2020/acs/acs5/groups/B03002.html.

We can see clearly where variables exist at internal nodes of the tree
and where they exist at the leaves. Not all internal nodes have variables
but all leaves do.

In [3]:
tree = ced.variables.group_tree(DATASET, YEAR, GROUP)
print(tree)

VVV dict_keys(['variables'])
+ Estimate
    + Total: (B03002_001E)
        + Not Hispanic or Latino: (B03002_002E)
            + White alone (B03002_003E)
            + Black or African American alone (B03002_004E)
            + American Indian and Alaska Native alone (B03002_005E)
            + Asian alone (B03002_006E)
            + Native Hawaiian and Other Pacific Islander alone (B03002_007E)
            + Some other race alone (B03002_008E)
            + Two or more races: (B03002_009E)
                + Two races including Some other race (B03002_010E)
                + Two races excluding Some other race, and three or more races (B03002_011E)
        + Hispanic or Latino: (B03002_012E)
            + White alone (B03002_013E)
            + Black or African American alone (B03002_014E)
            + American Indian and Alaska Native alone (B03002_015E)
            + Asian alone (B03002_016E)
            + Native Hawaiian and Other Pacific Islander alone (B03002_017E)
            +

## Accessing a Sub-Tree

Most of the time we are only interested in variables that are estimates, so 
we can look down in that part of the tree alone.

In [4]:
print(tree["Estimate"])

+ Total: (B03002_001E)
    + Not Hispanic or Latino: (B03002_002E)
        + White alone (B03002_003E)
        + Black or African American alone (B03002_004E)
        + American Indian and Alaska Native alone (B03002_005E)
        + Asian alone (B03002_006E)
        + Native Hawaiian and Other Pacific Islander alone (B03002_007E)
        + Some other race alone (B03002_008E)
        + Two or more races: (B03002_009E)
            + Two races including Some other race (B03002_010E)
            + Two races excluding Some other race, and three or more races (B03002_011E)
    + Hispanic or Latino: (B03002_012E)
        + White alone (B03002_013E)
        + Black or African American alone (B03002_014E)
        + American Indian and Alaska Native alone (B03002_015E)
        + Asian alone (B03002_016E)
        + Native Hawaiian and Other Pacific Islander alone (B03002_017E)
        + Some other race alone (B03002_018E)
        + Two or more races: (B03002_019E)
            + Two races includin

## Leaves

In many cases, we are really just interested in the leaves, because the
internal nodes of the tree contain variables that are aggregate sums of the 
subtrees below them.

In [5]:
leaves = ced.variables.group_leaves(DATASET, YEAR, GROUP)
leaves

['B03002_003E',
 'B03002_004E',
 'B03002_005E',
 'B03002_006E',
 'B03002_007E',
 'B03002_008E',
 'B03002_010E',
 'B03002_011E',
 'B03002_013E',
 'B03002_014E',
 'B03002_015E',
 'B03002_016E',
 'B03002_017E',
 'B03002_018E',
 'B03002_020E',
 'B03002_021E']

Notice that the set of leaves we got does not include those for
annotations. If we really want to see those too, we can add an
optional argument.

In [6]:
all_leaves = ced.variables.group_leaves(DATASET, YEAR, GROUP, skip_annotations=False)
all_leaves

['B03002_003E',
 'B03002_003EA',
 'B03002_003M',
 'B03002_003MA',
 'B03002_004E',
 'B03002_004EA',
 'B03002_004M',
 'B03002_004MA',
 'B03002_005E',
 'B03002_005EA',
 'B03002_005M',
 'B03002_005MA',
 'B03002_006E',
 'B03002_006EA',
 'B03002_006M',
 'B03002_006MA',
 'B03002_007E',
 'B03002_007EA',
 'B03002_007M',
 'B03002_007MA',
 'B03002_008E',
 'B03002_008EA',
 'B03002_008M',
 'B03002_008MA',
 'B03002_010E',
 'B03002_010EA',
 'B03002_010M',
 'B03002_010MA',
 'B03002_011E',
 'B03002_011EA',
 'B03002_011M',
 'B03002_011MA',
 'B03002_013E',
 'B03002_013EA',
 'B03002_013M',
 'B03002_013MA',
 'B03002_014E',
 'B03002_014EA',
 'B03002_014M',
 'B03002_014MA',
 'B03002_015E',
 'B03002_015EA',
 'B03002_015M',
 'B03002_015MA',
 'B03002_016E',
 'B03002_016EA',
 'B03002_016M',
 'B03002_016MA',
 'B03002_017E',
 'B03002_017EA',
 'B03002_017M',
 'B03002_017MA',
 'B03002_018E',
 'B03002_018EA',
 'B03002_018M',
 'B03002_018MA',
 'B03002_020E',
 'B03002_020EA',
 'B03002_020M',
 'B03002_020MA',
 'B03002_0

# Loading Data for Leaves

Once we have the leaves of a group, we will often want to load
data for them. This call will load data for all the leaves in the 
group for each census tract in the state of New Jersey.

In [7]:
ced.download_detail(DATASET, YEAR, leaves, state=STATE_NJ, tract="*")

Unnamed: 0,STATE,COUNTY,TRACT,B03002_003E,B03002_004E,B03002_005E,B03002_006E,B03002_007E,B03002_008E,B03002_010E,B03002_011E,B03002_013E,B03002_014E,B03002_015E,B03002_016E,B03002_017E,B03002_018E,B03002_020E,B03002_021E
0,34,013,009200,430,797,0,0,0,11,10,69,928,81,0,0,0,639,207,0
1,34,013,009300,153,1297,0,166,0,271,0,37,1270,266,0,0,0,1367,249,82
2,34,013,009400,790,1197,0,115,0,0,0,0,1099,199,0,0,0,1573,386,430
3,34,013,009500,1040,1182,0,94,0,31,0,90,1299,586,24,0,0,1417,0,436
4,34,013,009600,377,933,62,0,0,152,0,159,1817,155,0,0,0,1456,84,206
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2176,34,039,036200,5307,0,0,240,0,73,0,14,734,0,0,0,0,60,0,54
2177,34,039,036301,4547,159,0,272,0,0,6,11,415,16,0,0,0,116,48,0
2178,34,039,036302,3409,115,0,88,0,0,0,0,149,0,0,0,0,14,0,0
2179,34,039,036400,6035,58,0,349,0,9,0,176,174,0,0,0,0,3,0,0
