# Census Data Extraction and Processing

## Purpose

This notebook provides functionality to programatically aquire and then process census data. Such data can be obtained manually by visitng [data.census.gov](https://data.census.gov/cedsci/), but often the format of the data or even the act of downloading it can be time consuming. Therefore we are demonstrating how you can pull data from the census in an easily reproducible way and process it into the format needed for analysis. Note that different uses will require different formats so we will just demonstrate one such case, but hope that it provides the necessary experience for a user to alter the code for their use case. 

## Approach

The US Census provides [APIs](https://www.census.gov/data/developers/data-sets.htm) for programatically accessing census data. We could use those directly but it would require us to write a large amount of code. Therefore we will leverage an available Python library called [`censusdata`](https://github.com/jtleider/censusdata) to help us obtain the data. 

Then we will use some code to process it. 

**API KEY** Note if you want to download large amounts of data or use the API frequently, then you must obtain a census API key. This is quite easy just visit [here](https://api.census.gov/data/key_signup.html) and enter your information, you will then be sent a key.

A key is just a random string of numbers and letters that is linked to you. 

It will look something like this `96e87430410c12340dc57c6e556edf5c25905d70`, below (where relevant) we will highlight where you can add your key when downloading data. 

## Code

**Imports**

In [32]:
import pandas as pd
import censusdata

pd.set_option(
    "display.expand_frame_repr", False
)  # These options just help us see more of the data in the dataframe
pd.set_option("display.precision", 2)

## Case

We will obtain data from the `B05006` table which provides information on the `PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES`. 

But we will provide some other examples as well.

Lets say we didn't know the table name we were looking for but knew the concept. We can use the `censusdata.search` functionality to search for and examine the fields available.

## Search for data

**Lets search for Income in the detail tables Using the 5 year acs, and the 2019 estimates**

refernce on detail, vs subject tables etc [link](https://www.census.gov/data/developers/data-sets/acs-5year.html)



Also note the third argument `concept` the other option here is `label`. If using label then it will search for the search term in the variable label, while if you use `concept` it will search in the census conecpt label. This is a bit confusing and not very well documented by the Census. When in doubt perhaps try both.

In [12]:
censusdata.search("acs5", 2019, "concept", "INCOME", tabletype="detail")

[('B05010_001E',
  'RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS BY NATIVITY OF CHILDREN UNDER 18 YEARS IN FAMILIES AND SUBFAMILIES BY LIVING ARRANGEMENTS AND NATIVITY OF PARENTS',
  'Estimate!!Total:'),
 ('B05010_002E',
  'RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS BY NATIVITY OF CHILDREN UNDER 18 YEARS IN FAMILIES AND SUBFAMILIES BY LIVING ARRANGEMENTS AND NATIVITY OF PARENTS',
  'Estimate!!Total:!!Under 1.00:'),
 ('B05010_003E',
  'RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS BY NATIVITY OF CHILDREN UNDER 18 YEARS IN FAMILIES AND SUBFAMILIES BY LIVING ARRANGEMENTS AND NATIVITY OF PARENTS',
  'Estimate!!Total:!!Under 1.00:!!Living with two parents:'),
 ('B05010_004E',
  'RATIO OF INCOME TO POVERTY LEVEL IN THE PAST 12 MONTHS BY NATIVITY OF CHILDREN UNDER 18 YEARS IN FAMILIES AND SUBFAMILIES BY LIVING ARRANGEMENTS AND NATIVITY OF PARENTS',
  'Estimate!!Total:!!Under 1.00:!!Living with two parents:!!Both parents native'),
 ('B05010_005E',
  'RATIO OF INC

In [11]:
censusdata.search("acs5", 2019, "label", "INCOME", tabletype="detail")

[('B06010PR_002E',
  'PLACE OF BIRTH BY INDIVIDUAL INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) IN PUERTO RICO',
  'Estimate!!Total:!!No income'),
 ('B06010PR_003E',
  'PLACE OF BIRTH BY INDIVIDUAL INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) IN PUERTO RICO',
  'Estimate!!Total:!!With income:'),
 ('B06010PR_004E',
  'PLACE OF BIRTH BY INDIVIDUAL INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) IN PUERTO RICO',
  'Estimate!!Total:!!With income:!!$1 to $9,999 or loss'),
 ('B06010PR_005E',
  'PLACE OF BIRTH BY INDIVIDUAL INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) IN PUERTO RICO',
  'Estimate!!Total:!!With income:!!$10,000 to $14,999'),
 ('B06010PR_006E',
  'PLACE OF BIRTH BY INDIVIDUAL INCOME IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) IN PUERTO RICO',
  'Estimate!!Total:!!With income:!!$15,000 to $24,999'),
 ('B06010PR_007E',
  'PLACE OF BIRTH BY INDIVIDUAL INCOME IN THE PAST 12 MONTHS (IN 20

We see we get a list of variables/fields along with the table name and variable code `B05010_001E` --> Table Name `B05010` and variable code `001E`

For our use case we want info on `PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION` so lets search for that. 

In [33]:
censusdata.search(
    "acs5",
    2019,
    "label",
    "PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES",
    tabletype="detail",
)

[]

We get nothing in the label so lets try conecpt. 

In [34]:
r = censusdata.search(
    "acs5",
    2019,
    "concept",
    "PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES",
    tabletype="detail",
)
r

[('B05006_001E',
  'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
  'Estimate!!Total:'),
 ('B05006_002E',
  'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
  'Estimate!!Total:!!Europe:'),
 ('B05006_003E',
  'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
  'Estimate!!Total:!!Europe:!!Northern Europe:'),
 ('B05006_004E',
  'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
  'Estimate!!Total:!!Europe:!!Northern Europe:!!Ireland'),
 ('B05006_005E',
  'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
  'Estimate!!Total:!!Europe:!!Northern Europe:!!Denmark'),
 ('B05006_006E',
  'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
  'Estimate!!Total:!!Europe:!!Northern Europe:!!Norway'),
 ('B05006_007E',
  'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
  'Estimate!!Total:!!Europe:!!Northern Europe:!!Sweden'),
 ('B05006_008E',
  'PLACE OF BI

If we scroll through it looked like all the info is from one table `B05006`, so now we know thats the table we want to get. Please note you may already know the table you want so you can skip to the next cell.

**Get the table info base on table name, year, and survey**

In [17]:
table_info = censusdata.censustable("acs5", 2019, "B05006")

In [18]:
table_info

OrderedDict([('B05006_001E',
              {'label': 'Estimate!!Total:',
               'concept': 'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
               'predicateType': 'int'}),
             ('B05006_002E',
              {'label': 'Estimate!!Total:!!Europe:',
               'concept': 'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
               'predicateType': 'int'}),
             ('B05006_003E',
              {'label': 'Estimate!!Total:!!Europe:!!Northern Europe:',
               'concept': 'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
               'predicateType': 'int'}),
             ('B05006_004E',
              {'label': 'Estimate!!Total:!!Europe:!!Northern Europe:!!Ireland',
               'concept': 'PLACE OF BIRTH FOR THE FOREIGN-BORN POPULATION IN THE UNITED STATES',
               'predicateType': 'int'}),
             ('B05006_005E',
              {'label': 'Estimate!!Total:!!Europe:!!Nor

Above we see all the information for that table, which matches what we found in the search. The `censusdata` package also provides a helper function to make this more readable. 

In [20]:
censusdata.printtable(table_info)

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B05006_001E  | PLACE OF BIRTH FOR THE FOREIGN | !! Estimate Total:                                       | int  
B05006_002E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! Estimate Total: Europe:                            | int  
B05006_003E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! Estimate Total: Europe: Northern Europe:        | int  
B05006_004E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern Europe: Ire | int  
B05006_005E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern Europe: Den | int  
B05006_006E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern Europe: Nor | int  
B05006_007E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern 

## Now get the geography

Census data can come in many different [geographies](https://www.census.gov/programs-surveys/geography/geographies.html) (national, state, county, tract, block group, etc. )

The `censusdata` library can also help us pick our geogrpahy and get information for it. How this works is that we combine geographies and table info then request information from the census. 

**Lets look for all states in the US**

In [22]:
censusdata.geographies(censusdata.censusgeo([("state", "*")]), "acs5", 2019)

{'Alabama': censusgeo((('state', '01'),)),
 'Alaska': censusgeo((('state', '02'),)),
 'Arizona': censusgeo((('state', '04'),)),
 'Arkansas': censusgeo((('state', '05'),)),
 'California': censusgeo((('state', '06'),)),
 'Colorado': censusgeo((('state', '08'),)),
 'Delaware': censusgeo((('state', '10'),)),
 'District of Columbia': censusgeo((('state', '11'),)),
 'Connecticut': censusgeo((('state', '09'),)),
 'Florida': censusgeo((('state', '12'),)),
 'Georgia': censusgeo((('state', '13'),)),
 'Idaho': censusgeo((('state', '16'),)),
 'Hawaii': censusgeo((('state', '15'),)),
 'Illinois': censusgeo((('state', '17'),)),
 'Indiana': censusgeo((('state', '18'),)),
 'Iowa': censusgeo((('state', '19'),)),
 'Kansas': censusgeo((('state', '20'),)),
 'Kentucky': censusgeo((('state', '21'),)),
 'Louisiana': censusgeo((('state', '22'),)),
 'Maine': censusgeo((('state', '23'),)),
 'Maryland': censusgeo((('state', '24'),)),
 'Massachusetts': censusgeo((('state', '25'),)),
 'Michigan': censusgeo((('stat

**Lets look for all counties in the US**

Below we have a list with different search elements. 
```
[
    ('county', *) <-- Here we use an asterisk to indicate that we want all counties in the country
]
```

In [21]:
censusdata.geographies(censusdata.censusgeo([("county", "*")]), "acs5", 2019)

{'Fayette County, Illinois': censusgeo((('state', '17'), ('county', '051'))),
 'Logan County, Illinois': censusgeo((('state', '17'), ('county', '107'))),
 'Saline County, Illinois': censusgeo((('state', '17'), ('county', '165'))),
 'Lake County, Illinois': censusgeo((('state', '17'), ('county', '097'))),
 'Massac County, Illinois': censusgeo((('state', '17'), ('county', '127'))),
 'Cass County, Illinois': censusgeo((('state', '17'), ('county', '017'))),
 'Huntington County, Indiana': censusgeo((('state', '18'), ('county', '069'))),
 'White County, Indiana': censusgeo((('state', '18'), ('county', '181'))),
 'Jay County, Indiana': censusgeo((('state', '18'), ('county', '075'))),
 'Shelby County, Indiana': censusgeo((('state', '18'), ('county', '145'))),
 'Sullivan County, Indiana': censusgeo((('state', '18'), ('county', '153'))),
 'Tippecanoe County, Indiana': censusgeo((('state', '18'), ('county', '157'))),
 'Hamilton County, Indiana': censusgeo((('state', '18'), ('county', '057'))),
 '

**What if you needed all the tracts in a certain area**

You can't grab all the tracts for the whole country at once because thats more data than the census will allow you to extract at once. But lets try to get all the tracts in 

Below we have a list with different search elements. 
```
[
    ('state', 48) <-- This is the fips code for Texas (a state),
    ('tract', *) <-- Here we use an asterisk to indicate that we want all tracts
]
```

In [28]:
censusdata.geographies(
    censusdata.censusgeo([("state", "48"), ("tract", "*")]), "acs5", 2019
)

{'Census Tract 133.05, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '013305'))),
 'Census Tract 133.09, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '013309'))),
 'Census Tract 134.02, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '013402'))),
 'Census Tract 135, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '013500'))),
 'Census Tract 126.13, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '012613'))),
 'Census Tract 133.06, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '013306'))),
 'Census Tract 102.01, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '010201'))),
 'Census Tract 108, Cameron County, Texas': censusgeo((('state', '48'), ('county', '061'), ('tract', '010800'))),
 'Census Tract 105, Cameron County, Texas': censusgeo((('state', '48')

## Now Download out B05006 data for all counties in the US

What data will be request ? Well for now I only want the estimates, not the Margins of Error, to do that we will grab the keys of dictionary generated by `censusdata.censustable('acs5', 2015, 'B05006')`

For example if we see the output of the cell below

In [37]:
censusdata.censustable("acs5", 2015, "B05006")

OrderedDict([('B05006_001E',
              {'label': 'Total:',
               'concept': 'B05006. Place of Birth for the Foreign-Born Population in the United States',
               'predicateType': 'int'}),
             ('B05006_001M',
              {'label': 'Margin of Error for!!Total:',
               'concept': 'B05006. Place of Birth for the Foreign-Born Population in the United States',
               'predicateType': 'int'}),
             ('B05006_002E',
              {'label': 'Europe:',
               'concept': 'B05006. Place of Birth for the Foreign-Born Population in the United States',
               'predicateType': 'int'}),
             ('B05006_002M',
              {'label': 'Margin of Error for!!Europe:',
               'concept': 'B05006. Place of Birth for the Foreign-Born Population in the United States',
               'predicateType': 'int'}),
             ('B05006_003E',
              {'label': 'Europe:!!Northern Europe:',
               'concept': 'B05006. Pla

 `B05006_001E` is the total number of peopole recorded in the Place of Birth for Foreign-Born Population in the United States. 
 
 
 Lets also grab the data for `B05006_003E` for all people born in Northern Europe. 
 

In [41]:
censusdata.download(
    "acs5",
    2019,
    censusdata.censusgeo([("state", "48"), ("county", "*")]),
    ["B05006_001E", "B05006_003E"],  # <-our variables
)

Unnamed: 0,B05006_001E,B05006_003E
"San Jacinto County, Texas: Summary level: 050, state:48> county:407",1701,45
"Upshur County, Texas: Summary level: 050, state:48> county:459",1539,53
"Waller County, Texas: Summary level: 050, state:48> county:473",7181,47
"Wilson County, Texas: Summary level: 050, state:48> county:493",1841,12
"Hockley County, Texas: Summary level: 050, state:48> county:219",1940,9
...,...,...
"Brown County, Texas: Summary level: 050, state:48> county:049",1676,14
"Hall County, Texas: Summary level: 050, state:48> county:191",350,0
"Franklin County, Texas: Summary level: 050, state:48> county:159",602,0
"Frio County, Texas: Summary level: 050, state:48> county:163",3454,13


Alright lets now download the data for all variable estimates in the table, but not the margins of error. 

To do this we can loop over the variables and make a list of just the estimates. In python we could do this with a for loop and an if statement. 

Estimates end in `E` while margins of error in `M`. 

`B05006_004E` vs `B05006_003M`

See below

In [42]:
just_estimates = []
for variable in censusdata.censustable("acs5", 2015, "B05006"):
    if "E" in variable:
        just_estimates.append(variable)

In [43]:
just_estimates

['B05006_001E',
 'B05006_002E',
 'B05006_003E',
 'B05006_004E',
 'B05006_005E',
 'B05006_006E',
 'B05006_007E',
 'B05006_008E',
 'B05006_009E',
 'B05006_010E',
 'B05006_011E',
 'B05006_012E',
 'B05006_013E',
 'B05006_014E',
 'B05006_015E',
 'B05006_016E',
 'B05006_017E',
 'B05006_018E',
 'B05006_019E',
 'B05006_020E',
 'B05006_021E',
 'B05006_022E',
 'B05006_023E',
 'B05006_024E',
 'B05006_025E',
 'B05006_026E',
 'B05006_027E',
 'B05006_028E',
 'B05006_029E',
 'B05006_030E',
 'B05006_031E',
 'B05006_032E',
 'B05006_033E',
 'B05006_034E',
 'B05006_035E',
 'B05006_036E',
 'B05006_037E',
 'B05006_038E',
 'B05006_039E',
 'B05006_040E',
 'B05006_041E',
 'B05006_042E',
 'B05006_043E',
 'B05006_044E',
 'B05006_045E',
 'B05006_046E',
 'B05006_047E',
 'B05006_048E',
 'B05006_049E',
 'B05006_050E',
 'B05006_051E',
 'B05006_052E',
 'B05006_053E',
 'B05006_054E',
 'B05006_055E',
 'B05006_056E',
 'B05006_057E',
 'B05006_058E',
 'B05006_059E',
 'B05006_060E',
 'B05006_061E',
 'B05006_062E',
 'B05006

Another way we can do this that is more efficient is called [`list comprehension`](https://www.w3schools.com/python/python_lists_comprehension.asp).. see below.

In [44]:
just_estimates = [
    variable
    for variable in censusdata.censustable("acs5", 2015, "B05006")
    if "E" in variable
]

In [45]:
just_estimates

['B05006_001E',
 'B05006_002E',
 'B05006_003E',
 'B05006_004E',
 'B05006_005E',
 'B05006_006E',
 'B05006_007E',
 'B05006_008E',
 'B05006_009E',
 'B05006_010E',
 'B05006_011E',
 'B05006_012E',
 'B05006_013E',
 'B05006_014E',
 'B05006_015E',
 'B05006_016E',
 'B05006_017E',
 'B05006_018E',
 'B05006_019E',
 'B05006_020E',
 'B05006_021E',
 'B05006_022E',
 'B05006_023E',
 'B05006_024E',
 'B05006_025E',
 'B05006_026E',
 'B05006_027E',
 'B05006_028E',
 'B05006_029E',
 'B05006_030E',
 'B05006_031E',
 'B05006_032E',
 'B05006_033E',
 'B05006_034E',
 'B05006_035E',
 'B05006_036E',
 'B05006_037E',
 'B05006_038E',
 'B05006_039E',
 'B05006_040E',
 'B05006_041E',
 'B05006_042E',
 'B05006_043E',
 'B05006_044E',
 'B05006_045E',
 'B05006_046E',
 'B05006_047E',
 'B05006_048E',
 'B05006_049E',
 'B05006_050E',
 'B05006_051E',
 'B05006_052E',
 'B05006_053E',
 'B05006_054E',
 'B05006_055E',
 'B05006_056E',
 'B05006_057E',
 'B05006_058E',
 'B05006_059E',
 'B05006_060E',
 'B05006_061E',
 'B05006_062E',
 'B05006

**Now lets request our data !**

We use the `censusdata.download` function to request the data

In [46]:
all_counties = censusdata.download(
    "acs5", 2019, censusdata.censusgeo([("county", "*")]), just_estimates
)

Now we have a data frame of all estimates from the `B05006` table

In [49]:
all_counties.head()

Unnamed: 0,B05006_001E,B05006_002E,B05006_003E,B05006_004E,B05006_005E,B05006_006E,B05006_007E,B05006_008E,B05006_009E,B05006_010E,...,B05006_153E,B05006_154E,B05006_155E,B05006_156E,B05006_157E,B05006_158E,B05006_159E,B05006_160E,B05006_161E,B05006_162E
"Fayette County, Illinois: Summary level: 050, state:17> county:051",277.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Logan County, Illinois: Summary level: 050, state:17> county:107",468.0,109.0,13.0,0.0,0.0,0.0,0.0,13.0,0.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Saline County, Illinois: Summary level: 050, state:17> county:165",241.0,70.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Lake County, Illinois: Summary level: 050, state:17> county:097",131398.0,25548.0,2484.0,369.0,106.0,45.0,105.0,1747.0,917.0,733.0,...,0.0,3191.0,343.0,20.0,700.0,59.0,856.0,232.0,0.0,487.0
"Massac County, Illinois: Summary level: 050, state:17> county:127",146.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Note: In the download call below you should pass an api key as follows:
```
censusdata.download('acs5', 2019,
                         censusdata.censusgeo([('county', '*')]),
                         just_estimates
                        key='##########################################'
                        )
```

**Getting Text Columns**

Our columns right now are variable code, these are confusing to work with so lets reassign them with text info. We use list comprehension and the `table_info` variable to get the text. 

In [56]:
all_counties.columns = [table_info[k]["label"] for k in all_counties.columns]

In [58]:
all_counties.head()

Unnamed: 0,Estimate!!Total:,Estimate!!Total:!!Europe:,Estimate!!Total:!!Europe:!!Northern Europe:,Estimate!!Total:!!Europe:!!Northern Europe:!!Ireland,Estimate!!Total:!!Europe:!!Northern Europe:!!Denmark,Estimate!!Total:!!Europe:!!Northern Europe:!!Norway,Estimate!!Total:!!Europe:!!Northern Europe:!!Sweden,Estimate!!Total:!!Europe:!!Northern Europe:!!United Kingdom (inc. Crown Dependencies):,"Estimate!!Total:!!Europe:!!Northern Europe:!!United Kingdom (inc. Crown Dependencies):!!United Kingdom, excluding England and Scotland",Estimate!!Total:!!Europe:!!Northern Europe:!!United Kingdom (inc. Crown Dependencies):!!England,...,Estimate!!Total:!!Americas:!!Latin America:!!Central America:!!Other Central America,Estimate!!Total:!!Americas:!!Latin America:!!South America:,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Argentina,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Bolivia,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Brazil,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Chile,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Colombia,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Ecuador,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Guyana,Estimate!!Total:!!Americas:!!Latin America:!!South America:!!Peru
"Fayette County, Illinois: Summary level: 050, state:17> county:051",277.0,33.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Logan County, Illinois: Summary level: 050, state:17> county:107",468.0,109.0,13.0,0.0,0.0,0.0,0.0,13.0,0.0,13.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Saline County, Illinois: Summary level: 050, state:17> county:165",241.0,70.0,7.0,0.0,0.0,0.0,0.0,7.0,0.0,7.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"Lake County, Illinois: Summary level: 050, state:17> county:097",131398.0,25548.0,2484.0,369.0,106.0,45.0,105.0,1747.0,917.0,733.0,...,0.0,3191.0,343.0,20.0,700.0,59.0,856.0,232.0,0.0,487.0
"Massac County, Illinois: Summary level: 050, state:17> county:127",146.0,4.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,...,0.0,21.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Processing the Data

So now we have the data, but one challenge with this table is that we have regions and sub regions and countries all mixed together in this data

In [59]:
censusdata.printtable(table_info)

Variable     | Table                          | Label                                                    | Type 
-------------------------------------------------------------------------------------------------------------------
B05006_001E  | PLACE OF BIRTH FOR THE FOREIGN | !! Estimate Total:                                       | int  
B05006_002E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! Estimate Total: Europe:                            | int  
B05006_003E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! Estimate Total: Europe: Northern Europe:        | int  
B05006_004E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern Europe: Ire | int  
B05006_005E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern Europe: Den | int  
B05006_006E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern Europe: Nor | int  
B05006_007E  | PLACE OF BIRTH FOR THE FOREIGN | !! !! !! !! Estimate Total: Europe: Northern 

We've written some code to help process out desired data subsets below. We will request data based on "level". In the print out above we see that `!!` indicate groupings and sub groupings. `!! !! !! !!` indicates a deep subgrouping. In the cell block below the helper functions we will demonstrate how this works. 

In [60]:
# Helper Functions
def filter_data_by_level(
    cols=all_counties.columns, level=1, include_higher_level=False
):
    keep_cols = []
    for col in cols:
        if col.count("!!") == level or (
            include_higher_level and col.count("!!") > level
        ):
            keep_cols.append(col)
    return keep_cols


def filter_data_by_terms(search_terms, cols=all_counties.columns):
    keep_cols = []
    for col in cols:
        for term in search_terms:
            if term.lower() in col.lower():
                keep_cols.append(col)
    return keep_cols

**Lets get the top most level, which should be just the overall total**

In [65]:
filter_data_by_level(level=1)

['Estimate!!Total:']

**Lets get the next level which should be continents**

In [66]:
filter_data_by_level(level=2)

['Estimate!!Total:!!Europe:',
 'Estimate!!Total:!!Asia:',
 'Estimate!!Total:!!Africa:',
 'Estimate!!Total:!!Oceania:',
 'Estimate!!Total:!!Americas:']

**And now lets get all subregions and countries**

Note the `include_higher_level` parameter, if we set this to true it gets the level specified and any level above that.  

In [75]:
filter_data_by_level(level=3, include_higher_level=True)

['Estimate!!Total:!!Europe:!!Northern Europe:',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!Ireland',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!Denmark',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!Norway',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!Sweden',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!United Kingdom (inc. Crown Dependencies):',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!United Kingdom (inc. Crown Dependencies):!!United Kingdom, excluding England and Scotland',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!United Kingdom (inc. Crown Dependencies):!!England',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!United Kingdom (inc. Crown Dependencies):!!Scotland',
 'Estimate!!Total:!!Europe:!!Northern Europe:!!Other Northern Europe',
 'Estimate!!Total:!!Europe:!!Western Europe:',
 'Estimate!!Total:!!Europe:!!Western Europe:!!Austria',
 'Estimate!!Total:!!Europe:!!Western Europe:!!Belgium',
 'Estimate!!Total:!!Europe:!!Western Europe:!!France',
 'Estimate

## Filter the data

**Lets create a dataframe of just the continent totals**

In [79]:
continentsB05006 = all_counties[filter_data_by_level(level=2)]
continentsB05006

Unnamed: 0,Estimate!!Total:!!Europe:,Estimate!!Total:!!Asia:,Estimate!!Total:!!Africa:,Estimate!!Total:!!Oceania:,Estimate!!Total:!!Americas:
"Fayette County, Illinois: Summary level: 050, state:17> county:051",33.0,87.0,7.0,15.0,135.0
"Logan County, Illinois: Summary level: 050, state:17> county:107",109.0,139.0,4.0,0.0,216.0
"Saline County, Illinois: Summary level: 050, state:17> county:165",70.0,114.0,25.0,0.0,32.0
"Lake County, Illinois: Summary level: 050, state:17> county:097",25548.0,40772.0,2291.0,318.0,62469.0
"Massac County, Illinois: Summary level: 050, state:17> county:127",4.0,19.0,0.0,0.0,123.0
...,...,...,...,...,...
"Crockett County, Tennessee: Summary level: 050, state:47> county:033",14.0,48.0,10.0,0.0,553.0
"Lake County, Tennessee: Summary level: 050, state:47> county:095",10.0,18.0,0.0,0.0,66.0
"Knox County, Tennessee: Summary level: 050, state:47> county:093",3727.0,9001.0,1721.0,171.0,8122.0
"Benton County, Washington: Summary level: 050, state:53> county:005",2592.0,4773.0,250.0,67.0,13451.0


**Lets try to get all the subregions**

In [80]:
sub_regionsB05006 = all_counties[filter_data_by_level(level=3)]
sub_regionsB05006

Unnamed: 0,Estimate!!Total:!!Europe:!!Northern Europe:,Estimate!!Total:!!Europe:!!Western Europe:,Estimate!!Total:!!Europe:!!Southern Europe:,Estimate!!Total:!!Europe:!!Eastern Europe:,"Estimate!!Total:!!Europe:!!Europe, n.e.c.",Estimate!!Total:!!Asia:!!Eastern Asia:,Estimate!!Total:!!Asia:!!South Central Asia:,Estimate!!Total:!!Asia:!!South Eastern Asia:,Estimate!!Total:!!Asia:!!Western Asia:,"Estimate!!Total:!!Asia:!!Asia,n.e.c.",...,Estimate!!Total:!!Africa:!!Middle Africa:,Estimate!!Total:!!Africa:!!Northern Africa:,Estimate!!Total:!!Africa:!!Southern Africa:,Estimate!!Total:!!Africa:!!Western Africa:,"Estimate!!Total:!!Africa:!!Africa, n.e.c.",Estimate!!Total:!!Oceania:!!Australia and New Zealand Subregion:,Estimate!!Total:!!Oceania:!!Fiji,Estimate!!Total:!!Oceania:!!Micronesia,"Estimate!!Total:!!Oceania:!!Oceania, n.e.c.",Estimate!!Total:!!Americas:!!Latin America:
"Fayette County, Illinois: Summary level: 050, state:17> county:051",0.0,33.0,0.0,0.0,0.0,63.0,10.0,14.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,15.0,115.0
"Logan County, Illinois: Summary level: 050, state:17> county:107",13.0,5.0,6.0,85.0,0.0,59.0,72.0,8.0,0.0,0.0,...,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0,216.0
"Saline County, Illinois: Summary level: 050, state:17> county:165",7.0,26.0,15.0,22.0,0.0,12.0,60.0,42.0,0.0,0.0,...,0.0,8.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0,32.0
"Lake County, Illinois: Summary level: 050, state:17> county:097",2484.0,2566.0,2278.0,18141.0,79.0,12525.0,17157.0,9034.0,2031.0,25.0,...,234.0,195.0,544.0,800.0,26.0,285.0,0.0,0.0,33.0,60949.0
"Massac County, Illinois: Summary level: 050, state:17> county:127",1.0,3.0,0.0,0.0,0.0,17.0,0.0,2.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,112.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"Crockett County, Tennessee: Summary level: 050, state:47> county:033",10.0,4.0,0.0,0.0,0.0,5.0,0.0,43.0,0.0,0.0,...,0.0,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,553.0
"Lake County, Tennessee: Summary level: 050, state:47> county:095",0.0,2.0,0.0,8.0,0.0,0.0,0.0,18.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,66.0
"Knox County, Tennessee: Summary level: 050, state:47> county:093",897.0,692.0,215.0,1923.0,0.0,3189.0,2425.0,1873.0,1501.0,13.0,...,170.0,137.0,82.0,397.0,93.0,144.0,0.0,0.0,27.0,7505.0
"Benton County, Washington: Summary level: 050, state:53> county:005",487.0,439.0,58.0,1608.0,0.0,975.0,852.0,2108.0,838.0,0.0,...,0.0,78.0,31.0,6.0,26.0,67.0,0.0,0.0,0.0,12714.0


**Now lets try to get just all countries**

This will be a little harder because of the various subgroups but with some extra code we should be able to do it.

In the previous data we see that micronesia and Fiji are in with the other sub regions. Since these are countries we need to add those in to our countries list. 

In [81]:
missing_countries = [
    "Estimate!!Total:!!Oceania:!!Fiji",
    "Estimate!!Total:!!Oceania:!!Micronesia",
]

In [86]:
countries = filter_data_by_level(level=4, include_higher_level=True) + missing_countries

In [87]:
all_counties[countries].shape

(3220, 137)

In [88]:
countries = all_counties[countries]

See below for the dataframe of countries we have generated. Now there may be some regions that snuck through because of how the census formats its data, but this is likely 95% of the way there. 

## Export

We can export the data by calling `to_csv` on the dataframe. 

In [None]:
countries.to_csv("countries_B05006.csv", index=False)

This is but one example of extracting and processing data from the census, it may take some manual correction or adujustment but once you figure out what you want and how to get it this notebook can be easily rerun. 
# End