<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 06: Census

Associated Textbook Sections: [6.3, 6.4](https://inferentialthinking.com/chapters/06/3/Example_Population_Trends.html)

<h2>Set Up the Notebook<h2>

In [1]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Lecture-06:-Census" data-toc-modified-id="Lecture-06:-Census-1">Lecture 06: Census</a></span><ul class="toc-item"><li><span><a href="#Set-Up-the-Notebook" data-toc-modified-id="Set-Up-the-Notebook-1.1">Set Up the Notebook</a></span></li><li><span><a href="#" data-toc-modified-id="-1.2"></a></span></li><li><span><a href="#Table-Review" data-toc-modified-id="Table-Review-1.3">Table Review</a></span><ul class="toc-item"><li><span><a href="#Some-of-the-Table-Methods" data-toc-modified-id="Some-of-the-Table-Methods-1.3.1">Some of the Table Methods</a></span></li><li><span><a href="#Manipulating-Rows" data-toc-modified-id="Manipulating-Rows-1.3.2">Manipulating Rows</a></span></li><li><span><a href="#Demo:-Exploring-a-Welcome-Survey" data-toc-modified-id="Demo:-Exploring-a-Welcome-Survey-1.3.3">Demo: Exploring a Welcome Survey</a></span></li></ul></li><li><span><a href="#Attribute-Types" data-toc-modified-id="Attribute-Types-1.4">Attribute Types</a></span><ul class="toc-item"><li><span><a href="#Types-of-Attributes" data-toc-modified-id="Types-of-Attributes-1.4.1">Types of Attributes</a></span></li><li><span><a href="#“Numerical”-Attributes" data-toc-modified-id="“Numerical”-Attributes-1.4.2">“Numerical” Attributes</a></span></li></ul></li><li><span><a href="#Census-Data" data-toc-modified-id="Census-Data-1.5">Census Data</a></span><ul class="toc-item"><li><span><a href="#The-Decennial-Census" data-toc-modified-id="The-Decennial-Census-1.5.1">The Decennial Census</a></span></li><li><span><a href="#Census-Table-Description" data-toc-modified-id="Census-Table-Description-1.5.2">Census Table Description</a></span></li><li><span><a href="#Analyzing-Census-Data" data-toc-modified-id="Analyzing-Census-Data-1.5.3">Analyzing Census Data</a></span></li><li><span><a href="#Demo:-Census" data-toc-modified-id="Demo:-Census-1.5.4">Demo: Census</a></span></li></ul></li></ul></li></ul></div>

## Table Review

### Some of the Table Methods

* Creating and extending tables: `Table().with_column` and `Table.read_table`
* Finding the size: `num_rows` and `num_columns`
* Referring to columns: indices --- column indices start at 0
* Accessing data in a column --- column takes a label or index and returns an array
* Using array methods to work with data in columns `item`, `sum`, `min`, `max`, and so on
* Creating new tables containing some of the original columns: `select`, `drop`

### Manipulating Rows

* `tbl.sort(column_name_or_index)` --- sorts the rows in increasing order
* `tbl.sort(column_name_or_index, descending=True)` --- sorts the rows in decreasing order
* `tbl.take(row_indices)` --- keeps the numbered rows where each row has an index, starting at 0
* `tbl.where(column, predicate)` --- keeps all rows for which a column's value satisfies a condition
* `t.where(column, value)` --- keeps all rows for which a column's value equals some particular value

Note that `t.where(column, value)` is the same as `t.where(column, are.equal_to(value))`.

### Demo: Exploring a Welcome Survey

1332 Students in UC Berkelely's DATA 8 course were surveyed. Load that data and explore it using the following prompts.

In [2]:
welcome = Table.read_table('welcome_survey_v1.csv')
welcome.show(5)

Extraversion,Number of textees,Hours of sleep,Handedness,Pant leg,Sleep position
4,6,4.0,Both,I don't know,On your back
8,6,7.0,Both,I don't know,On your back
9,6,7.0,Both,I don't know,On your back
2,3,6.75,Left-handed,I don't know,On your back
7,10,7.0,Left-handed,I don't know,On your back


On average, how long do side-sleepers sleep?

In [4]:
np.average(welcome.where("Sleep position", are.containing("side")).column(2))

7.031403940886699

How many students get at least 8 hours of sleep each night (on average)?

In [5]:
# One way using predicates:
welcome.where("Hours of sleep", are.above_or_equal_to(8)).num_rows

417

In [6]:
# A second way:
np.count_nonzero(welcome.column('Hours of sleep') >= 8)

417

In [7]:
# A third way:
np.sum(welcome.column('Hours of sleep') >= 8)

417

Create a table with only the two sleep-related columns, with names `'Hours'` and `'Position'`

In [10]:
# One way:
renamed = welcome.relabeled("Hours of sleep", "Hours").relabeled("Sleep position", "Position")
renamed.select("Hours", "Position")

Hours,Position
4.0,On your back
7.0,On your back
7.0,On your back
6.75,On your back
7.0,On your back
8.0,On your back
10.0,On your back
5.0,On your back
5.0,On your back
5.5,On your back


In [15]:
# A second way:
renamed.drop(0, 1, 3, 4)

Hours,Position
4.0,On your back
7.0,On your back
7.0,On your back
6.75,On your back
7.0,On your back
8.0,On your back
10.0,On your back
5.0,On your back
5.0,On your back
5.5,On your back


## Attribute Types


### Types of Attributes

All values in a column of a table should be both the same type and be comparable to each other in some way
* **Numerical** --- Each value is from a numerical scale
    * Numerical measurements are ordered
    * Differences are meaningful
* **Categorical** --- Each value is from a fixed inventory
    * May or may not have an ordering
    * Categories are the same or different


### “Numerical” Attributes

Just because the values are numbers, doesn't mean the variable is numerical
* Census example has numerical `SEX` code (`0`, `1`, and `2`)
* It doesn't make sense to perform arithmetic on these "numbers", e.g. `1 - 0` or `(0+1+2)/3` are meaningless
* The variable `SEX` is still categorical, even though numbers were used for the categories

## Census Data

### The Decennial Census

* Every ten years, the Census Bureau counts how many people there are in the U.S.
* In between censuses, the Bureau estimates how many people there are each year.
* Article 1, Section 2 of the Constitution: 
> "Representatives and direct Taxes shall be apportioned among the several States ... according to their respective Numbers ..."


### Census Table Description

* Values have column-dependent interpretations
    * The `SEX` column: `1` is Male, `2` is Female
    * The `POPESTIMATE2010` column: 7/1/2010 estimate
* In this table, some rows are sums of other rows
    * The `SEX` column: `0` is Total (of Male + Female)
    * The `AGE` column: `999` is Total of all ages
* Numeric codes are often used for storage efficiency
    * Values in a column have the same type, but are not necessarily comparable (`AGE 12` vs `AGE 999`)

### Analyzing Census Data

Leads to the discovery of interesting features and trends in the population.

### Demo: Census

Explore the US Census data from the [Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2020/cc-est2020-agesex.pdf). 

(Release date: June 2021, Updated January 2022 to include April 1, 2020 estimates)

In [16]:
full = Table.read_table('https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv')
full

SEX,AGE,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015,POPESTIMATE2016,POPESTIMATE2017,POPESTIMATE2018,POPESTIMATE2019,POPESTIMATE2020
0,0,3944153,3944160,3951495,3963264,3926731,3931411,3954973,3984144,3963268,3882437,3826908,3762227,3735010
0,1,3978070,3978090,3957904,3966768,3978210,3943348,3949559,3973828,4003586,3981864,3897917,3842257,3773884
0,2,4096929,4096939,4090799,3971498,3980139,3993047,3960015,3967672,3992657,4021261,3996742,3911822,3853025
0,3,4119040,4119051,4111869,4102429,3983007,3992839,4007852,3976277,3984985,4009060,4035053,4009037,3921526
0,4,4063170,4063186,4077511,4122252,4112849,3994539,4006407,4022785,3992241,4000394,4021907,4045996,4017847
0,5,4056858,4056872,4064653,4087770,4132349,4123745,4007123,4020489,4038022,4007233,4012789,4032231,4054336
0,6,4066381,4066412,4073031,4075153,4097860,4142923,4135738,4020428,4034969,4052428,4019106,4022432,4040169
0,7,4030579,4030594,4043100,4083399,4085255,4108453,4154947,4148711,4034355,4048430,4063647,4027876,4029753
0,8,4046486,4046497,4025624,4053313,4093553,4096033,4120476,4167765,4162142,4047130,4059209,4071894,4034785
0,9,4148353,4148369,4125413,4035854,4063662,4104437,4107986,4133426,4181069,4175085,4058207,4067320,4078668


Select the `SEX`, `AGE`, `CENSUS2010POP`, and `POPESTIMATE2019` columns.

In [18]:
partial = full.select(0, 1, 2, 13) # could use label names instead of indices
partial.show(4)

SEX,AGE,CENSUS2010POP,POPESTIMATE2019
0,0,3944153,3762227
0,1,3978070,3842257
0,2,4096929,3911822
0,3,4119040,4009037


Relabel the 2010 and 2019 columns.

In [19]:
simple = partial.relabeled(2, '2010').relabeled(3, '2019')
simple.show(4)

SEX,AGE,2010,2019
0,0,3944153,3762227
0,1,3978070,3842257
0,2,4096929,3911822
0,3,4119040,4009037


Sort by `AGE`.

In [20]:
simple.sort("AGE")

SEX,AGE,2010,2019
0,0,3944153,3762227
1,0,2014276,1921001
2,0,1929877,1841226
0,1,3978070,3842257
1,1,2030853,1963261
2,1,1947217,1878996
0,2,4096929,3911822
1,2,2092198,2000102
2,2,2004731,1911720
0,3,4119040,4009037


In [21]:
simple.sort('AGE', descending=True)

SEX,AGE,2010,2019
0,999,308745538,328329953
1,999,151781326,161692336
2,999,156964212,166637617
0,100,53364,100442
1,100,9162,23531
2,100,44202,76911
0,99,32266,56774
1,99,6073,14486
2,99,26193,42288
0,98,45900,85867


Remove the 999 ages and focus just on the combined data where the `SEX` value is 0. Drop the `SEX` column since there is only one value there.

In [22]:
simple.where("SEX", are.equal_to(0)).where("AGE", are.not_equal_to(999)).drop("SEX")

AGE,2010,2019
0,3944153,3762227
1,3978070,3842257
2,4096929,3911822
3,4119040,4009037
4,4063170,4045996
5,4056858,4032231
6,4066381,4022432
7,4030579,4027876
8,4046486,4071894
9,4148353,4067320


<footer>
    <hr>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>