<div style="width: 38.5%;">
    <p><strong>City College of San Francisco</strong><p>
    <hr>
    <p>MATH 108 - Foundations of Data Science</p>
</div>

# Lecture 06: Census

Associated Textbook Sections: [6.3, 6.4](https://inferentialthinking.com/chapters/06/3/Example_Population_Trends.html)

## Overview

* [Table Review](#Table-Review)
* [Exploring the Welcome Survey](#Exploring-the-Welcome-Survey)
* [Attribute Types](#Attribute-Types)
* [Census Data](#Census-Data)

## Set Up the Notebook

In [None]:
from datascience import *
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plots
plots.style.use('fivethirtyeight')

---

## Table Review

### Some of the Table Methods

* Creating and extending tables: `Table().with_column` and `Table.read_table`
* Finding the size: `num_rows` and `num_columns`
* Referring to columns: indices --- column indices start at 0
* Accessing data in a column --- column takes a label or index and returns an array
* Using array methods to work with data in columns `item`, `sum`, `min`, `max`, and so on
* Creating new tables containing some of the original columns: `select`, `drop`

### Manipulating Rows

* `tbl.sort(column_name_or_index)` --- sorts the rows in increasing order
* `tbl.sort(column_name_or_index, descending=True)` --- sorts the rows in decreasing order
* `tbl.take(row_indices)` --- keeps the numbered rows where each row has an index, starting at 0
* `tbl.where(column, predicate)` --- keeps all rows for which a column's value satisfies a condition
* `t.where(column, value)` --- keeps all rows for which a column's value equals some particular value

Note that `t.where(column, value)` is the same as `t.where(column, are.equal_to(value))`.

### Demo: Exploring the Welcome Survey

Students in all the MATH 108 sections were surveyed. Load the survey results that are in `welcome_survey_sp23.csv` and explore it.

The survey questions/responses have been simplified.
    
Additionally cleaning of the data might need to be done.

In [None]:
welcome = Table.read_table('./data/welcome_survey_sp23.csv')
welcome.show(5)

How many students are seeking a Data Analytics Fundamentals certfiicate?

In [None]:
...

What is the breakdown of ages?

In [None]:
...

---

## Attribute Types


### Types of Attributes

All values in a column of a table should be both the same type and be comparable to each other in some way
* **Numerical** --- Each value is from a numerical scale
    * Numerical measurements are ordered
    * Differences are meaningful
* **Categorical** --- Each value is from a fixed inventory
    * May or may not have an ordering
    * Categories are the same or different


### “Numerical” Attributes

Sometimes numbers represent categorical data.
    * Census example has numerical `SEX` code (`0`, `1`, and `2`)
    * It doesn't make sense to perform arithmetic on these "numbers", e.g. `1 - 0` or `(0+1+2)/3` are meaningless
    * The variable `SEX` is still categorical, even though numbers were used for the categories

---

## Census Data

### The Decennial Census

* Every ten years, the Census Bureau counts how many people there are in the U.S.
* In between censuses, the Bureau estimates how many people there are each year.
* Article 1, Section 2 of the Constitution: 
> "Representatives and direct Taxes shall be apportioned among the several States ... according to their respective Numbers ..."


### Census Table Description

* Values have column-dependent interpretations
    * The `SEX` column: `1` is Male, `2` is Female
    * The `POPESTIMATE2010` column: 7/1/2010 estimate
* In this table, some rows are sums of other rows
    * The `SEX` column: `0` is Total (of Male + Female)
    * The `AGE` column: `999` is Total of all ages
* Numeric codes are often used for storage efficiency
    * Values in a column have the same type, but are not necessarily comparable (`AGE 12` vs `AGE 999`)

### Analyzing Census Data

Leads to the discovery of interesting features and trends in the population.

### Demo: Census

Explore the US Census data from the [Annual Estimates of the Resident Population by Single Year of Age and Sex for the United States](https://www2.census.gov/programs-surveys/popest/technical-documentation/file-layouts/2010-2020/cc-est2020-agesex.pdf). 

(Release date: June 2021, Updated January 2022 to include April 1, 2020 estimates)

In [None]:
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2020/national/asrh/nc-est2020-agesex-res.csv'
full = Table.read_table(url)
full

Select the `SEX`, `AGE`, `CENSUS2010POP`, and `POPESTIMATE2019` columns.

In [None]:
partial = full.select('SEX', 'AGE', 'CENSUS2010POP', 'POPESTIMATE2019')
partial.show(4)

Relabel the 2010 and 2019 columns.

In [None]:
simple = ...
simple.show(4)

Sort by `AGE`.

In [None]:
...

In [None]:
simple.sort('AGE', ...)

Remove the 999 ages and focus just on the combined data where the `SEX` value is 0. Drop the `SEX` column since there is only one value there.

In [None]:
no_999 = ...
everyone = ...
everyone

---

<footer>
    <p>Adopted from UC Berkeley DATA 8 course materials.</p>
    <p>This content is offered under a <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC Attribution Non-Commercial Share Alike</a> license.</p>
</footer>