# BUDS Report 6: Census Data

In [3]:
# Just run this cell to load the dependencies
from datascience import *
import numpy as np

#Plotting dependencies
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
%matplotlib inline

## Table of Contents
1.  <a href='#section 1'>Background Knowledge</a>

    a. <a href='#section 1a'>What is the Census?</a>

    b. <a href='#subsection 1b'> Why is the Census important?</a>
    

2.  <a href='#section 2'>Formulating a Question or Problem</a> 


3.  <a href='#section 3'>Acquiring and Cleaning Data</a>

    </a>
<br><br>

# 1. Background Knowledge  <a id='section 1'></a>

### What is the Census? <a id='subsection 1a'></a>

The U.S Census Bureau is a government agency founded in 1902 and is responsible for collecting and maintaining data regarding the American people and economy. The 2020 Census will count every person in the United States and the 5 territories. The Constitution mandates the country counts the population every 10 years.

More information can be found https://2020census.gov/en/what-is-2020-census.html

<div class="alert alert-info">
<b>Question:</b> Who is included in the census?
   </div>

The census tries to include everyone within the United States at a given year in time. However, there are many ways groups of people are not included. For example, people without a permanent residence, people who live off the grid and people who move during the survey period may not be included.

<div class="alert alert-info">
<b>Question:</b> Take a few minutes to look up some information about the Census. Provide 3 things you found interesting and the link in which you found them.
   </div>

**Interesting Insights**
1. The census is a process which is based on the US Constitution.

2. It determines the number of house of representatives that each state has.

3. It also influences the amount of funding that is used on roads, schools, hospitals, and emergency response within an area.

**LINK:** N/A

### Why is the Census important? <a id='subsection 1b'></a>

<div class="alert alert-info">
<b>Question:</b> The Census informs lawmakers and other officials make decisions and allocate resources. What are some of the resources/decisions the Census informs?
   </div>

The census also determins the number of people in the house of representatives for a state, which influences how much of a state's voice is heard.

# 2. Formulating a Question or Problem  <a id='section 2'></a>

Now that we have done some investigation into the census, it is time to come up with a few questions that we could investigate about the Census (*using data*)! Here are a few prompts to get you started:

>1. Have there been changes in the population over time? 
2. Do certain states have different demographics?
3. Are there trends based on gender or age that we can look at?

<div class="alert alert-info">
<b>Question:</b> From the preliminary research, think of a few questions you would ask.
   </div>

**Potential Questions**
1. How many people does a representative from California represent? 

2. How has population size changed across the United states in the past 40 years?

3. How do population growth rates differ for men versus women in the last decade?

# 3. Acquiring and Cleaning Data <a id='section 3'></a>

Let's look at the 2010 Census data so that we can start on the **Data Acquisiton & Cleaning** phase of the Data Science Life Cycle.

Run the following cell to load in the dataset and go through the process of cleaning. Cleaning data makes it more usable and easier to read. It can include identifying missing values, changing column names, or changing the data types of the elements of a column in order to work with it better 

In [4]:
# As of Jan 2017, this census file is online here: 
data = 'http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.csv'

full_census_table = Table.read_table(data)
full_census_table

SEX,AGE,CENSUS2010POP,ESTIMATESBASE2010,POPESTIMATE2010,POPESTIMATE2011,POPESTIMATE2012,POPESTIMATE2013,POPESTIMATE2014,POPESTIMATE2015
0,0,3944153,3944160,3951330,3963087,3926540,3931141,3949775,3978038
0,1,3978070,3978090,3957888,3966551,3977939,3942872,3949776,3968564
0,2,4096929,4096939,4090862,3971565,3980095,3992720,3959664,3966583
0,3,4119040,4119051,4111920,4102470,3983157,3992734,4007079,3974061
0,4,4063170,4063186,4077551,4122294,4112849,3994449,4005716,4020035
0,5,4056858,4056872,4064653,4087709,4132242,4123626,4006900,4018158
0,6,4066381,4066412,4073013,4074993,4097605,4142916,4135930,4019207
0,7,4030579,4030594,4043046,4083225,4084913,4108349,4155326,4148360
0,8,4046486,4046497,4025604,4053203,4093177,4095711,4120903,4167887
0,9,4148353,4148369,4125415,4035710,4063152,4104072,4108349,4133564


Only the first 10 rows of the table are displayed. Later we will see how to display the entire table; however, this is typically not useful with large tables.

A [description of the table](http://www2.census.gov/programs-surveys/popest/datasets/2010-2015/national/asrh/nc-est2015-agesex-res.pdf) appears online. The `SEX` column contains numeric codes: `0` stands for the total, `1` for male, and `2` for female. The `AGE` column contains ages in completed years, but the special value `999` is a sum of the total population. The rest of the columns contain estimates of the US population.

Let's take a subset of the full data set so that it becomes more wieldy for our investigation. 

In [5]:
partial_census_table = full_census_table.select('SEX', 'AGE', 'POPESTIMATE2010', 'POPESTIMATE2014')
partial_census_table

SEX,AGE,POPESTIMATE2010,POPESTIMATE2014
0,0,3951330,3949775
0,1,3957888,3949776
0,2,4090862,3959664
0,3,4111920,4007079
0,4,4077551,4005716
0,5,4064653,4006900
0,6,4073013,4135930
0,7,4043046,4155326
0,8,4025604,4120903
0,9,4125415,4108349


<div class="alert alert-info">
<b>Question:</b> What do you notice about the column names? What does the 0 stand for in the 'SEX' column?
   </div>

The column names are long and similar. The third column represents the actual population count given by the census. The columns following it represent estimations of the population for each year. We have to estimate the population counts for the years 2011 to 2019 since there is not a census to count every one (only happens every 10 years). Remember, even the census does not catch everyone, and it is only our best estimate of the population.

SEX = 0 represents the sum of population counts for males and females, so the total poulation of a certain age.

Because we are working with Census data, it seems a bit redundant to call each column `POPESTIMATE`. To clean this up, let's change the `POPESTIMATE2010` to `2010` in order to make it easier to read and use in future analysis.

In [6]:
us_pop_relabel = partial_census_table.relabel('POPESTIMATE2010', '2010')
us_pop_relabel

SEX,AGE,2010,POPESTIMATE2014
0,0,3951330,3949775
0,1,3957888,3949776
0,2,4090862,3959664
0,3,4111920,4007079
0,4,4077551,4005716
0,5,4064653,4006900
0,6,4073013,4135930
0,7,4043046,4155326
0,8,4025604,4120903
0,9,4125415,4108349


<div class="alert alert-info">
    <b>Question:</b> Set us_pop to a table where both columns are relabeled. 
     <b>HINT:</b> use the table method <code>.relabel</code>
   </div>

In [7]:
us_pop = us_pop_relabel.relabel('POPESTIMATE2014', '2014')
us_pop

SEX,AGE,2010,2014
0,0,3951330,3949775
0,1,3957888,3949776
0,2,4090862,3959664
0,3,4111920,4007079
0,4,4077551,4005716
0,5,4064653,4006900
0,6,4073013,4135930
0,7,4043046,4155326
0,8,4025604,4120903
0,9,4125415,4108349


We now have a table that is easy to work with. Each column of the table is an array of the same length, and so columns can be combined using arithmetic. We can use the `.column` table method to get each column as an array.

<div class="alert alert-info">
    <b>Question:</b> Calculate the population difference between 2010 and 2014 for each row. Save the result to a variable called <code>pop_diff_array</code>
   </div>

In [8]:
pop_diff_array = us_pop.column("2014") - us_pop.column('2010')
pop_diff_array

array([  -1555,   -8112, -131198, -104841,  -71835,  -57753,   62917,
        112280,   95299,  -17066,  -70120,  -28109,  -42597,   51364,
         88225,  -66206, -144693, -189854, -263085, -242373, -147187,
        104417,  328724,  484928,  451809,  221942,  247237,   97780,
        107699,  181502,  -48905,  280701,  356349,  345083,  542559,
        147447,  186512,   79984, -218592, -341956, -528085, -243668,
         14986,  240006,  211775, -275930, -452565, -452590, -374925,
       -188852, -153824,   -9581,    -396,   96027,  286313,  162886,
        302813,  400505,  388913,  461267,  368877,  314258,  190223,
        -80557,  781504,  704299,  725725,  791534,  212543,  366465,
        402861,  566098,  310125,  251396,  203518,  141878,  211994,
        106931,   78429,   58281,  -57188,    1754,   -6674,    7161,
         47346,    7934,   32174,   47480,   36745,   51635,   62750,
         80872,   64071,   65821,   46553,   25211,   23324,   14196,
         12689,    9

### Congratulations! You've finished *Report 5: Census Data*