# INFO 3402 – Week 01: Lecture

[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Does this notebook work?

In [None]:
1 + 1

In [None]:
print("{0} has the best haircut of any INFO faculty!".format('Brian Keegan'))

## Can you load libraries?

In [None]:
# Embed visualization outputs in the notebook
%matplotlib inline

# Import matplotlib's pyplot
import matplotlib.pyplot as plt

# Import numpy
import numpy as np

# Import pandas
import pandas as pd

# Customize pandas to display more columns than default
pd.options.display.max_columns = 500

## Can you load data?

In [None]:
# From https://data.census.gov/cedsci/table?q=DP05&g=0100000US%240400000&tid=ACSDP1Y2019.DP05&moe=false&tp=true
census_df = pd.read_csv('ACS2019_DP05_state.csv')

## What does this data look like?

In [None]:
census_df.head()

From the Census documentation:

<ol><li>An "**" entry in the margin of error column indicates that either no sample observations or too few sample observations were available to compute a standard error and thus the margin of error. A statistical test is not appropriate.</li><li>An "-" entry in the estimate column indicates that either no sample observations or too few sample observations were available to compute an estimate, or a ratio of medians cannot be calculated because one or both of the median estimates falls in the lowest interval or upper interval of an open-ended distribution, or the margin of error associated with a median was larger than the median itself.</li><li>An "-" following a median estimate means the median falls in the lowest interval of an open-ended distribution.</li><li>An "+" following a median estimate means the median falls in the upper interval of an open-ended distribution.</li><li>An "***" entry in the margin of error column indicates that the median falls in the lowest interval or upper interval of an open-ended distribution. A statistical test is not appropriate.</li><li>An "*****" entry in the margin of error column indicates that the estimate is controlled. A statistical test for sampling variability is not appropriate. </li><li>An "N" entry in the estimate and margin of error columns indicates that data for this geographic area cannot be displayed because the number of sample cases is too small.</li><li>An "(X)" means that the estimate is not applicable or not available.</li></ol>

## Exercises

Pandas's `read_csv` function has a lot of powerful functionality built into it. Check out the [User Guide for CSV files](https://pandas.pydata.org/docs/user_guide/io.html#csv-text-files) or the reference for the [`read_csv` function](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html).

### Exercise 1: Fix the problem with column names being the 0th row of data

We don't need the column names like "DP05_001E" since they're hard-to-understand. The second row of columns names like "Estimate!!SEX AND AGE!!Total population" are more useful, but were read in as a row of data. 

Use the `read_csv` function to make the helpful names on the second row the column names for the DataFrame and assign to `dp05_df`.

### Exercise 2: Fix the multiple kinds of NaN values

There are multiple kinds of missing data (see the Census documentation above). It's often helpful to know *why* the data is missing, but for our purposes we only care if the data is present or absent.

Use the `read_csv` function to convert these different types of missing data values into a consistent `NaN`  and assign to `dp05_df`.

### Exercise 3: Online include column names containing "Estimate"

The American Community Survey (ACS) is a sample of the population rather than the full census. So the esimates reported have some sampling error in them. Each column is actually part of a triplet: (estimate, margin of error, and percent). We only care about the columns with the name "Estimate" in them for this exercise.

Use the `read_csv` function to filter the columns down to only those that contain the string "Estimate" and assign to `dp05_df`.

### Exercise 4: Sort by Asian population
Sort the `dp05_df` DataFrame by the "Estimate!!RACE!!Total population!!One race!!Asian" column. What states have the largest Asian population?

### Exercise 5: Compute and sort on a new column
States with a large Asian population may simply have large populations. We lost the percentages in Exercise 3. Let's compute the percentage of the population that is Asian. Divide the column used in Exercise 4 by the Total Population estimate and store as "Asian percentage". Sort on "Asian percentage" show the top 5 and bottom 5 states by Asian population.