# 2020 University of Minnesota Day of Data Python Notebook
# Exploration of American Community Survey data extracted from IPUMS-USA (https://ipums.org), a product of the U of M's Institute for Social Research and Data Innovation (ISRDI)

----------------------------------------------

## Python libraries, whether part of the standard set of Python libraries or from 3rd party sources, need to be imported. These are the libraries that we'll make use of in this notebook:
*  Pandas is a Python library for reading and manipulating tabular data. think "programmatic spreadsheets" 
*  Numpy is a number-processing library that pandas works closely with
*  BeautifulSoup is a library that can parse misc. markup languages, including XML
*  Altair is one of python's many data viz libariers

In [None]:
# pandas is a Python library for reading and manipulating tabular data. think "programatic spreadsheets"
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup as BS
import altair as alt

## Let's start by reading in the data into a Pandas dataframe.
### The data file is in (gzipped) csv, which Pandas can read into a dataframe via its built-in read_csv() method

In [None]:
data = pd.read_csv("../data/usa_00071.csv.gz")

# the variable HHINCOME will show all 9s for no response, so let's change those to np.nan (which means "blank")
data["HHINCOME"] = data["HHINCOME"].replace(9999999, np.nan)

data

## In addition to the data, we have metadata that describes the data. This includes an XML file that maps the variables' numeric codes (how survey answers are represented in the data) to understandable labels.
### these two helper methods are for getting label information out of a provided XML file and into a codebook dict to translate data codes->labels

In [None]:
# this method takes in information on a given variable and returns a code-to-label dictionary for that variable
def parse_var_xml(var):
    var_values = {}
    for cat in var.find_all("catgry"):
        var_values[int(cat.catvalu.text)] = cat.labl.text
        
    return var_values

# Use Beautiful Soup to parse XML and send blocks of variable info to the parse_var_xml() method
# This method returns a codebook, which is a Pthon dict of dicts. Each top-level key is a variable, with values as a dice of code-to-label translations for that variable
def ipums_xml_to_var_dicts(xml_file):
    with open(xml_file, "r") as file:
        content = file.readlines()
        content = "".join(content)
        bs_content = BS(content, "lxml")
    variables = bs_content.find_all("var")
    codebook = {}
    for var in variables:
        codebook[var.get("name")] = parse_var_xml(var)
    
    return codebook

## Now that the methods are defined, to populate a codebook is simple: send the XML file to ipums_xml_to_var_dicts()

In [None]:
# create a dictionary of variable codes-to-labels for each variable
var_val_labels = ipums_xml_to_var_dicts("../syntax/usa_00071.xml")

## Let's have a look at one variable dictionary, TRANWORK (mode of transportation to get to work)

In [None]:
var_val_labels['TRANWORK']

## using the var_val_labels dictionary, add columns for every variable's label value with the column name \<VARIABLE\>_lbl

In [None]:
for var in var_val_labels.keys():
    data[f"{var}_lbl"] = data[var].map(var_val_labels[var])
data

# We can actually start doing simple visualizations [right from pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html)

## Say we wanted to plot a trend line of "Prime Age Workers" (age 25-54) in our data.
* First we would subset down to just prime age workers,
* Then count by year and plot

In [None]:
# We want to leave the data in Year-sort-order, but value_counts() tries to sort
# by, well, value counts. So we turn that sorting off.
data[data["AGE"].between(25,54)]["YEAR"].value_counts(sort=False).plot()

While it looks like there was a massive spike in prime working age folks, make sure you pay attention to that Y-axis, the bottom value is not zero. In fact, let's take care of that right now.

In [None]:
data[data["AGE"].between(25,54)]["YEAR"].value_counts(sort=False).plot(ylim=(0,125000))

That's more like it. But also very boring... 

If we want to make some more engaging visualizations we have to manipulate the data a bit.

## At this point we have a dataframe in which each row represents a single person

### We want to look at modes of transportation "prime age" workers use. So, first we subset our data down to those with an EMPSTAT of 1 (working) and those between 25 and 54 inclusive (prime working age)

In [None]:
workers = data[data['EMPSTAT']==1 & data["AGE"].between(25,54)]
workers

### To explore the data visually, create a dataframe that represents aggregate summary data
### Specifically, a dataframe in which each row represents a Year and City and the columns contain labels and counts for various variables
### To take the raw data and obtain counts of each City (CITY_lbl) by Year (YEAR) by type of transportation (TRANWORK_lbl), use the Pandas crosstab() method

In [None]:
df = pd.crosstab(index=[workers["YEAR"],
                        workers["CITY_lbl"],
                        workers["TRANWORK_lbl"]],
                 columns="count")
df

## This crosstab dataframe puts YEAR, CITY_lbl, and TRANWORK_lbl as indexes to the dataframe. We want them as columns, which is done with the dataframe's reset_index() method

In [None]:
df = df.reset_index()
df

## From here we need to go one step further. The data are in the format where each line shows a count for a particular type of transportation for a given year and city. We want to end up with the counts of the various transportation modes as columns, with one row per city/year
## This can be accomplished with a pivot table

In [None]:
table = df.pivot_table(index=["YEAR", "CITY_lbl"],
                       columns="TRANWORK_lbl",
                       values="count",
                       aggfunc='sum',
                       margins=True,
                       fill_value=0)
table

## A couple pieces of cleanup. 
* First we want the indexes as columns, so as before we reset_index()
* Second, we do not need the "All" row that represents the counts of all rows

In [None]:
table = table.reset_index()
# Drop the All row (YEAR=="All")
table = table[table["YEAR"]!="All"]
table

## To compare across cities, raw counts are not sufficient (as each city have different amount of survey respondents)
### Create new columns that represent the RATIO of a given transporation mode to the total count of responses

In [None]:
for col in df["TRANWORK_lbl"].unique():
    table[f"% {col}"] = (table[col] / table["All"])
table

# Now Let's graph some data!

## While there are a ton of graphing libaries for Python (check out [Pyviz](https://pyviz.org/tools.html) for a fairly comprehensive list of viable options), we are going to be using [Altair](https://altair-viz.github.io/user_guide/customization.html

## First, we can try to compare percent of working population working from home

In [None]:
line = alt.Chart(table).mark_line(interpolate="natural").encode(
    alt.X('YEAR:O'),
    alt.Y('% Worked at home:Q'),
    alt.Color('CITY_lbl'),
).properties(width=500, height=500)
line

## Well. That's a mess.
## How about we scale this back a bit...Just % Worked at home for the year 2019, and display it as a vertical bar chart

In [None]:
# Start with just a 1 year
one_year = table[table["YEAR"]==2019]
bar = alt.Chart(one_year, title="2019").mark_bar().encode(
        alt.X('% Worked at home'),
        alt.Y('CITY_lbl:N'),
    )
bar

## Better! But, hard to compare without these ranked by value...
## Use sort="-x" on the Y axis to sort in descending order

In [None]:
# Sort based on X value
one_year = table[table["YEAR"]==2019]
bar = alt.Chart(one_year, title="2019").mark_bar().encode(
        alt.X('% Worked at home'),
        alt.Y('CITY_lbl:N', sort="-x"),
    )
bar

## Cool. Cool Cool Cool. Now that we've got 2019 displaying nicely, let's show every year with the .hconcat() multiple chart feature

In [None]:
# Multiple years
charts = alt.hconcat()
for y in table["YEAR"].unique():
    one_year = table[table["YEAR"]==y]
    bar = alt.Chart(one_year, title=str(y)).mark_bar().encode(
        alt.X('% Worked at home'),
        alt.Y('CITY_lbl:N', sort="-x"),
    ).properties(width=150)
    charts |= bar
    
charts

## Lovely! Now that we're to this point, let's do some tidying up.
* Add a title for the set of charts
* The Y-axis label CITY_Lbl is unneeded
* Whoops, % Worked at home is displaying ratios not percentages. Fix that formatting
* Finally, make the X-axis range across each chart consistent by finding the max x value across all years and create the scale based on that

In [None]:
# Consistent X axis range
charts = alt.hconcat(title="City rankings for % Working at Home")
max_pct = table["% Worked at home"].max()
for y in table["YEAR"].unique():
    bar = alt.Chart(table[table["YEAR"]==y], title=str(y)).mark_bar().encode(
        alt.X(
            '% Worked at home',
            axis=alt.Axis(format="%"),
            scale=alt.Scale(domain=(0, max_pct)),
        ),
        alt.Y('CITY_lbl:N', sort="-x", title=None),
    ).properties(width=150)
    charts |= bar
    
charts

## Excellent!
## Now that we're nicely cleaned up, let's highlight a city of interest using alt.condition()
## Let's take a look at the great city of Minneapolis by making its bar yellow across each chart

In [None]:
# conditional coloring of one bar
charts = alt.hconcat(title="City rankings for % Working at Home (age 25-55)")
max_pct = table["% Worked at home"].max()
for y in table["YEAR"].unique():
    bar = alt.Chart(table[table["YEAR"]==y], title=str(y)).mark_bar().encode(
        alt.X(
            '% Worked at home',
            axis=alt.Axis(format="%"),
            scale=alt.Scale(domain=(0, max_pct)),
        ),
        alt.Y('CITY_lbl:N', sort="-x", title=None),
        color=alt.condition(
            alt.datum.CITY_lbl == "Minneapolis, MN",
            alt.value('orange'),
            alt.value('steelblue'),
        )
    ).properties(width=150)
    charts |= bar
    
charts


## Using the same code but pointing to % Bicycle, we can produce the same type of graphs for different variables

In [None]:
# conditional coloring of one bar
charts = alt.hconcat(title="City rankings for % Biking to work (age 25-55)")
max_pct = table["% Bicycle"].max()
for y in table["YEAR"].unique():
    bar = alt.Chart(table[table["YEAR"]==y], title=str(y)).mark_bar().encode(
        alt.X(
            '% Bicycle',
            axis=alt.Axis(format="%"),
            scale=alt.Scale(domain=(0, max_pct)),
        ),
        alt.Y('CITY_lbl:N', sort="-x", title=None),
        color=alt.condition(
            alt.datum.CITY_lbl == "Minneapolis, MN",
            alt.value('orange'),
            alt.value('steelblue'),
        )
    ).properties(width=150)
    charts |= bar
    
charts

---------------------------------------------------------
---------------------------------------------------------
---------------------------------------------------------