# Information Visualization I 
## School of Information, University of Michigan

## Pandas Review
This tutorial serves as a brief review of Pandas. There are a few specific transformations we will use in the infovis class. Working through this material (and looking back at past courses should help). Each block of code contains comments that explain the Pandas operations being used. There are a few places where we ask you to try to write some code. There's a button that will pop up with the answers. We have noticed that it doesn't work with some security configurations. If that's the case for you, we've provided a file with all the answers [here](assets/pandas/pandas_tutorial_answers.txt).

For this tutorial, we downloaded some health data for you related to e-coli infections from: https://bchi.bigcitieshealth.org/indicators/1859/searches/34539. It's a nice dataset to play with to stretch our Pandas muscles.

In [1]:
#Import
import pandas as pd
import numpy as np
exec(open("tutorial_helper.py").read())

In [2]:
# Replace the following code to load the file 'assets/pandas/tutorial_data_ecoli_wide.csv'
# Call the dataframe df

# make the sample ID the index
answerButton("assets/pandas/e66327e8c15548b0bcdcdadefd3f4e30","show me...")

answerButton(description='show me...', style=ButtonStyle())

In [None]:
# fix this
df = pd.DataFrame()

## Let's take a look at what's inside. There are a few methods to help...

In [None]:
df.head(5) # let's look at the first five lines of the data frame

# you'll notice a set of different fields and values... some will be more useful than others

In [None]:
df.shape # we can find out how many columns and rows we have
# first number is rows, second is columns

In [None]:
df.columns # let's also just get a list of the column names

### We can delete rows or columns we don't want

`df.drop('dropme',axis=1)` is used to drop the column named dropme. The code `axis=1` tells pandas that we want to drop a column rather than a row (otherwise, we'd use `axis=0`).

When you make this call, the code will return a new dataframe, so we'll want to "grab" it. We can either create a new dataframe:

`df_clean = df.drop(...)` or just overwrite the old one:

`df = df.drop(...)`

There are a few fields we don't need, specifically: 'Indicator Category', 'Indicator', 'BCHC Requested Methodology', 'Source', 'Methods', 'Notes', '90% Confidence Level - Low', '90% Confidence Level - High', '95% Confidence Level - Low', '95% Confidence Level - High'

Delete these and save the result into a new dataframe called `df_clean`. You can do this one at a time or read the documentation to do it all at once.

In [None]:
answerButton("assets/pandas/95365da784604176bf77e1ba86860a0d","show me...")

In [None]:
# fix this
df_clean = None

## Let's work with selecting some rows and columns

In [None]:
# We can extract given rows of data based on position (row count). iloc is based on the row number
df_clean.iloc[2:4]

In [None]:
# Or extract the rows based on index (in this case, they are the same). loc is based on whatever we set the index to
# in this case it's Sample ID. This next command says give me everything between sample IDs 395 and 7678
df_clean.loc[395:7678]

In [None]:
# If we want to grab a specific column, we could do the following:
df_clean['Place']

# This gives us back a "series" which is a much more basic dataframe... only a pair of index and value

In [None]:
# Now, let's say we want to pick out data from Texas. There are a few ways of doing that

# first thing first, if we run the following:

df_clean['Place'] == 'Dallas, TX'

# we will get back a series with true/false values corresponding to rows that match our search


In [None]:
# if we wanted to get the actual corresponding to this query, we could do:

dallas = df_clean[df_clean['Place'] == 'Dallas, TX']
dallas.head(5)

In [None]:
# try to make a new data frame called texas which only has Dallas, Houston, Fort Worth (Tarrant County), 
# or San Antonio locations. There are (at least) two ways to do it

answerButton("assets/pandas/f484f6a78f7b40c4aae7fea8684ca0da","show me...")

In [None]:
# fix this
texas = None

## Group by State

We want to merge all the cities in a given state into one. The first thing we'll want to do is pull out the last two characters from the Place... this corresponds to the state. You can do this with:

`df_clean['Place'].str[-2:]`

add a new column to df_clean that contains the state name, call it `State`

In [None]:
df_clean['State'] = df_clean['Place'].str[-2:]

Our next step is to use groupby to merge all the rows that are for the same state into one row per state/year pair. So for example, if we have data for California that includes "Long Beach", "San Diego" and "Los Angeles" we'd like to make just a single row that has the *mean* of all the e-coli incidents. Note that we could also do the sum or some other aggregation if we wanted. For example, if we wanted to calculate the sum by place we might do:

`df_clean.groupby(['Place']).sum()`

Try to do this for State and Year (using the mean, not the sum!) and put the new data frame into a table called grouped_df

In [None]:
answerButton("assets/pandas/e6b45903889943a993fe65da25b57bb9","show me...")

In [None]:
# fix this
grouped_df = None

If you do this, you will notice that the row indices are a little bit wonky (they're group). To reset these we can do:

In [None]:
grouped_df = grouped_df.reset_index()
grouped_df

## Wide to Long

At this point the data is in a "wide format."  That is, we have lots of columns (one for each race) for each location/year. 

<img src="assets/pandas/wide.png" alt="wide format table" width="600">

It turns out that the visualization tools we will use do better with data in "long format." In a long format, we create a new row for each year, place, race triplet: 

<img src="assets/pandas/long.png" alt="long format table" width="300">

There are a number of ways to achieve this, but one of the easiest is the pandas "melt" function.  The melt function takes the column names on which we want to "pivot" the data. In our case, we want to pivot the values from the race columns but keep state and year consistent. So we will do:

In [None]:
long_format = grouped_df.melt(id_vars=['State','Year'])
long_format.head()

Sometimes we will need to use other transformations (pivots, unstacking, etc.) but those are less common. Review the page at: https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html for more information.

In [None]:
# as a last step, let's clean the data a bit and drop the N/As
long_format = long_format.dropna()

### Analysis

Let's do a quick sanity check on the data. If we want to find the 5 highest cases we might do:

In [None]:
long_format.sort_values('value',ascending=False).head(5) #Sorts based on length, ascending order is default

In [None]:
# we can also just get lots of summary statistics.
long_format.describe()

### Dealing with Time Series

The Year column is currently a number, but we may want to transform it into a date so we can easily calculate things like moving averages. Use the pandas `to_datetime` function to transform the year for the long_format table. Because we only have 'Year,' you'll need to use the 'format' argument. We'll just assume any measurement was taken on January 1st of that year.

In [None]:
answerButton("assets/pandas/23bbcc621003415f8ba401a083ac0348","show me...")

In [None]:
# fix this code
long_format['Year'] = long_format['Year']

Note that in some cases we'll want to transform numerical data (like year) to a string. This is because some visualization tools will make an *inference* about what you want to display based on the data type. If the year was numerical, the software might assume you want a continuous time series and will draw points or bars for missing years in a way that might not be desirable. 

Create a column called StringYear that holds the Year as a string:

In [None]:
answerButton("assets/pandas/b70446f2e79a4f34990aecb2c55f54d2","show me...")

In [None]:
# fix this
long_format['StringYear'] = long_format['Year']

If you've done this correctly:

`long_format['Year'].mean()` will work (but only with Pandas >= .25)

`long_format['StringYear'].mean()` will not

## Binning data

Let's do a really fast experiment to bin the data by value. 

`pd.cut(series,bins,labels=...)` is used for this.

If we thought we had equal-sized bins, we might generate a range and then use the value of the bin to give a bin label to each value. For example:

In [3]:
bin_values = np.arange(0.0,100.0,5.0)
pd.cut(long_format['value'],bin_values,labels=bin_values[0:-1])  # note, we need one less label than binning criteria

# let's add a column
long_format['fixedbins'] = pd.cut(long_format['value'],bin_values,labels=bin_values[0:-1])
long_format.sample(10)

NameError: name 'long_format' is not defined

Unfortunately, the data is very lopsided; Most of the values are really low. So when we ask for the "counts" of each bin we get a skewed distribution.

In [None]:
long_format['fixedbins'].value_counts()

Instead, try to bin the values so that we have a "very low", "low", "medium", and "high" corresponding to values of 0 to .5, .5 to 1, 1 to 5, and anything higher. Put the bin labels into a "bin" column

In [None]:
answerButton("assets/pandas/f920cd4c3662417abf90a4eff03b60a2","show me...")

In [None]:
# your code goes here
long_format['bin'].value_counts()

## Creating Data Frames
It is worth reviewing different ways to create dataframes from code (instead of files). There are a number of ways of doing this including reading in dictionaries or making lists. We encourage you to review the documentation on this (easy start: https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html) 

In [None]:
# here's an example to make a simple two column data frame

data = {'fruit':['apple', 'banana', 'pineapple', 'mango'], 'price':[5, 2, 9, 10]}
pd.DataFrame(data)

In [None]:
# we're going to build the same data frame here but we're going to assume the data came to us in 
# in pieces

# we have an array with the first set of fruit and the prices
fruit1 = ['apple', 'banana']
sales1 = [5,2]

# and a second array of tropic fruit
tropical = ['pineapple','mango']
sales2 = [9,10]

# if we think we're going to need three data frames (one for each type of fruit and one that's the whole thing)
# we might do:
fruit_df = pd.DataFrame(list(zip(fruit1,sales1)), columns=['fruit','price'])
tropical_df = pd.DataFrame(list(zip(tropical,sales2)), columns=['fruit','price'])

# zip... if you're interested, will make pairs of matched objects (e.g., apple -> 5, banana -> 2) from 
# arrays... list will transform this into a list data structure that we can use

pd.concat([fruit_df,tropical_df]).reset_index(drop=True)

# because each table had it's own index we dropped those and created a new one