<a href="https://colab.research.google.com/github/ekille/ekille.github.io/blob/master/visualizations_with_python_practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Visualizations with Python

***

# General Notes on This Session

This is a Jupyter notebook running in Google's Colab environment that we will use to practice with some Python packages that are useful for data analysis.

You can write and execute your Python code right in the browser here. No additional setup is required.

Because of the large number of people here, our interaction during the session will be limited. If you get stuck on something, please do your best for now and I promise to help you out later.

If you get an error with the code I supplied, make sure you have *run all prior code.*

The main packages we will cover today are *pandas* / *numpy* (used for manipulating tabular and array data) and *matplotlib* / *seaborn* / *bokeh* / *plotly* (used to create graphs and animations).

We could easily spend hours on each of these packages and so can only do a quick tour during our time today.

# Pandas

The pandas library is essential for data analysis in Python. It allows you to maniulate tabular data structures, such as you would find in a relational database or spreadsheet.

The name comes from "panel data" - a term for used for data sets that track multiple variables over time.

Some things we'll do with pandas:
*   Load data
*   Explore that data
*   Subset the data
*   Join data sets together




# Load Data

Let's get some data first. We can load data from (and write back to) a variety of locations and formats.

To get things started, we'll load some CSV data about COVID-19 cases in the United States from a URL.


In [None]:
# it's conventional to alias pandas as pd once imported
import pandas as pd
historical_data_url='https://query.data.world/s/5amvcq2lwgrsjhrcsy7vpjglambmsq'
# pandas will read this data into a DataFrame, the typical pandas data structure
historical_covid_data=pd.read_csv(historical_data_url)
# let's see the first 10 rows - could also do tail()
historical_covid_data.head(n=10)

# Explore Our Data

Now that we've loaded our data, let's learn more about it. 

In [None]:
# let's see some stats about the values in each column
historical_covid_data.describe(include="all")

# Exercise #1

Load the data from https://query.data.world/s/3haf2gay6ntrp6groaxuuo2taumrki into a data frame called *current_covid_data*.

How many confirmed cases ("confirmed") are in the median county?

In [None]:
# load the data into a new data frame called current_covid_data

# take a look at the data to see what it looks like
current_data_url='https://query.data.world/s/3haf2gay6ntrp6groaxuuo2taumrki'
# pandas will read this data into a DataFrame, the typical pandas data structure
current_covid_data=pd.read_csv(current_data_url)
# let's see the first 10 rows - could also do tail()
current_covid_data.head(n=10)

In [None]:
# how many cases are in the median county?
current_covid_data.describe()


# Plotting with Matplotlib

The matplotlib package is the most commonly used way to plot data from pandas data frames and probably Python data in general.

It was inspired by and based partly upon a mathematical computing and graphics environment called MATLAB.

We're going to find the 10 states with the most COVID-19 cases and plot the number of cases over time.

In [None]:
### UNCOMMENT AND RUN THESE FIRST IF YOU DIDN'T COMPLETE EXERCISE 1###
# current_data_url='https://query.data.world/s/3haf2gay6ntrp6groaxuuo2taumrki' 
# current_covid_data=pd.read_csv(current_data_url)

In [None]:
# let's find the 10 states that have the most COVID-19 cases
# use a group by to get state totals, summing up the records for each state, and sort by decreasing number of cases, limiting to top 10
# notice that you can chain calls together and that you already know how to limit the top n with head()
top_10_states_totals = current_covid_data.groupby('state').sum().sort_values(by='confirmed', ascending=False).head(n=10)
top_10_states_totals # note that many numbers don't make sense here because not additive - just ignore those

In [None]:
# get that list of top 10 states to filter our other data set
# the last operation made it the index of our data frame
top_10_states_list = top_10_states_totals.index.to_list()
top_10_states_list

In [None]:
# let's create a new data frame from our historical COVID data for confirmed cases in just those states we identified above
# we'll use a new "isin" method to subset our data for just those states
# the brackets allow us to specify the rows we want in our new data frame
# also notice copy() method to give us new data frame instead of view (we'll be editing this data)
top_10_states_history = historical_covid_data[historical_covid_data.state.isin(top_10_states_list)].copy()
top_10_states_history.head()

In [None]:
# we want to plot by date - does pandas know that date column is actually a date?
top_10_states_history.dtypes

In [None]:
# make it a date - then go back and run line above
top_10_states_history.date =  pd.to_datetime(top_10_states_history.date)

In [None]:
# numpy is a library used for various numeric operations - pandas is actually built on it
import numpy as np
# pivot the data frame - each date gets a row, the states become columns, and the sum of the cases become the cell values  
top_10_pivot_cases = pd.pivot_table(top_10_states_history, values='cumulative_cases', index=['date'], columns=['state'], aggfunc=np.sum)
top_10_pivot_cases.tail()


In [None]:
# need this line to create plot inside a Jupyter notebook like this one
%matplotlib inline
# conventional to import as plt - don't actually need plt reference until next code block
import matplotlib.pyplot as plt
# draw the plot
top_10_pivot_cases.plot()

In [None]:
from datetime import date
# make plot bigger with width, height in inches
plt.rcParams['figure.figsize'] = [20, 10]
# get a reference to the plot area and add a marker
top_10_pivot_cases_plot = top_10_pivot_cases.plot(marker="o")
# set the x-axis limits 
top_10_pivot_cases_plot.set_xlim(pd.Timestamp('2020-03-15'), date.today())
# add a title
top_10_pivot_cases_plot.set_title("COVID-19 Cases in Hardest Hit States")

#Exercise #2 

Plot the 7-day rolling average of new cases ("new_cases_7_day_rolling_avg") since March 15th, 2020 for each of these states. Title your plot "COVID-19 7-Day Rolling Average of New Cases in Hardest Hit States."

Which state was accumulating new cases fastest in winter 2021? How about summer 2021?

In [None]:
# you can steal most of the code above for the new cases plot - you just need to make a handful of key edits
# create a pivot for new cases from the top_10_states_history data frame

# create your plot from the pivot

# set the x-axis limits 

# add a title



# Plotting with Seaborn

Seaborn is a data visualization library built on top of matplotlib. It focuses on having a simple interface and attractive defaults. Basically, it tries to expose matplotlib capabilities more easily and make things look nicer out-of-the-box.

The name comes from a character in the TV series "The West Wing." The author of the package just seems to like the show.

In [None]:
# common to import as sns - the initials of the character from that show
import seaborn as sns
# set default style, color palette, etc.
sns.set(style="white")
# creation relational plot (basically a scatterplot)
# sizes gives a relative scale on which things are drawn
splot = sns.relplot(x="lon", y="lat", hue="state", size="confirmed", 
            sizes=(20,1000), legend=None, data=current_covid_data)
splot.fig.set_size_inches(20, 12)
# focus axes on contiguous US states
# because Seaborn is matplotlib under the covers, we can use plt reference from before
plt.ylim(25, 50)
plt.xlim(-125,-65)
# add title
plt.title("Distribution of Confirmed COVID Cases in U.S.")


#Exercise #3 

Re-draw the map above with the size of the markers based on the number of deaths per 100,000 inhabitants ('deaths_per_100000').

Are there any areas with a surprising (high or low) death rate?

In [None]:
# re-draw the map using deaths_per_100000 to size the markers


# Check Out Bokeh (rhymes with "okay") for Interactivity

Bokeh is another plotting library that emphasizes interactivity. It allows you do pan/zoom, save graphics to disk, and build other kinds of interactions. Check it out at https://docs.bokeh.org/en/latest/. The name refers to "aesthetic blur" in photography.

In [None]:
# going to get the differences in cases among counties onto a lower scale for drawing dot sizes - adding the .1 because otherwise many data points would be invisible
# throw this into a new column called 'scale'
current_covid_data['scale'] = (current_covid_data.confirmed / (current_covid_data.confirmed.max() - current_covid_data.confirmed.min()) + .1)
# going to bin these into 256 bins because that's how many colors I have in a pallete I'm about to use
current_covid_data['color_bin'] = np.digitize(current_covid_data.confirmed, np.arange(0,256)*100)
current_covid_data.describe()


In [None]:
from bokeh.plotting import figure, output_notebook, show
from bokeh.models import ColumnDataSource, LinearColorMapper
from bokeh.models.tools import HoverTool

# data that will appear when I mouseover points
TOOLTIPS = [
    ("county", "@county_name"),
    ("state", "@state"),
    ("cases", "@confirmed")
]

# set up a way to map colors to values in the data set
color_bin = current_covid_data.color_bin
color_mapper = LinearColorMapper(palette='Turbo256', low=min(color_bin), high=max(color_bin))

# need to set data source for graph
source = ColumnDataSource(current_covid_data)
# set up the basic plot
p = figure(plot_width=1000, plot_height=600, background_fill_color = "beige", tooltips=TOOLTIPS,
           title="COVID Confirmed Case Map - Mouse Over to See County Data")
# now draw the circles for each county
p.circle(source=source,
         x='lon', y='lat', radius='scale',
         color={'field': 'color_bin', 'transform': color_mapper},
         )

# show the result
output_notebook()
show(p)

# Exercise #4
Using the Bokeh plot above, find Denver, Colorado, and see how many confirmed cases it has.

This is a no-code exercise. The point is to get familiar with Bokeh UI navigation.

# Plotly for Animations

Plotly is similar to Bokeh, but has a much better API for animations. (This is my opinion, but if you disagree, I'd love to hear why.)

Let's see how easy it is to animate our data with Plotly. We're going to look at the 7-day rolling average of new cases over time in Pennsylvania.

In [None]:
# limit to certain states just for data set size
target_states = ['Pennsylvania']
target_data = historical_covid_data[historical_covid_data.state.isin(target_states)].copy()
# limit it to just interesting dates
target_data = target_data[target_data.date.between('2021-01-01',str(date.today()))]
# now, need to get our lat/long data into same data frame as our historical data - fips_code is join basis
# loc takes a [rows,columns] approach to specifying data from a data frame
target_data = pd.merge(target_data, current_covid_data.loc[:,['fips_code','lat','lon']], on='fips_code')
# get rid of rows with null values in metric of interest
target_data = target_data[target_data.new_cases_7_day_rolling_avg.notnull()]
target_data.head()

In [None]:
import plotly.express as px
# create a complex animation with a single method call
fig = px.scatter(target_data, x="lon", y="lat", animation_frame="date", animation_group="fips_code",
           size="new_cases_7_day_rolling_avg", color="state", hover_name="location_name",
           size_max=100, height=700, width=1165)
# speed frame duration up from 1 second to 100 ms
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 100
# clunky way to update title
fig.update_layout(title={'text':'COVID Cases in PA throughout 2021'})

fig.show()

In [None]:
# now show an animated bar chart of state totals
target_states = ['Pennsylvania','California','New York','Texas','Florida']
target_data = historical_covid_data[historical_covid_data.state.isin(target_states)].copy()
# limit it to just interesting dates
target_data = target_data[target_data.date.between('2021-06-01',str(date.today()))]
fig = px.bar(target_data, x="state", y="new_cases_7_day_rolling_avg", color="state",
  animation_frame="date", range_y=[0,40000], hover_data=['location_name'])
fig.layout.updatemenus[0].buttons[0].args[1]["frame"]["duration"] = 150
fig.update_layout(title={'text':'COVID Cases Across States in Summer 2021'})
fig.show()

# Exercise #5

Animate any metric you like for some area of the country. You may need to experiment with the dimensions of the figure to make things look reasonable.

In [None]:
# have fun!