pandas (http://pandas.pydata.org/) is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

It's common to see pandas, numpy and matplotlib imported in the following way. We also have to specify that we would like generated images to be presented on this page. 

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

There are configuration options for Jupyter to do this automatically, this is useful if your notebooks will be used for similar types of data analysis.

First let's revisit the data we gathered earlier. We created a list of lists which pair up an IP address and how many times that IP address was seen in an nginx access log file.

In [None]:
ip_count = !cat access.log | awk '{print $1}' | sort | uniq -c | sort -rn
ip_count = [line.strip() for line in ip_count]
ip_count = [line.split() for line in ip_count][:10]
ip_count

Now we want to take this data and have pandas be able to do something with it. We begin by creating a "DataFrame" from the 'ip_count' variable. DataFrames (DF from here on) are essentially spreadsheets that pandas can do some work on.

A common idiom in pandas is to use the 'head' and 'tail' functions to get a quick peek at the DF without having to load the entire thing (especially useful if your DF is large).

In [None]:
df = pd.DataFrame(ip_count, columns=['count', 'IP'])
df.head()

In [None]:
df.tail()

Side note: these dataframes are styled using html. Brandon Rhodes had an interesting presentation at PyCon 2015 which shows how to modify IPython's core css to style the DF: https://github.com/brandon-rhodes/pycon-pandas-tutorial I don't understand it enough to explain it so I won't be using it for this presentation

We can have pandas well us some information about the DF like what type of objects it's comprised of.

In [None]:
df.dtypes

Uh-oh. We won't be able to do useful work unless pandas recognizes the 'count' column as a numeric type. 

#dtypes
Let's take a brief interlude to talk about pandas dtypes: (from http://pandas.pydata.org/pandas-docs/stable/basics.html#dtypes )

In [None]:
from pandas import Timestamp, Series
dft = pd.DataFrame(dict( A = np.random.rand(3),
                         B = 1,
                         C = 'foo',
                         D = Timestamp('20010102'),
                         E = Series([1.0]*3).astype('float32'),
                         F = False,
                         G = Series([1]*3,dtype='int8')))
dft.head()

Columns with string data are represented as the 'object' dtype. We'll need to coerce the data in our DF to integers to work with them further.

In [None]:
dft.dtypes

To get column 'count' to integers we can 'apply' a function to a column:

In [None]:
df['count'] = df['count'].apply(int)
df.dtypes

#Selecting
Now we can do some fun stuff like selecting with booleans:

In [None]:
df[df['count'] > 100]

If we want to chain boolean checks we need to wrap them in parens since the '&' operator takes precedence over the '>' and '<' operators. If we were to try this without the parens pandas would complain that the "truth" of a series cannot be evaluated.

In [None]:
df[(df['count'] > 100) & (df['count'] < 200)]

#Plotting

For now we can make a simple plot from our DF:

In [None]:
df.plot()

Yeah, not what we wanted at all. pandas will default to using a line graph and use it's internal index (the row numbers) to plot the data. pandas visualizations are not as full-featured as matplotlib but we can get pretty far with a few simple options:

Let's select the proper type of visualization for this data, in this case we'll use a horizontal bar chart ('barh'). We have to specify which data we expect to have plotted against the count data.

In [None]:
df.plot(kind='barh', x=df['IP'])

Ok, closer but I'd like to have the IP with the most hits at the top. We can perform transformations on the data then plot that transformed data without having to save the intermediate results.

In [None]:
df.sort(columns='count', ascending=True).plot(kind='barh', x=df['IP'])

From here we can start styling the graph to reduce visual noise and making it more visually appealing. A simple way is to use the 'ggplot' style. 

ggplot (http://ggplot.yhathq.com/) is a plotting system for Python based on R's ggplot2 and the Grammar of Graphics. It is built for making profressional looking, plots quickly with minimal code. 

I haven't used it much but a discussion of it's goals can be found here: http://blog.yhathq.com/posts/ggplot-for-python.html Essentially plotting with sane defaults since:

> matplotlib is powerful...but its plotting commands remain rather verbose, and its no-frills, default output looks much more like Excel circa 1993 than ggplot circa 2013. ~ Jake Vanderplas, [Matplotlib & the Future of Visualization in Python](http://jakevdp.github.io/blog/2013/03/23/matplotlib-and-the-future-of-visualization-in-python/) ([@jakevdp](https://twitter.com/jakevdp))

In [None]:
# use ggplot to make plots automagically better
import matplotlib
matplotlib.style.use('ggplot')

# primary data and plot config
df.sort(columns='count', ascending=True).plot(kind='barh', x=df['IP'], grid=False, legend=False, alpha=0.5, color='g', xlim=(0, df['count'].max() + 5))

# get plot dimensions
#xmin, xmax, ymin, ymax = plt.axis()

# plot a red line for the average of the count column
#plt.vlines(df['count'].mean(), ymin=ymin, ymax=ymax, linewidth=1.5, color='r')

# add some annotations
#plt.annotate('Average', xy=(df['count'].mean(), ymax / 3), xytext=(df['count'].mean() + 30, ymax / 2), arrowprops=dict(facecolor='black', shrink=0.05))

#Public data sets

Let's try working with some public data from the TCEQ. You can download Historical Pollutant and Weather data from here: http://www.tceq.state.tx.us/airquality/monops/historical_data.html

For this example we'll get the most recent (2006) Ozone and Carbon Monoxide data which are in two seperate files which come as Excel spreadsheets (around 4MB a piece after they are unzipped)

In [None]:
%%bash
wget http://www.tceq.texas.gov/assets/public/compliance/monops/air/ozonehist/oz_2006.zip 2> /dev/null
wget http://www.tceq.texas.gov/assets/public/compliance/monops/air/ozonehist/co_2006.zip 2> /dev/null
for i in $(ls | grep zip); do unzip $i; done

We can read directly from .xls and .xlsx files into a DF like this:

In [None]:
ozone = pd.io.excel.read_excel('file://localhost/home/steven/acpg-may2015/oz_2006.xls')
carbon_monoxide = pd.io.excel.read_excel('file://localhost/home/steven/acpg-may2015/oz_2006.xls')

For now we'll focus on the Ozone DF and get a better understanding of the data that we're working with. 

In [None]:
ozone.head(3)

In [None]:
ozone.describe()