#Public data sets

Let's try working with some public data from the TCEQ. You can download Historical Pollutant and Weather data from here: http://www.tceq.state.tx.us/airquality/monops/historical_data.html

For this example we'll get the most recent (2006) Ozone and Carbon Monoxide data which are in two seperate files which come as Excel spreadsheets (around 4MB a piece after they are unzipped)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
%%bash
wget http://www.tceq.texas.gov/assets/public/compliance/monops/air/ozonehist/oz_2006.zip 2> /dev/null
wget http://www.tceq.texas.gov/assets/public/compliance/monops/air/ozonehist/co_2006.zip 2> /dev/null
for i in $(ls | grep zip); do unzip $i; done

We can read directly from .xls and .xlsx files into a DF like this:

In [None]:
ozone = pd.io.excel.read_excel('file://localhost/home/steven/code/acpg-may2015/oz_2006.xls')
carbon_monoxide = pd.io.excel.read_excel('file://localhost/home/steven/code/acpg-may2015/co_2006.xls')

For now we'll focus on the Ozone DF and get a better understanding of the data that we're working with. We can get the shape of the data to see how many rows and columns we're working with:

In [None]:
ozone.shape

Let's use head() to take a peek at the data, we can see it consists of an 'airs' number which is a recording station, a date the measurement was taken then Ozone reading every hour. The columns are truncated so we'll have to examine that another way.

In [None]:
ozone.head(3)

In [None]:
ozone.columns
#for col_name in ozone.columns:
#    print col_name

It consists of 1hr and 8hr measurements. Our goal for now will be to plot this 1hr data.

We can use regex to get just the colums that match the way 1hr measurements are named.

In [None]:
ozone_1hr = ozone.filter(regex="OZ1hr")

We can see the # of columns have gone from 60 to 28:

In [None]:
ozone_1hr.shape

Except we didn't match 'airs' or 'date' and we'd like to keep that column so we know where the data is coming from.

In [None]:
ozone_1hr.head()

We can instert at a particular index (0) give it a column name ('airs') and a source for the data (ozone['airs'])

In [None]:
ozone_1hr.insert(0, 'airs', ozone['airs'])
ozone_1hr.insert(1, 'date', ozone['date'])

Let's get some quick stats on the data:

In [None]:
ozone_1hr.describe()

As we saw in the pandas introduction trying to slice a DF will get a selection of rows. If we want to get a subset of columns we can do it this way:

In [None]:
ozone_1hr = ozone_1hr.ix[:, :-4]

In [None]:
ozone_1hr.head()

Ideally each 'airs' would have 365 readings through out the year but we can see that some don't have a full dataset:

In [None]:
#ozone_1hr.groupby('airs')
ozone_1hr.groupby('airs').size().head(15)

Let's create a DF comprised of the airs and how many records they have:

In [None]:
# from http://stackoverflow.com/a/10374456
airs_count = pd.DataFrame({'count': ozone_1hr.groupby('airs').size()}).reset_index()
airs_count.head()

And another that only contains the airs which have 365 records:

In [None]:
airs_all_year = airs_count[airs_count['count'] == 365]
airs_all_year.head()

In [None]:
airs_all_year.shape

Now that we have a list of 'airs' that have 365 records, we'll only select them from our original 'ozone' variable if they exist in the 'airs_all_year' variable:

In [None]:
ozone_1hr_filtered = ozone_1hr[ozone_1hr['airs'].isin(airs_all_year['airs'])]
ozone_1hr_filtered.head()

In [None]:
# alias for convenience
df = ozone_1hr_filtered

Right now pandas thinks the airs numbers are integers and that'll intefere when we attempt to create a plot. This next cell generates a warning but will succeed.

In [None]:
df['airs'] = df['airs'].astype(str)

This dataset has a lot of information to put into a single visualization: 47 airs locations, with 365 readings each and 24 data points every day.

Rather than try to cram the data into a single plot I'll select a single airs location, resample to take montly averages and plot the first 6hrs of the day:

In [None]:
df[df['airs'] == '480290032'].ix[:, :8].set_index('date').resample('MS', how='mean').plot(kind='line',
                                                                                title="monthly Ozone averages for airs 480290032",
                                                                                legend=True,
                                                                                figsize=(15, 10))