In [None]:
import pandas as pd
%matplotlib inline

# Air quality data

In [None]:
aq = pd.read_csv("air-quality-london-monthly-averages.csv")
aq.head(4)

*What does NaN mean?*

*Are there any other problems with this table?*

In [None]:
# Let's see what pandas thinks the data are?
aq.dtypes

## MAKING THE MONTH COLUMN INTO A DATE INDEX
Documentation for the format string: https://docs.python.org/3.5/library/datetime.html?format#strftime-strptime-behavior

In [None]:
# Test run
pd.to_datetime(aq["Month"], format="%b-%y").head(3)

In [None]:
# That looks better so make it stick
aq["Month"] = pd.to_datetime(aq["Month"],format="%b-%y")

The month column should be the *index* of this data rather than the row number so let's fix that next

In [None]:
# Again, test run first
aq.set_index("Month").head(2)

In [None]:
# Looks good so go for it
aq = aq.set_index("Month")

## DOING SOME ACTUAL ANALYSIS

Now that initial effort starts to pay off

In [None]:
# What was the highest Nitric Oxide reading?
aq["London Mean Background:Nitric Oxide (ug/m3)"].max()

In [None]:
# And when was that?
aq["London Mean Background:Nitric Oxide (ug/m3)"].idxmax()

In [None]:
# How is that varying over time?
aq["London Mean Background:Nitric Oxide (ug/m3)"].plot()
# Note this only works like it does because we made the date the index

*Notice anything about this graph?*

In [None]:
# So let's try a 12 month moving average
aq["London Mean Background:Nitric Oxide (ug/m3)"].rolling(window=12).mean().plot()

It's kind of misleading to have a $y$-axis that doesn't start from zero, though, isn't it?

In [None]:
# So let's try a 12 month moving average with a full y-axis
aq["London Mean Background:Nitric Oxide (ug/m3)"].rolling(window=12).mean().plot(ylim=(0,40)) 

## FILTERING
There were a lot of NaNs before 2011 so let's just look at data since then.

In [None]:
aq[aq.index>'2011'].head(2)

In [None]:
# This works on individual columns (series) too
aq["London Mean Background:Nitric Oxide (ug/m3)"][aq.index>="2011"].head(3)

In [None]:
# Back to our plot 12-month moving average plot
aq["London Mean Background:Nitric Oxide (ug/m3)"][aq.index>="2011"].rolling(
    window=12).mean().plot(ylim=(0,40))

## CORRELATION


First visually, using Pandas' `scatter` plot:

In [None]:
aq.plot(x="London Mean Background:Nitric Oxide (ug/m3)",y="London Mean Background:Nitrogen Dioxide (ug/m3)",kind="scatter")

That looks like a positive correlation.

`seaborn` has better correlation plotting tools so let's import that.

In [None]:
import seaborn

In [None]:
# regression is a related concept to correlation
seaborn.regplot(aq["London Mean Background:Nitric Oxide (ug/m3)"],aq["London Mean Background:Nitrogen Dioxide (ug/m3)"])

This correlation looks fairly strong. How strong?

In [None]:
aq["London Mean Background:Nitric Oxide (ug/m3)"].corr(aq["London Mean Background:Nitrogen Dioxide (ug/m3)"])

So as expected, this is a fairly strong positive correlation.

# PRACTICE

1. What year was the highest recorded Sulphur Dioxide?
1. What was the lowest PM10 reading? When did that happen?
1. Is there a correlation between Ozone and Nitric Oxide readings? If so, how strong is it?
1. What other options are there for the `kind` argument to `pd.plot()`? What are they for?
1. What other plots can `seaborn` create?
1. Download some other data of interest. You might have to clean it up in excel and then export to csv.