In [None]:
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

plt.style.use('ggplot')
plt.rcParams['figure.figsize'] = (15, 3)
plt.rcParams['font.family'] = 'sans-serif'

# Summary

By the end of this chapter, we're going to have downloaded all of Canada's weather data for 2012, and saved it to a CSV. 

We'll do this by downloading it one month at a time, and then combining all the months together.

Here's the temperature every hour for 2012!

In [None]:
weather_2012_final = pd.read_csv('../data/weather_2012.csv', index_col='Date/Time')
weather_2012_final['Temp (C)'].plot(figsize=(15, 6))

# 5.1 Downloading one month of weather data

When playing with the cycling data, I wanted temperature and precipitation data to find out if people like biking when it's raining. So I went to the site for [Canadian historical weather data](http://climate.weather.gc.ca/index_e.html#access), and figured out how to get it automatically.

Here we're going to get the data for March 2012, and clean it up

Here's an URL template you can use to get data in Montreal. 

In [None]:
url_template = "http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"

To get the data for March 2013, we need to format it with `month=3, year=2012`.

In [None]:
#url = url_template.format(month=3, year=2012)
#weather_mar2012 = pd.read_csv(url)

# Or directly with
weather_mar2012 = pd.read_csv('../data/weather_032012_raw.csv')

This is super great! We can just use the same `read_csv` function as before, and just give it a URL as a filename. Awesome.

We convert 'Date/Time (LST)' to dates and set 'Date/Time (LST)' to be the index column. Here's the resulting dataframe.

In [None]:
weather_mar2012['Date/Time (LST)'] = pd.to_datetime(weather_mar2012['Date/Time (LST)'], format = '%Y-%m-%d %H:%M:%S')
weather_mar2012 = weather_mar2012.set_index('Date/Time (LST)')

In [None]:
weather_mar2012

Let's plot it!

In [None]:
weather_mar2012[u"Temp (°C)"].plot(figsize=(15, 5))

Notice how it goes up to 25° C in the middle there? That was a big deal. It was March, and people were wearing shorts outside. 

And I was out of town and I missed it. Still sad, humans.

To not be disturbed by the degree character °. Let's rename the columns. 

In [None]:
weather_mar2012.columns

In [None]:
weather_mar2012 = weather_mar2012.rename(columns = {'Temp (°C)' : 'Temp (C)','Dew Point Temp (°C)':'Dew Point Temp (C)'})
weather_mar2012.columns

You'll notice in the summary above that there are a few columns which are are either entirely empty or only have a few values in them. Let's get rid of all of those with `dropna`.

The argument `axis=1` to `dropna` means "drop columns", not rows", and `how='any'` means "drop the column if any value is null". 

This is much better now -- we only have columns with real data.

In [None]:
weather_mar2012 = weather_mar2012.dropna(axis=1, how='any')
weather_mar2012[:5]

The Year/Month/Day/Time columns are redundant. Let's get rid of those.

The `axis=1` argument means "Drop columns", like before. The default for operations like `dropna` and `drop` is always to operate on rows.

In [None]:
weather_mar2012 = weather_mar2012.drop(['Year', 'Month', 'Day', 'Time (LST)'], axis=1)
weather_mar2012[:5]

Awesome! We now only have the relevant columns, and it's much more manageable.

# 5.2 Plotting the temperature by hour of day

This one's just for fun -- we've already done this before, using groupby and aggregate! We will learn whether or not it gets colder at night. Well, obviously. But let's do it anyway.

In [None]:
temperatures = weather_mar2012[[u'Temp (C)']].copy()
print(temperatures.head)
temperatures['Hour'] = weather_mar2012.index.hour
temperatures.groupby('Hour').aggregate(np.median).plot()

So it looks like the time with the highest median temperature is 2pm. Neat.

# 5.3 Add another month

Let's summarize what we have done for the next month. 

In [None]:
custom_date_parser = lambda x: pd.to_datetime(x, format = '%Y-%m-%d %H:%M:%S')
#url = url_template.format(month=3, year=2012)
#weather_mar2012 = pd.read_csv(url, index_col='Date/Time (LST)', date_parser=custom_date_parser)
# Or directly with
weather_apr2012 = pd.read_csv('../data/weather_042012_raw.csv',date_parser=custom_date_parser,index_col='Date/Time (LST)')
weather_apr2012 = (weather_apr2012.rename(columns = {'Temp (°C)' : 'Temp (C)','Dew Point Temp (°C)':'Dew Point Temp (C)'})
                                  .dropna(axis=1)
                                  .drop(['Year', 'Month', 'Day', 'Time (LST)'], axis=1)
                  )
weather_apr2012

Now, we con concatenate the two dataframes together in one dataframe using `pd.concat`.

In [None]:
weather_2012 = pd.concat([weather_mar2012,weather_apr2012])
weather_2012

# 5.4 Getting the whole year of data

Okay, so what if we want the data for the whole year? 

 Ideally the API would just let us download that, but I couldn't figure out a way to do that.

First, let's put our work from above into a function that gets the weather for a given month.

I noticed that there's an irritating bug where when I ask for January, it gives me data for the previous year, so we'll fix that too. [no, really. You can check =)]

Now we can get all the months at once. This will take a little while to run.

In [None]:
def download_weather_month(year, month, download=False):
    custom_date_parser = lambda x: pd.to_datetime(x, format = '%Y-%m-%d %H:%M:%S')
    weather_data = None
    if download:
        if month == 1:
            year += 1
        url = f"http://climate.weather.gc.ca/climate_data/bulk_data_e.html?format=csv&stationID=5415&Year={year}&Month={month}&timeframe=1&submit=Download+Data"
        weather_data = pd.read_csv(url, index_col='Date/Time (LST)', date_parser=custom_date_parser)
    else:
        # Or directly with
        weather_data = pd.read_csv(f"../data/weather_{month:0>2}{year}_raw.csv",date_parser=custom_date_parser,index_col='Date/Time (LST)')
    weather_data = (weather_data.rename(columns = {'Temp (°C)' : 'Temp (C)','Dew Point Temp (°C)':'Dew Point Temp (C)'})
                                  .dropna(axis=1)
                                  .drop(['Year', 'Month', 'Day', 'Time (LST)'], axis=1)
                       )
    return weather_data

We can test that this function does the right thing:

In [None]:
download_weather_month(2012, 1)[:5]

Now we can get all the months at once. This will take a little while to run.

In [None]:
data_by_month = [download_weather_month(2012, i) for i in range(1, 13)]

Once we have this, it's easy to concatenate all the dataframes together into one big dataframe using pd.concat. And now we have the whole year's data!

In [None]:
weather_2012 = pd.concat(data_by_month)
weather_2012

We can plot the temperature over all the year.

In [None]:
weather_2012['Temp (C)'].plot(figsize=(15, 6))

# 5.4 Saving to a CSV

It's slow and unnecessary to download the data every time, so let's save our dataframe for later use!

In [None]:
weather_2012.to_csv('weather_2012_tmp.csv')

And we're done!