# DATA 202 Homework 6: Data Wrangling


In [None]:
import pandas as pd
from pandas.tseries.holiday import USFederalHolidayCalendar
import datetime

# Source Data

## Capital Bikeshare Rides Data

Download the [2011 trip data](https://s3.amazonaws.com/capitalbikeshare-data/2011-capitalbikeshare-tripdata.zip) from [Capital Bikeshare](https://www.capitalbikeshare.com/system-data). Don't need to unzip the ZIP file; Pandas will handle it:

In [None]:
rides = pd.read_csv("2011-capitalbikeshare-tripdata.zip")
rides.info()

In [None]:
rides.head()

Let's remove some columns we don't need, to save memory.

In [None]:
del rides["Start station"], rides["End station"]

## Holidays

The following code gets us a table of federal holidays. Please run it without changing it.

In [None]:
# Run this code unchanged.
holidays = pd.DataFrame({
    'date': USFederalHolidayCalendar().holidays(datetime.date(2011,1,1), datetime.date(2015,12,31)).date,
    'is_holiday': True})
holidays.head()

## Weather Data
Our main goal will be to get the hourly temperature data.

The original wranglers used a weather data source that does not seem to provide downloadable data anymore. But we can use the US government's records. They're in a cumbersome format, which will provide us an excuse to practice some **data cleaning**!

First challenge is where to find the data. Here's how we solved this hard problem:

NOAA's [Integrated Surface Database](https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets) provides weather data from all over the country. But how to use it? There's a "Find a Station" tool, but it's confusing how to use the results. https://www.ncdc.noaa.gov/data-access/land-based-station-data/station-metadata has a link to a [station list file](ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.txt). Searching that, it looks like the code for Reagan Airport is 724050 13743. So the file is
https://www.ncei.noaa.gov/data/global-hourly/access/2011/72405013743.csv

Poking around in that site revealed two documents that look very important:
- https://www.ncei.noaa.gov/data/global-hourly/doc/isd-format-document.pdf
- https://www.ncei.noaa.gov/data/global-hourly/doc/CSV_HELP.pdf



In [None]:
# Run this to load the file directly from the NOAA website.
# You may want to make a local copy and read it in from there instead.
weather = pd.read_csv("https://www.ncei.noaa.gov/data/global-hourly/access/2011/72405013743.csv")

In [None]:
print(len(weather))
weather.head()

# Data Wrangling

## 1. Extract `date` and `hour`

In [None]:
rides['start'] = pd.to_datetime(rides['Start date'])
rides['start'].iloc[0]

In [None]:
rides['date'] = rides['start'].dt.date#strftime("%Y-%m-%d")
rides['hour'] = rides['start'].dt.hour

## 2. Filter to include only rides by Members
You'll end up with a Series with a hierarchical index; remember that the "get out of jail card" is `.to_frame(name="NAME_GOES_HERE").reset_index()`.

In [None]:
# your code here
...

## ... more data wrangling...

## Yay, we're done!

In [None]:
assert len(merged_data) > 365 * 23
assert 'date' in merged_data.columns
assert 'hour' in merged_data.columns
assert 'is_holiday' in merged_data.columns
assert 'temp_C' in merged_data.columns
assert 'rides' in merged_data.columns
assert len(merged_data.dropna()) == len(merged_data)