A few things you should keep in mind when working on assignments:

1. Make sure you fill in any place that says `YOUR CODE HERE`. Do **not** write your answer in anywhere else other than where it says `YOUR CODE HERE`. Anything you write anywhere else will be removed or overwritten by the autograder.

2. Before you submit your assignment, make sure everything runs as expected. Go to menubar, select _Kernel_, and restart the kernel and run all cells (_Restart & Run all_).

3. Do not change the title (i.e. file name) of this notebook.

4. Make sure that you save your work (in the menubar, select _File_ → _Save and CheckPoint_)

5. You are allowed to submit an assignment multiple times, but only the most recent submission will be graded.

# Problem 1. Pandas Data Manipulation

In this problem, we will use the
  [groupby()](http://pandas.pydata.org/pandas-docs/stable/groupby.html)
  and [aggregate()](http://pandas.pydata.org/pandas-docs/stable/groupby.html#aggregation)
  functions in Pandas to compute and plot the number of flight cancellations
  in each month of 2001.
  
![](https://github.com/UI-DataScience/accy570-fa16/raw/master/Week13/images/cancelled.png)

In [None]:
%matplotlib inline

import numpy as np
import pandas as pd
import seaborn as sns

from pandas.util.testing import assert_frame_equal

In this problem, we use the [airline on-time performance data](http://stat-computing.org/dataexpo/2009/the-data.html). In the following code cell, we specify the encoding of the file (`latin-1`). Also note that we use the `usecols` option to read only two columns, `Month` and `Cancelled`, for faster performance and lower memory usage.

In [None]:
df = pd.read_csv('/home/data_scientist/data/2001.csv', encoding='latin-1', usecols=['Month', 'Cancelled'])
print(df.head())

## Add up the flight cancellations by month.

- In the following code cell, write a function named `get_month_cancelled()` that groups the rows by month and then adds up the cancellations within the same month.
- `get_month_cancelled()` takes a Pandas DataFrame (`df`) which has two columns, `Month` and `Cancelled`.
  The `Month` column is an integer column with values ranging from 1 to 12.
  The `Cancelled` column is also an integer column, but has only two values, `0` for not cancelled
  and `1` if the flight was cancelled. Thus, you will get the total number of cancellations if you add up all the values in the `Cancelled` column, because flights that were not cancelled are all zeros and don't contribute.
- `get_month_cancelled()` returns a Pandas DataFrame that is indexed by the **names** of the months
  and has only one column `Cancelled`, the number of flight cancellations in each month. In other words,
  when you run
  ```python
  >>> month_cancelled = get_month_cancelled(df)
  >>> print(month_cancelled)
  ```
  you should get
  ```
                 Cancelled
    January        19891
    February       17448
    March          17876
    April          11414
    May             9452
    June           15509
    July           11286
    August         13318
    September      99324
    October         6850
    November        4497
    December        4333
  ```

- If you don't set the index manualy, they will be just numbers, i.e. 0, 1, 2,...
  Use the following list to change the numbers to the names of the months.
  ```python
  ['January', 'February', 'March', 'April', 'May', 'June',
   'July', 'August', 'September', 'October', 'November', 'December']
  ```
  If you are not sure how to change the index, see [Chapter 4 of Pandas Cookbook](http://nbviewer.jupyter.org/github/jvns/pandas-cookbook/blob/master/cookbook/Chapter%204%20-%20Find%20out%20on%20which%20weekday%20people%20bike%20the%20most%20with%20groupby%20and%20aggregate.ipynb).
  In Pandas Cookbook, the index of `weekday_counts` is originally 0, 1, 2, ...
  To change it to Monday, Tuesday, ..., the author uses the following code.
  ```python
  weekday_counts.index = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
  ```
  You can use a similar method to change the index.

In [None]:
def get_month_cancelled(df):
    # YOUR CODE HERE

In [None]:
month_cancelled = get_month_cancelled(df)
print(month_cancelled)

In [None]:
answer = pd.DataFrame(
    [19891, 17448, 17876, 11414, 9452, 15509,
     11286, 13318, 99324, 6850, 4497, 4333],
    index=['January', 'February', 'March', 'April', 'May', 'June',
           'July', 'August', 'September', 'October', 'November', 'December'],
    columns=['Cancelled']
    )

assert_frame_equal(month_cancelled, answer)

# additional test
df1 = pd.DataFrame(
    {"Month": list(range(1, 13)) * 2,
     "Cancelled": [0, 1] * 12}
)
df2 = pd.DataFrame(
    [0 if i % 2 else 2 for i in range(1, 13)],
    index=['January', 'February', 'March', 'April', 'May', 'June',
           'July', 'August', 'September', 'October', 'November', 'December'],
    columns=['Cancelled']
)
assert_frame_equal(get_month_cancelled(df1), df2)

In the following code cell, we plot the number of cancellations for each month.

In [None]:
month_cancelled.plot(kind='bar');