# BONUS

In this question, we'll do some basic processing on a real-world dataset, courtesy of Kaggle: "A Million News Headlines" from ABC, published over the *15-year period* beginning 2003 and through 2017. You can [download the dataset and play with it yourself](https://www.kaggle.com/therohk/million-headlines/version/6) (though it's available on JupyterHub as well for this assignment).

### Part A

Write a function to read and process the input into a usable format. This function should

 - be named `parse_csv`,
 - take 1 argument: the name of the input file to read
 - return 1 value: a dictionary, where each key is a unique date, and each value is a *list* of *titles* of all headlines posted on that date
 
The *input* file will be a comma-separated text file, the same one that's available on Kaggle. The very first line contains file metadata and can be safely ignored. Each subsequent line is a single news header, with its accompanying timestamp (year, month, day). These two values are separated by a comma--no other punctuation exists.

The first four lines of the 1-million-plus line file look like this:

```
publish_date,headline_text
20030219,aba decides against community broadcasting licence
20030219,act fire witnesses must be aware of defamation
20030219,a g calls for infrastructure protection summit
```

Again, skip the very first line. After that, for each line, keep track of both the date/timestamp and the news header. Add the key (timestamp) to the dictionary if it doesn't already exist, and include the value as a new element to a list under that key (make sure you `strip()` both strings before adding them). Do this for each header in the file, then return the dictionary containing all this information.

**No external packages are allowed (this includes NumPy).** You can use built-in functions like `range`, `zip`, `enumerate`, and `open`. 

In [None]:
import json
ytrue = json.load(open("d.json", "r"))
ypred = parse_csv("abcnews-date-text.csv")
assert ytrue == ypred

### Part B

Using your `parse_csv` function from Part A, answer the question: **how many total news headings are there, total?** Show the code you needed to run to obtain this answer, and use the text field to explain (in a paragraph that could fit in a tweet).

**No other imports are allowed.**

*Hint*: you'll want to pass `abcnews-date-text.csv` into your function as its input argument.

In [None]:
# Please DO NOT MODIFY THIS CELL! Thank you!

### BEGIN HIDDEN TESTS

# The answer: 1103665 headings
# The code should read in the data and count the number of elements
# under each key.

### END HIDDEN TESTS

### Part C

Using your `parse_csv` function from Part A, answer the question: **in what *year* were the *most* headings posted?** Show the code you needed to run to obtain this answer, and use the text field to explain (in a paragraph that could fit in a tweet).

**No other imports are allowed.**

*Hint*: you'll want to pass `abcnews-date-text.csv` into your function as its input argument.

In [None]:
# Please DO NOT MODIFY THIS CELL! Thank you!

### BEGIN HIDDEN TESTS

# The answer: 2012 and 2016 are tied for the most with 366 headings each.
# The code will probably use another dictionary to key up the years
# from the original data and count the headings under those years.

### END HIDDEN TESTS

### Part D

Using your `parse_csv` function from Part A, answer the question: **what are the most common words for each year?** Remove the following stopwords from consideration: `["to", "for", "in", "on", "of", "the", "over", "at", "and", "a", "with", "an", "after"]`. Show the code you needed to run to obtain this answer, and use the text field to explain (in a paragraph that could fit in a tweet).

**You may use `defaultdict` and `Counter` to answer this question.** To use them, add the following statement at the beginning of your code:

`from collections import defaultdict, Counter`

The documentation for `defaultdict` is [here](https://docs.python.org/3/library/collections.html#collections.defaultdict), and for `Counter` is [here](https://docs.python.org/3/library/collections.html#collections.Counter) (both are on different parts of the same webpage).

*Hint*: you'll want to pass `abcnews-date-text.csv` into your function as its input argument.

In [None]:
# I've gone ahead and imported things you can use and defined your list of stopwords.
from collections import defaultdict, Counter
stopwords = ["to", "for", "in", "on", "of", "the", "over", "at", "and", "a", "with", "an", "after"]

### BEGIN SOLUTION

### END SOLUTION

In [None]:
# Please DO NOT MODIFY THIS CELL! Thank you!

### BEGIN HIDDEN TESTS

# The answer should look something like this:

"""
2003 [('us', 2399)]
2004 [('police', 2767)]
2005 [('police', 2934)]
2006 [('police', 2458)]
2007 [('police', 3330)]
2008 [('police', 3141)]
2009 [('police', 2669)]
2010 [('interview', 2807)]
2011 [('police', 2102)]
2012 [('interview', 2438)]
2013 [('new', 2533)]
2014 [('new', 2405)]
2015 [('new', 2503)]
2016 [('man', 1820)]
2017 [('says', 1369)]
"""

# The code will again need to group the output of parse_csv
# by year, but will also have to break the headings up into
# individual words. defaultdict(int) may be used to count
# these words. Counter can then be used for each year's
# word count to pick out the top word.

### END HIDDEN TESTS