# Final assignment 
For your final assignment, you will analyze linguistic data with so-called **linguistic processors** and process the output of these processors. 

In general, the assignment consists of these main steps:
1. preprocess a number of files with linguistic processors
2. extract statistics from those annotated files
4. store the statistics in a useful format
5. present the statistics using visualizations

There are two main directions, from which you must choose one:
1. you analyze the data proposed by us
2. you propose your own project. We encourage you to discuss your ideas with us as soon as possible.

## Important dates
1. **23-12-2016 13:00 pm**: decision about which direction you will choose. For those who will propose their own project, the project has to be approved by us before this deadline.
2. **2-2-2017 23:59 pm**: 5-minute presentation (you will NOT receive a grade for this presentation ; see below for more information)
3. **5-2-2017 23:59 pm**: deadline submission final assignment

## Data (do not redistribute this data)
### everything described in this section is relevant if you choose to analyze the data proposed by us
In the folder **data**, there are 10.000 json files, each containing one article of the [SignalMedia corpus](http://research.signalmedia.co/newsir16/signal-dataset.html).
Let's look at an example:

In [3]:
import json

In [4]:
with open('data/4848.json') as infile:
    a_json = json.load(infile)
    print(json.dumps(a_json, indent=4, sort_keys=True))

{
    "content": "CLEVELAND, OH - AUGUST 06: Republican presidential candidates (L-R) New Jersey Gov. Chris Christie, Sen. Marco Rubio (R-FL), Ben Carson, Wisconsin Gov. Scott Walker, Donald Trump, Jeb Bush, Mike Huckabee and Sen. Ted Cruz (R-TX) take the stage at the Quicken Loans Arena August 6, 2015 in Cleveland, Ohio. \n \nScott Olson \n \nImage copyright 2015 . All rights reserved. This material may not be published, broadcast, rewritten, or redistributed.",
    "id": "5bb69bce-741e-42b3-af0a-c911bfb71330",
    "media-type": "News",
    "published": "2015-09-16T16:57:45Z",
    "source": "KSHB",
    "title": "The GOP Presidential debate BINGO card"
}


Each json file (containing one news article) has the following keys (information taken [SignalMedia corpus](http://research.signalmedia.co/newsir16/signal-dataset.html)):
* **id**: a unique identifier for the article
* **title**: the title of the article
* **content**: the textual content of the article (may occasionally contain HTML and JavaScript content)
* **source**: the name of the article source (e.g. Reuters)
* **published**: the publication date of the article
* **media-type**: either *News* or *Blog*

We will use [spaCy](https://spacy.io/) to run the linguistic processors. Please check the notebook **Topic 4** for instructions on how to install spaCy. Here is an example of how spaCy works

In [5]:
from spacy.en import English
nlp = English()

In [6]:
with open('data/4848.json') as infile:
    a_json = json.load(infile)

In [10]:
spacy_output = nlp(a_json['content'])

In [12]:
for sent_obj in spacy_output.sents: # for each sentence
    for token_obj in spacy_output: # for each token in a sentence
        print()
        print('token', token_obj.text)
        print('type', token_obj.lemma_)
        print('part of speech', token_obj.pos_)
        print('stopword', token_obj.is_stop)
        break
    break


token CLEVELAND
type cleveland
part of speech NOUN
stopword False


To summarize, for each article you will have two sources of information:
* you have the **metadata** in the json file
* you have the **annotations** from spaCy

## Statistics
Write a a function/script that takes a list/set/generator of paths to one or more json files in the folder **data** as input and outputs a dictionary containing the:
* average number of tokens
    * per article
* average number of types
    * per article
* average number of sentences
    * per article
* 10 longest sentences in all articles
* type-token ratio (number of types / number of tokens)
    * per article
    * overall
* frequency per source (see metadata)
* frequency per media-type (see metadata)
    
In addition, please write the statistics to a TSV (tab separated) file.

## Visualizing interesting properties of the data
This part is completely open. Feel free to visualize whatever you want.
Please create at least three visualizations. Please take a look at **Topic 6** for inspiration.

## 5-minute presentation

You'll also have to present your work in the last session of the course. This is useful for several reasons. 
* First, we want you to reflect on the way you've handled your project. 
* Second, it's useful to see how other people tackle similar problems.
* Third, your fellow students' work may be useful to you in the future as well. We'd like to encourage you to check out what your classmates did.

## Preparation for the presentation

Here are some questions to help you reflect on your project. You don't need to
address all of these points in your presentation (5 minutes is *really* short!).
Just highlight the points that are most important for you.

* What did you do?
* How did you do it?
    - What modules did you use? What are they useful for?
    - What kind of data did you use? How did you get it?
    - How did you manage your project? What did your workflow look like?
* How can others use or build on your project?
* What was the greatest challenge for you?
* What took the largest amount of time? (try to keep track of this)
* What would you do differently next time?

In terms of format, you can choose any kind of presentation that you like (powerpoint, notebook, web demo, ...).

## Grading

We will consider the following questions (along with the core principles) to evaluate your final assignment:
* Does the code work?
* Does the code fulfill the requirements?
* Is the code well-documented?
* Is the code clear and understandable?
* Is the code modular?
* Is the code easily extensible?
* How scalable is the solution?
* Is the code written in accordance with [the community standards](http://pep8.org/)? (That is: PEP8)

The weighting of the grading is as follows:
<table bgcolor="#eeeeee" border=1>
<tr bgcolor="#cccccc"><td></td><td>Weight (%)</td>
<tr><td bgcolor=#cccccc>Code Quality</td><td>40</td></tr>
<tr><td bgcolor=#cccccc>Statistics</td><td>30</td></tr>
<tr><td bgcolor=#cccccc>Visualizations</td><td>20</td></tr>
<tr><td bgcolor=#cccccc>Documentation</td><td>10</td></tr>
</table>