# Assignment 1 PREVIEW - Python and NLTK for Text Processing
*Submission deadline: Friday 16 March 2018, 11:00pm*

## Objectives of this assignment

In this assignment you will practice with the use of Python packages for text processing as a first step towards implementing real-world document processing systems.

The deadline of this assignment is before census date, so it can serve as a diagnostic test so that you can determine if you want to remain in the unit or withdraw without penalty.


In [1]:
import nltk
nltk.download('punkt')
nltk.download('gutenberg')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

### Word counts (1 mark)
Implement a function that returns a vector of word counts in a given text, given a vector of words. For your solution you may use third-party modules if you wish. As part of this exercise you will need to split the text into words. When you do it, please note that NLTK's tokeniser works best when it takes sentences as their input. Thus, to tokenise a text that has multiple sentences it is best first to split the text into sentences, and then tokenise each sentence. Look at the lecture notes and exercises of the week 1 workshop for examples of how to do this.

###  PoS counts (1 mark)
Implement a function that returns a vector of counts of parts of speech, given a vector of parts of speech. To determine the parts of speech, use NLTK's `pos_tag_sents` function using the `'universal'` tag set. See the lecture notes and practical exercises from week 1 for details of how to use `pos_tag_sents`.

### Readability (1 mark)
A popular formula to measure the readability of a document is the [Flesh reading-ease test (FRES)](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests), which gives higher scores to texts that are easier to read. According to Wikipedia, the formula is: 

![formula](https://wikimedia.org/api/rest_v1/media/math/render/svg/bd4916e193d2f96fa3b74ee258aaa6fe242e110e)

Write a function that returns the FRES of a text. To help you in this exercise, below is a simple function that you can use to approximate the number of syllables in a word. This function is based on the calculation of the word length used for the [Porter stemmer](https://tartarus.org/martin/PorterStemmer/def.txt):

In [30]:
import re
VC = re.compile('[aeiou]+[^aeiou]+', re.I)
def count_syllables(word):
    return len(VC.findall(word))

### Advanced task (2 marks)

During the last practical exercises of week 1 you were asked to identify all the cardinal numbers in a list of tokens. In this advanced task, you will need to identify all the **ordinal numbers** such as "first", "22nd", etc. We will use the Brown corpus, which, as you know, is annotated with the parts of speech. The Brown corpus tags for ordinal numbers begin with 'OD'. The following code counts all the tokens tagged as ordinal numbers in the "news" section of NLTK's Brown corpus:

In [2]:
nltk.download('brown')
tagged = nltk.corpus.brown.tagged_words(categories='news')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


In [4]:
import collections
c = collections.Counter([t for w, t in tagged if t[:2] == 'OD'])
c

Counter({'OD': 309, 'OD-HL': 1, 'OD-TL': 30})

Implement a function that annotates all ordinal numbers with 'OD' and everything else with the empty string ''. As an example to get you started, you can reuse this code which uses a simple regular expression that tags all tokens that end in 'st' ,'nd', 'rd' and 'th'. The function will correctly label words such as 'first' and 'fifth' but it will incorrectly label words like 'tooth' and 'and':

In [5]:
import re
regexp = re.compile('.*(st|nd|rd|th)$')
def annotateOD(listoftokens):
    result = []
    for t in listoftokens:
        if regexp.match(t):
            result.append((t, 'OD'))
        else:
            result.append((t, ''))
    return result

In [6]:
annotateOD("the second tooth".split())

[('the', ''), ('second', 'OD'), ('tooth', 'OD')]

To evaluate the function we will compute the F1 score using the following code:

In [7]:
def compute_f1(result, tagged):
    assert len(result) == len(tagged) # This is a check that the length of the result and tagged are equal
    correct = [result[i][0] for i in range(len(result)) if result[i][1][:2] == 'OD' and tagged[i][1][:2] == 'OD']
    numbers_result = [result[i][0] for i in range(len(result)) if result[i][1][:2] == 'OD']
    numbers_tagged = [tagged[i][0] for i in range(len(tagged)) if tagged[i][1][:2] == 'OD']
    if len(numbers_tagged) > 0:
        r = len(correct)/len(numbers_tagged)
    else:
        r = 0.0
    if len(numbers_result) > 0:
        p = len(correct)/len(numbers_result)
    else:
        p = 0.0
    return 2*r*p/(r+p)

In [8]:
words = [t for t, w in tagged]
result = annotateOD(words)
compute_f1(result, tagged)

0.11333103685842233

Feel free to reuse any code from the practical exercises, including the code to identify false positives and false negatives. This exercise will be marked as follows.

* F1 > 0.9: 2 marks
* F1 > 0.3: 1 mark
* F1 < 0.3: 0 marks

**Note that your code should not use any large lists of words, and should not use any part of speech taggers.**

## Submission
Submit a single Python file with the solutions to all the questions. The template provided contains all the functions defined as stubs. Make sure that you do not change the names of the functions, since the submission will use an automatic marker that relies on these exact names and argument structure. The template includes a few simple tests using [Python's doctest](https://docs.python.org/3/library/doctest.html) environment. These tests are there to help you, but note that we will use a separate set of tests when we assess your submission. It is your responsibility to run your own tests, in addition to the doctests provided.

The submission must be a single Python file. Do not submit several files or a zip file or the automarker will not know what to do with your submission.

Note that the deadline is a hard deadline and there will be a penalty of one mark per day of late submission. In addition, since the submission date is a week before the census date of 26 of March 2018, late submissions might not be assessed before census date.