# Python-programming task

In this task, you will be asked to implement a tokeniser and a function that provides simple corpus statistics.

In a separate text file called tokeniser.py, you should define a `Tokeniser` class with the following public methods:

```python
class Tokeniser:
    def __init__(self):
        pass

    def tokenise_on_punctuation(self, text):
        pass

    def train(self, text):
        pass

    def tokenise(self, text, use_unk=False):
        pass

    def tokenise_with_count_threshold(self, text, threshold, use_unk=False):
        pass

    def tokenise_with_freq_threshold(self, text, threshold, use_unk=False):
        pass
```

The `__init__` method should initialise the tokeniser. The `tokenise_on_punctuation` method can be used without training the tokeniser. It should split the input on any kind of whitespace (including tab characters and newlines) and on punctuation signs. The punctuation signs are as follows: `!"#%&'()*,-./:;?@[\]_{}¡§«¶·»¿‘’“”–—`

The `train` method should take as input a corpus and prepare the tokeniser for use on new texts. The trained tokeniser can be used by invoking the last three methods. If the user tries to invoke them _before_ training the tokeniser, your code should raise a `RuntimeError` with the message `The tokeniser has not been trained yet.`.

The methods should work as follows:

1. `tokenise` splits the input on punctuation and whitespace and then goes over tokens. If a token is found in the vocabulary, it is added to the output. Otherwise the behaviour is determined by the `use_unk` flag: if it is set to `True`, unknown tokens should be replaced with 'UNK'; otherwise they should be split into individual characters.
2. `tokenise_with_count_threshold` first also splits the input on whitespace and vocabulary. It then checks how many times a token appeared in the training corpus. If it appeared `threshold` times or more, it is added to the output; otherwise it is treated as an unknown token in the same way as above.
3. `tokenise_with_freq_threshold` does the same thing with regard to _relative frequency_ of the current token in the training corpus. Relative frequency ranges from 0 (not found in the training corpus) to 1 (all of the tokens from the training corpus are this same token).

All the operations of the tokeniser should be case dependent, i.e. 'a' and 'A' are different tokens, same as 'this' and 'tHis'.

You can use the code below to test the behaviour of your tokeniser. When evaluating your code, we will use the same code but with other inputs. You are also encouraged to use other inputs to make sure that the code is doing what it is supposed to do.

You can implement as many additional methods of the `Tokeniser` class, auxiliary functions, or even auxiliary classes as you want. The only restriction is that you can only use the Python standard library and that all your code resides in the `tokeniser.py` file.

In addition to the tokeniser, in the same file you should also implement a function called `get_stats`. It should take a tokenised corpus as input and return a dictionary with the following fields:

1. `type_count`, the number of different tokens in the corpus
2. `token_count`, the total number of tokens in the corpus
3. `type_token_ratio`, the ratio between the two, a common measure of lexical variability
4. `token_count_by_length`, the number of tokens of different lengths, measured in characters, found in the corpus
5. `average_token_length`, the mean length in characters of all tokens in the corpus
6. `token_length_std`, the standard average of the same

See the end of this notebook for a usage example.

In [1]:
from tokeniser import Tokeniser

In [2]:
import urllib.request

raw_bytes = urllib.request.urlopen(
    'http://www.sls.hawaii.edu/bley-vroman/brown.txt')
brown_corpus = raw_bytes.read().decode('utf8').replace('\r\n', '\n')

In [3]:
brown_250 = brown_corpus[:250]

In [4]:
tokeniser = Tokeniser()

In [5]:
print(repr(
    tokeniser.tokenise_on_punctuation(brown_250)))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', 's', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities', 'took', 'place', 'The', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', 'which', 'had']


In [6]:
tokeniser.tokenise_on_punctuation(
    'Budget : l’effort demandé aux départements sera réduit «très significativement»,'
    ' annonce Michel Barnier')

['Budget',
 'l',
 'effort',
 'demandé',
 'aux',
 'départements',
 'sera',
 'réduit',
 'très',
 'significativement',
 'annonce',
 'Michel',
 'Barnier']

In [7]:
# Should fail at this time
tokeniser.tokenise(brown_250)

RuntimeError: The tokeniser has not been trained yet.

In [8]:
tokeniser.train(brown_corpus)

In [9]:
print(repr(tokeniser.tokenise(brown_250)))

['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', 's', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'irregularities', 'took', 'place', 'The', 'jury', 'further', 'said', 'in', 'term', 'end', 'presentments', 'that', 'the', 'City', 'Executive', 'Committee', 'which', 'had']


In [10]:
print(repr(tokeniser.tokenise('Budget : l’effort demandé aux départements sera réduit «très significativement»,'
                              ' annonce Michel Barnier')))

['Budget', 'l', 'effort', 'd', 'e', 'm', 'a', 'n', 'd', 'é', 'aux', 'd', 'é', 'p', 'a', 'r', 't', 'e', 'm', 'e', 'n', 't', 's', 'sera', 'r', 'é', 'd', 'u', 'i', 't', 't', 'r', 'è', 's', 's', 'i', 'g', 'n', 'i', 'f', 'i', 'c', 'a', 't', 'i', 'v', 'e', 'm', 'e', 'n', 't', 'a', 'n', 'n', 'o', 'n', 'c', 'e', 'M', 'i', 'c', 'h', 'e', 'l', 'B', 'a', 'r', 'n', 'i', 'e', 'r']


In [11]:
print(repr(
    tokeniser.tokenise(
    'Budget : l’effort demandé aux départements sera réduit «très significativement», annonce Michel Barnier',
    use_unk=True)))

['Budget', 'l', 'effort', 'UNK', 'aux', 'UNK', 'sera', 'UNK', 'UNK', 'UNK', 'UNK', 'UNK', 'UNK']


In [12]:
print(repr(
    tokeniser.tokenise_with_count_threshold(brown_250, 1000)))

['The', 'F', 'u', 'l', 't', 'o', 'n', 'C', 'o', 'u', 'n', 't', 'y', 'G', 'r', 'a', 'n', 'd', 'J', 'u', 'r', 'y', 'said', 'F', 'r', 'i', 'd', 'a', 'y', 'an', 'i', 'n', 'v', 'e', 's', 't', 'i', 'g', 'a', 't', 'i', 'o', 'n', 'of', 'A', 't', 'l', 'a', 'n', 't', 'a', 's', 'r', 'e', 'c', 'e', 'n', 't', 'p', 'r', 'i', 'm', 'a', 'r', 'y', 'e', 'l', 'e', 'c', 't', 'i', 'o', 'n', 'p', 'r', 'o', 'd', 'u', 'c', 'e', 'd', 'no', 'e', 'v', 'i', 'd', 'e', 'n', 'c', 'e', 'that', 'any', 'i', 'r', 'r', 'e', 'g', 'u', 'l', 'a', 'r', 'i', 't', 'i', 'e', 's', 't', 'o', 'o', 'k', 'p', 'l', 'a', 'c', 'e', 'The', 'j', 'u', 'r', 'y', 'f', 'u', 'r', 't', 'h', 'e', 'r', 'said', 'in', 't', 'e', 'r', 'm', 'e', 'n', 'd', 'p', 'r', 'e', 's', 'e', 'n', 't', 'm', 'e', 'n', 't', 's', 'that', 'the', 'C', 'i', 't', 'y', 'E', 'x', 'e', 'c', 'u', 't', 'i', 'v', 'e', 'C', 'o', 'm', 'm', 'i', 't', 't', 'e', 'e', 'which', 'had']


In [13]:
print(repr(
    tokeniser.tokenise_with_count_threshold(brown_250, 1000, use_unk=True)))

['The', 'UNK', 'UNK', 'UNK', 'UNK', 'said', 'UNK', 'an', 'UNK', 'of', 'UNK', 's', 'UNK', 'UNK', 'UNK', 'UNK', 'no', 'UNK', 'that', 'any', 'UNK', 'UNK', 'UNK', 'The', 'UNK', 'UNK', 'said', 'in', 'UNK', 'UNK', 'UNK', 'that', 'the', 'UNK', 'UNK', 'UNK', 'which', 'had']


In [14]:
print(repr(
    tokeniser.tokenise_with_freq_threshold(brown_250, 0.00001)))

['The', 'Fulton', 'County', 'Grand', 'J', 'u', 'r', 'y', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', 's', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'i', 'r', 'r', 'e', 'g', 'u', 'l', 'a', 'r', 'i', 't', 'i', 'e', 's', 'took', 'place', 'The', 'jury', 'further', 'said', 'in', 'term', 'end', 'p', 'r', 'e', 's', 'e', 'n', 't', 'm', 'e', 'n', 't', 's', 'that', 'the', 'City', 'E', 'x', 'e', 'c', 'u', 't', 'i', 'v', 'e', 'Committee', 'which', 'had']


In [15]:
print(repr(
    tokeniser.tokenise_with_freq_threshold(brown_250, 0.00001, True)))

['The', 'Fulton', 'County', 'Grand', 'UNK', 'said', 'Friday', 'an', 'investigation', 'of', 'Atlanta', 's', 'recent', 'primary', 'election', 'produced', 'no', 'evidence', 'that', 'any', 'UNK', 'took', 'place', 'The', 'jury', 'further', 'said', 'in', 'term', 'end', 'UNK', 'that', 'the', 'City', 'UNK', 'Committee', 'which', 'had']


In [16]:
raw_bytes_austen = urllib.request.urlopen(
    'https://www.gutenberg.org/cache/epub/1342/pg1342.txt')
pride_and_prejudice = raw_bytes_austen.read().decode('utf8').replace('\r\n', '\n').replace('\ufeff', '')

In [17]:
pride_and_prejudice_tokenised = tokeniser.tokenise_with_count_threshold(pride_and_prejudice, 5)

In [18]:
print(repr(
    pride_and_prejudice_tokenised[:150]))

['The', 'P', 'r', 'o', 'j', 'e', 'c', 't', 'G', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g', 'e', 'B', 'o', 'o', 'k', 'of', 'P', 'r', 'i', 'd', 'e', 'and', 'P', 'r', 'e', 'j', 'u', 'd', 'i', 'c', 'e', 'This', 'e', 'b', 'o', 'o', 'k', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', 'You', 'may', 'copy', 'it', 'give', 'it', 'away', 'or', 're', 'use', 'it', 'under', 'the', 'terms', 'of', 'the', 'P', 'r', 'o', 'j', 'e', 'c', 't', 'G', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g', 'L', 'i', 'c', 'e', 'n', 's', 'e', 'included', 'with', 'this', 'e', 'b', 'o', 'o', 'k', 'or', 'o', 'n', 'l', 'i', 'n', 'e', 'at', 'w', 'w', 'w', 'g', 'u', 't', 'e', 'n', 'b', 'e', 'r', 'g', 'o', 'r', 'g', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States']


In [19]:
from tokeniser import get_stats

In [20]:
get_stats(pride_and_prejudice_tokenised)

{'type_count': 4841,
 'token_count': 190532,
 'type_token_ratio': 0.025407805512984695,
 'token_count_by_length': {1: 72415,
  2: 24992,
  3: 30186,
  4: 23043,
  5: 11450,
  6: 8719,
  7: 7567,
  8: 4286,
  9: 4126,
  10: 1818,
  11: 915,
  12: 738,
  13: 197,
  14: 58,
  15: 21,
  16: 1},
 'average_token_length': 3.0435254970293704,
 'token_length_std': 2.3760764980000806}