# Corpus Preprocessing for Language Modeling

- __author__: Evgeny A. Stepanov
- __e-mail__: stepanov.evgeny.a@gmail.com

This notebook is part of the Laboratory Work for [Language Understanding Systems class](http://disi.unitn.it/~riccardi/page7/page13/page13.html) of [University of Trento](https://www.unitn.it/en).

Dan Jurafsky and James H. Martin's __Speech and Language Processing__ ([3rd ed. draft](https://web.stanford.edu/~jurafsky/slp3/)) is advised for reading. 

- Section *Corpora and Counting* covers some concepts of *Chapter 2: "Regular Expressions, Text Normalization, Edit Distance"*.

__Requirements__

- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset

    - run `git clone https://github.com/esrel/NL2SparQL4NLU.git`
    

## 1. Corpora and Counting

### 1.1. Corpus

[Corpus](https://en.wikipedia.org/wiki/Text_corpus) is a collection of written or spoken texts that is used for language research. Before doing anything with a corpus we need to know its properties:

__Corpus Properties__:
- *Format* -- how to read/load it?
- *Language* -- which tools/models can I use?
- *Annotation* -- what it is intended for?
- *Split* for __Evaluation__: (terminology varies from source to source)

| Set         | Purpose                                       |
|:------------|:----------------------------------------------|
| Training    | training model, extracting rules, etc.        |
| Development | tuning, optimization, intermediate evaluation |
| Test        | final evaluation (remains unseen)             |


#### NL2SparQL4NLU

- __Format__:

    - Utterance (sentence) per line
    - Tokenized
    - Lowercased

- __Language__: English monolingual

- __Annotation__: None (for now)

- __Split__: training & test sets

#### Exercise

- define a function to load a corpus into a list-of-lists

- load `NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt`
- print first `2` words (tokens) of the first `10` sentences


In [1]:
trn='NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt'
tst='NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt'

### 1.2. Corpus Descriptive Statistics (Counting)

*Corpus* description in terms of:

- total number of words
- total number of utterances


#### Exercise

- define a function to compute corpus descriptive statistics -- total utterance and word counts.
- compute the statistics for the __training__ and __test__ sets of NL2SparQL4NLU dataset. 
- compare the computed statistics with the reference values below.


| Metric           | Train  | Test   |
|------------------|-------:|-------:|
| Total Words      | 21,453 |  7,117 |
| Total Utterances |  3,338 |  1,084 |


## 2. Lexicon

[Lexicon](https://en.wikipedia.org/wiki/Lexicon) is the *vocabulary* of a language. In linguistics, a lexicon is a language's inventory of lexemes.

Linguistic theories generally regard human languages as consisting of two parts: a lexicon, essentially a catalog of a language's words; and a grammar, a system of rules which allow for the combination of those words into meaningful sentences. 

*Lexicon (or Vocabulary) Size* is one of the statistics reported for corpora. While *Word Count* is the number of __tokens__, *Lexicon Size* is the number of __types__ (unique words).


### 2.1. Lexicon Size

#### Exercise

- define a function to compute a lexicon from corpus in a list-of-lists format
    - sort the list alphabetically
    
- compute the lexicon of the training set of NL2SparQL4NLU dataset
- compare the its size to the reference value below.

| Metric       | Value |
|--------------|------:|
| Lexicon Size | 1,729 |


### 2.2. Frequency List

In Natural Language Processing (NLP), [a frequency list](https://en.wikipedia.org/wiki/Word_lists_by_frequency) is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

What is a "word"?

- case sensitive counts
- case insensitive counts (our corpus is lowercased)

#### Exercise

- define a function to compute a frequency list for a corpus
- compute frequency list for the training set of NL2SparQL4NLU dataset
- report `5` most frequent words (use can use provided `nbest` function to get a dict of top N items)
- compare the frequencies to the reference values below

| Word   | Frequency |
|--------|----------:|
| the    |     1,337 |
| movies |     1,126 |
| of     |       607 |
| in     |       582 |
| movie  |       564 |


In [2]:
def nbest(d, n=1):
    """
    get n max values from a dict
    :param d: input dict (values are numbers, keys are stings)
    :param n: number of values to get (int)
    :return: dict of top n key-value pairs
    """
    return dict(sorted(d.items(), key=lambda item: item[1], reverse=True)[:n])

### 2.3. Lexicon Operations

It is common to process the lexicon according to the task at hand (not every transformation makes sense for all tasks). The common operations are removing words by frequency (minimum or maximum, i.e. *Frequency Cut-Off*) and removing words for a specific lists (i.e. *Stop Word Removal*).

In computing, [stop words](https://en.wikipedia.org/wiki/Stop_words) are words which are filtered out before or after processing of natural language data (text). Though "stop words" usually refers to the most common words in a language, there is no single universal list of stop words used by all natural language processing tools, and indeed not all tools even use such a list. Some tools specifically avoid removing these stop words to support phrase search.

Any group of words can be chosen as the stop words for a given purpose.

#### Exercises

##### Frequency Cut-Off

- define a function to compute a lexicon from a frequency list applying minimum and maximum frequency cut-offs

    - use default values for min and max
    
- using frequency list for the training set of NL2SparQL4NLU dataset
    
    - compute lexicon applying:
    
        - minimum cut-off 2 (remove words that appear less than 2 times, i.e. remove [hapax legomena](https://en.wikipedia.org/wiki/Hapax_legomenon))
        - maximum cut-off 100 (remove words that appear more that 100 times)
        - both minimum and maximum thresholds together
        
    - report size for each comparing to the reference values in the table

| Operation  | Min | Max | Size |
|------------|----:|----:|-----:|
| original   | N/A | N/A | 1729 |
| cut-off    |   2 | N/A |  950 |
| cut-off    | N/A | 100 | 1694 |
| cut-off    |   2 | 100 |  915 |


##### Stop Word Removal

- define a function to read/load a list of words in token-per-line format (i.e. lexicon)
- load stop word list from `NL2SparQL4NLU/extras/english.stop.txt`
- using Python's built it `set` [methods](https://docs.python.org/2/library/stdtypes.html#set):
    
    - define a function to compute overlap of two lexicons
    - define a function to apply a stopword list to a lexicon

- compare the 100 most frequent words in frequency list of the training set to the list of stopwords (report count)
- apply stopword list to the lexicon of the training set
- report size of the resulting lexicon comparing to the reference values.

| Operation       | Size |
|-----------------|-----:|
| original        | 1729 |
| no stop words   | 1529 |
| top 100 overlap |   50 |

In [3]:
swl='NL2SparQL4NLU/extras/english.stop.txt'

## 3. Corpus Normalization (Pre-processing)

### 3.2. Handling Unseen Words

*Out-Of-Vocabulary (OOV) word* -- tokens in test data that are not contained in the lexicon (vocabulary).
Empirically each OOV word results in 1.5 - 2 extra errors (> 1 due to the loss of contextual information).

__*How to handle words (in test set) that were never seen in the training data?*__

Train a language model with specific token (e.g. `<unk>`) for unknown words!

__*How to estimate probabilities of unknown words and ngrams?*__

The *simplest* approach is to replace all the words that are not in vocabulary (lexicon) with the `<unk>` token and treat it as any other word. (For instance, applying frequency cut-off to the lexicon, will allow estimate these probabilities on the training set.)

#### Exercise
- define a function to replace OOV words in a corpus as list-of-list given a lexicon