# Corpus Preprocessing using Python: Solutions

- Natural Language Understanding
- Evgeny A. Stepanov
- stepanov.evgeny.a@gmail.com

__Requirements__

- [NL2SparQL4NLU](https://github.com/esrel/NL2SparQL4NLU) dataset

## Corpus Pre-processing

### Reading Corpus

In [1]:
def read_corpus(corpus_file):
    """
    read corpus into a list-of-lists, splitting sentences into tokens by space (' ')
    :param corpus_file: corpus file in sentence-per-line format (tokenized)
    :return: corpus as list of lists
    """
    return [line.strip().split() for line in open(corpus_file, 'r')]

### Adding Sentence Beginning and End Tags

In [2]:
def corpus_add_tags(corpus, bos='<s>', eos='</s>'):
    """
    add beginning-of-sentence (bos) and end-of-sentence (eos) tags
    :param corpus: corpus as list-of-lists
    :param bos: beginning-of-sentence tag
    :param eos: end-of-sentence tag
    """
    return [[bos] + sent + [eos] for sent in corpus]

### Handling Unknown Words
Even though it is easier to handle unknown words by creating a custom lexicon file; this requires knowledge of the file formats (which we will cover later). Consequently, let's pre-process training data using python (and no libraries), from previous lab.

#### Lexicon Frequency Cut-Off

##### Computing Frequency List

In [3]:
def compute_frequency_list(corpus):
    """
    create frequency list for a corpus
    :param corpus: corpus in list-of-lists format
    :return: frequency list as dict of counts
    """
    frequencies = {}
    for sentence in corpus:
        for token in sentence:
            frequencies[token] = frequencies.setdefault(token, 0) + 1 
    return frequencies

##### Applying min and max Frequency Cut-Off

In [4]:
def cutoff(frequency_list, tf_min=1, tf_max=float('inf')):
    """
    apply min and max cutoffs to a frequency list
    :param frequency_list: frequency list of a corpus as dict
    :param tf_min: minimum token frequency for lexicon elements (below removed); default 1
    :param tf_max: maximum token frequency for lexicon elements (above removed); default infinity
    :return: lexicon as sorted list of tokens
    """
    return sorted([token for token, frequency in frequency_list.items() if tf_max >= frequency >= tf_min])

#### Removing Stop Words from Lexicon
- Not used for Ngram Language Modeling

In [5]:
def remove_stopwords(lexicon, stopwords):
    """
    remove stopwords from a lexicon
    :param lexicon: lexicon as a list
    :param stopwords: stopwords list
    :return: sorted difference of two lists (lexicon - stopwords)
    """
    return sorted(list(set(lexicon) - set(stopwords)))

#### Computing Lexicon from Corpus and Reading Lexicon from File

In [6]:
def read_lexicon(lexicon_file):
    """
    read lexicon into a list
    :param lexicon_file: lexicon file in token-per-line format
    :return: lexicon as a list
    """
    return [line.strip() for line in open(lexicon_file, 'r')]

In [7]:
def compute_lexicon(corpus):
    """
    compute lexicon of a corpus
    :param corpus: corpus as list-of-lists
    :return: sorted list of unique words
    """
    return sorted(list(set([word for sent in corpus for word in sent])))

#### Replacing Unknown Words (OOV) with `<unk>` in a Corpus

In [8]:
def corpus_replace_oov(corpus, lexicon, unk='<unk>'):
    """
    replace all tokens that are not in lexicon with unk
    :param corpus: corpus as list-of-lists
    :param lexicon: lexicon as a list of tokens
    :return: processed corpus
    """
    return [[token if token in lexicon else unk for token in sent] for sent in corpus]

## Pre-processing Training and Test Sets

- Lexicon is computed only using training data
- Both training and test sets are augmented with BOS (`<s>`) and EOS (`</s>`) tags
- Both training and test sets have OOV words replaced with `<unk>`

In [9]:
trn='NL2SparQL4NLU/dataset/NL2SparQL4NLU.train.utterances.txt'
tst='NL2SparQL4NLU/dataset/NL2SparQL4NLU.test.utterances.txt'
trn_out='NL2SparQL4NLU.trn.data'
tst_out='NL2SparQL4NLU.tst.data'

trn_raw = read_corpus(trn)
trn_tag = corpus_add_tags(trn_raw)
trn_lex = cutoff(compute_frequency_list(trn_tag), tf_min=2)
trn_unk = corpus_replace_oov(trn_tag, trn_lex)

# write training data to a file
with open(trn_out, 'w') as f:
    for sent in trn_unk:
        f.write(" ".join(sent) + "\n")
        
tst_raw = read_corpus(tst)
tst_tag = corpus_add_tags(tst_raw)
tst_unk = corpus_replace_oov(tst_tag, trn_lex)

# write test data to a file
with open(tst_out, 'w') as f:
    for sent in tst_unk:
        f.write(" ".join(sent) + "\n")

## Data Analysis

### Basic Corpus Statistics

#### Sentence and Word Counts
- sentence count
- word count

In [10]:
def corpus_stats(corpus):
    """
    compute word and sentence counts of the corpus
    :param corpus: corpus as list-of-lists
    :return: sentence count, word count
    """
    return len(corpus), sum([len(sent) for sent in corpus])

### Lexicon Size, etc.
- Length of lexicon list
- Lexicon overlap (e.g. with stopwords)

In [14]:
def compute_overlap(a, b):
    """
    compute overal of two lists as set intersection
    :param a: list 1
    :param b: list 2
    :return: sorted list of overlapping elements
    """
    return sorted(list(set(a) & set(b)))

In [18]:
swl = 'NL2SparQL4NLU/extras/english.stop.txt'
stopwords = read_lexicon(swl)
trn_lex = compute_lexicon(trn_unk)

print(len(trn_lex))
print(len(stopwords))
print(len(compute_overlap(trn_lex, stopwords)))

print(corpus_stats(trn_unk))

953
571
144
(3338, 28129)
