ELEC-E5550 - Statistical Natural Language Processing
# SET 1: Text Preprocessing

By completing this assignment you will learn how to handle raw text data. We will explore the frequency distribution of different language units and, then, discuss what this knowledge might give. 

KEYWORDS:

- Frequency distribution
- Tokenization
- Stop words removal

### Data
"The Gold-Bug" by Edgar Allan Poe. 
### Libraries
In this task you'll use:
- [NLTK](http://www.nltk.org) — a platform to work with human language data.

You are also allowed to use other libraries of your choice.

## TASK 1
## Warm up: The Gold-Bug cipher
The data used in this assignment is actually a story about the importance of letter frequencies. The narrator in the story was able to decipher a message leading to a hidden treasure by applying frequency analysis. The cipher used in the story is a substitute cipher where each letter is replaced by a different letter or number.

Knowing the frequency of letters in a language is important not only for solving ciphers, but it also has practical applications like data compression. For example, Morse code uses the shortest symbols for the most frequent letters. 

In this warm-up task you'll need to discover for yourself the frequency distribution of single letters and of letter pairs in English.

### Frequency distribution of letters
## 1.1
First of all, you need to load the text into the Jupyter Notebook. Create a function that reads the data located in *'/coursedata/the_gold-bug.txt'* file into a string.

HINTS: you can employ Python's function open() and read() method

In [27]:
def read(file_name):
    """
    this function reads a .txt file into a string
    
    INPUT:
    file_name - a path to the text file
    OUTPUT
    raw_string - text file as one string
    """
    
    # YOUR CODE HERE
    raw_string = open(file_name, "r").read()
    
    return raw_string

raw_gold_bug = read('/coursedata/the_gold-bug.txt')

In [28]:
from nose.tools import assert_equal

assert_equal(type(read('/coursedata/the_gold-bug.txt')), str)
assert_equal(len(read('/coursedata/the_gold-bug.txt')), 76483)


## 1.2
To count the letters, clean the text to leave only lowercase alphabetic characters in it.

HINT: you may use string methods 

In [29]:
def convert_to_letters(raw_string):
    """
    this function takes a raw text string and converts it into an array of lowercase letters
    
    INPUT:
    raw_string - text file in a string format
    OUTPUT:
    letters - text as a list of only letters
    """
    # YOUR CODE HERE
    letters = []
    raw_string = raw_string.lower()
    for l in raw_string:
        if not l.isalpha():
            continue
        letters.append(l)
    return letters

bug_letters = convert_to_letters(raw_gold_bug)

In [30]:
assert_equal(convert_to_letters("Lalala has 3 las."), ["l","a","l","a","l","a","h","a","s","l","a","s"])
assert_equal(type(bug_letters), list)
assert_equal(len(bug_letters), 58269)


## 1.3
Count how many times each letter occurred in the story. Then, sort the letters so that the most frequent letter would appear first. Put the sorted alphabet in a string.


HINT: you may use NLTK or Collections libraries

In [31]:
import nltk
import collections
# HINT: Try nltk.FreqDist or collections.Counter

def sort_letters(letters):
    """
    this function takes a text represented as a list of lowercase letters
    and outputs: 
    1. a frequency dictionary of letters
    2. a string of letters sorted by their frequencies
    
    INPUT:
    letters - text as a list of only lowercase letters 
    OUTPUT:
    letter_dict - a frequency dictionary of letters
    sorted_letters - a string with the alphabet sorted by frequency
    """
    
    # YOUR CODE HERE
    letter_dict = dict(nltk.FreqDist(letters))
    sorted_letters = [l for _, l in reversed(sorted(zip(letter_dict.values(), letter_dict.keys())))]
    sorted_letters = "".join(sorted_letters)
    return sorted_letters, letter_dict

bug_letters_sorted, letter_dict = sort_letters(bug_letters)

In [32]:
assert_equal(sort_letters("aaabbc"), ("abc", {"a":3,"b":2,"c":1}))
assert_equal(bug_letters_sorted[0], 'e')
assert_equal(len(bug_letters_sorted), 26)


## 1.4
The frequencies of letters differ drastically. That means the probabilities of seeing each letter are also different.
Look at your frequency dictionary: 
* what is the probability to see the most frequent letter? 
* what is the probability to see the least frequent one?

Type your answer in the cell below. You can create an additional cell to do calculations. 

In [33]:
def mle_probability_of_letter(letter_dict, letter):
    """
    Computes the maximum likelihood estimate (ie. relative frequency in corpus)
    of the given letter based on the given corpus statistics
    INPUT:
    letter_dict - a frequency dictionary of letters
    letter - which letter to compute probability for
    OUTPUT:
    probability - probability of the given letter
    """
    # YOUR CODE HERE
    probability = letter_dict[letter]/sum(letter_dict.values())
    return probability
    

In [34]:
from numpy.testing import assert_almost_equal

assert_almost_equal(mle_probability_of_letter({"a":5, "b":5}, "a"), 0.5)
assert_almost_equal(mle_probability_of_letter({"a":5, "b":5, "c":0, "d":10}, "a"), 0.25)

p_mfl = mle_probability_of_letter(letter_dict, bug_letters_sorted[0])
assert_almost_equal(p_mfl, 0.13125332509567694, 2)


## 1.5 
### Frequency distribution of letter combinations
Some combinations of language units are more likely than other. You'll see it in a minute.

There are 26 letters in English alphabet. How many possible two-letter combinations are there according to combinatorics? For example, if we have an alphabet of 3 letters **a**, **b** and **c**. We can have 9 combinations: **aa**, **bb**, **cc**, **ab**, **ac**, **ba**, **bc**, **ca**, **cb**.

Type your answer in the cell below.

In [35]:
def number_of_combinations(number_of_letters, sequence_len):
    """
    this function takes a number of letters in an alphabet and the desired length of letter combinations
    and outputs the numberof all possible letter combinations of length "sequence_len". 
    The combination can contain the same letter "sequence_len" times.
    
    INPUT:
    number_of_letters - a number of letters in an alphabet
    sequence_len - a length of the combination of letters
    OUTPUT:
    num_of_combinations - the numberof all possible letter combinations
    """
    
    # YOUR CODE HERE
    num_of_combinations = number_of_letters**sequence_len
    return num_of_combinations

In [36]:
assert_equal(number_of_combinations(2,2), 4)
assert_equal(number_of_combinations(2,1), 2)


## 1.6

Same as with single letters, some sequences of language units are more probable than the other. Not all combinations of two letters are possible in English. This fact can be used in such applications as predictive texting: your phone suggests what might be the next word you need.

In the following exercise, you'll need to count all combinations of two letters that appeared in the text. For this, you'll create a new function. It uses a sliding window of two letters, records what letters it sees at every step and then counts how many times it encountered each two-letter combination.

In [37]:
def count_pairs(letters):
    """
    this function takes a text represented as a list of lowercase letters
    and converts it into a sorted list of tuples, where the first element
    is a two-letter string, and the second element is the frequency of this letter pair.
    the first element of a list should be a tuple for the most frequent pair.
    
    INPUT:
    letters - text as a list of only lowercase letters 
    OUTPUT:
    pairs_sorted - a list of tuples (pair, frequency) sorted by the frequency element
    """
    
    # YOUR CODE HERE
    pairs = {}
    for i in range(len(letters) - 1):
        key = letters[i]+letters[i + 1]
        if not key in pairs.keys():
            pairs[key] = 1
        else:
            pairs[key] += 1

    pairs_sorted = [(l, f) for f, l in reversed(sorted(zip(pairs.values(), pairs.keys())))]
    return pairs_sorted

bug_pairs_sorted = count_pairs(bug_letters)

In [38]:

assert_equal(type(bug_pairs_sorted), list)
assert_equal(type(bug_pairs_sorted[0]), tuple)
assert_equal(bug_pairs_sorted[0][1], 1800)
assert_equal(count_pairs("aaaabbabc")[:2], [("aa",3),("ab",2)])
assert(count_pairs("aaaabbabc")[2] in [("ba",1), ("bb", 1), ("bc", 1)])


## 1.7
Using the frequency list you created, answer the following questions in the cell below:

1. How many different two-letter combinations have you encountered in our data?
2. What is the most frequent two-letter combination in English?
3. What is the probability of seeing a pair where both letters are the same?

You can create an additional cell for calculations.

In [39]:
# write the number of different two-letter combinations
n_pairs_data = len(bug_pairs_sorted) ##FILL IN THE ANSWER
# write the most frequent two-letter combination
most_frequent_pair = bug_pairs_sorted[0][0] ##FILL IN THE ANSWER
# write the maximum likelihood estimate of the probability of seeing a pair where both letters are the same
p_same_letters = None ##FILL IN THE ANSWER

# YOUR CODE HERE
same = 0
not_same = 0
for s in bug_pairs_sorted:
    if s[0][0] == s[0][1]:
        same += s[1]
    else:
        not_same += s[1]
p_same_letters = same / (same + not_same)

In [40]:
print(len(bug_pairs_sorted))
print(bug_pairs_sorted[0][0])
print(p_same_letters)

519
th
0.03459875060067275


In [41]:
### This cell contains hidden tests for the correct answers.


## TASK 2
## Word Tokenizer
In this task, you will create a function that splits the text into more elaborate units than just letters: words.

Text data is a part of virtually any NLP application. Sometimes you're lucky, and instead of plain raw text you get nice and clean one, but this is not always the case. Before getting your hands dirty with your actual application, you would most probably need to perform some manipulation to the text. For instance, separate it into words and sentences, remove unwanted symbols. Different tasks require different preprocessing techniques. In this task we'll use some simple ones.

It's not trivial to separate words from a string of text. The first thing that needs to be decided is what to count as a word. Should punctuation and numbers be considered words? Should *frogs* and *frog* be considered the same word? What about *Frog*, *frog* and *FROG*? Before answering those questions, let's make sure we are on the same page and discuss some terminology.

When talking about words, we can mean several different things: lemmas, word types and word tokens.

* **Lemma** - an identifier of a set of lexical forms sharing the same stem (*run* is the lemma for *runs* and *running*), a dictionary form of a word.
* **Word type** - a distinct unit in a text (all the instances of *runs* are counted once).
* **Word token** - every instance of word occurrence (every instance of *runs* counted as a separate word token).

Thus:
* **Tokenization** - a process of separating out word tokens from text
* **Lemmatization** - a process of assigning a group of word forms their lemma, and further separating out these lemmas from text

Generally, English doesn't require lemmatization since it has quite a limited number of word forms. For this reason, we'll leave this task out, for now, and focus on tokenization instead.

Let's create a tokenizer that considers numbers and punctuation as tokens and doesn't separate hyphenated words like *dum-dum*. For that you'll need:
- regular expressions
- string operations

## 2.1
Let's start off by separating words just by whitespaces and see what happens to our dummy sentence example: *It's a dum-dum example, we'll place it here to prove a point. Also look at this number: 300.99.*

In [42]:
dumb_example = "It's a dum-dum example, we'll place it here to prove a point. Also look at this number: 300.99."

def tokenize(raw_string):    
    """
    
    INPUT:
    raw_string - text file in a string format
    OUTPUT:
    space_tokenized - list of strings
    """
    # YOUR CODE HERE
    space_tokenized = raw_string.split()
    
    return space_tokenized

dum_dum_example = tokenize(dumb_example)
print(dum_dum_example)

["It's", 'a', 'dum-dum', 'example,', "we'll", 'place', 'it', 'here', 'to', 'prove', 'a', 'point.', 'Also', 'look', 'at', 'this', 'number:', '300.99.']


In [43]:
assert_equal(dum_dum_example[0], "It's")


As can be seen from the dummy example, it's not enough to just separate the words by the whitespaces. We get tokens like *'example,'*, *'point.'* and *'number:'*. Instead, we would actually like to have punctuation marks as separate tokens, but keep them inside the items like prices and numbers (4.99). Thus, we need something more clever.
## 2.2
For these means, you'll need to write a regular expression. It should:
- match all alphanumeric strings with hyphen, apostrophe or point inside.

In [44]:
import re
words_with_inside = "\w+['-.]\w+"

print(re.findall(words_with_inside, dumb_example))

["It's", 'dum-dum', "we'll", '300.99']


In [45]:
assert_equal(re.findall(words_with_inside, "It's a dum-dum example, we'll place it here to prove a point. Also look at this number: 300.99."), 
             ["It's", 
               'dum-dum', 
               "we'll", 
               '300.99'])

## 2.3
Now, let's add a disjunction to our regex:
- match either all words with hyphen, apostrophe or point inside OR any non-whitespace character followed by any word character between zero and unlimited times.

HINT: google what a *word character* is in a regex

In [46]:
words_with_inside_and_stuff = "(\w+['-.]\w+|\S\w*)"


print(re.findall(words_with_inside_and_stuff, dumb_example))

["It's", 'a', 'dum-dum', 'example', ',', "we'll", 'place', 'it', 'here', 'to', 'prove', 'a', 'point', '.', 'Also', 'look', 'at', 'this', 'number', ':', '300.99', '.']


In [48]:
assert_equal(re.findall(words_with_inside_and_stuff, dumb_example)[-1], '.')


As you've already noticed, the process of creating a tokenizer is pretty complicated. There are many more things to be considered, and a tokenizer should be chosen in accordance with a task. For example, we might also want to capture abbreviations (U.S.A.), percentages (82%) or URLs.

Luckily, there are already several good tokenizers implemented for. For instance, the NLTK package has several. 

## 2.4
Let's tokenise our text using the Treebank tokenizer. It uses regular expressions to tokenize text as in Penn Treebank. Don't forget to lowercase the words.

In [28]:
from nltk.tokenize import TreebankWordTokenizer

def tokenize_and_lowercase(raw_string):
    """
    this function takes a raw text string and converts it into an array of lowercased word
    tokens using Penn Treebank tokenizer.
    
    INPUT:
    raw_string - text file in a string format
    OUTPUT:
    tokens - text as a list of lowercased word tokens
    """
    # YOUR CODE HERE
    tokens = raw_string.lower()
    tokens = TreebankWordTokenizer().tokenize(tokens)
    return tokens


tokenized_bug = tokenize_and_lowercase(raw_gold_bug)

In [29]:
assert_equal(len(tokenized_bug), 16290)
assert_equal(tokenized_bug[0], 'the')

## TASK 3
## Word frequencies
In this task you will explore the distribution of word frequencies.
## 3.1
Let's count how many times each word token occurred in the story. Write a function that returns a frequency dictionary.

Explore the data. What are some of the most common tokens? 

In [30]:
#HINT: simply use nltk.FreqDist or collections.Counter again.

def count_tokens(tokenized_text):
    """
    this function takes a list of tokens and converts it into a frequency dictionary
    
    INPUT:
    tokenized_text - text as a list of tokens 
    OUTPUT:
    freq_dict - frequency dictionary of tokens
    """
    
    # YOUR CODE HERE
    freq_dict = dict(nltk.FreqDist(tokenized_text))
    
    return freq_dict

freq_dict = count_tokens(tokenized_bug)

In [31]:
assert_equal(freq_dict[','], 1302)
assert_equal(len(freq_dict) , 3071)


## 3.2
As you can see, the most frequent words are not specific to the Poe's story, but are pretty much the same across English language.  

In information theory, the more likely an event to occur, the less information it contains. Thus, if an event is not a surprise, it's simply "old news". For some natural language applications, it means that words like *to* and *the* don't tell anything important about a text. They are not helpful in recognising its topic or its author.


Such frequent uninformative words are called **stop words**, and, in some cases, they can simply be cleaned out from data. There exist prepared lists of such words in English. Let's remove the ones provided by NLTK.

In [32]:
from nltk.corpus import stopwords
nltk.download('stopwords')

stop_words_english = stopwords.words('english')

def remove_stop_words(tokenized_text, stop_words):
    """
    this function takes a list of tokens and removes stop words from it
    
    INPUT:
    tokenized_text - text as a list of tokens 
    stop_words - a list of words to remove
    OUTPUT:
    clean_text - text as a list of tokens with stop words removed
    """
    
    # YOUR CODE HERE
    clean_text = tokenized_text.copy()
    i = 0
    for w in stop_words:
        while w in clean_text:
            clean_text.remove(w)

    return clean_text
    
clean_bug = remove_stop_words(tokenized_bug, stop_words_english)

[nltk_data] Downloading package stopwords to
[nltk_data]     /home/spindll1/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [33]:
assert_equal(len(clean_bug), 9211)


## 3.3
Great! Now you have a clean test that can further be used for such tasks as 

Last thing. What was the percentage of stop words in the tokenized text?
Write your answer in the cell below.

In [34]:
# fraction of stop words in the tokenized text (number between 0 and 1)
fraction_of_stop_words = 0.565438919582566
# YOUR CODE HERE

In [35]:
### This cell contains hidden tests for the correct answers.

In [36]:
print(len(clean_bug)/len(tokenized_bug))

0.565438919582566
