Computational linguistics
====================


First things first. Let's style our notebook

In [None]:
from utils import css_from_file
css_from_file('css/pawel.css')

Install required libraries
-------------------------

1. Install nltk - ```pip install nltk```

Building and querying a simple language model
-----------------------------

A language model is simply put statistical information about the language. 
In its most simple form it is just word occurence statistics.

There are many types of language models. Most common are based on words. But there are also character based models:

Here is a demo of a system trained on characters

http://www.cs.toronto.edu/~ilya/rnn.html

Useful vocabulary
--------------------------

**Corpus** - a collection of text - normally from 1 domain

**Parallel corpus** - a collection of source text and translated text - very useful for translation purposes

**Tokenization** - a process of splitting (eg. sentences, words)

Download corpus for analysis
-------------------

In [None]:
!wget http://norvig.com/big.txt

Read text and clean it (slightly)
--------------------

Open the corpus and do some cleaning

**Exercises:**
1. Read the contents of the big.txt file into a variable
2. Replace all \n with a single space
3. Convert all text to lowercase
4. Print out the beginning of the text 500 characters

In [None]:
# put your solution here

<a>Double click to show the solution</a>
<div class="spoiler">

text = open("big.txt").read()
text = text.replace("\n"," ").lower()
print text[:500]

</div>

Sentence splitting
-------------------------

It is useful to split the text into sentences.

Questions:
1. Why a naive method like splitting with "." won't work very well?

There exists libraries for sentence tokenization which are smarter than that.
Let's import sentence tokenizer from nltk library

In [None]:
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(text)
for sentence in sentences[:10]:
    print sentence

Word splitting (tokenization)
-----------------

Word tokenization is as important or maybe even more than sentence splitting?

Questions:
1. Why a naive method of splitting with blank characters won't work very well?

There exist tools for this also. Let's import word tokenization function from nltk.
In our company we use a custom tokenizer - because our domain is very specific.

In [None]:
from nltk.tokenize import word_tokenize

print word_tokenize("Hello, world.")

**Exercise:**
    
1. Tokenize each sentence into words (tip: use map function)
2. Print out first 10 sentences tokenized

In [None]:
# put your solution here

<a>Double click to show the solution</a>
<div class="spoiler">

sentences_tokenized = map(word_tokenize, sentences)

for sentence in sentences_tokenized[:10]:
    print sentence

</div>

n-grams
---------------------

From wikipedia (https://en.wikipedia.org/wiki/N-gram)

_In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.[1]_

- 1-gram (aka unigram) - 1 word
- 2-gram (aka bigram) - 2 consecutive words
- 3-gram (aka trigram) - 3 consecutive words
- ...
- n-gram - n consecutive words

Exercise
----------------

In a text:

    A simple complete sentence consists of a single clause
    
Questions:
1. How many 1-grams, 2-grams and n-grams are in the sentence? Enumerate them all
2. Create a function ```ngrams``` that given a list of words an n returns a list of n-grams
3. Print out all n-grams from our sentence

In [None]:
sentence = "A simple complete sentence consists of a single clause"
words = sentence.split()

# put your solution here

<a>Double click to show the solution</a>
<div class="spoiler">

def ngrams(words, n):
    for i in range(len(words)-n+1):
        yield tuple(words[i:(i+n)])
            
all_ngrams = 0
for n in range(1,len(words)+1):
    print "{}-gram".format(n)
    for ngram in ngrams(words, n):
        print ngram
        all_ngrams += 1
    print

</div>

Building a simple language model
-------------------

**Exercise:**

Based on our corpus - create a list of all 1,2,3,4,5-grams (Not necessarily unique)

In [None]:
from collections import Counter

# put your solution here

<a>Double click to show the solution</a>
<div class="spoiler">

from collections import Counter

ngrams_all = []
for n in range(1,6):
    print n
    for sentence in sentences_tokenized:
        for ngram in ngrams(sentence,n):
            ngrams_all.append(ngram)

ngrams_freq = Counter(ngrams_all)

</div>

**Exercise:**

Print out the most frequent 1,2,3,4,5-grams (eg top 10)

In [None]:
# put your solution here

<a>Double click to show the solution</a>
<div class="spoiler">

for n in range(1,6):
    print "most frequent {}-grams".format(n)
    ngrams_freq_ = filter(lambda ngram: len(ngram[0]) == n, ngrams_freq.most_common())
    for ngram,freq in ngrams_freq_[:10]:
        print ngram, freq
    print

</div>

Fun time - generating random sentences from the bigram model
------------------------

In [None]:
unigrams = filter(lambda ngram: len(ngram[0]) == 1, ngrams_freq.most_common())
bigrams = filter(lambda ngram: len(ngram[0]) == 2, ngrams_freq.most_common())

In [None]:
from bisect import bisect
from random import random
from copy import copy

class WeightedSampler():
    def __init__(self, freqs):
        # copy the counter dictionary
        self.freqs = copy(freqs)
        self.normalize_probabilities()
        self.calculate_cdf()
        
    def normalize_probabilities(self):
        s = float(sum([f for k,f in self.freqs]))
        for i in range(len(self.freqs)):
            self.freqs[i] = (self.freqs[i][0], self.freqs[i][1] / s)
            
    def calculate_cdf(self):
        self.cdf = [self.freqs[0][1]]
        for k,prob in self.freqs[1:]:
            self.cdf.append(self.cdf[-1] + prob)
            
    def random_choice(self):
        return self.freqs[bisect(self.cdf,random())][0]

In [None]:
random_sentence = "hi".split()

while True:
    bigrams_ = filter(lambda t: t[0][0] == random_sentence[-1], bigrams)
    bigrams_sampler = WeightedSampler(bigrams_)
    continuation = bigrams_sampler.random_choice()
    random_sentence.append(continuation[1])
    print " ".join(random_sentence)
    if len(random_sentence) > 15:
        break
    print 