## Welcome to ÅF Competence Evening (Machine Learning, NLP)
This is part one of a four, or five, part series where we'll go through **text generation**.


### Text Generation
What to do is self-saying but what we need is some kind of *Language Model*. 

#### Language Model
Language Models as the most simple definition is the probability of a sequence of word as whole. 

Think of this as `ello`, what does this move towards? `hello` or should we continue and build `mellow` perhaps? The human brain is really good at understanding the context and filling in the blanks. Depending on the "history" we have it is easier or harder to guess. The same applies if we use Maximum Likelihood. 

We can also apply this on a word-level meaning that if we have "*How are you WORD*" we would most likely guess "*WORD*" to be "*doing*".

The conclusion is that we need to count N-grams & produce statistics out of these using Markov Chains or something like it. Looking at this we can find the following;  
Bigram-model: $p(w) = \prod_{i=1}^{k+1} p(w_i|w_{i-1})$  

To find the probabilities given history we need to find the possibilites given the history,

Estimate probabilities: $p(w_i|w_{i-1})=\frac{c(w_{i-1}w_i)}{c(w_{i-1})}$

We can expand this concept to apply to N-grams too. 


First we import the needed modules

In [0]:
from collections import *
from random import random
import string
import numpy as np
import spacy
import pandas as pd

nlp = spacy.load('en')
nlp.max_length=5576562
PADDING = "~"

In [0]:
# Can be exchange for other inputs.
# e.g. https://github.com/ashwinmj/word-prediction/blob/master/eminem_songs_lyrics.txt
!wget http://cs.stanford.edu/people/karpathy/char-rnn/shakespeare_input.txt

In [0]:
print(open('shakespeare_input.txt', 'r').read()[:250])

### Training
We first define training. We want to read a file & have a certain length of "memory" ($n$).  
We'll start with order (memory) = $2$

### Data

Let's start off with the classic - Shakespeare. 

In [0]:
def normalize(counter):
    s = float(sum(counter.values()))
    return [(c, cnt / s) for c, cnt in counter.items()]

In [0]:
def train_char_lm(fname, order=4):
    with open(fname, 'r') as f:
        data = f.read()

        lm = defaultdict(Counter)
        pad = PADDING * order
        data = pad + data
        for i in range(len(data)-order):
            history, char = data[i:i+order], data[i+order]
            lm[history][char] += 1

        outlm = {hist: normalize(chars) for hist, chars in lm.items()}
        return outlm

In [0]:
lm = train_char_lm("shakespeare_input.txt", order=4)

Let's test the Language Model (lm). 

In [0]:
lm['ello']

In [0]:
lm['Firs']

What do we learn from this?

### Generating text
Now to the fun part. We want to generate text!

To generate text we'll generate one letter (character) at a time. We will look at history and the last order of characters, from this we will sample a letter based on the distribution.

In [0]:
def generate_letter(lm, history, order):
        history = history[-order:]
        dist = lm[history]
        x = random()
        for c,v in dist:
            x -= v # Done to have some more randomization
            if x <= 0: return c

But generating letters doesn't make a text, we need something that glues this together. We need to generate the text out of the letters.

In [0]:
def generate_text(lm, order, nletters=1000):
    history = PADDING * order
    out = []
    for i in range(nletters):
        c = generate_letter(lm, history, order)
        history = history[-order:] + c
        out.append(c)
    return "".join(out)

### Order = 2
Let's give it a try!

In [0]:
lm = train_char_lm("shakespeare_input.txt", order=2)
print(generate_text(lm, 2))

### Order = 4 

In [0]:
lm = train_char_lm("shakespeare_input.txt", order=4)
print(generate_text(lm, 4))

### Order = 7

In [0]:
lm = train_char_lm("shakespeare_input.txt", order=7)
print(generate_text(lm, 7))

### Order = 10

In [0]:
lm = train_char_lm("shakespeare_input.txt", order=10)
print(generate_text(lm, 10))

### Conclusion of step 1
We find that at order = 4 we get reasonable results that only get better with the higher order.

We can find that this generation does not support out-of-vocabulary, and can only generate history. This is "so-so".  
With the famous LSTM (RNN) char2char we can generate new texts & with its memory it can remember to open and end paranthesis, to know that an birthdate is connected with a location often and so on. LSTM can remember for a looong time, as we will see in the (probably) next workshop.


### Word by Word Text Generation
Let's spin this around to instead generate words, will it work?  
Of course it will when we're driving the vehicle!  

we want learn a function $P(w|h)$. Here, $w$ is a word, $h$ is a n-word history, and $P(w|h)$ stands for how likely is it to see $w$ after we've seen $h$. See earlier explanation for Bigram-model.

#### Preprocessing (over & over again!)


In [0]:
def tokenize(text):
  doc = nlp(text,  disable=['parser', 'tagger', 'ner'])
  return [str(token) for token in doc]

def preprocess(text):
  return str(text).lower()
  
def pandas_preprocess(dataframe):
  dataframe = dataframe.applymap(preprocess)
  return dataframe

#### Training word level generation language model
First, just as with character-level, we need to train the model (count words that is).  
The same techniques is applied as with character level generation with the difference that we now count words. And we require more data for a good generation.

In [0]:
def generate_word(lm, history, order):
    history = history[-order:]
    history_key = ' '.join(history)
    dist = lm[history_key]
    x = random()
    for c, v in dist:
        x = x - v
        if x <= 0: return c

In [0]:
def generate_text_word(lm, order, nletters=25):
    history = [PADDING] * order
    out = []
    for i in range(nletters):
        c = generate_word(lm, history, order)
        history = history[-order:] + [c]
        out.append(c)
    return ' '.join(out)

In [0]:
def train_word_lm(fname, order=2):
    with open(fname, 'r') as f:
        data = f.read()
        words = tokenize(data)
        lm = defaultdict(Counter)
        pad = [PADDING] * order
        data = pad + words
        for i in range(len(data)-order):
            history, word = data[i:i+order], data[i+order]
            lm[' '.join(history)][word] += 1

        outlm = {hist: normalize(words) for hist, words in lm.items()}
        return outlm


In [0]:
order = 4

In [0]:
lm = train_word_lm("shakespeare_input.txt", order)

In [0]:
print(generate_text_word(lm, order))

In [0]:
# Difference to Markov Chain - https://stackoverflow.com/a/24419604
# https://blog.dataiku.com/2016/10/08/machine-learning-markov-chains-generate-clinton-trump-quotes
# With c2c add <sos> & <eos>, perhaps to word too.

## NLTK
NLTK is one of the important libraries for someone who works with text. It contains a lot of tooling that can simplify our lives, so let's try to reimplement this using NLTK & their corpuses.

Project Gutenberg is a famous corpus containing about 25 000 e-books, including Shakespeare, Jane Austen etc.

In [0]:
import nltk
nltk.download('gutenberg')
nltk.download('punkt')
from nltk.util import ngrams
from nltk.corpus import gutenberg
from nltk import FreqDist

When using NLTK-corpuses that are already processed and beautiful we get some bonuses, we can extract sentences, ord or raw.  
When taking the sentences it's already tokenized & ready for use which is pretty damn awesome.

In [0]:
print(gutenberg.sents()[:10])

As our text is tokenized, let's start of by using what Gutenberg givs us. We can apply `lowercase` later into the process.

In [0]:
def train_word_lm_nltk(order=3):
  gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, order, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
  ngram_prob = defaultdict(Counter)
  for ngram in gut_ngrams:
      ngram_prob[ngram[0] + ngram[1]][ngram[2]] += 1
      # ngram_prob[ngram[0]][ngram[2]] += 1 <-- BackOff
  outlm = {hist: normalize(chars) for hist, chars in ngram_prob.items()}
  return outlm

### Improvements
Okey, this is all good. We can now generate text and we can easily swap the data that we use (just a `.txt` file).  
The first improvemet we can do is called _smoothing_. 

#### Smoothing
Smoothing does just what the name suggests, we smooth data. In other words, we allow OOV (out-of-vocabulary) words to be used. This is incredibly important and can help us generate much better text.  

**Laplacian Smoothing**  
Simplest approach, very naïve. There's two ways, either _add-one smoothing_ or _add-k smoothing_.


**Katz-Backoff**  
Longer N-grams are better, but if it doesn't exist back off to a shorter one.

**Interpolation Smoothing**  
Use multiple N in N-grams to get total prob.

**Kneser-Key Smoothing**  
Most popular one, but hard to implement correctly.  
![alt text](https://cdn-images-1.medium.com/max/800/1*pMttoEXAH_GS9d6AtkhF2g.png)  
Very good explanation through a [blog](https://medium.com/@seccon/a-simple-numerical-example-for-kneser-ney-smoothing-nlp-4600addf38b8)

#### Smoothing by UNK
Another approach could be to smooth the most uncommon words by UNK. This could for example be names if they're rare. In that case we would have names more commonly, and as such perhaps generate "*UNK was a man of honor*".

## Smooth by Backoff & UNK
Let's implement smoothing using Backoff & UNK-token.

In [0]:
# BackOff = if < X options for an bigram, choose the unigram prob
# UNK = find the least common words, replace them by UNK and redo.

### Tips on fun to do at home till next time
Markovify: https://github.com/jsvine/markovify

This is basically what we've done.

### Harder improvements
Create a Hidden Markov Model that also makes use of the POS.

## What's in store for future sessions?


*   Neural Networks (will improve result)
*   State-of-the-Art Neural Network using GPT-2 & transfer learning
*  Generate something really fun (Trump tweets, Rap songs or whatever we decide)
*  Deploying a model
*  (If people want too; text generation by selecting texts via Word Embedding & such. Example: Zac_the_second_bot)  



If all agree & don't have something else they'd prefer.

In [0]:
from nltk.util import ngrams
from nltk.corpus import gutenberg

gut_ngrams = ( ngram for sent in gutenberg.sents() for ngram in ngrams(sent, 3, pad_left = True, pad_right = True, right_pad_symbol='EOS', left_pad_symbol="BOS"))
#print(list(gut_ngrams)[:5])
freq_dist = nltk.FreqDist(gut_ngrams)
print(freq_dist.keys())
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

prob_sum = 0
for i in kneser_ney.samples():
    if i[0] == "Who" and i[1] == "are":
        prob_sum += kneser_ney.prob(i)
        print("{0}:{1}".format(i, kneser_ney.prob(i)))
print(prob_sum)