# Language Models
* One traditional approach
* And one deep approach

----

## What is a Language Model?

### A model which tries to assess the liklehood of language

$P(W) = P(w_1, w_2, ..., w_n)$

or

$P(w_{t+1} | w_{t-1+n}, ..., w_{t})$

### Three main areas that are more or less 'likely' to occur:
* Syntax issues - e.g. I go home vs I home go - WITH A BIGRAM MODEL ON V LIMITED DATA, YOU CAN 'SOLVE' SYNTAX
* Semantic issues - I go home vs I go house
* Pragmatics issues - I go home vs I go home and 2+2=4

### All this knowledge neds to be captured inside the pairings of words with other words!

---

## A 'traditional' language model... for Sequence Generation

#### A bigram Markov chain

* bin the words into size two bins
* assess all the relationships of those pairs
* pairs- bi-gram.
* n-gram models - tri/quad/etc

In [None]:
from sklearn.datasets import fetch_20newsgroups
import pandas as pd
import numpy as np
import re

In [None]:
#this number represents the unqiue pair size of the dataset
word_distribution.shape

In [None]:
def bigram_word_distribution(data):
    """create a probability distribution over all bigrams
    
        params: data - a Bunch data object from sklearn
        returns: Word probability distribution
    """

    text = data['data']
    all_data = ' '.join([' '.join(re.findall('(?u)\\b\\w\\w*\\b',article.lower())) for article in text]).split()
    words = pd.DataFrame({'words':all_data})
    words['next_words'] = words['words'].shift(-1)
    word_distribution = words.groupby('words')['next_words'].value_counts(normalize=True)
    
    return word_distribution

In [None]:
def bigram_text_generation(seed, length, distribution):
    """seed a distribution with a seed word, and ask it to make more words
        
        params: seed - A seed word, 
                length -Length of the generated sentence
                distribution - A word probability distribution
                
        returns: generated sentence
    """
    
    try:
        seed = seed.lower()
        for i in range(length):
             seed += ' ' + np.random.choice(distribution[seed.split()[-1]].index, p=distribution[seed.split()[-1]].values)
        return seed
    
    except:
        print('Oops! Try another seed')
        return None

### Download text data

In [None]:
data = fetch_20newsgroups(remove=['headers', 'footers'])

### Calculate the bigram probabilities

In [None]:
bi_dist = bigram_word_distribution(data)

### Generate a sentence
* if the seed is not part of the dataset then the model will error

In [None]:
seed = 'because'

In [None]:
sentence_bigram = bigram_text_generation(seed, 20, bi_dist)

In [None]:
sentence_bigram

### How can we improve it?
* We're using bigram predictions, instead we can use trigram
* Take context from the previous 2 words instead of the previous word only

In [None]:
def trigram_word_distribution(data):
    """create a probability distribution over all trigrams
    
        params: data - a Bunch data object from sklearn
        returns: [Bigram probability distribution, trigram probability distribution]
    """
    
    text = data['data']
    all_data = ' '.join([' '.join(re.findall('(?u)\\b\\w\\w*\\b',article.lower())) for article in text]).split()
    tri_gram = [' '.join([x,y]) for x,y in zip(all_data[:-1:], all_data[1::])]
    next_word = all_data[2:] + [' '] * 1
    words = pd.DataFrame({'seed_word':all_data[:-1],'gram_words':tri_gram, "next_word":next_word})
    words['seed_next_word'] = words['seed_word'].shift(-1)
    seed_word_distribution = words.groupby('seed_word')['seed_next_word'].value_counts(normalize=True)
    gram_word_distribution = words.groupby('gram_words')['next_word'].value_counts(normalize=True)
    
    return [seed_word_distribution, gram_word_distribution]

In [None]:
tri_dist = trigram_word_distribution(data)

In [None]:
def trigram_text_generation(seed, length, distribution):
    """seed a distribution with a seed word, and ask it to make more words
        
        params: seed - A seed word, 
                length -Length of the generated sentence
                distribution - A word probability distribution
                
        returns: generated sentence
    """
    
    try:
        seed = seed.lower()
        seed += ' ' + np.random.choice(distribution[0][seed].index, p=distribution[0][seed].values)
        for i in range(length):
             seed += ' ' + np.random.choice(distribution[1][' '.join(seed.split()[-2:])].index, p=distribution[1][' '.join(seed.split()[-2:])].values)
        return seed
    
    except:
        print('Oops! Try another seed')
        return None

In [None]:
sentence_trigram = trigram_text_generation('because', 20, tri_dist)

In [None]:
sentence_bigram

In [None]:
sentence_trigram

* Trigrams are better than bigrams
* quadgrams are WAAAY better than bigrams

## Challenges - 
* less agile - requires a lot more data to be general
* PREPROCESSING TECH DEBT no puncuation in this model - we need to model puncuation. <PUNK> - build a punk transition matrix
* no flexibility for abbreviations (it is != it's) / synonyms (machine learning != data science)
* SIZE of the dictionary increases exponentially
* Monitor sentence start distributions differently
* Add start and end tokens to generate separable sentences
* Add other punctuation

In [None]:
import matplotlib.pyplot as plt

In [None]:
plt.hist(np.random.normal(size=1000)) #the idea is to sample from what is likely

In [None]:
plt.hist(np.random.random(size=1000)) #this is not we want in DS

### Drawbacks:

* This model typs requires lots of computation power to train, and a lot of space to store advanced models
* N-grams are a sparse representation of language -  any word not present in the training corpus has a zero probability chance of being used


---

# Better approach - Deep Language Models!
* Deep Language generation using LSTMs

In [1]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, BatchNormalization, Flatten, Bidirectional, LSTM, Dropout
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.utils import to_categorical
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import re

In [2]:
data = fetch_20newsgroups(remove=['headers', 'footers'])

In [3]:
# grab the text
text = data['data']

In [4]:
# create a continuous list of words
all_data = ' '.join([' '.join(re.findall('(?u)\\b[a-zA-Z]*\\b',article.lower())) for article in text]).split()

In [5]:
# work out the vocab list and size of vocab
vocab_list = sorted(list(set(all_data)))
n_vocab = len(vocab_list)

### NN and sklearn models require numbers not strings as input

In [6]:
#translate words to numbers
word_to_num = {}
num_to_word = {}
for i, word in enumerate(vocab_list):
    num_to_word[i] = word
    word_to_num[vocab_list[i]] = i

In [7]:
word_to_num['because'], num_to_word[5593]

(5593, 'because')

In [8]:
#embed the data
embedded_data = [word_to_num[word] for word in all_data]

#### This is a feature required for NN / sklearn models - STANDARDIZED INPUT SHAPE
* each sentence will have length 10

In [9]:
#create the next word guess for each previous 10 words
X_data= []
y_data = []
seq_length=10
for i in range(len(embedded_data)-seq_length):
    X_data.append(embedded_data[i:i+seq_length])
    y_data.append(embedded_data[i+seq_length])

In [10]:
# reshape the X and y data
X = np.array(X_data).reshape(len(X_data), seq_length)
y = to_categorical(y_data) #onehot encode our y data

#normalise the X data - makes training better
#X = X / float(n_vocab)

In [11]:
X.shape

(2899861, 10)

### Build the model

In [12]:
model = Sequential()
model.add(Embedding(input_dim=n_vocab, output_dim=32, input_length=seq_length))
model.add(Dropout(0.2))
model.add(Bidirectional(LSTM(128, return_sequences=True), merge_mode='sum'))
model.add(LSTM(128))
model.add(BatchNormalization(axis=-1, momentum=0.99, epsilon=1e-3, center=True, scale=True))
model.add(Dense(y.shape[1], activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

#### Save our best version of the model

In [13]:
filepath=f"best_weights.hdf5"
early_stop = EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=15, min_delta=0.0001) 
checkpoint = ModelCheckpoint(filepath, monitor='val_loss', verbose=1, save_best_only=True, save_weights_only=True, mode='max')
callbacks = [early_stop, checkpoint]

In [14]:
model.summary() # i know the model is big, and the size is in the 'wrong' part

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 10, 32)            2359904   
_________________________________________________________________
dropout (Dropout)            (None, 10, 32)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 10, 128)           164864    
_________________________________________________________________
lstm_1 (LSTM)                (None, 128)               131584    
_________________________________________________________________
batch_normalization (BatchNo (None, 128)               512       
_________________________________________________________________
dense (Dense)                (None, 73747)             9513363   
Total params: 12,170,227
Trainable params: 12,169,971
Non-trainable params: 256
__________________________________________

In [15]:
# fit the model
history = model.fit(X, y, epochs=20, batch_size=512, validation_split=0.2, callbacks=callbacks)

Epoch 1/20
  66/4532 [..............................] - ETA: 50:32 - loss: 10.8641 - accuracy: 0.0444

KeyboardInterrupt: 

In [None]:
# load a saved model
# filename = "weights_01.hdf5"
# model.load_weights(filename)
# model.compile(loss='categorical_crossentropy', optimizer='adam')

### For a given input string generate some new text

In [None]:
def prepare_input(seed_input):
    """prepare a string for the LSTM"""
    
    seed_input = seed_input.split()
    try:
        return np.expand_dims(np.array([word_to_num[x] for x in seed_input]).reshape(-1,1),axis=0)
    except:
        return 'please try with different words'

In [None]:
def generate_text(input_string):
    """generate some new text as a string"""
    
    seed = prepare_input(input_string)
    for i in range(10):
    #predict next word based on window of 10 previous words - and add to embedded doc
        next_word = np.argmax(model.predict(seed[:,i:,:])).reshape(1,-1,1)
        seed = np.append(seed,next_word,axis=1)

    return ' '.join([num_to_word[x] for x in seed[0,:,0]])

In [None]:
# seed has to be 10 words not 1 word
input_string = 'space jupyter mars pluto space space earth mars pluto space'
len(input_string.split())

In [None]:
generate_text(input_string)

----

#### How to improve?
* i trained it for 1 epoch, will need 50!
* i estimate 2-6 hours to train on a good GPU

---