# Language modelling

In this tutorial we will use two character level language models to create dinosaur names. These models are:
* bigram model
* recurrent neural network model.

Why are they different?

![picture](https://vignette.wikia.nocookie.net/uncyclopedia/images/7/73/Hankk_the_dino.png/revision/latest?cb=20100127020302)

In [0]:
from google.colab import files
uploaded = files.upload()

Saving dinos.txt to dinos.txt


## Bigram model

Uncomment the cell below if you do not have NLTK installed.

In [0]:
names = ['<' + name.strip().lower() + '>' for name in open('dinos.txt').readlines()]
print(names[:10])

['<aachenosaurus>', '<aardonyx>', '<abdallahsaurus>', '<abelisaurus>', '<abrictosaurus>', '<abrosaurus>', '<abydosaurus>', '<acanthopholis>', '<achelousaurus>', '<acheroraptor>']


In [0]:
!pip install nltk



In [0]:
import nltk

**Step 1.** Compute frequency of each character

In [0]:
chars = [char  for name in names for char in name]
freq = nltk.FreqDist(chars)

In [0]:
print(list(freq.keys()))

['<', 'a', 'c', 'h', 'e', 'n', 'o', 's', 'u', 'r', '>', 'd', 'y', 'x', 'b', 'l', 'i', 't', 'p', 'v', 'm', 'g', 'f', 'j', 'k', 'w', 'z', 'q']


In [0]:
freq.most_common(10)

[('a', 2487),
 ('s', 2285),
 ('u', 2123),
 ('o', 1710),
 ('r', 1704),
 ('<', 1536),
 ('>', 1536),
 ('n', 1081),
 ('i', 944),
 ('e', 913)]

Define a function to estimate probabilty of character

In [0]:
l = sum([freq[char] for char in freq])
def unigram_prob(char):
    return freq[char] / l

In [0]:
print('p(a) = %1.4f' %unigram_prob('a'))

p(a) = 0.1160


**Step 2.** Compute frequency of each character condtioned on the previous one

In [0]:
cfreq = nltk.ConditionalFreqDist(nltk.bigrams(chars),)

In [0]:
# кол-во раз, когда какая-то буква встреается ПОСЛЕ а
cfreq['a']

FreqDist({'>': 138,
          'a': 11,
          'b': 24,
          'c': 100,
          'd': 36,
          'e': 42,
          'f': 6,
          'g': 40,
          'h': 17,
          'i': 23,
          'j': 5,
          'k': 20,
          'l': 138,
          'm': 68,
          'n': 347,
          'o': 22,
          'p': 89,
          'q': 3,
          'r': 124,
          's': 171,
          't': 204,
          'u': 791,
          'v': 30,
          'w': 6,
          'x': 12,
          'y': 12,
          'z': 8})

**Step 3.** Use MLE to estimate condtional probabilities

In [0]:
cprob = nltk.ConditionalProbDist(cfreq, nltk.MLEProbDist)

In [0]:
print('p(a a) = %1.4f' %cprob['a'].prob('a'))
print('p(a b) = %1.4f' %cprob['a'].prob('b'))
print('p(a u) = %1.4f' %cprob['a'].prob('u'))

p(a a) = 0.0044
p(a b) = 0.0097
p(a u) = 0.3181


We can use cprob to generate next characters

### Task 1.
a. Write a function to generate a dinosaur name of **fixed** length. Use '<' as a start of name symbol.

b. Write a to generate a dinosaur names of any length. 

In [0]:
## let's ommit for a while function for a name with fixed length

# HW, Problem 3.1

### a) dinosaur name of any length


In [0]:
def dinosaur_name_any_length(cprob = cprob):
  start_symbol = "<"
  dino_name = "<"

  new_char = cprob[start_symbol].generate()
  dino_name = dino_name + new_char

  while True:
    new_char = cprob[new_char].generate()
    dino_name = dino_name + new_char
    if new_char == ">":
      break
  return dino_name
  

  
name = dinosaur_name_any_length(cprob = cprob)
/print(name)

<gnosarusatiaurus>


### b) Add add-one smoothing. What is the difference?

In [0]:
## add-1 smoothing
cprob_add_one_smoothing = nltk.ConditionalProbDist(cfreq, nltk.LaplaceProbDist)
name = dinosaur_name_any_length(cprob = cprob_add_one_smoothing)
print(name)

<aclelodelis>


With add-one smoothing we can get more names (all possible combinations of letters) as P(letter 1 | letter 2) != 0 for any letter 1 and 2

## Recurrent neural networks

Input:

$x_{1:n} = x_1, x_2, \ldots, x_n$, $x_i \in \mathbb{R}^{d_{in}}$

For each input $x_{1:i}$ we get an output $y_i$:

$y_i = RNN(x_{1:i})$, $y_i \in \mathbb{R}^{d_{out}}$

For the whole input sequence $x_{1:n}$:

$y_{1:n} = RNN^{*}(x_{1:n})$, $y_i \in \mathbb{R}^{d_{out}}$

$R$ is a recursive activation function with two inputs: $x_i$ и $s_{i-1}$ (state vector)

$RNN^{*}(x_{1:n}, s_0) = y_{1:n}$

$y_i = O(s_i) = g(W^{out}[s_{i} ,x_i] +b)$

$s_i = R(s_{i-1}, x_i)$

$s_i = R(s_{i-1}, x_i) = g(W^{hid}[s_{i-1} ,x_i] +b)$  -- concatenate $[s_{i-1}, x]$

$x_i \in \mathbb{R}^{d_{in}}$, $y_i \in \mathbb{R}^{ d_{out}}$, $s_i \in \mathbb{R}^{d_{hid}}$

$W^{hid} \in \mathbb{R}^{(d_{in}+d_{out}) \times d_{hid}}$, $W^{out} \in \mathbb{R}^{d_{hid} \times d_{out}}$

![rnn](https://github.com/enggen/Deep-Learning-Coursera/raw/1407e19c98833d2686a0748db26b594f3102301e/Sequence%20Models/Week1/Dinosaur%20Island%20--%20Character-level%20language%20model/images/rnn.png)

We are going to create an RNN-LM using pytorch

In [0]:
import numpy as np
import random
import torch
import torch.nn as nn
import torch.optim as optim
import pdb
from torch.utils.data import Dataset, DataLoader

%load_ext autoreload
%autoreload 2

torch.set_printoptions(linewidth=200)

In [0]:
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
hidden_size = 50

print(device)

cuda:0


**Step 1**. Prepare the dataset

In [0]:
class DinosDataset(Dataset):
    def __init__(self):
        super().__init__()
        with open('dinos.txt') as f:
            content = f.read().lower()
            self.vocab = sorted(set(content)) + ['<', '>']
            self.vocab_size = len(self.vocab)
            self.lines = content.splitlines()
        self.ch_to_idx = {c:i for i, c in enumerate(self.vocab)}
        self.idx_to_ch = {i:c for i, c in enumerate(self.vocab)}
    
    def __getitem__(self, index):
        line = self.lines[index]
        
        ######################################
        ## teacher forcing
        x_str = '<' + line       # size = len x 28 ~ One hot encoding of letters. There are 'len' letters
        y_str = line + '>'       # size = len  ~ number of letters
        ######################################
        
        x = torch.zeros([len(x_str), self.vocab_size], dtype=torch.float)
        y = torch.empty(len(x_str), dtype=torch.long)
        for i, (x_ch, y_ch) in enumerate(zip(x_str, y_str)):
            x[i][self.ch_to_idx[x_ch]] = 1
            y[i] = self.ch_to_idx[y_ch]
        
        return x, y
    
    def __len__(self):
        return len(self.lines)

In [0]:
trn_ds = DinosDataset()
trn_dl = DataLoader(trn_ds, shuffle=True)

In [0]:
trn_ds.lines[1]

'aardonyx'

In [0]:
print(trn_ds.idx_to_ch)

{0: '\n', 1: 'a', 2: 'b', 3: 'c', 4: 'd', 5: 'e', 6: 'f', 7: 'g', 8: 'h', 9: 'i', 10: 'j', 11: 'k', 12: 'l', 13: 'm', 14: 'n', 15: 'o', 16: 'p', 17: 'q', 18: 'r', 19: 's', 20: 't', 21: 'u', 22: 'v', 23: 'w', 24: 'x', 25: 'y', 26: 'z', 27: '<', 28: '>'}


In [0]:
trn_ds.vocab_size

29

In [0]:
x, y = trn_ds[1]

In [0]:
x.shape

torch.Size([9, 29])

In [0]:
y.shape

torch.Size([9])

**Step 2**. Define a model, loss function and optimization algorithm

In [0]:
class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)  # input to hidden matrix
        self.dropout = nn.Dropout(0.3)
        self.i2o = nn.Linear(input_size + hidden_size, output_size)  # input to output matrix
    
    def forward(self, h_prev, x):
        combined = torch.cat([h_prev, x], dim = 1) # concatenate x and h
        h = torch.tanh(self.dropout(self.i2h(combined)))
        y = self.i2o(combined)
        return h, y

In [0]:
model = RNN(trn_ds.vocab_size, hidden_size, trn_ds.vocab_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=1e-2)

**Step 3**. Declare a sampling procedure

![rnn](https://github.com/enggen/Deep-Learning-Coursera/raw/1407e19c98833d2686a0748db26b594f3102301e/Sequence%20Models/Week1/Dinosaur%20Island%20--%20Character-level%20language%20model/images/dinos3.png)

In [0]:
def sample(model):
    model.eval()
    word_size=0
    newline_idx = trn_ds.ch_to_idx['>']
    with torch.no_grad():
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x = h_prev.new_zeros([1, trn_ds.vocab_size])
        start_char_idx = trn_ds.ch_to_idx['<']
        indices = [start_char_idx]
        x[0, start_char_idx] = 1
        predicted_char_idx = start_char_idx
        
        while predicted_char_idx != newline_idx and word_size != 50:
            h_prev, y_pred = model(h_prev, x)
            y_softmax_scores = torch.softmax(y_pred, dim=1)
            
            np.random.seed(np.random.randint(1, 5000))
            idx = np.random.choice(np.arange(trn_ds.vocab_size), p=y_softmax_scores.cpu().numpy().ravel())
            indices.append(idx)
            
            x = (y_pred == y_pred.max(1)[0]).float()
 
            
            predicted_char_idx = idx
            
            word_size += 1
        
        if word_size == 50:
            indices.append(newline_idx)
    return indices

In [0]:
def print_sample(sample_idxs):
    [print(trn_ds.idx_to_ch[x], end ='') for x in sample_idxs]
    print()

**Step 4**. Almost done! Train the model

In [0]:
def train_one_epoch(model, loss_fn, optimizer):
    model.train()
    for line_num, (x, y) in enumerate(trn_dl):
        loss = 0
        optimizer.zero_grad()
        h_prev = torch.zeros([1, hidden_size], dtype=torch.float, device=device)
        x, y = x.to(device), y.to(device)
        for i in range(x.shape[1]):
            h_prev, y_pred = model(h_prev, x[:, i])
            loss += loss_fn(y_pred, y[:, i])
            
        if (line_num+1) % 100 == 0:
            print_sample(sample(model))
        loss.backward()
        optimizer.step()

In [0]:
def train(model, loss_fn, optimizer, dataset='dinos', epochs=1):
    for e in range(1, epochs+1):
        print('Epoch:{}'.format(e))
        train_one_epoch(model, loss_fn, optimizer)
        print()

In [0]:
print(model)

RNN(
  (i2h): Linear(in_features=79, out_features=50, bias=True)
  (dropout): Dropout(p=0.3)
  (i2o): Linear(in_features=79, out_features=29, bias=True)
)


In [0]:
train(model, loss_fn, optimizer, epochs = 50)

Epoch:1
<nba>
<juuoat>
<oaalonasruslhs>
<lacusasnus>
<aubot>
<xerusiurus>
<tnrusasras>
<aunuaauras>
<tinasausus>
<turooourus>
<rtraseerus>
<guaascarus>
<haras>
<pugus>
<mirus>

Epoch:2
<tnotrahpapseur>
<aeroainrus>
<tatsosaurus>
<sipttrnuras>
<aaucontcrusaurus>
<ancrusturus>
<eiurysaurui>
<lamudtssas>
<marhsauius>
<yrsas>
<splushurus>
<taoocaurus>
<agrraiurus>
<tcrus>
<bururaurut>

Epoch:3
<ltrua>
<lbrucspras>
<lardsaurus>
<xtsesturus>
<jyrostudus>
<eubuptcrusaurus>
<aubosourus>
<llrus>
<snrusasras>
<alnsakgunus>
<hncucaurus>
<turhshusus>
<sualoauras>
<tbesraurus>
<aubmtiurus>

Epoch:4
<llrur>
<srostcruo>
<saurucauras>
<snrasaurus>
<surohsaurus>
<suairauros>
<tcgspaurus>
<auaivotieurus>
<rrvurjurasairus>
<eubesacrus>
<gbrantosaurus>
<ruryshurul>
<lcnucturis>
<laresaunus>
<xstasturus>

Epoch:5
<kyroseunus>
<eubuotdrusaur>
<gncuchurus>
<turcshurus>
<suaioauros>
<tceuraurus>
<auaiuosaurus>
<ruryscurul>
<lcmucnurds>
<larbsaurus>
<xssas>
<salrosuurus>
<anrapnnrus>
<curaneunanrus>
<pueutoras

# HW Problem 3.2

In [0]:
class Neural_Language_Model(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super().__init__()
        self.i2h = nn.Linear(input_size, hidden_size)  # input to hidden matrix
        self.i2o = nn.Linear(3*hidden_size, output_size)  # input to output matrix
    
    def forward(self, x):
        
        x_1 = self.i2h(x[:, 0])
        x_2 = self.i2h(x[:, 1])
        x_3 = self.i2h(x[:, 2])
        x = torch.cat([x_1, x_2, x_3], dim = 1)
        x = torch.tanh(x)
        x = self.i2o(x)
        return x
      
def train_one_epoch(model, loss_fn, optimizer):
    model.train()
    for line_num, (x, y) in enumerate(trn_dl):
        loss = 0
        optimizer.zero_grad()
        x, y = x.to(device), y.to(device)
        if x.shape[1] <= 3:
          continue
        for i in range(x.shape[1] - 3):
            y_pred = model(x[:, i:i+3])
            loss += loss_fn(y_pred, y[:, i+3])
        loss.backward()
        optimizer.step()
        
        
def train(model, loss_fn, optimizer, dataset='dinos', epochs=1):
    for e in range(1, epochs+1):
        print('Epoch:{}'.format(e))
        train_one_epoch(model, loss_fn, optimizer)
        

In [0]:
model_2 = Neural_Language_Model(trn_ds.vocab_size, hidden_size, trn_ds.vocab_size).to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = optim.SGD(model_2.parameters(), lr=1e-2)

In [0]:
train(model_2, loss_fn, optimizer, epochs = 50)

Epoch:1
Epoch:2
Epoch:3
Epoch:4
Epoch:5
Epoch:6
Epoch:7
Epoch:8
Epoch:9
Epoch:10
Epoch:11
Epoch:12
Epoch:13
Epoch:14
Epoch:15
Epoch:16
Epoch:17
Epoch:18
Epoch:19
Epoch:20
Epoch:21
Epoch:22
Epoch:23
Epoch:24
Epoch:25
Epoch:26
Epoch:27
Epoch:28
Epoch:29
Epoch:30
Epoch:31
Epoch:32
Epoch:33
Epoch:34
Epoch:35
Epoch:36
Epoch:37
Epoch:38
Epoch:39
Epoch:40
Epoch:41
Epoch:42
Epoch:43
Epoch:44
Epoch:45
Epoch:46
Epoch:47
Epoch:48
Epoch:49
Epoch:50


Hyperparameters are:  
- k  
- hidden_size

 **Increasing k to ~5** would yield a bit better results. Hidden size equal to 40 or 60 approximately does not change anything 

# HW, Problem 3.3

In [0]:
## generate a few name of dinosaurs

few = 5

for i in range(few):
  name_idx = sample(model)
  print_sample(name_idx)

<lbrcpnicaurus>
<busolmosaurus>
<tbmianodauris>
<lbrcpnicaurus>
<busolmosaurus>


### a) Key differences between neural language model and RNN model   

1. neural language model uses fixed length context, RNN - does not
2. RNN may have longer memory. However, in, practice, if we do not use GRU or LSTM memory is pretty short
3. In Neural Language model there is no hidden state from previous step (h_prev)
3. RNN has vanishing gradient problem, while neural language model - not 



### b) Number of parameters for neural language model and RNN model:  

1.  Neural Language Model:   
number parameters of "i2h"  and **k*** "i2o" matrix = **input_size * hidden_size + k*hidden_size * output_size + 2= 5802**   
  
  
2.  RNN:   
number parameters of "i2h" and "i2o" matrix =  **(input_size + hidden_size) * hidden_size + (input_size + hidden_size) * output_size + 2= 6243**   

      Note:  
      -- iput size = 29  
      -- hidden size = 50  
      -- output size = 29  
      -- Each layer has a bias term, 2 layers = 2 bias terms  
      -- k = 3 for Neural Language Model

# Reference

1. Sampling in  RNN: https://nlp.stanford.edu/blog/maximum-likelihood-decoding-with-rnns-the-good-the-bad-and-the-ugly/
2. Coursera course (main source): https://github.com/furkanu/deeplearning.ai-pytorch/tree/master/5-%20Sequence%20Models
3. Coursera course (main source): https://github.com/Kulbear/deep-learning-coursera/blob/master/Sequence%20Models/Dinosaurus%20Island%20--%20Character%20level%20language%20model%20final%20-%20v3.ipynb
4. LSTM: http://colah.github.io/posts/2015-08-Understanding-LSTMs/