# Deep N-grams

### Predecir el siguiente conjunto de caracteres usando los caractees previos

### Tareas por realizar
1. Convertir texto en tensores
2. Crear iterador para alimentar los datos al modelo
3. Definir modelo GRU con trax
4. Entrenar modelo con trax
5. Calcular la precisión del modelo usando perplexity
6. Hacer predicciones con el modelo generado


In [14]:
import os
import trax
import trax.fastmath.numpy as np
import pickle
import numpy
import random as rnd
from trax import fastmath
from trax import layers as tl

In [15]:
dirname = '/media/andrea/Baba Yaga/BIGFOOT/CIC/PycharmProjects/PycharmProjects/semester work/Proy4/data/'

### Cargando datos

Como texto usaremos algunas obras de Shakespeare

### Preprocesamiento de los datos
* Para la generacion de caracteres, cara caracter debe tener un número único
* Todos los caracteres se transforman a minpusculas
* Se usa la funcion $ord$ para convertir a un único entero un caracter
* Crear un generador que regresa batches del conjunto de datos

In [16]:
#dirname = 'data/'
lines = [] # storing all the lines in a variable.
for filename in os.listdir(dirname):
    with open(os.path.join(dirname, filename)) as files:
        for line in files:
            pure_line = line.strip()

            if pure_line:
                lines.append(pure_line)

In [43]:
n_lines = len(lines)
print(f"Number of lines: {n_lines}")

Number of lines: 124097


Todo e texto se transformará en minpuscualas

In [19]:
for i, line in enumerate(lines):
    lines[i] = line.lower()

In [44]:
eval_lines = lines[-1000:] # Create a holdout validation set
lines = lines[:-1000] # Leave the rest for training


print(f"Number for training: {len(lines)}")
print(f"Number for validation: {len(eval_lines)}")

Number for training: 123097
Number for validation: 1000


#### Línea a tensor
Toma como entrada una línea y transforma cada caracter a s forma unicode entera, y regresa lista de enteros(tensor)
Agregar al final de la oración el caracter especial

In [21]:
def line_to_tensor(line, EOS_int=1):
    tensor = []
    for c in line:
        c_int = ord(c)
        tensor.append(c_int)
    tensor.append(EOS_int)
    return tensor


In [22]:
line_to_tensor('abc xyz')

[97, 98, 99, 32, 120, 121, 122, 1]

#### Batch genetaror

Generador por bloques de texto para entrenamiento, validación y pruebas.
* El generador convierte las líneas de texto en arreglos de numpy "rellenos" de ceros para que todos tengan la misma longitud


In [23]:
def data_generator(batch_size, max_length, data_lines, line_to_tensor=line_to_tensor, shuffle=True):
    index = 0
    cur_batch = []
    num_lines = len(data_lines)
    lines_index = [*range(num_lines)]
    if shuffle:
        rnd.shuffle(lines_index)
    while True:
        if index>=num_lines:
            index = 0
            if shuffle:
                rnd.shuffle(lines_index)
        line = data_lines[index]

        if len(line)<max_length:
            cur_batch.append(line)
        index += 1

        if len(cur_batch)==batch_size:
            batch = []
            mask = []
            for li in cur_batch:
                tensor = line_to_tensor(li)
                pad = [0] * (max_length-len(tensor))
                tensor_pad = tensor+pad
                batch.append(tensor_pad)
                example_mask = [0 if i==0 else 1  for i in tensor_pad]
                mask.append(example_mask)
            batch_np_arr = np.array(batch)

            mask_np_arr = np.array(mask)

            yield batch_np_arr, batch_np_arr, mask_np_arr

            cur_batch = []

#### Probando generador

In [None]:
tmp_lines = ['12345678901', #11
             '123456789', # 9
             '234567890', # 9
             '345678901'] # 9

# Get a batch size of 2, max length 10
tmp_data_gen = data_generator(batch_size=2,
                              max_length=10,
                              data_lines=tmp_lines,
                              shuffle=False)

# get one batch
tmp_batch = next(tmp_data_gen)

# view the batch
tmp_batch

In [25]:
import torch, jax; print(torch.cuda.is_available()); print(jax.devices())

True
[CpuDevice(id=0)]


#### Repitiendo generador de batches
La función ```itertools.cycle``` es util para que el generador eventualmente se detenga

In [None]:
import itertools

infinite_data_generator = itertools.cycle(
    data_generator(batch_size=2, max_length=10, data_lines=tmp_lines))

In [27]:
ten_lines = [next(infinite_data_generator) for _ in range(10)]
print(len(ten_lines))

10


## Model with GRU
Implementando modelo GRU (Gated Recurrent Unit).
Para la construcción del modelo con trax es necesario los siguientes paquetes:
* ```tl.Serial```: Permite aplicar capas sucesivas
* ```tl.ShiftRight```: Permite que el modelo pase hacia adelante(feed forward)
* ```tl.Embedding```: Inicializa el embedding del tamaño del vocabulario y dimensión del modelo
* ```tl.GRU```: Construye un GRU con n_cells unidades
* ```tl.Dense```: N_unidades de salida para la capa densa
* ```tl.LogSoftmax```: Log de la probabilidad de la salida

In [28]:
def GRULM(vocab_size=256, d_model=512, n_layers=2, mode='train'):

    model = tl.Serial(
      tl.ShiftRight(mode=mode), # Stack the ShiftRight layer
      tl.Embedding(vocab_size = vocab_size,d_feature=d_model), # Stack the embedding layer
      [tl.GRU(n_units=d_model) for _ in range(n_layers)], # Stack GRU layers of d_model units keeping n_layer parameter in mind (use list comprehension syntax)
      tl.Dense(n_units=vocab_size), # Dense layer
      tl.LogSoftmax() # Log Softmax
    )
    return model

In [29]:
model = GRULM()
print(model)

Serial[
  Serial[
    ShiftRight(1)
  ]
  Embedding_256_512
  GRU_512
  GRU_512
  Dense_256
  LogSoftmax
]


In [30]:
batch_size = 32
max_length = 64

In [31]:
def n_used_lines(lines, max_length):


    n_lines = 0
    for l in lines:
        if len(l) <= max_length:
            n_lines += 1
    return n_lines

num_used_lines = n_used_lines(lines, 32)
print('Number of used lines from the dataset:', num_used_lines)
print('Batch size (a power of 2):', int(batch_size))
steps_per_epoch = int(num_used_lines/batch_size)
print('Number of steps to cover one epoch:', steps_per_epoch)

Number of used lines from the dataset: 25773
Batch size (a power of 2): 32
Number of steps to cover one epoch: 805


## Training model

In [32]:
output_dir = '/media/andrea/Baba Yaga/BIGFOOT/CIC/PycharmProjects/PycharmProjects/semester work/Proy4/model/'

In [37]:
from trax.supervised import training

def train_model(model, data_generator, batch_size=32, max_length=64, lines=lines, eval_lines=eval_lines, n_steps=100,output_dir=output_dir):
    print("Dir")
    print(output_dir)
    bare_train_generator = data_generator(batch_size=batch_size, max_length=max_length, data_lines=lines)
    infinite_train_generator =  itertools.cycle(bare_train_generator)

    bare_eval_generator = data_generator(batch_size=batch_size, max_length=max_length, data_lines=eval_lines)
    infinite_eval_generator = itertools.cycle(bare_eval_generator)

    train_task = training.TrainTask(
        labeled_data=infinite_train_generator,
        loss_layer= tl.CrossEntropyLoss(),
        optimizer=trax.optimizers.Adam(learning_rate=0.0005)
    )

    eval_task = training.EvalTask(
        labeled_data=infinite_eval_generator,
        metrics=[tl.CrossEntropyLoss(), tl.Accuracy()],
        n_eval_batches=3
    )

    training_loop = training.Loop(model,
                                  train_task,
                                  eval_tasks=[eval_task],
                                  output_dir=output_dir)

    training_loop.run(n_steps=n_steps)

    return training_loop


In [38]:
training_loop = train_model(GRULM(), data_generator)

Dir
/media/andrea/Baba Yaga/BIGFOOT/CIC/PycharmProjects/PycharmProjects/semester work/Proy4/model/





Step    100: Ran 99 train steps in 62.02 secs
Step    100: train CrossEntropyLoss |  3.35154223
Step    100: eval  CrossEntropyLoss |  2.88437454
Step    100: eval          Accuracy |  0.19368572


#### Evaluación

Usaremos preplejidad para evaluar que tan bien lo ha hecho el modelo
Perplexity definida como:
$$P(W) = \sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}$$

$$log P(W) = {log\big(\sqrt[N]{\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)}$$

$$ = {log\big({\prod_{i=1}^{N} \frac{1}{P(w_i| w_1,...,w_{n-1})}}\big)^{\frac{1}{N}}}$$

$$ = {log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)^{-\frac{1}{N}}} $$
$$ = -\frac{1}{N}{log\big({\prod_{i=1}^{N}{P(w_i| w_1,...,w_{n-1})}}\big)} $$
$$ = -\frac{1}{N}{\big({\sum_{i=1}^{N}{logP(w_i| w_1,...,w_{n-1})}}\big)} $$

In [45]:
def test_model(preds, target):

    total_log_ppx = np.sum(tl.one_hot(target,preds.shape[-1]) * preds, axis= -1) # HINT: tl.one_hot() should replace one of the Nones

    non_pad = 1.0 - np.equal(target, 0)
    ppx = total_log_ppx * non_pad

    log_ppx = np.sum(ppx) / np.sum(non_pad)

    return -log_ppx


In [47]:
model = GRULM()
model.init_from_file('/media/andrea/Baba Yaga/BIGFOOT/CIC/PycharmProjects/PycharmProjects/semester work/Proy4/model.pkl.gz')
batch = next(data_generator(batch_size, max_length, lines, shuffle=False))
preds = model(batch[0])
log_ppx = test_model(preds, batch[1])
print('The log perplexity and perplexity of your model are respectively', log_ppx, np.exp(log_ppx))

The log perplexity and perplexity of your model are respectively 4.498601 89.89128
The log perplexity and perplexity of your model are respectively 4.498601 89.89128


#### Generando lenguaje con nuestro modelo
Para este modelo haremos muestreo de la distribución Gumbel, de esta forma podemos generar nuevas oraciones.
La Función de Densidad de Probabilidad de Gumbel es definida como:
$$ f(z) = {1\over{\beta}}e^{(-z+e^{(-z)})} $$

where: $$ z = {(x - \mu)\over{\beta}}$$

Cuando una variable aleatoria tiene un crecimiento exponencial, la distribución Gumbel se acerca cuando el muestreo crece asintóticamente.

In [None]:
def gumbel_sample(log_probs, temperature=1.0):
    """Gumbel sampling from a categorical distribution."""
    u = numpy.random.uniform(low=1e-6, high=1.0 - 1e-6, size=log_probs.shape)
    g = -np.log(-np.log(u))
    return np.argmax(log_probs + g * temperature, axis=-1)

def predict(num_chars, prefix):
    inp = [ord(c) for c in prefix]
    result = [c for c in prefix]
    max_len = len(prefix) + num_chars
    for _ in range(num_chars):
        cur_inp = np.array(inp + [0] * (max_len - len(inp)))
        outp = model(cur_inp[None, :])  # Add batch dim.
        next_char = gumbel_sample(outp[0, len(inp)])
        inp += [int(next_char)]

        if inp[-1] == 1:
            break  # EOS
        result.append(chr(int(next_char)))

    return "".join(result)

print(predict(32, ""))


Los siguentes son textos generados por el modelo y captura las dependencias entre las palabras sin necesidad de alguna entrada.

In [49]:
print(predict(32, ""))
print(predict(32, ""))
print(predict(32, ""))


SAY	Not worth the fools.
But let the instrument
SIMPLE	Which are not membrayetly


In [52]:
print(predict(50,"love"))

love him once married, run, sir; our gates and places 
