<h1>Ejercicio 3 Práctica 6 (Iris)</h1>
<p>En esta versión de este ejercicio se usa el siguiente dataset: <a>https://www.kaggle.com/datasets/jeffheaton/iris-computer-vision</a>. Es importante destacar que este dataset está desbalanceado, ya que contiene muchas imágenes de la flor versicolor, y pocas de las demás flores (setosa, virginica). Se decidió continuar con este ejercicio para mostrar los efectos de un dataset desbalanceado en el entrenamiento de un modelo CNN y se creó otra versión usando un dataset distinto y más completo (CIFAR10).</p>

<h3>Importaciones</h3>

In [1]:
import tensorflow as tf
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
from keras.models import Sequential
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.layers import Dense, Conv2D, MaxPooling2D, Dropout, Rescaling, RandomFlip, RandomRotation, RandomZoom, GlobalAveragePooling2D, RandomContrast
from sklearn.metrics import classification_report

<h3>Carga de datos y división en conjuntos de entrenamiento y validación</h3>

In [2]:

# dataset Iris
IMG_SIZE = (256, 256)
BATCH_SIZE = 8

train_ds = tf.keras.utils.image_dataset_from_directory(
    "./iris",
    validation_split=0.2,  # 20% para validación
    subset="training",    
    seed=123,             
    image_size=IMG_SIZE,
    batch_size=BATCH_SIZE
)

valid_ds = tf.keras.utils.image_dataset_from_directory(
    "./iris",
    validation_split=0.2,
    subset="validation",  
    seed=123,
    image_size=IMG_SIZE,
    batch_size=BATCH_SIZE
)


Found 423 files belonging to 3 classes.
Using 339 files for training.
Found 423 files belonging to 3 classes.
Using 84 files for validation.


<h3>Modificaciones de pesos de clases</h3>
<p>Para intentar reducir los efectos del desbalance en los datos, se usa compute_class_weight, que calcula y devuelve pesos teniendo en cuenta el número de ejemplos por clase. Estos pesos penalizan a la clase dominante en los datos y ayudan a aquellas que no tienen muchos ejemplos. Se pasan al modelo antes de comenzar el entrenamiento.</p>

In [3]:

# Como se tienen muchos mas ejemplos de la flor iris versicolor, modifico los pesos para intentar balancearlos y buscar permitir el correcto aprendizaje de todas las clases.
class_names = train_ds.class_names
labels = np.concatenate([y for x, y in train_ds], axis=0)
class_weights = compute_class_weight(
    class_weight="balanced",
    classes=np.unique(labels),
    y=labels
)
class_weights = dict(enumerate(class_weights))
print("Pesos por clase:", class_weights)

Pesos por clase: {0: 2.1320754716981134, 1: 0.5159817351598174, 2: 1.6865671641791045}


<h3>Construcción del modelo y entrenamiento</h3>

In [4]:

# Early stopping para evitar el sobreentrenamiento
callback = EarlyStopping(
    patience=5, restore_best_weights=True
)

# Capas para data augmentation, al tener pocos datos
data_augmentation = tf.keras.Sequential([
    RandomFlip("horizontal"),
    RandomRotation(0.1),
    RandomZoom(0.1),
    RandomContrast(0.1),
])

model = Sequential([
    data_augmentation,
    Rescaling(1./255, input_shape=(256, 256, 3)),
    Conv2D(32, (3,3), activation='relu'),
    MaxPooling2D(4,4),
    Conv2D(64, (3,3), activation='relu'),
    MaxPooling2D(2,2),
    Conv2D(128, (3,3), activation='relu'),
    MaxPooling2D(2,2),
    GlobalAveragePooling2D(),    
    Dense(128, activation='relu'),
    Dropout(0.5),
    Dense(3, activation='softmax')
])
model.compile(optimizer=Adam(learning_rate=1e-4), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_ds, epochs=30, batch_size=BATCH_SIZE, verbose=1, validation_data=valid_ds, class_weight=class_weights, callbacks=callback)
loss, acc = model.evaluate(valid_ds)
print(f"Accuracy: {acc:.2f}")
model.summary()


Epoch 1/30


  super().__init__(**kwargs)


[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 143ms/step - accuracy: 0.2743 - loss: 1.1047 - val_accuracy: 0.6190 - val_loss: 1.0885
Epoch 2/30
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 123ms/step - accuracy: 0.4159 - loss: 1.0957 - val_accuracy: 0.6071 - val_loss: 1.0898
Epoch 3/30
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 122ms/step - accuracy: 0.3864 - loss: 1.0958 - val_accuracy: 0.4286 - val_loss: 1.0942
Epoch 4/30
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 124ms/step - accuracy: 0.3304 - loss: 1.1067 - val_accuracy: 0.3452 - val_loss: 1.1007
Epoch 5/30
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 124ms/step - accuracy: 0.2861 - loss: 1.0976 - val_accuracy: 0.2976 - val_loss: 1.1000
Epoch 6/30
[1m43/43[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 126ms/step - accuracy: 0.3304 - loss: 1.0951 - val_accuracy: 0.5119 - val_loss: 1.0869
Epoch 7/30
[1m43/43[0m [32m━━━━━━

In [5]:
# Predicciones y estadísticas
y_pred = model.predict(valid_ds)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.concatenate([y for x, y in valid_ds], axis=0)
print(classification_report(y_true, y_pred_classes))

[1m11/11[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 43ms/step
              precision    recall  f1-score   support

           0       0.11      0.25      0.16        16
           1       0.61      0.50      0.55        50
           2       0.12      0.06      0.08        18

    accuracy                           0.36        84
   macro avg       0.28      0.27      0.26        84
weighted avg       0.41      0.36      0.37        84



<p>Estas estadísticas finales muestran que el modelo está sesgado y predice en su mayoría la clase 1 (versicolor) para entradas que no lo son. Las otras dos clases (setosa, virginica) son casi ignoradas por el modelo.</p>

<h1>Ejercicio 3 Práctica 6 (CIFAR10)</h1>
<p>En esta versión de este ejercicio se usa el siguiente dataset: <a>https://www.cs.toronto.edu/~kriz/cifar.html</a>.</p>

<h3>Importaciones y carga de datos</h3>

In [6]:
import tensorflow as tf
from sklearn.utils.class_weight import compute_class_weight
import numpy as np
from keras.models import Sequential
from keras.datasets import cifar10
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping
from keras.layers import Dense, Conv2D, MaxPooling2D, Dropout, Flatten
from sklearn.metrics import classification_report

# dataset CIFAR10
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
x_train = x_train.astype('float32') / 255.0
x_test = x_test.astype('float32') / 255.0


<h3>Construcción del modelo y entrenamiento</h3>

In [7]:

callback = EarlyStopping(
    patience=5, restore_best_weights=True
)

model = Sequential([
    Conv2D(32, (3,3), activation='relu', input_shape=(32, 32, 3)),
    MaxPooling2D(2,2),
    Conv2D(64, (2,2), activation='relu'),
    MaxPooling2D(2,2),
    Flatten(),   
    Dense(128, activation='relu'),
    Dropout(0.4),
    Dense(10, activation='softmax')
])

model.compile(optimizer=Adam(learning_rate=1e-4), loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(x=x_train, y=y_train, epochs=30, batch_size=16, verbose=1, validation_split=0.1, callbacks=callback)
loss, acc = model.evaluate(x_test, y_test)
print(f"Accuracy: {acc:.2f}")
model.summary()

  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 1/30
[1m2813/2813[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 8ms/step - accuracy: 0.3463 - loss: 1.8050 - val_accuracy: 0.4614 - val_loss: 1.5127
Epoch 2/30
[1m2813/2813[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 8ms/step - accuracy: 0.4558 - loss: 1.5077 - val_accuracy: 0.5122 - val_loss: 1.3618
Epoch 3/30
[1m2813/2813[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m24s[0m 8ms/step - accuracy: 0.5067 - loss: 1.3886 - val_accuracy: 0.5712 - val_loss: 1.2660
Epoch 4/30
[1m2813/2813[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m40s[0m 8ms/step - accuracy: 0.5366 - loss: 1.3101 - val_accuracy: 0.5852 - val_loss: 1.1956
Epoch 5/30
[1m2813/2813[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m23s[0m 8ms/step - accuracy: 0.5574 - loss: 1.2523 - val_accuracy: 0.5882 - val_loss: 1.1804
Epoch 6/30
[1m2813/2813[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m25s[0m 9ms/step - accuracy: 0.5735 - loss: 1.2076 - val_accuracy: 0.6056 - val_loss: 1.1407
Epoch 7/30

<h3>Validación y estadísticas</h3>

In [8]:
y_pred = model.predict(x_test)
y_pred_classes = np.argmax(y_pred, axis=1)
y_true = np.concatenate([y for y in y_test], axis=0)
print(classification_report(y_true, y_pred_classes))


[1m313/313[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 4ms/step
              precision    recall  f1-score   support

           0       0.70      0.78      0.74      1000
           1       0.83      0.79      0.81      1000
           2       0.58      0.60      0.59      1000
           3       0.52      0.53      0.52      1000
           4       0.64      0.66      0.65      1000
           5       0.67      0.54      0.60      1000
           6       0.76      0.79      0.77      1000
           7       0.72      0.77      0.74      1000
           8       0.80      0.81      0.80      1000
           9       0.80      0.76      0.78      1000

    accuracy                           0.70     10000
   macro avg       0.70      0.70      0.70     10000
weighted avg       0.70      0.70      0.70     10000



<h1>Ejercicio 4 Práctica 7</h1>
<p>En esta sección se muestra una red RNN vanilla y luego su modificación para la implementación de GRU.</p>

In [20]:
"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""
import numpy as np

# data I/O
data = open('linux_input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('data has %d characters, %d unique.' % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 50 # number of steps to unroll the RNN for
learning_rate = 1e-2

# model parameters
Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, ys, ps = {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass
  for t in range(len(inputs)):
    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars
    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
  # backward pass: compute gradients going backwards
  dWxh, dWhh, dWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
  dbh, dby = np.zeros_like(bh), np.zeros_like(by)
  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dbh += dhraw
    dWxh += np.dot(dhraw, xs[t].T)
    dWhh += np.dot(dhraw, hs[t-1].T)
    dhnext = np.dot(Whh.T, dhraw)
  for dparam in [dWxh, dWhh, dWhy, dbh, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxh, dWhh, dWhy, dbh, dby, hs[len(inputs)-1]

def sample(h, seed_ix, n):
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in range(n):
    h = np.tanh(np.dot(Wxh, x) + np.dot(Whh, h) + bh)
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes

n, p = 0, 0
mWxh, mWhh, mWhy = np.zeros_like(Wxh), np.zeros_like(Whh), np.zeros_like(Why)
mbh, mby = np.zeros_like(bh), np.zeros_like(by) # memory variables for Adagrad
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while (n < 8000):
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print('----\n %s \n----' % (txt, ))

  # forward seq_length characters through the net and fetch gradient
  loss, dWxh, dWhh, dWhy, dbh, dby, hprev = lossFun(inputs, targets, hprev)
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print('iter %d, loss: %f' % (n, smooth_loss)) # print progress
  
  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxh, Whh, Why, bh, by], 
                                [dWxh, dWhh, dWhy, dbh, dby], 
                                [mWxh, mWhh, mWhy, mbh, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter 


data has 1115394 characters, 65 unique.
----
 lBRSt.'IR$I3t TCLgS-cp!YsqWGg&.BuUcmCZsP
H-oFD3XKYK:KQjoSxLdQfvO-ReE;bu jS

:uM3DldffM:lDf.q,gm3xl;rIvaciwLwkBfdCyy':OLL'-OUnEFUhOe;xcvcx r
BvHLZZtGHewXk$Bt!phpC
lkcp&,JIQ:
P;AML AM?d
d.-:zpm-kcVecgaR 
----
iter 0, loss: 208.719363
----
 sySnycli,l isw;whslo e aeseirheathNtdo iFk pmo;ufe -eimoeo
nXcTvip.sy ds!i WOf'prnlanoiuoa,heottIelfmgyedS:r;?ueu u sanobotehsiavwu s ie irrr.wTrmlevg
prcFt ht;veh,llkWehwwrtenoy:auhnrltderr slid
stel 
----
iter 100, loss: 204.122176
----
 I misht febsprreus. y
f :endb aiy imo   aw- aiceg .iehty
W C'N ar Teuld ltee dhOd'l,S f nrrt h gtb tvacl 
bhrehhFtrets ees: bhemnucrh
Ist
Okbnt
Rdms
W
y
Jtsu  hand. nhem
Fiu hh oh-Nsah mdiy yagayat em 
----
iter 200, loss: 199.190930
----
 r IenNmf
Ahwl',s ia sanirr reSweclomte rfense yn raritc rhadni Ios'aretireoaroieen ne aytntmeotmeaowitnekd yaot  oeUr  Iu vi
d uosed b
to btI thi alln uwtseoe fouIwotlsr ,a'  or nhrhph As toes au le 
 
----
iter 300, loss: 194.347805
--

<h2>Implementación de GRU</h2>

In [19]:
"""
Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
BSD License
"""
import numpy as np

# data I/O
data = open('linux_input.txt', 'r').read() # should be simple plain text file
chars = list(set(data))
data_size, vocab_size = len(data), len(chars)
print('data has %d characters, %d unique.' % (data_size, vocab_size))
char_to_ix = { ch:i for i,ch in enumerate(chars) }
ix_to_char = { i:ch for i,ch in enumerate(chars) }

# hyperparameters
hidden_size = 100 # size of hidden layer of neurons
seq_length = 50 # number of steps to unroll the RNN for
learning_rate = 1e-2

# model parameters
#Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden
#Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden
Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output
Wxz = np.random.randn(hidden_size, vocab_size)*0.01 # input a actualizacion
Whz = np.random.randn(hidden_size, hidden_size)*0.01 # hidden a actualizacion
Wxn = np.random.randn(hidden_size, vocab_size)*0.01 
Whn = np.random.randn(hidden_size, hidden_size)*0.01
Wxr = np.random.randn(hidden_size, vocab_size)*0.01
Whr = np.random.randn(hidden_size, hidden_size)*0.01
br = np.zeros((hidden_size, 1))
bn = np.zeros((hidden_size, 1))
bz = np.zeros((hidden_size, 1)) # bias actualizacion
#bh = np.zeros((hidden_size, 1)) # hidden bias
by = np.zeros((vocab_size, 1)) # output bias

def sigmoid(x):
    return 1/(1+np.exp(-x))

def lossFun(inputs, targets, hprev):
  """
  inputs,targets are both list of integers.
  hprev is Hx1 array of initial hidden state
  returns the loss, gradients on model parameters, and last hidden state
  """
  xs, hs, zs, rs, ys, ps, hcs = {}, {}, {}, {}, {}, {}, {}
  hs[-1] = np.copy(hprev)
  loss = 0
  # forward pass
  for t in range(len(inputs)):

    xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation
    xs[t][inputs[t]] = 1
    zs[t] = sigmoid(np.dot(Wxz, xs[t]) + np.dot(Whz, hs[t - 1]) + bz)
    rs[t] = sigmoid(np.dot(Wxr, xs[t]) + np.dot(Whr, hs[t - 1] + br))
    hcs[t] = np.tanh(np.dot(Wxn, xs[t]) + rs[t] * np.dot(Whn, hs[t - 1]) + bn) # hidden candidate state
    hs[t] = (1 - zs[t]) * hs[t - 1] + zs[t] * hcs[t] # hidden state
    ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars
    ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars

    loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)
  # backward pass: compute gradients going backwards
  dWxz, dWhz, dbz = np.zeros_like(Wxz), np.zeros_like(Whz), np.zeros_like(bz)
  # (r) Reset gate
  dWxr, dWhr, dbr = np.zeros_like(Wxr), np.zeros_like(Whr), np.zeros_like(br)
  # (n) Candidate state
  dWxn, dWhn, dbn = np.zeros_like(Wxn), np.zeros_like(Whn), np.zeros_like(bn)
  dWhy, dby = np.zeros_like(Why), np.zeros_like(by)

  dhnext = np.zeros_like(hs[0])
  for t in reversed(range(len(inputs))):
    dy = np.copy(ps[t])
    dy[targets[t]] -= 1 # backprop into y. see http://cs231n.github.io/neural-networks-case-study/#grad if confused here
    dWhy += np.dot(dy, hs[t].T)
    dby += dy
    dh = np.dot(Why.T, dy) + dhnext # backprop into h
    dhc = dh * zs[t]
    dz = dh * (hcs[t] - hs[t - 1])
    dh_prev_h = dh * (1 - zs[t])
    #dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity
    dzraw = (zs[t] * (1 - zs[t])) * dz
    dhcraw = (1 - hcs[t]**2) * dhc

    #dbh += dhraw
    dbn += dhcraw
    dWxn += np.dot(dhcraw, xs[t].T)

    dr = dhcraw * np.dot(Whn, hs[t - 1])
    dh_prev_hcs = np.dot(Whn.T, dhcraw * rs[t])
    dWhn += np.dot(dhcraw * rs[t], hs[t - 1].T)

    dbz += dzraw
    dWxz += np.dot(dzraw, xs[t].T)

    dh_prev_z = np.dot(Whz.T, dzraw)
    dWhz += np.dot(dzraw, hs[t - 1].T)

    drraw = (rs[t] * (1 - rs[t])) * dr
    dbr += drraw
    dWxr += np.dot(drraw, xs[t].T)

    dh_prev_r = np.dot(Whr.T, drraw) 
    dWhr += np.dot(drraw, hs[t - 1].T)

    dhnext = dh_prev_h + dh_prev_hcs + dh_prev_z + dh_prev_r
  for dparam in [dWxz, dWhz, dbz, dWxr, dWhr, dbr, dWxn, dWhn, dbn, dWhy, dby]:
    np.clip(dparam, -5, 5, out=dparam) # clip to mitigate exploding gradients
  return loss, dWxz, dWhz, dbz, dWxr, dWhr, dbr, dWxn, dWhn, dbn, dWhy, dby, hs[len(inputs)-1]

def sample(h, seed_ix, n):
  """ 
  sample a sequence of integers from the model 
  h is memory state, seed_ix is seed letter for first time step
  """
  x = np.zeros((vocab_size, 1))
  x[seed_ix] = 1
  ixes = []
  for t in range(n):
    r = sigmoid(np.dot(Wxr, x) + np.dot(Whr, h) + br) # Cuánto del estado anterior se olvida
    z = sigmoid(np.dot(Wxz, x) + np.dot(Whz, h) + bz) # Cuánto del estado anterior se mantiene
    h_can = np.tanh(np.dot(Wxn, x) + r * np.dot(Whn, h) + bn)
    h = (1 - z) * h + z * h_can
    y = np.dot(Why, h) + by
    p = np.exp(y) / np.sum(np.exp(y))
    ix = np.random.choice(range(vocab_size), p=p.ravel())
    x = np.zeros((vocab_size, 1))
    x[ix] = 1
    ixes.append(ix)
  return ixes

n, p = 0, 0
mWxz, mWhz, mbz = np.zeros_like(Wxz), np.zeros_like(Whz), np.zeros_like(bz)
# (r) Reset gate
mWxr, mWhr, mbr = np.zeros_like(Wxr), np.zeros_like(Whr), np.zeros_like(br)
# (n) Candidate state
mWxn, mWhn, mbn = np.zeros_like(Wxn), np.zeros_like(Whn), np.zeros_like(bn)

# (Output)
mWhy, mby = np.zeros_like(Why), np.zeros_like(by)
smooth_loss = -np.log(1.0/vocab_size)*seq_length # loss at iteration 0
while (n < 8000):
  # prepare inputs (we're sweeping from left to right in steps seq_length long)
  if p+seq_length+1 >= len(data) or n == 0: 
    hprev = np.zeros((hidden_size,1)) # reset RNN memory
    p = 0 # go from start of data
  inputs = [char_to_ix[ch] for ch in data[p:p+seq_length]]
  targets = [char_to_ix[ch] for ch in data[p+1:p+seq_length+1]]

  # sample from the model now and then
  if n % 100 == 0:
    sample_ix = sample(hprev, inputs[0], 200)
    txt = ''.join(ix_to_char[ix] for ix in sample_ix)
    print('----\n %s \n----' % (txt, ))

  # forward seq_length characters through the net and fetch gradient
  loss, dWxz, dWhz, dbz, dWxr, dWhr, dbr, dWxn, dWhn, dbn, dWhy, dby, hprev = lossFun(inputs, targets, hprev)  
  smooth_loss = smooth_loss * 0.999 + loss * 0.001
  if n % 100 == 0: print('iter %d, loss: %f' % (n, smooth_loss)) # print progress
  
  # perform parameter update with Adagrad
  for param, dparam, mem in zip([Wxz, Whz, bz,
                Wxr, Whr, br,
                Wxn, Whn, bn,
                Why, by], 
                                [dWxz, dWhz, dbz,
                 dWxr, dWhr, dbr,
                 dWxn, dWhn, dbn,
                 dWhy, dby], 
                                [mWxz, mWhz, mbz,
              mWxr, mWhr, mbr,
              mWxn, mWhn, mbn,
              mWhy, mby]):
    mem += dparam * dparam
    param += -learning_rate * dparam / np.sqrt(mem + 1e-8) # adagrad update

  p += seq_length # move data pointer
  n += 1 # iteration counter 


data has 1115394 characters, 65 unique.
----
 ntedIDEgHCtl$X,zgx?NMdcL?qnq'?UKPdgphBr:?zEXAiZbl-UFWgGIeO.litVfC,3:cgXLRukgfhw YxwwBR,XtgCU
xaCvDxqoXTXDNSb.;,Q?,MzNsvEtcAJC'u vHxTTe-e.KZq?C :&3Xkgm:YKbLdWDwnAInP'ZG3UtNsYl:nbnesy$dUFbH
gt!lsltZRyOc 
----
iter 0, loss: 208.719361
----
 lrMttl c eL  iv c
atieen 'tomUeiota wbe
hesu bhu daNabrImyu
iah eIi'eSdeioushgiwe
t sBnnedim
eh,atot tnt twrh.: totlenai'pahdrca ite mp
?s,elunnk fwfnnCeus
aha;emenmprtoyAEao frarat saarhmh  mtmwTitdy 
----
iter 100, loss: 204.138705
----
 d,y or: tult gotr afhe ths, BiH ced mouthany flteo vMoFi,r oe,
sa moutm ,et Cnre,'u'Uscposk

OumUt nd,m te'var dvrpe ehs y chlplr; tho'ay;
gil weg, masc ro tt
lfiuhs , n, imo,uc:hsTvont rithdClOr fo'  
----
iter 200, loss: 199.027806
----
 ltoc tolin i'lt  bSnle, houdennchar'  htmite-s samre outhe fhaoeg gowesa :o an se thyr
d bfiis eo, oornided bencolic ah oanauinndeeo woiuhd
 oren sevm'eoes ousmrelmoby wo
es, Oy ron lhnrhioeno
nt?gsat 
----
iter 300, loss: 193.855362
--

<p>En este caso, con seq_length = 50, la implementación de GRU no se estanca en cuanto a la reducción de su pérdida como la RNN común. Esto tiene sentido, ya que la RNN falla en las predicciones a largo plazo, mientras que la GRU puede seguir aprendiendo. También </p>

<p>Ahora, si bajamos el seq_length a 25, la RNN debería comportarse mejor.</p>