# Práctico 2

En este práctico se aplicarán los conocimientos adquiridos de redes avanzadas.
Basándose en lo aprendido con el práctico 1, esperamos que apliquen la misma
metodología en la creación de un pipeline de trabajo para el entrenamiento y
evaluación de modelos de redes neuronales avanzados en tareas que les
conciernan.

El problema a elegir quedará a su criterio. Algunas posibilidades son:

1. Tareas de PLN (pueden aplicarse a redes convolucionales o recurrentes):
    - Análisis de sentimiento sobre los datos del práctico 1.
    - Análisis de sentimiento sobre el conjunto de datos de IMDB (disponible en
      `keras.datasets`).
    - Clasificación de textos del [20 Newsgroup](http://qwone.com/~jason/20Newsgroups/).
2. Tareas de imágenes:
    - Reconocimiento de imágenes del CIFAR-10 (disponible en `keras.datasets`)
      usando convolucionales.
    - Análisis de videos de [YouTube](https://ai.googleblog.com/2016/09/announcing-youtube-8m-large-and-diverse.html)
      usando recurrentes.

Estos son solo algunos ejemplos. Pueden explorar otros conjuntos de datos y ver
cómo aplicar las distintas redes (o incluso combinaciones de las mismas). Si
van a trabajar en algo de imágenes busquen clasificación de imágenes a color
para aprovechas los canales en las redes convolucionales.  A la hora de
entregar sus trabajos dejen un link al conjunto de datos trabajado.

In [3]:
#download the data
import keras
from keras.datasets import imdb 
top_words = 5000 
(X_train, y_train), (X_test, y_test) = imdb.load_data(num_words=top_words)

In [4]:
y_train

array([1, 0, 0, ..., 0, 1, 0])

In [5]:
X_train

array([list([1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 2, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 2, 19, 178, 32]),
       list([1, 194, 1153, 194, 2, 78, 228, 5, 6, 1463, 4369,

In [6]:
#reverse lookup
INDEX_FROM = 1
word_to_id = keras.datasets.imdb.get_word_index()
word_to_id = {k:(v+INDEX_FROM) for k,v in word_to_id.items()}
word_to_id["<PAD>"] = 0
word_to_id["<START>"] = 1
word_to_id["<UNK>"] = 2
id_to_word = {value:key for key,value in word_to_id.items()}
print(' '.join(id_to_word[id] for id in X_train[0] ))

<START> was not for it's self joke professional disappointment see already pretending their staged a every so found of his movies it's third plot good episodes <UNK> in who guess wasn't of doesn't a again plot find <UNK> poor let her a again vegas trouble with fight like that oh a big good for to watching essentially but was not a fat centers turn a not well how this for it's self like bad as that natural a not with starts with this for david movie <UNK> of only moments this br special br films of a sell <UNK> for guess their childish an a man this for like musical of his ever more so while there his feelings an to not this role be get when of was others for people <UNK> br a character love <UNK> as found a <UNK> is turner of upon so well it's self fine have early seeing if is a <UNK> social that watch him a sex as plays could by suffering time have through to long <UNK> movie a music not on scene fine have guess of i'm all <UNK> movie more so be whole its his watch a music see for lik

In [7]:
#one hot encode your documents
from numpy import array
from keras.preprocessing.text import one_hot
docs = ['Gut gemacht',
        'Gute arbeit',
        'Super idee',
        'Perfekt erledigt',
        'exzellent',
        'naja',
        'Schwache arbeit.',
        'Nicht gut',
        'Miese arbeit.',
        'Hätte es besser machen können.']
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[12, 31], [7, 8], [44, 28], [44, 17], [44], [46], [45, 8], [21, 12], [38, 8], [22, 2, 29, 17, 49]]


In [8]:
# Truncate and pad the review sequences 
from keras.preprocessing import sequence 
max_review_length = 500 
X_train = sequence.pad_sequences(X_train, maxlen=max_review_length) 
X_test = sequence.pad_sequences(X_test, maxlen=max_review_length) 

In [9]:
X_train

array([[   0,    0,    0, ...,   19,  178,   32],
       [   0,    0,    0, ...,   16,  145,   95],
       [   0,    0,    0, ...,    7,  129,  113],
       ...,
       [   0,    0,    0, ...,    4, 3586,    2],
       [   0,    0,    0, ...,   12,    9,   23],
       [   0,    0,    0, ...,  204,  131,    9]], dtype=int32)

# Model1 with recurrent layer

In [10]:
from keras.models import Sequential
from keras.layers import Embedding
from keras.layers import LSTM
from keras.layers import Dense

# Build the model 
embedding_vector_length = 32 
model = Sequential() 
model.add(Embedding(top_words, embedding_vector_length, input_length=max_review_length)) 
model.add(LSTM(100)) 
model.add(Dense(1, activation='sigmoid')) 
model.compile(loss='binary_crossentropy',optimizer='adam', metrics=['accuracy']) 
print(model.summary()) 

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
lstm_1 (LSTM)                (None, 100)               53200     
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 101       
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
_________________________________________________________________
None


In [112]:
print(model.layers[0].get_weights()[0].shape)


(5000, 32)


In [110]:
model.layers[0].get_weights()[0]

array([[ 0.06251945,  0.01849978,  0.01251933, ...,  0.05105105,
         0.03683813, -0.00451667],
       [-0.09376787,  0.01084303,  0.11799113, ..., -0.02341145,
        -0.10103051,  0.02137175],
       [-0.00684326, -0.00661166, -0.00053332, ..., -0.00932615,
        -0.02350529, -0.01377067],
       ...,
       [-0.03681023, -0.00687249, -0.014835  , ..., -0.05329632,
        -0.04246052, -0.0339723 ],
       [-0.03253951, -0.00426907,  0.01919474, ...,  0.00326748,
        -0.01933559,  0.03650987],
       [-0.02083049, -0.01888216,  0.04749423, ..., -0.02236629,
        -0.04648605,  0.0078646 ]], dtype=float32)

In [111]:
model.layers[0].get_config()

{'activity_regularizer': None,
 'batch_input_shape': (None, 500),
 'dtype': 'float32',
 'embeddings_constraint': None,
 'embeddings_initializer': {'class_name': 'RandomUniform',
  'config': {'maxval': 0.05, 'minval': -0.05, 'seed': None}},
 'embeddings_regularizer': None,
 'input_dim': 5000,
 'input_length': 500,
 'mask_zero': False,
 'name': 'embedding_1',
 'output_dim': 32,
 'trainable': True}

In [113]:
model.layers[1].get_config()

{'activation': 'tanh',
 'activity_regularizer': None,
 'bias_constraint': None,
 'bias_initializer': {'class_name': 'Zeros', 'config': {}},
 'bias_regularizer': None,
 'dropout': 0.0,
 'go_backwards': False,
 'implementation': 1,
 'kernel_constraint': None,
 'kernel_initializer': {'class_name': 'VarianceScaling',
  'config': {'distribution': 'uniform',
   'mode': 'fan_avg',
   'scale': 1.0,
   'seed': None}},
 'kernel_regularizer': None,
 'name': 'lstm_1',
 'recurrent_activation': 'hard_sigmoid',
 'recurrent_constraint': None,
 'recurrent_dropout': 0.0,
 'recurrent_initializer': {'class_name': 'Orthogonal',
  'config': {'gain': 1.0, 'seed': None}},
 'recurrent_regularizer': None,
 'return_sequences': False,
 'return_state': False,
 'stateful': False,
 'trainable': True,
 'unit_forget_bias': True,
 'units': 100,
 'unroll': False,
 'use_bias': True}

In [114]:
model.layers[2].get_config()

{'activation': 'sigmoid',
 'activity_regularizer': None,
 'bias_constraint': None,
 'bias_initializer': {'class_name': 'Zeros', 'config': {}},
 'bias_regularizer': None,
 'kernel_constraint': None,
 'kernel_initializer': {'class_name': 'VarianceScaling',
  'config': {'distribution': 'uniform',
   'mode': 'fan_avg',
   'scale': 1.0,
   'seed': None}},
 'kernel_regularizer': None,
 'name': 'dense_1',
 'trainable': True,
 'units': 1,
 'use_bias': True}

In [11]:
#Train the model
model.fit(X_train, y_train, validation_data=(X_test, y_test), nb_epoch=3, batch_size=64) 

  from ipykernel import kernelapp as app


Train on 25000 samples, validate on 25000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7fbc41271908>

In [133]:
from keras.models import load_model

model.save('Practico2_IMDB_recurrent.h5')

In [13]:
#Evaluate the model
scores = model.evaluate(X_test, y_test, verbose=0) 
print("Accuracy: %.2f%%" % (scores[1]*100))

Accuracy: 84.25%


In [115]:
#predict sentiment from reviews
bad = "this movie was terrible and bad"
good = "i really liked the movie and had fun"

for review in [good,bad]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length) 
    print("%s. Sentiment: %s" % (review,model.predict(array([tmp_padded][0]))[0][0]))
# i really liked the movie and had fun. Sentiment: 0.715537
# this movie was terrible and bad. Sentiment: 0.0353295

i really liked the movie and had fun. Sentiment: 0.88983554
this movie was terrible and bad. Sentiment: 0.8909906


# Model2 with convolutional layer


In [126]:
from keras.layers.convolutional import Conv1D
from keras.layers.convolutional import MaxPooling1D
from keras.layers import Flatten
import numpy

seed = 7
numpy.random.seed(seed)

max_words = 500

model2 = Sequential()
model2.add(Embedding(top_words, 32, input_length=max_words))
model2.add(Conv1D(filters=32, kernel_size=3, padding='same', activation='relu'))
model2.add(MaxPooling1D(pool_size=2))
model2.add(Flatten())
model2.add(Dense(250, activation='relu'))
model2.add(Dense(1, activation='sigmoid'))
model2.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model2.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_6 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
conv1d_4 (Conv1D)            (None, 500, 32)           3104      
_________________________________________________________________
max_pooling1d_3 (MaxPooling1 (None, 250, 32)           0         
_________________________________________________________________
flatten_2 (Flatten)          (None, 8000)              0         
_________________________________________________________________
dense_4 (Dense)              (None, 250)               2000250   
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 251       
Total params: 2,163,605
Trainable params: 2,163,605
Non-trainable params: 0
_________________________________________________________________


In [143]:
model2.layers[1].get_config()

{'activation': 'relu',
 'activity_regularizer': None,
 'bias_constraint': None,
 'bias_initializer': {'class_name': 'Zeros', 'config': {}},
 'bias_regularizer': None,
 'data_format': 'channels_last',
 'dilation_rate': (1,),
 'filters': 32,
 'kernel_constraint': None,
 'kernel_initializer': {'class_name': 'VarianceScaling',
  'config': {'distribution': 'uniform',
   'mode': 'fan_avg',
   'scale': 1.0,
   'seed': None}},
 'kernel_regularizer': None,
 'kernel_size': (3,),
 'name': 'conv1d_4',
 'padding': 'same',
 'strides': (1,),
 'trainable': True,
 'use_bias': True}

In [127]:
# Fit the model
model2.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=2, batch_size=128, verbose=2)
# Final evaluation of the model
scores = model2.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
 - 11s - loss: 0.4343 - acc: 0.7738 - val_loss: 0.2758 - val_acc: 0.8845
Epoch 2/2
 - 11s - loss: 0.2072 - acc: 0.9192 - val_loss: 0.2947 - val_acc: 0.8753
Accuracy: 87.53%


In [128]:
model2.save('Practico2_IMDB_convolutional.h5')

In [131]:
#predict sentiment from reviews
bad = "this movie was terrible and bad"
good = "i really liked the movie and had fun"

for review in [good,bad]:
    tmp = []
    for word in review.split(" "):
        tmp.append(word_to_id[word])
    tmp_padded = sequence.pad_sequences([tmp], maxlen=max_review_length) 
    print("%s. Sentiment: %s" % (review,model2.predict(array([tmp_padded][0]))[0][0]))
# i really liked the movie and had fun. Sentiment: 0.715537
# this movie was terrible and bad. Sentiment: 0.0353295

i really liked the movie and had fun. Sentiment: 0.52189475
this movie was terrible and bad. Sentiment: 0.5327182
