<a href="https://colab.research.google.com/github/daniellevin98/DeepLearning/blob/master/Assignment_2_Daniel_Levin.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Deep Learning with Python - Assignment 2 - Daniel Levin

# 1 Text Generation

###### Verify GPU

In [14]:
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found') 
print('Found GPU at: {}'.format(device_name))

Found GPU at: /device:GPU:0


## Goal of the assignment


We are configuring a deep learning network that will learn a language model based on the input texts. The network can be used for text generation by character. 'frankenstein.txt' will be the default input text for this network. Since some of the texts uploaded on Ilias are in German I tried to feed the network by some of them too. I couldn't see any difference in perfomance between German and English, which makes sense knowing that all what the network learns is statistics of the input data. So the result depends significantly on the data the network was fed on. I considered using several texts for training as well but decided not to do it for clearness of evaluation, i.e. the model will imitate the style of the text it's trained on. This means that if different texts will constitute a training input for the network it might get confused which style/lexicon it's imitating. In general this approach might be useful for certain goals. In our case I've chosen to limit the model to one specific text to make the assessment of the model more comprehensible.

As a side note I'd notice that for this type of learning problem it's quite challenging to come up with a meaningful evaluation method, especially when comparing between two languages.

The text, on which the network will be trained, should be chosen here:

In [15]:
TEXT_FILE = 'frankenstein.txt'

###### Import libraries

In [16]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

import keras
from keras import models, layers, metrics, optimizers
from sklearn.model_selection import train_test_split

# Data preprocessing
###### Importing the text for training
We're reading the text from the previously configured file in variable TEXT_FILE into one string. Upper case is ignored since it doesn't constitute anything meaningful for learning a language model. Moreover it doubles the size of the alphabet which significantly decreases the depth of the language model that's being learned. Newline characters aren't removed because we do want to keep the sequential properties of the text. In other words we want our network to generate a text that has the same layout as the text that was used for training.

In [17]:
with open(TEXT_FILE, 'r') as f:
    text = f.read().lower()

print('Corpus length:', len(text))


Corpus length: 421607


The corpus length is more than 400000 which should be enough for training a network.

Now we extract snippets by length 80 (`maxlen`) characters each with every sequence starting after `step` 3 tokens after the previous one. This code fragment is identical to what is shown in chapter 8.1. 

In [18]:
# the maximal length of characters of each sequence
maxlen = 80

# a new sequence will be sampled every `step` characters
step = 3

# extracted sequences
sentences = []

# the list of 'follow-up' characters - target labels
next_chars = []

for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
    
print('Number of sequences: ', len(sentences))

Number of sequences:  140509


Now we're building a list of unique characters in the text in order to index characters.

In [19]:
# list of unique characters in the text
chars = sorted(list(set(text)))
print('Unique characters: ', len(chars))

# Dictionary mapping characters to their indices according to their order in `chars` list
char_indices = dict((char, i) for i, char in enumerate(chars))

Unique characters:  52


Next let's one-hot encode the characters into binary matrices.

In [20]:
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for j, char in enumerate(sentence):
        x[i, j, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

# Building the network
As a baseline let's take a very straightforward architecture that was proposed by Chollet in chapter 8.1 - one `LSTM` layer followed by a `Dense` classifier and softmax over all possible characters. The only thing I played with at this stage is number of neurons in the LSTM layer. It's quite challenging to see the difference but relying on my intuition 256 neurons have given the best result.

In [38]:
model = models.Sequential()
model.add(layers.LSTM(256, input_shape=(maxlen, len(chars))))
model.add(layers.Dense(len(chars), activation='softmax'))

Our targets are one-hot encoded, therefore we will use the `categorical_crossentropy` loss during compilation. At compilation phase I tried RMSprop, Adam and SGD optimizers. Adam requires less memory which is precious when training a heavy LSTM network and the run time was a bit better so I went with that.

In [41]:
model.compile(loss='categorical_crossentropy', optimizer=optimizers.Adam())

# Training the language model and sampling from it
Here we implement the `sample` function identical to what we've seen in Chollet's chapter, namely the imlementation of reweighting of the probability distribution given a certain temperature.

In [42]:
def sample(preds, temperature=1.0):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Now we are ready to implement the main loop where we generate text by appending the most recent prediction sampled according to the reweighted distribution to the current text and sampling again and again.

First let's fit the model. I chose quite significant size of batch to save the training time. I tried smaller values and couldn't see difference in the performance.

In [43]:
epochs = 60
batch_size = 1024

model.fit(x, y,
          batch_size=batch_size,
          epochs=epochs)

Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Epoch 33/60
Epoch 34/60
Epoch 35/60
Epoch 36/60
Epoch 37/60
Epoch 38/60
Epoch 39/60
Epoch 40/60
Epoch 41/60
Epoch 42/60
Epoch 43/60
Epoch 44/60
Epoch 45/60
Epoch 46/60
Epoch 47/60
Epoch 48/60
Epoch 49/60
Epoch 50/60
Epoch 51/60
Epoch 52/60
Epoch 53/60
Epoch 54/60
Epoch 55/60
Epoch 56/60
Epoch 57/60
Epoch 58/60
Epoch 59/60
Epoch 60/60


I decided not to observe how the model converges as it passes along epochs. I'll focus on temperature tuning and see how the resulting generator behaves. 


Second let's implement a function `generate_text` that will generate and append the next `length` predicted characters given a `text` and `start_index`. The sampling is done according to the distribution defined by the `model` with given `temperature` used for reweighting.

In [44]:
import random
import sys

def generate_text(text, start_index, temperature=1.0, length=300):
    
    # the window in the text we're currently looking at
    generated_text = text[start_index: start_index + maxlen]
    
    print('--- Generating with seed: "' + generated_text + '"\n')

    print('--- Fragment generated by the model\n')
    
    sys.stdout.write(generated_text)

    # we generate `length` characters
    for i in range(length):
        sampled = np.zeros((1, maxlen, len(chars)))
        for t, char in enumerate(generated_text):
            sampled[0, t, char_indices[char]] = 1.

        # predict the next character
        preds = model.predict(sampled, verbose=0)[0]
        next_index = sample(preds, temperature)
        next_char = chars[next_index]

        # update the current generated text by appending the sampled char and moving it one char right
        generated_text = generated_text[1:] + next_char

        sys.stdout.write(next_char)
        sys.stdout.flush()
    print()

Now let's tune the model by using different temperatures and observing the outcome. I'll create a function that gets as input an array of temperatures to be tried out.

In [45]:
def generate_texts(temperatures):
  for temperature in temperatures:
      print('###########\nTemperature: ' + str(temperature))
      # Select a text seed at random
      start_index = random.randint(0, len(text) - maxlen - 1)
      generate_text(text, start_index, temperature=temperature)
        
temperatures = np.arange(0.1, 1.5, 0.2)
generate_texts(temperatures)

###########
Temperature: 0.1
--- Generating with seed: "moment, instead of venting my sensations in
exclamations and agony, i did not ru"

--- Fragment generated by the model

moment, instead of venting my sensations in
exclamations and agony, i did not ruty tore his to entire to the starms
and despair.  but not in my consume that i had and at the contral
proside that it was the sumplore of my sould stranger and
production the old man and string for the hopes of the heavess and
particlly the sprief streat of the sident sariture which i had and
a thou
###########
Temperature: 0.30000000000000004
--- Generating with seed: "esire to
become an actor in the busy scene where so many admirable qualities
wer"

--- Fragment generated by the model

esire to
become an actor in the busy scene where so many admirable qualities
were as a surdons of mine suffered by elizabeth as the deep and magnas
on the even for a long toor and there was a triess of his father,
the eaght of the death of the same sc

As we can see from the generated texts high temperature value result in a quite creative text with not existing words that still comply with the phonotactical limits of the language. The layout of the text is very similar to the original one - punctuation marks and newline are placed logically in terms of visualization. In contrary low temperature values produce a very predictable and repetitive structures and words, where words are very similar to the training text but punctuation marks are used too much and newline characters are not used at all.

At this point the temperatures 0.7 and 0.9 seem to be the most balanced.

# Choosing an optimal network architecture
In order to choose a meaningful architecture that will perform better than the one we just tested we should focus on the disatvantages of the first one. The most significant one that came to my mind is strong correlation between LSTM and sequential essence of the language. Natural language is way more generic and it allows finite number of recursive patterns that are usually built up into hierarchies. This point leads us to an idea which is similar to logic standing behind convolutional networks.

Intuitively CONVNETS would help us since they make the network learn local patterns as well as hierarchical relationships between those patterns. It may be very handy for text generation as an addition to strictly sequential LSTMs. Learning patterns independently of the order and place of their appearance will help the network generalize better. At the same time we should be aware of overfitting. It can be partly solved by adding a MaxPooling layer and choosing a higher temperature rate at the phase of generation. As a first layer before we feed convnet, we have to use embedding which can be useful for our purpose as a preprocessing and creating some abstract representation of a character. It would work better if we were to generate text by words but overall it'll give some additional information about the distribution and mutual behaiour of the characters.

After some struggling I realised that combining convolutional network and lstm layer is not so obvious. It requires importing additional module of from the keras package. Another option is to hard-code this behaviour in Numpy. I'm not sure that I have an idea of how to do it, so I decided that the hassle isn't worth it and I can still mitigate the problem discussed above by introducing an Embedding layer before feeding the LSTM layers.

After some experimenting this is the architecture that I found the most successful.


In [76]:

cmodel = models.Sequential()
cmodel.add(layers.LSTM(128, return_sequences=True, input_shape=(maxlen, len(chars))))
cmodel.add(layers.LSTM(128, return_sequences=False))
cmodel.add(layers.Dense(len(chars), activation='softmax'))

cmodel.summary()
# # compiling the network
cmodel.compile(loss='categorical_crossentropy', optimizer=optimizers.Adam())

# # training
c_epochs = 60
c_batch_size = 1024

cmodel.fit(x, y,
           epochs=c_epochs,
           batch_size=c_batch_size)




Model: "sequential_52"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_41 (LSTM)               (None, 80, 128)           92672     
_________________________________________________________________
lstm_42 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_21 (Dense)             (None, 52)                6708      
Total params: 230,964
Trainable params: 230,964
Non-trainable params: 0
_________________________________________________________________
Epoch 1/60
Epoch 2/60
Epoch 3/60
Epoch 4/60
Epoch 5/60
Epoch 6/60
Epoch 7/60
Epoch 8/60
Epoch 9/60
Epoch 10/60
Epoch 11/60
Epoch 12/60
Epoch 13/60
Epoch 14/60
Epoch 15/60
Epoch 16/60
Epoch 17/60
Epoch 18/60
Epoch 19/60
Epoch 20/60
Epoch 21/60
Epoch 22/60
Epoch 23/60
Epoch 24/60
Epoch 25/60
Epoch 26/60
Epoch 27/60
Epoch 28/60
Epoch 29/60
Epoch 30/60
Epoch 31/60
Epoch 32/60
Ep

<tensorflow.python.keras.callbacks.History at 0x7f484f8ce5f8>

In [77]:
generate_texts(temperatures)

###########
Temperature: 0.1
--- Generating with seed: " praises bestowed on
her i received as made to a possession of my own.  we calle"

--- Fragment generated by the model

 praises bestowed on
her i received as made to a possession of my own.  we called with the
happiness of the latter and wele and the sun be any one consented the
same scare of the surrow reternity of the same science and among the
treatory was to come conversent the expetion of the works of the
eveness of the death of the morner which i had and a suge parion of
dear creatored an
###########
Temperature: 0.30000000000000004
--- Generating with seed: "boy, you will never see your father again; you must come with me.'

"'hideous mo"

--- Fragment generated by the model

boy, you will never see your father again; you must come with me.'

"'hideous mountains and presently to my exerte to me.  i had contented that
i was unound bestowed to the nect as it sarries that she sould her have
been the crimes of the events which

These results are the best that I could achieve. For low temperatures we can see a decent grammatical structure of the sentences. Verbs are being used more frequently. The results for high temperatures are similar to those of the initial model - made up words with realistic punctuation. A possible explanation why this model seems to perform better is that this model is deeper (due to more LSTM layers) and hence it has learned more advanced underlying statistical structures. Probably it has learnt some grammar by only looking at sequences of characters - very impressive. This statement seems to be true because I could observe the same tendency for other texts, in particular for german ones.

# 2 Extra Credit: Literature and Computation

I will relate to the most relevant fragment (in my opinion):

"
It's really what people call intuition and make such a fuss about. 
Intuition is like reading a word without having to spell it out. 
A child can't do that, because it has had so little experience. 
But a grown-up person knows the word because he's seen it often before.
"

This passage strongly reminds me of what we're trying to achieve when we learn a language model by means of a neural network. We're trying to force the network to learn some kind of intuition about the language it's trained on, just like a child absorbs natural language - by being constantly exposed to linguistical input. So in the sense of statistics humans and networks are similar - the more data they get, the better they learn the model. The only difference (rather enormous, unfortunately for Machine Learning) is that a human being is able to take these statistics a few steps further than the most sophisticated and computationally advanced network as of today. A human is able to interpret such a data and imply in a very natural and so far obscure way how to apply this knowledge to infinite number of different real and abstract life situations. The machines are getting better at this but still far behind us. So the phrase "reading a word without having to spell it out" means that a person doesn't need a numerical (or whatever else) encoding of a word to be able to perceive it. The machines in contrary are helpless without a decent encoding.

Another parallel to our assignment from this passage is hidden in comparing a child to a neural network at its first epochs. Both don't know much about the language - both just gather some first observations - statistics and context. When a child grows up, its knowledge is more similar to the powerful neural network that contains a comprehensive language model under the hood. But obviously, as I already mentioned, the deep learning, as we are familiar with it today, is quite limited and it cannot top the limitation set by statistics, even when it's not possible for a human to explain it clearly.