##  Text generation by Markov chain

In [1]:
import csv

Markov chain is a probabalistic model in which the probability of each event depends only on the state attained in the previous event. [markovify](https://github.com/jsvine/markovify) is a library for text generation by Markov chain.

Use "pip install markovify" to install markovify

In [2]:
import markovify

Download [Dataset.csv](https://drive.google.com/file/d/1raxIkJZ4lMTvgTB8eYjxu4Ux8aNL-x0s/view?usp=sharing) composed of sarcastic and serious headlines for the news. The csv-file consists of two columns. "headline" column contains texts of headlines. "is_sarcastic" column contain 0 if the hiadline is serious and 1 otherwise.

Read dataset 

In [3]:
texts_serious = [] # list of serious hiadline texts
texts_sarcastic = [] # list of sarcastic hiadline texts
with open('Dataset.csv', encoding='utf-8') as f:
    # Rread csv-file by DictReader from csv library
    reader = csv.DictReader(f) 
    for line in reader:
        # read texts of headline
        headline = line['headline'].strip()
        # read sarcasticity of headline
        is_sarcastic = int(line['is_sarcastic'].strip())
        if is_sarcastic:
            texts_sarcastic.append(headline)
        else:
            texts_serious.append(headline)
print('Found {} sarcastic texts in Dataset.csv'.format(len(texts_sarcastic)))
print('Found {} serious texts in Dataset.csv'.format(len(texts_serious)))

Found 11724 sarcastic texts in Dataset.csv
Found 14985 serious texts in Dataset.csv


Merge headlines into one text

In [4]:
text_0 = '\n'.join(texts_serious)
text_1 = '\n'.join(texts_sarcastic)

Create Markov chain model for serious headlines generation

In [5]:
#your code here
serious_text_model = markovify.NewlineText(texts_serious, state_size = 2)

Generate 20 serious headlines

In [6]:
#your code here
for i in range(20):
    print(serious_text_model.make_sentence())

this simple reform would get a little disappointing
britain to send elite army unit to investigate post-election surge in the minority
7 styling tricks to make your meetings more positive leader
spanish government threatens to fire bad teachers
the good life in prison
simone biles pulls the perfect day in office. that matters.
huffpost rise: what you need to have a bias against gay men in the shadow of a tiger looks like: faith instead of arresting panhandlers, albuquerque's giving them jobs
demi lovato and wilmer valderrama decide to give book on earth
alligator and python locked in death duel on golf course from the iran deal
watch dirt bikers get a little disappointing
how many women does it mean when we talk about abortion
pelosi throws cold water on gay marriage and families
bayer offers to buy clippers for $2 billion
obama takes shots at bank drive-thru
who needs 10 fps when you realize someone you thought was a baby with gorgeous photo
chuck grassley introduces donald trump is g

Create model and generate 20 sarcastic headlines

In [7]:
sarcastic_text_model = markovify.NewlineText(texts_sarcastic, state_size = 2)
for i in range(20):
    print(sarcastic_text_model.make_sentence())

experts advise against throwing laptop across office even though it's what prince would have tipped gun control debate
rich white people away
government-publications enthusiast makes pilgrimage to facebook headquarters to lay readers
google unveils new self-defense vase for smashing onto head of illinois nepotist party
rat fancy magazine fails to reach climax
tv show has another season in her office before leaving
obama a little ad space to host 1998 ill-will games
time traveler from the strokes accused of wartime non-profiteering
either jay leno reconsiders retirement after georgia woman sets google alert for kevin costner
hire of local moron gives nation hope for saving planet lies with people who save napkins from takeout order presumes man eats anything like human being terrific news for grover cleveland fans
wheelchair-bound student would have been proud of unsuccessful child
sxsw as cool and as real as it slowly lowers itself into warm arctic water
nation demands more slow-motion

## Text generation by Variation Autoencoder

This part of tutorial based on [Text generation with a Variational Autoencoder](https://nicgian.github.io/text-generation-vae/) article. For sentence generation we use Variational Autoencoder (VAE) neural network model that is an extension seq2seq model. Originally VAE was described in [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf) paper. The idea behind Variational Autoencoder is that we impose predefined disribution (e.g., normal distribution) on the latent state formed by encoder. On the one hand this restriction alow us to sample random vectors from normal distribution and generate arbitrary sentences. On the othe hand this restriction form very dense well differentiated space without holes

In [8]:
import tensorflow as tf
import tensorflow_addons as tfa


from keras.layers import Bidirectional, Dense, Embedding, \
Input, Lambda, LSTM, RepeatVector, TimeDistributed, Layer, Activation, Dropout
from keras.preprocessing.sequence import pad_sequences
from keras.layers.advanced_activations import ELU
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras import backend as K
from keras.models import Model

from scipy import spatial
import pandas as pd
import numpy as np
import codecs
import random
import csv
import os

Dowload [GloVe](http://nlp.stanford.edu/data/glove.6B.zip) pretrained word embedding vectors

In [9]:
MAX_SEQUENCE_LENGTH = 15 # Max text length in tokens

MAX_NB_WORDS = 20000 # Max words in dictionary
EMBEDDING_DIM = 50 # Dimensionality of GloVe vectors 

In [10]:
#!wget http://nlp.stanford.edu/data/glove.6B.zip

In [11]:
#!unzip glove.6b.zip

Create sentence tokenizer and two dictionaries: word_to_id and id_to_word

For tokenisation we use [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) from keras library.

In [12]:

def tokenize( lang,MAX_SEQUENCE_LENGTH):
        # lang = list of sentences in a language
        word_to_id={}
        id_to_word={}
        # print(len(lang), "example sentence: {}".format(lang[0]))
        lang_tokenizer = tf.keras.preprocessing.text.Tokenizer(filters='',oov_token='<OOV>')#, 
        lang_tokenizer.fit_on_texts(lang)
        print('nb',lang_tokenizer.num_words)
        ## tf.keras.preprocessing.text.Tokenizer.texts_to_sequences converts string (w1, w2, w3, ......, wn) 
        ## to a list of correspoding integer ids of words (id_w1, id_w2, id_w3, ...., id_wn)
        tensor = lang_tokenizer.texts_to_sequences(lang) 

        ## tf.keras.preprocessing.sequence.pad_sequences takes argument a list of integer id sequences 
        ## and pads the sequences to match the longest sequences in the given input
        tensor = tf.keras.preprocessing.sequence.pad_sequences(tensor,maxlen=MAX_SEQUENCE_LENGTH, padding='post')
        
        for word, i in lang_tokenizer.word_index.items():
            word_to_id[word]=i
            id_to_word[i]=word
            
        
        return tensor, lang_tokenizer,word_to_id,id_to_word
    
#or ##########################    
texts = texts_serious + texts_sarcastic
tokenizer = Tokenizer(MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
word_to_id = tokenizer.word_index 
id_to_word = {v: k for k, v in word_to_id.items()}

Tokenize sarcastic texts and create tensor composed of tokens indexes. If a sentence shorter than MAX_SEQUENCE_LENGTH we pad it. If a sentence longer than MAX_SEQUENCE_LENGTH we cut it.


In [13]:
sequences = tokenizer.texts_to_sequences(texts_sarcastic)
data_1 = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Data tensor shape:', data_1.shape)
NB_WORDS = (min(tokenizer.num_words, len(word_to_id)) + 1 ) #+1 for zero padding
data_1_val = data_1[-1440:] #select 6000 sentences as validation data
data_1 = data_1[:-1440]

Data tensor shape: (11724, 15)


In [14]:

tensor, lang_tokenizer,word_to_id,id_to_word=tokenize(texts_sarcastic,MAX_SEQUENCE_LENGTH)


nb None


In [15]:
NB_WORDS = (len(word_to_id)) + 1 

Define batch generator to train a neural network

For padding sentences to max length we use [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)

In [42]:
# def sent_generator(TRAIN_DATA_FILE, batchsize):
#     # Create iterator that reads dataset file batch by batch 
#     reader = pd.read_csv(TRAIN_DATA_FILE, chunksize=batchsize, iterator=True)
#     for batch in reader:
#         # Read a column that contains headlines
#         #text=reader.headline.str.strip().tolist()
#         text= batch['headline'].str.strip().tolist()
#         # Tokenize texts and create padded tensor composed of tokens indexes
        
#         tensor, lang_tokenizer,word_to_id,id_to_word=tokenize(text,MAX_SEQUENCE_LENGTH)
        
#         # Return input-target pairs
#         yield [tensor, tensor]

In [44]:
def sent_generator(TRAIN_DATA_FILE, batchsize):
    # Create iterator that reads dataset file batch by batch 
    #your code here
    reader = pd.read_csv(TRAIN_DATA_FILE, chunksize=batchsize, iterator=True)
    for df in reader:
        # Read a column that contains headlines
        headlines = df['headline'].str.strip().tolist()
        # Tokenize texts and create padded tensor composed of tokens indexes
        sequences = tokenizer.texts_to_sequences(headlines)
        data_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
        # Return input-target pairs
        yield [data_train, data_train]

Load pretrained GloVe vectors described in [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) paper

In [17]:
root_folder='C:/Users/User/PycharmProjects/yandex/NLP class/Y-Data-NLP/Assignment 2/glove.6B/'

In [18]:
embeddings_index = {}
with open(root_folder+'glove.6B.50d.txt', encoding='utf-8') as f:
    # read rows from file line by line
    for line in f:
        #print(line)
        values = line.split()
        word = values[0] # Get word
        #print(word)
        coefs = np.asarray(values[1:], dtype='float32') # Get elements of word's vector
        #print(coefs)
        embeddings_index[word] = coefs
        

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


Create matrix from embedding vectors. Any row of the matrix is a word's vector. We get words from the dictionary word_to_id defined earlier.

In [22]:
# #NB_WORDS=len(embeddings_index)
# glove_embedding_matrix = np.zeros((NB_WORDS, EMBEDDING_DIM)) # Create empty matrix (max number of tokens, dimension of the embedding vectors)
# for word, i in word_to_id.items():
#     if i < NB_WORDS:
#         if embedding_vector is not None:
#             # words not found in embedding index will be the word embedding of 'unk'.
#             glove_embedding_matrix[i] = embedding_vector
#         else:
#             glove_embedding_matrix[i] = embeddings_index.get('unk')

# # compute number of words which there aren't in the GloVe vectors
# print('Null word embeddings: %d' % np.sum(np.sum(glove_embedding_matrix, axis=1) == 0))

In [21]:
glove_embedding_matrix = np.zeros((NB_WORDS, EMBEDDING_DIM))
for word, i in word_to_id.items():
    if i < NB_WORDS:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be the word embedding of 'unk'.
            glove_embedding_matrix[i] = embedding_vector
        else:
            glove_embedding_matrix[i] = embeddings_index.get('unk')
print('Null word embeddings: %d' % np.sum(np.sum(glove_embedding_matrix, axis=1) == 0))

Null word embeddings: 1


Define parameters of the net

In [24]:
batch_size = 64
max_len = MAX_SEQUENCE_LENGTH
emb_dim = EMBEDDING_DIM
latent_dim = 32 # dimensionality of the hidden state in encoder and decoder RNN's
intermediate_dim = 96 # dimensionality of variational space into which we map encoder's hidden state
epsilon_std = 1.0 # standard deviation of gaussian noise
act = ELU() # activation function of projection layer

In [25]:
glove_embedding_matrix.shape

(21590, 50)

Encoder of the variational autoencoder. It based on bidirectional LSTM

We use following layers: [Input](https://www.tensorflow.org/api_docs/python/tf/keras/Input), [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), [Bidirectional](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional), [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM), [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout), [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), [ELU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/ELU)

In [26]:
 # Input layer fo the net. 
# Write an embedding layer for the input sequences of indexes. 
# Use pretrained word embeddings as a embedding layer
# weights and don't update these weights
from tensorflow.keras.layers import Embedding,Bidirectional,LSTM,Dense,Input
x = Input(batch_shape=(None, max_len))
x_embed = Embedding(NB_WORDS, emb_dim, weights=[glove_embedding_matrix],
                            input_length=max_len, trainable=False)(x)
# Bidirectional LSTM encoder

h = Bidirectional(LSTM(intermediate_dim, return_sequences=False, recurrent_dropout=0.2), merge_mode='concat')(x_embed)
h = Dropout(0.2)(h) # Dropout for the BiLSTM layer to avoid overfitting 

# Fully-connected layer to map encoder hidden state into variational space
h = Dense(1, activation="linear")(h)
h = act(h)

h = Dropout(0.2)(h) # Dropout for the fully-connected layer to avoid overfitting 
z_mean = Dense(latent_dim)(h) # Fully-connected layer to map variational space into means space 
z_log_var = Dense(latent_dim)(h) # Fully-connected layer to map variational space into standard deviations space 

The mechanism for sampling hidden vectors from variational space

To apply it to our model we use [Lambda](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda) layer

In [27]:
def sampling(args):
    # Vectors from means space and standard deviations space respectively
    z_mean, z_log_var = args
    # Sample random vectors from normal distribution 
    # with mean=0 and std=epsilon_std
    sample=np.random.normal(loc=0.0, scale=epsilon_std, size=(batch_size, latent_dim))
    
    
    # Get new hidden state for decoder using vectors from means,
    # standard deviations and normal random spaces
    return z_mean + K.exp(z_log_var / 2) * sample

# Get hidden states for the decoder
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])

Define decoder of the autoencoder

For this we use following layers: [RepeatVector](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RepeatVector), [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM), [TimeDistributed](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TimeDistributed), [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)

In [29]:
# # Repeat the hidden state vector to form input sequence for decoder
# repeated_context = RepeatVector(max_len)
# # Decoder LSTM
# decoder_h = LSTM(intermediate_dim, return_sequences=True, recurrent_dropout=0.2)
# # Layer for mapping from hidden satates space to the space of dimension equal to size of vocabulary
# decoder_mean = TimeDistributed(Dense(NB_WORDS, activation='linear'))
# # Generated sequence
# h_decoded = decoder_h(repeated_context(z))
# # Decode every time step vector of the decoded sequence into space of dimension equal to size of vocabulary
# x_decoded_mean = decoder_mean(h_decoded)

In [33]:

# Repeat the hidden state vector to form input sequence for decoder
repeated_context = RepeatVector(max_len)
# Decoder LSTM
decoder_h = LSTM(intermediate_dim, return_sequences=True, recurrent_dropout=0.2)
# Layer for mapping from hidden satates space to the space of dimension equal to size of vocabulary
decoder_mean = tf.keras.layers.TimeDistributed(Dense(NB_WORDS, activation='linear'))
 
# Generated sequence
h_decoded = decoder_h(repeated_context(z))
# Decode every time step vector of the decoded sequence into space of dimension equal to size of vocabulary
x_decoded_mean = decoder_mean(h_decoded)

Define layer for loss computing

In [34]:
def zero_loss(y_true, y_pred):
    # Return tensor filled with ones with shape equal generated sequence shape
    return K.zeros_like(y_pred)

In [35]:
class CustomVariationalLayer(Layer):
    def __init__(self, **kwargs):
        self.is_placeholder = True
        super(CustomVariationalLayer, self).__init__(**kwargs)
        # Create tensor (batch_size, max_sequence_len) filled with ones to consider all elements of generated sequence 
        self.target_weights = tf.constant(np.ones((batch_size, max_len)),tf.float32)

    def vae_loss(self, x, x_decoded_mean):
        # Get tensor with similar shape as x
        labels = tf.cast(x, tf.int32)
        # Compute sequence reconstruction loss
        xent_loss = K.sum(tfa.seq2seq.sequence_loss(x_decoded_mean, labels,
                                                    weights=self.target_weights,
                                                    average_across_timesteps=False,
                                                    average_across_batch=False), axis=-1)
        # Compute KL-divergence as Variational loss 
        kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
        # Composite loss (reconstruction loss + Variational loss)
        return K.mean(xent_loss + kl_loss)

    def call(self, inputs):
        x = inputs[0] # input sequence
        x_decoded_mean = inputs[1] # reconstructed sequence
        print(x.shape, x_decoded_mean.shape)
        loss = self.vae_loss(x, x_decoded_mean) # Compute loss of the model
        self.add_loss(loss, inputs=inputs)
        # we don't use this output, but it has to have the correct shape
        return K.ones_like(x)

Assemble the model

To define model we use [Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) from tensorflow.

To train model employ [Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) optimization algorithm

In [37]:
# Create custom layer for loss computing
loss_layer = CustomVariationalLayer()([x, x_decoded_mean])

vae = Model(x, [loss_layer])
# Use Adam optimizer with learning rate = 0.01
opt = Adam(learning_rate=0.01)
vae.compile(optimizer='adam', loss=[zero_loss])
# Show model structure
vae.summary()

(None, 15) (64, 15, 21590)
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Total params: 0
Trainable params: 0
Non-trainable params: 0
_________________________________________________________________


Checkpoint function to save states of our model during training

In [38]:
def create_model_checkpoint(model_name):
    filepath = model_name + ".h5"
    directory = os.path.dirname(filepath)
    try:
        # Check if directory exists
        os.stat(directory)
    except:
        # If directory doesn't exist, create the directory
        os.mkdir(directory)
    # Save model states
    checkpointer = ModelCheckpoint(filepath=filepath, verbose=1, save_best_only=False)
    return checkpointer

# Create model checkpointer
checkpointer = create_model_checkpoint(r'C:/Users/User/PycharmProjects/yandex/NLP class/Y-Data-NLP/Assignment 2/vae_seq2seq')

Train model, test model after each epoch and save model's state

In [45]:
nb_epoch = 8 # number of epochs for model training
n_steps = 28000 / batch_size # Number of steps per epoch
for counter in range(nb_epoch):
    print('-------epoch: ', counter, '-------')
    # Train model. Test and save model after every epoch
    vae.fit_generator(sent_generator('Dataset.csv', batch_size),
                      steps_per_epoch=n_steps, epochs=1, callbacks=[checkpointer],
                      validation_data=(data_1_val, data_1_val))
vae.save(r'vae_lstmFull32dim96hid.h5')

-------epoch:  0 -------


ValueError: in user code:

    c:\users\user\appdata\local\programs\python\python37\lib\site-packages\keras\engine\training.py:830 train_function  *
        return step_function(self, iterator)
    c:\users\user\appdata\local\programs\python\python37\lib\site-packages\keras\engine\training.py:813 run_step  *
        outputs = model.train_step(data)
    c:\users\user\appdata\local\programs\python\python37\lib\site-packages\keras\engine\training.py:770 train_step  *
        y_pred = self(x, training=True)
    c:\users\user\appdata\local\programs\python\python37\lib\site-packages\keras\engine\base_layer.py:989 __call__  *
        input_spec.assert_input_compatibility(self.input_spec, inputs, self.name)
    c:\users\user\appdata\local\programs\python\python37\lib\site-packages\keras\engine\input_spec.py:197 assert_input_compatibility  *
        raise ValueError('Layer ' + layer_name + ' expects ' +

    ValueError: Layer model_1 expects 1 input(s), but it received 2 input tensors. Inputs received: [<tf.Tensor 'IteratorGetNext:0' shape=(None, None) dtype=int32>, <tf.Tensor 'IteratorGetNext:1' shape=(None, None) dtype=int32>]


Assemble encoder and decoder for sentence generation sampled from variational space

In [None]:
# Make separate encoder to encode input sentence
encoder = Model(x, z_mean)
# Input layer for decoder to decode vectors sampled from variational space 
decoder_input = Input(shape=(latent_dim,))
# Apply LSTM to decode hidden vector into sequence
_h_decoded = decoder_h(repeated_context(decoder_input))
# Decode every time step vector of the decoded sequence into space of dimension equal to size of vocabulary
_x_decoded_mean = decoder_mean(_h_decoded)
# Apply softmax to get most probable token
_x_decoded_mean = Activation('softmax')(_x_decoded_mean)
# Make decoderfor sempled sentences
generator = Model(decoder_input, _x_decoded_mean)

Generate sentence

In [23]:
# Dictionary maps indexes to words
index2word = {v: k for k, v in word_to_id.items()}
# Fit sentences from validation set into encoder
sent_encoded = encoder.predict(data_1_val, batch_size=16)
# Decode encoded sentences
x_test_reconstructed = generator.predict(sent_encoded)

sent_idx = 400
# Get words indexes with highest probability for the 500th sentence from validation set
reconstructed_indexes = np.apply_along_axis(np.argmax, 1, x_test_reconstructed[sent_idx])
# Map indexes of generated sentence to words
word_list = list(np.vectorize(index2word.get)(reconstructed_indexes))
word_list = ' '.join([w for w in word_list if w])
print(word_list)
# Map indexes of input sentence to words
original_sent = list(np.vectorize(index2word.get)(data_1_val[sent_idx]))
original_sent = ' '.join([w for w in original_sent if w])
print(original_sent)

NameError: name 'encoder' is not defined