##  Text generation by Markov chain

In [3]:
import csv

Markov chain is a probabalistic model in which the probability of each event depends only on the state attained in the previous event. [markovify](https://github.com/jsvine/markovify) is a library for text generation by Markov chain.

Use "pip install markovify" to install markovify

In [1]:
import markovify

Download [Dataset.csv](https://drive.google.com/file/d/1raxIkJZ4lMTvgTB8eYjxu4Ux8aNL-x0s/view?usp=sharing) composed of sarcastic and serious headlines for the news. The csv-file consists of two columns. "headline" column contains texts of headlines. "is_sarcastic" column contain 0 if the hiadline is serious and 1 otherwise.

Read dataset 

In [2]:
import pandas as pd

In [4]:
df = pd.read_json("/Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 2/Sarcasm_Headlines_Dataset.json", lines=True)
df.drop(columns="article_link",inplace=True)
df.head(3)
df.to_csv("Dataset.csv")

In [5]:
df.head()

Unnamed: 0,headline,is_sarcastic
0,former versace store clerk sues over secret 'b...,0
1,the 'roseanne' revival catches up to our thorn...,0
2,mom starting to fear son's web series closest ...,1
3,"boehner just wants wife to listen, not come up...",1
4,j.k. rowling wishes snape happy birthday in th...,0


In [6]:
texts_serious = [] # list of serious hiadline texts
texts_sarcastic = [] # list of sarcastic hiadline texts
with open('Dataset.csv', encoding='utf-8') as f:
    # Rread csv-file by DictReader from csv library
    reader = csv.DictReader(f) 
    for line in reader:
        # read texts of headline
        headline = line['headline'].strip()
        # read sarcasticity of headline
        is_sarcastic = int(line['is_sarcastic'].strip())
        if is_sarcastic:
            texts_sarcastic.append(headline)
        else:
            texts_serious.append(headline)
print('Found {} sarcastic texts in Dataset.csv'.format(len(texts_sarcastic)))
print('Found {} serious texts in Dataset.csv'.format(len(texts_serious)))

Found 11724 sarcastic texts in Dataset.csv
Found 14985 serious texts in Dataset.csv


Merge headlines into one text

In [7]:
text_0 = '\n'.join(texts_serious)
text_1 = '\n'.join(texts_sarcastic)

Create Markov chain model for serious headlines generation

In [8]:
serious_model = markovify.NewlineText(text_0)

Generate 20 serious headlines

In [9]:
for i in range(20):
    print(serious_model.make_sentence())

butch lesbians open up about her son's tv death
apparently andrew wk is a curvy style icon
recovery nonprofits stem the tide on inequality and climate change
pamela wright's son was shot dead after southwest airlines flight diverted after ebola virus and sen. barbara boxer can teach your kids will never connect with others
busy philipps consoles michelle williams on 10th anniversary of dr. gunn's death, thank an abortion
swiping right on a crowd of looters in ferguson let high-profile journalists go while charging regular folks with crimes
let us down this year than in 2015
why are bi men less likely to hurt its business
syrian rebels to exit aleppo as syrian rebels claim to break your heart is hurting immigrant victims of louisiana floods
keep the real world
what i wish i'd learned about business from making art
why i want my wife to know on december 21
matthew mcconaughey once faked an australian accent for an end to down syndrome is the internet still thinks efforts to gut post-cris

Create model and generate 20 sarcastic headlines

In [10]:
sarcastic_model = markovify.NewlineText(text_1)
for i in range(20):
    print(sarcastic_model.make_sentence())

child slavery gives area activist something to make sure pressing exposed penis against intern doesn't constitute sexual harassment ignored to save relationship
area woman to celebrate rosh hashasha or something on lawn
4 billion years to try some more shit
al franken tearfully announces intention to step down as leader of droogs amidst allegations of biggest sexual harassment apologies
gene wilder's career in self-storage
everyone in family sure who else is doing it
local man hates being put on phone told her parents hate her
court rules meryl streep unable to remove mrs. butterworth from table until wife arrives
report suggests stalin was just elaborate plan to ramble on about chick he banged last night
iran ready to go until 2016 election
wildlife cleaning volunteer stuck with the girl from headlines
monsanto develops hardier strain of soy flu traced to trouble-making genie
rush limbaugh tucks shirt back on dog
the thinkable happens to local hospital for royal blood transfusion
obam

## Text generation by Variation Autoencoder

This part of tutorial based on [Text generation with a Variational Autoencoder](https://nicgian.github.io/text-generation-vae/) article. For sentence generation we use Variational Autoencoder (VAE) neural network model that is an extension seq2seq model. Originally VAE was described in [Auto-Encoding Variational Bayes](https://arxiv.org/pdf/1312.6114.pdf) paper. The idea behind Variational Autoencoder is that we impose predefined disribution (e.g., normal distribution) on the latent state formed by encoder. On the one hand this restriction alow us to sample random vectors from normal distribution and generate arbitrary sentences. On the othe hand this restriction form very dense well differentiated space without holes

In [11]:
import tensorflow as tf
import tensorflow_addons as tfa


from keras.layers import Bidirectional, Dense, Embedding, \
Input, Lambda, LSTM, RepeatVector, TimeDistributed, Layer, Activation, Dropout
from keras.preprocessing.sequence import pad_sequences
from keras.layers.advanced_activations import ELU
from keras.preprocessing.text import Tokenizer
from keras.callbacks import ModelCheckpoint
from keras.optimizers import Adam
from keras import backend as K
from keras.models import Model

from scipy import spatial
import pandas as pd
import numpy as np
import codecs
import random
import csv
import os

Using TensorFlow backend.


Dowload [GloVe](http://nlp.stanford.edu/data/glove.6B.zip) pretrained word embedding vectors

In [12]:
# !wget http://nlp.stanford.edu/data/glove.6B.zip

In [13]:
# !unzip glove.6b.zip

In [14]:
MAX_SEQUENCE_LENGTH = 15 # Max text length in tokens
MAX_NB_WORDS = 20000 # Max words in dictionary
EMBEDDING_DIM = 50 # Dimensionality of GloVe vectors 

Create sentence tokenizer and two dictionaries: word_to_id and id_to_word

For tokenisation we use [Tokenizer](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/text/Tokenizer) from keras library.

In [15]:
texts = texts_serious + texts_sarcastic
tokenizer = Tokenizer(MAX_NB_WORDS)
tokenizer.fit_on_texts(texts)
word_to_id = tokenizer.word_index 
id_to_word = {v: k for k, v in word_to_id.items()}
print(f'Corpus has {len(word_to_id)} unique tokens')

Corpus has 29656 unique tokens


Tokenize sarcastic texts and create tensor composed of tokens indexes. If a sentence shorter than MAX_SEQUENCE_LENGTH we pad it. If a sentence longer than MAX_SEQUENCE_LENGTH we cut it.

In [107]:
12*120

1440

In [108]:
sequences = tokenizer.texts_to_sequences(texts_sarcastic)
data_1 = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
print('Data tensor shape:', data_1.shape)
NB_WORDS = (min(tokenizer.num_words, len(word_to_id)) + 1 ) #+1 for zero padding
data_1_val = data_1[-1440:] #select 6000 sentences as validation data

Data tensor shape: (11724, 15)


Define batch generator to train a neural network

For padding sentences to max length we use [pad_sequences](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences)

In [109]:
def sent_generator(TRAIN_DATA_FILE, batchsize):
    # Create iterator that reads dataset file batch by batch 
    #your code here
    reader = pd.read_csv(TRAIN_DATA_FILE, chunksize=batchsize, iterator=True)
    for df in reader:
        # Read a column that contains headlines
        headlines = df['headline'].str.strip().tolist()
        # Tokenize texts and create padded tensor composed of tokens indexes
        sequences = tokenizer.texts_to_sequences(headlines)
        data_train = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)
        # Return input-target pairs
        yield [data_train, data_train]

Load pretrained GloVe vectors described in [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/pubs/glove.pdf) paper

In [110]:
embeddings_index = {}
with open('glove.6B.50d.txt', encoding='utf-8') as f:
    # read rows from file line by line
    for line in f:
        values = line.split()
        word = values[0] # Get word
        coefs = np.asarray(values[1:], dtype='float32') # Get elements of word's vector
        embeddings_index[word] = coefs

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


Create matrix from embedding vectors. Any row of the matrix is a word's vector. We get words from the dictionary word_to_id defined earlier.

In [111]:
glove_embedding_matrix = np.zeros((NB_WORDS, EMBEDDING_DIM)) # Create empty matrix (max number of tokens, dimension of the embedding vectors)
for word, i in word_to_id.items():
    if i < NB_WORDS:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            # words not found in embedding index will be the word embedding of 'unk'.
            glove_embedding_matrix[i] = embedding_vector
        else:
            glove_embedding_matrix[i] = embeddings_index.get('unk')
# compute number of words which there aren't in the GloVe vectors
print('Null word embeddings: %d' % np.sum(np.sum(glove_embedding_matrix, axis=1) == 0))

Null word embeddings: 1


Define parameters of the net

In [112]:
batch_size = 12
max_len = MAX_SEQUENCE_LENGTH
emb_dim = EMBEDDING_DIM
latent_dim = 32 # dimensionality of the hidden state in encoder and decoder RNN's
intermediate_dim = 96 # dimensionality of variational space into which we map encoder's hidden state
epsilon_std = 1.0 # standard deviation of gaussian noise
act = ELU() # activation function of projection layer

Encoder of the variational autoencoder. It based on bidirectional LSTM

We use following layers: [Input](https://www.tensorflow.org/api_docs/python/tf/keras/Input), [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding), [Bidirectional](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Bidirectional), [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM), [Dropout](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dropout), [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense), [ELU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/ELU)

In [113]:
x = Input(batch_shape=(None, max_len)) # Input layer fo the net. 
# Write an embedding layer for the input sequences of indexes. 
# Use pretrained word embeddings as a embedding layer weights and don't update these weights

x_embed = Embedding(NB_WORDS, emb_dim, weights=[glove_embedding_matrix],
                            input_length=max_len, trainable=False)(x)

# Bidirectional LSTM encoder

h = Bidirectional(LSTM(intermediate_dim, return_sequences=False, recurrent_dropout=0.2), merge_mode='concat')(x_embed)

h = Dropout(0.2)(h) # Dropout for the BiLSTM layer to avoid overfitting 

# Fully-connected layer to map encoder hidden state into variational space

h = Dense(intermediate_dim, activation='linear')(h)
h = act(h)

h = Dropout(0.2)(h) # Dropout for the fully-connected layer to avoid overfitting 
z_mean = Dense(latent_dim)(h) # Fully-connected layer to map variational space into means space 
z_log_var = Dense(latent_dim)(h) # Fully-connected layer to map variational space into standard deviations space 

The mechanism for sampling hidden vectors from variational space

To apply it to our model we use [Lambda](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Lambda) layer

In [114]:
def sampling(args):
    # Vectors from means space and standard deviations space respectively
    z_mean, z_log_var = args
    # Sample random vectors from normal distribution with mean=0 and std=epsilon_std
    
    epsilon = K.random_normal(shape=(batch_size, latent_dim), mean=0.,
                              stddev=epsilon_std)
    
    # Get new hidden state for decoder using vectors from means, standard deviations and normal random spaces
    return z_mean + K.exp(z_log_var / 2) * epsilon

# Get hidden states for the decoder
z = Lambda(sampling, output_shape=(latent_dim,))([z_mean, z_log_var])

Define decoder of the autoencoder

For this we use following layers: [RepeatVector](https://www.tensorflow.org/api_docs/python/tf/keras/layers/RepeatVector), [LSTM](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM), [TimeDistributed](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TimeDistributed), [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense)

In [115]:
# Repeat the hidden state vector to form input sequence for decoder
repeated_context = RepeatVector(max_len)
# Decoder LSTM
decoder_h = LSTM(intermediate_dim, return_sequences=True, recurrent_dropout=0.2)
# Layer for mapping from hidden satates space to the space of dimension equal to size of vocabulary
decoder_mean = TimeDistributed(Dense(NB_WORDS, activation='linear'))
# Generated sequence
h_decoded = decoder_h(repeated_context(z))
# Decode every time step vector of the decoded sequence into space of dimension equal to size of vocabulary
x_decoded_mean = decoder_mean(h_decoded)

Define layer for loss computing

In [116]:
def zero_loss(y_true, y_pred):
    # Return tensor filled with ones with shape equal generated sequence shape
    return K.zeros_like(y_pred)

In [117]:
class CustomVariationalLayer(Layer):
    def __init__(self, **kwargs):
        self.is_placeholder = True
        super(CustomVariationalLayer, self).__init__(**kwargs)
        # Create tensor (batch_size, max_sequence_len) filled with ones to consider all elements of generated sequence 
        self.target_weights = tf.constant(np.ones((batch_size, max_len)), tf.float32)

    def vae_loss(self, x, x_decoded_mean):
        # Get tensor with similar shape as x
        labels = tf.cast(x, tf.int32)
        # Compute sequence reconstruction loss
        xent_loss = K.sum(tfa.seq2seq.sequence_loss(x_decoded_mean, labels,
                                                    weights=self.target_weights,
                                                    average_across_timesteps=False,
                                                    average_across_batch=False), axis=-1)
        # Compute KL-divergence as Variational loss 
        kl_loss = -0.5 * K.sum(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var), axis=-1)
        # Composite loss (reconstruction loss + Variational loss)
        return K.mean(xent_loss + kl_loss)

    def call(self, inputs):
        x = inputs[0] # input sequence
        x_decoded_mean = inputs[1] # reconstructed sequence
        print(x.shape, x_decoded_mean.shape)
        loss = self.vae_loss(x, x_decoded_mean) # Compute loss of the model
        self.add_loss(loss, inputs=inputs)
        # we don't use this output, but it has to have the correct shape
        return K.ones_like(x)

Assemble the model

To define model we use [Model](https://www.tensorflow.org/api_docs/python/tf/keras/Model) from tensorflow.

To train model employ [Adam](https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam) optimization algorithm

In [118]:
# Create custom layer for loss computing
loss_layer = CustomVariationalLayer()([x, x_decoded_mean])

vae = Model(x, [loss_layer])
# Use Adam optimizer with learning rate = 0.01
opt = Adam(lr=0.01)
vae.compile(optimizer='adam', loss=[zero_loss])
# Show model structure
vae.summary()

(None, 15) (12, 15, 20001)
Model: "model_3"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_3 (InputLayer)            (None, 15)           0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, 15, 50)       1000050     input_3[0][0]                    
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) (None, 192)          112896      embedding_3[0][0]                
__________________________________________________________________________________________________
dropout_5 (Dropout)             (None, 192)          0           bidirectional_3[0][0]            
_________________________________________________________________

Checkpoint function to save states of our model during training

In [119]:
def create_model_checkpoint(model_name):
    filepath = model_name + ".h5"
    directory = os.path.dirname(filepath)
    try:
        # Check if directory exists
        os.stat(directory)
    except:
        # If directory doesn't exist, create the directory
        os.mkdir(directory)
    # Save model states
    checkpointer = ModelCheckpoint(filepath=filepath, verbose=1, save_best_only=False)
    return checkpointer

# Create model checkpointer
checkpointer = create_model_checkpoint(f'{os.path.join(os.getcwd(),"checkpoints", "vae_seq2seq")}')

Train model, test model after each apoch and save model's state

In [126]:
nb_epoch = 25 # number of epochs for model training
n_steps = len(data_1) / batch_size # Number of steps per epoch
print(n_steps)

977.0


In [123]:
for counter in range(nb_epoch):
    print('-------epoch: ', counter, '-------')
    # Train model. Test and save model after every epoch
    vae.fit_generator(sent_generator('Dataset.csv', batch_size),
                      steps_per_epoch=n_steps, epochs=1, callbacks=[checkpointer],
                      validation_data=(data_1_val, data_1_val))
vae.save(r'vae_lstmFull32dim96hid.h5')

-------epoch:  0 -------
Epoch 1/1

Epoch 00001: saving model to /Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 2/checkpoints/vae_seq2seq.h5
-------epoch:  1 -------
Epoch 1/1

Epoch 00001: saving model to /Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 2/checkpoints/vae_seq2seq.h5
-------epoch:  2 -------
Epoch 1/1

Epoch 00001: saving model to /Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 2/checkpoints/vae_seq2seq.h5
-------epoch:  3 -------
Epoch 1/1

Epoch 00001: saving model to /Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 2/checkpoints/vae_seq2seq.h5
-------epoch:  4 -------
Epoch 1/1

Epoch 00001: saving model to /Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 2/checkpoints/vae_seq2seq.h5
-------epoch:  5 -------
Epoch 1/1

Epoch 00001: saving model to /Users/Adam/workspace/yandex/Y-Data/2nd Semester/NLP/Assignment 2/checkpoints/vae_seq2seq.h5
-------epoch:  6 -------
Epoch 1/1

Epoch 00001: sav

Assemble encoder and decoder for sentence generation sampled from variational space

In [124]:
# Make separate encoder to encode input sentence
encoder = Model(x, z_mean)
# Input layer for decoder to decode vectors sampled from variational space 
decoder_input = Input(shape=(latent_dim,))
# Apply LSTM to decode hidden vector into sequence
_h_decoded = decoder_h(repeated_context(decoder_input))
# Decode every time step vector of the decoded sequence into space of dimension equal to size of vocabulary
_x_decoded_mean = decoder_mean(_h_decoded)
# Apply softmax to get most probable token
_x_decoded_mean = Activation('softmax')(_x_decoded_mean)
# Make decoderfor sempled sentences
generator = Model(decoder_input, _x_decoded_mean)

Generate sentence

In [125]:
# Dictionary maps indexes to words
index2word = {v: k for k, v in word_to_id.items()}
# Fit sentences from validation set into encoder
sent_encoded = encoder.predict(data_1_val, batch_size=16)
# Decode encoded sentences
x_test_reconstructed = generator.predict(sent_encoded)

sent_idx = 400
# Get words indexes with highest probability for the 500th sentence from validation set
reconstructed_indexes = np.apply_along_axis(np.argmax, 1, x_test_reconstructed[sent_idx])
# Map indexes of generated sentence to words
word_list = list(np.vectorize(index2word.get)(reconstructed_indexes))
word_list = ' '.join([w for w in word_list if w])
print(word_list)
# Map indexes of input sentence to words
original_sent = list(np.vectorize(index2word.get)(data_1_val[sent_idx]))
original_sent = ' '.join([w for w in original_sent if w])
print(original_sent)

charlize ethics blasts 'peasant 'peasant murky murky 'russia 'russia 'russia 'russia
old friend avoided in hometown convenience store
