[View in Colaboratory](https://colab.research.google.com/github/assaflehr/language-style-transfer/blob/master/notebooks/keras_nlp.ipynb)

>>[Keras accuracy/performace limitations](#scrollTo=vR_wGTR6szDb)

>>[Pretraining autoencoder or LM will surely help.](#scrollTo=vR_wGTR6szDb)

>>>[CuDNNLSTM vs LSTM](#scrollTo=vR_wGTR6szDb)

>>>[Tip to self:](#scrollTo=vR_wGTR6szDb)

>>>[For David: example of how to clone, then use from github](#scrollTo=OrKzdh71sBKm)

>[Dataset](#scrollTo=wwc6_EHFD_Ir)

>[Model defintion](#scrollTo=BWVEcaeF6DBR)

>>[classifier](#scrollTo=VTobPYftYvQC)

>>[adverserial model](#scrollTo=9TBO_5dXv40K)

>[TRAINING](#scrollTo=9s0-8EJi6AG1)

>[Error analysis](#scrollTo=XvIkFOWy55ov)

>>[Error of style disc.](#scrollTo=8QaOXXjUAwVH)



## Keras accuracy/performace limitations
 
# Performace caution :
G training epoc (250x64 rows) 

type|RNN|bidi?|time
--|--
CPU|LSTM|Y| 15 min|
GPU|LSTM|Y| 75s| (slower but more-accurate)
GPU|CuDNNLSTM|Y| 42s
GPU|CuDNNLSTM|N| 35s






## Pretraining autoencoder or LM will surely help.

see late review here: https://thegradient.pub/author/sebastian
and autoencoder (which is less recommended than LM here: "Semi-supervised Sequence Learning" 2015. They use one RNN for both encoder and decoder)


###   LSTM 
* CuDNNLSTM trains on GPU , but does not support the attributes: dropout,recurrent_dropout,which are the STOA regulaizers. This is cuda problem, and even native TF does not support it
Hard to compare only that. all G (with dense and more...) takes 35s on GPU. on CPU ???
* Make LSTM deep instead of bigger for linear instead of quadratic growth in compuation. when doing so, add skip connection to 2nd/3rd/... layers
* large vocab size means huge softmax, means slow runtime. to optimize training NCE loss replaces softmax on train-time (TF only)

### Tip to self:
* manually check loss value on one sample (predict vs gt). From doing this, I saw <s> was not given one-hot-value

### For David: example of how to clone, then use from github
```python
#!rm -r language-style-transfer  #remove previous github copy if needed
!git clone https://github.com/assaflehr/language-style-transfer.git
#we rename to as lang_transfer will be the package name
!mv language-style-transfer/code language-style-transfer/lang_transfer

# Add the local_modules directory to the set of paths Python uses to look for imports.
import sys
sys.path.append('language-style-transfer')

import lang_transfer   #your code here!!!
```

In [0]:
# adaptation of: https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html
# first Dataset class to load bible-data
# then copy of the model, but working with words instead of chars





# Some params

In [0]:
embedding_dim=300 #100-300 good numbers
latent_dim = 256  #for hidden unit of LSTM
bidi_encoder=True
cuddlstm=True  #on bidi , diff in time is 20s vs 32 se

batch_size=64


# Dataset

In [36]:
## NLP preprocessing for text
# has few parts:
# 1. load zip files and then use glob to filter part of them (data/*/*.txt)
# 2. parse each row into (x,y) by passing a parser method. it can be simple as lambda line:line:x, or if you use tab delimited lambda line: line.split(',')[4]
# 3. tokenize - split by spaces, but also by ., and be smart about it.  ('...' should be one token , "ai'nt" one token. then; should be two 'token' and ';')
#    you should also build vocabulary, keep X words and throw away rare ones, they will be replaced by <oov> flag.
# 4. transform text to sequences for the result. for words there are usually two different types: ['s>','hello', 'world'] -> [0,5,6] but there is also 
#     a one-hot-econding version where 5 is actaully a vector of size voc-length full of zeros, with 5th index==1.
#    The one-hot ecoding is used as output for text-generation and has a HUGE MEMORY requirement.  100K sentences of size 20 words need 2M floats = 8MB
#    But for the one-hot-encoding multiply this by vocab-size. for char-encoding it's ~30 , for good vocab of 10K words, we need 80GB(!)
#    The simple, and only , way to solve this , is to never keep one-hot-encoding in memory, just use a generator to make it one-hot in runtime



from __future__ import print_function

import numpy as np
import os
import glob
import csv, json
from collections import namedtuple
from zipfile import ZipFile
from os.path import expanduser, exists
import keras
from keras import backend as K
from keras.utils.data_utils import get_file
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils.data_utils import get_file
from keras.utils import to_categorical
from keras.preprocessing.text import Tokenizer



TVT = namedtuple('TVT',['train','val','test'])

class Dataset:
  #Dataset for COPY encoder-decoder
  # to David: generator x,y where x1 is encoder_input is batch_size,max-words of type integer. (hello word <e> )   [x] (64,10)
  #                               x2 is decoder_input is batch,max-words of type integer  ( <s>  hello word)  [x]      (64,10)
  #                               y1             batch,max-words,one-hot-encoding (offset one: hello world <end>) (64,10,10000)
  #                               y_style:           batch,one-hot-of-style                                       (64,2)
  #                                          shape: (64,10)  batch_size=64 ,max_words=10 , vocab_size= 10000  style_vocab=2/5
  # tokenizer is currently very bad. replace it
  # vocabulary (training-only) , don't use 666 , use large number (10K?)
  # Generator : iteration result ([x1,x2],[y1,y_style])
  
  # for auto-encoder pre-training without style  ([x1,x2],y1)  with no y_style
  
  
  def __init__(self,unique_name,url,extract,cache_dir,pattern,skip_first,row_parser,validation_pattern=0.1,test_pattern=0.1):
    '''
    unique_name will be used for the dataset source(or zip) file. 
    pattern need to include path inside zip (including zip root)
    extract - is it zipped/tarred or not
    cache_dir - under which the files be downloaded <cache_dir>/datasets/<unique_name>
    pattern - glob will be done to choose only those files ,for example data*.txt. This should incldude both train and test
    validation - subset glob pattern to use. If it's a float like 0.1, use it as split of one file
    test - see above
    '''
    if not extract and pattern:
      raise ValueError('pattern must be empty if extract=False chooses a subset of the files (data/*.txt). but you downloaded only one file')

    if not os.path.exists(cache_dir):
       os.makedirs(cache_dir)

    fpath=get_file(unique_name, url,extract=True, cache_dir=cache_dir)
    print ('fpath',fpath)
    files = [fpath] if not pattern else glob.glob(f'{cache_dir}/datasets/{pattern}')

    train,val,test=[],[],[]
    for f in files:
      lines = [row_parser(f,line.rstrip()) for line in open(f,encoding="latin-1").readlines()] 
      lines = lines[1 if skip_first else 0:][:10000]
      print ('HARDCODED MAX LINES = 10000')
      
      print (files,'#lines',len(lines),'first 3 lines')
      print (lines[0],'\n',lines[1],'\n',lines[2])
      
      if isinstance(validation_pattern,float) and isinstance(test_pattern,float):
        test_count = int(len(lines)*(1-validation_pattern))
        val_count =  int(len(lines)*(1-test_pattern-validation_pattern))
        print ('(val_count,test_count)',val_count,test_count)
        train+= lines[:val_count]
        val+=   lines[val_count:test_count]
        test+=  lines[test_count:]
        
    self.tvt_lines = TVT(train, val, test)
    print ('train:',len(self.tvt_lines.train),'val',len(self.tvt_lines.val),'test',len(self.tvt_lines.test))
    self.parsed= self.tvt_lines
  
  
  def fit(self):
    """ the current implementation is quite bad, hello world! will be 2 tokens world! is the second. 
    """
    print ('limiting num_words in Tokenizer due to MEMORY BOUNDS')  #num_words =100*1000

    
    # I use here tokenizer only to count freq. of words, then manually choose most freq. and manually split
    # this is bad. There are better ways to do it (probably library/code which do it in one line)
    print ('\nREPLACE ME . BAD TOKENIZER!!!')
    self.tokenizer = Tokenizer(num_words=100000, filters='', lower=False, split=' ', char_level=False, oov_token='<oov>')
    
    # self.parsed.train is a list , each value is tuple text_string,file_name
    self.tokenizer.fit_on_texts([x for x,style in self.parsed.train])
    self.styles = set([style for x,style in self.parsed.train])
    print ('styles',self.styles)
    self.style2index = {style:i for i,style in enumerate(self.styles)}
    self.index2style = {index:style for style,index in self.style2index.items() }
    print (self.style2index,self.index2style)
    
    #print ('\n word_index',len(self.tokenizer.word_index),'<oov>',self.tokenizer.word_index['<oov>'])
    print ('common',list(self.tokenizer.word_index.items())[:15])
    print ('uncommon',list(self.tokenizer.word_index.items())[-15:])
  
    
    num_words= 10000
    print ('CAPPING. keeping ',num_words,'of',len(self.tokenizer.word_index))
    
    num_words= min(num_words,len(self.styles)+1+len(self.tokenizer.word_index))
    print (num_words,num_words,num_words)    
    word2index = dict(list(self.tokenizer.word_index.items())[:num_words-len(self.styles)-2])
    word2index['<s>']=0  #keras tokenizer keeps 0 unused
    for i,style in enumerate(self.styles):
      word2index[style]=num_words-1-len(self.styles)+i  #if num_words=100 . [96,97,98] 
    word2index['<oov>']=num_words-1                     #<oov> is [99]
    print ('word2index',len(word2index))
    
    #FOR NOW the start and end are both ZERO. maybe not good???
    
    num_encoder_tokens = num_decoder_tokens= num_words # len(self.tokenizer.word_index)
    self.word2index = word2index
    self.index2word = {index:word for (word,index) in self.word2index.items()}
    self.MAX_SEQUENCE_LENGTH=20  #100
    
    verbose=5
    result = []
    for rows in self.parsed:

      encoder_input_data  = np.zeros( (len(rows), self.MAX_SEQUENCE_LENGTH),    dtype='float32')
      decoder_input_data  = np.zeros( (len(rows), self.MAX_SEQUENCE_LENGTH),    dtype='float32') # shifted by 1
      decoder_target_data = None #np.zeros((len(rows),  self.MAX_SEQUENCE_LENGTH, num_decoder_tokens),    dtype='float32')
      style_data          = np.zeros((len(rows),  len(self.styles)),    dtype='float32') #one-hot
      
      #input to decoder   <s> hello world
      #target of decoder: hello world <s>
      
      for i, (input_text,style) in enumerate(rows):
        input_text = input_text.split(' ') #BUG: we need to use tokenizer here!!!!
        #pad with end token  hello world <end> <end> <end>
        end_token='<s>'
        input_text += [end_token for _ in range(self.MAX_SEQUENCE_LENGTH - len(input_text)+1)]
        if verbose:
          print ('input_text',input_text)
          verbose-=1
        # out : hello  world  <end>  (MAX_SEQUENCE_LENGTH=2)  <-encoder_input+ decoder_output(but one-hot)
        #
        # in: : <s>   hello   world  <- decoder-input
         
        for t, word in enumerate(([style]+input_text)[:self.MAX_SEQUENCE_LENGTH]):
            one_hot = word2index['<oov>'] if word not in word2index else word2index[word]
            decoder_input_data[i, t ] = one_hot
            
        for t,word in enumerate(input_text[:self.MAX_SEQUENCE_LENGTH]):  #last must be <end>=<s> token
            one_hot = word2index['<oov>'] if word not in word2index else word2index[word]
            encoder_input_data[i,t]=one_hot
            #decoder_target_data[i, t, one_hot] = 1. 
            
        style_data[i,self.style2index[style]]= 1
        
      #print (decoder_target_data.sum(),len(rows)*self.MAX_SEQUENCE_LENGTH)
      #assert int(decoder_target_data.sum())==len(rows)*self.MAX_SEQUENCE_LENGTH #one-hot-encoding must always include one
      
      result.append( (encoder_input_data,decoder_input_data,decoder_target_data,style_data))

    self.result= TVT(*result)

  def one_x_as_text(self,x):
    """ 1x20 or 20 input"""
    if len(x.shape)==2: 
      assert x.shape[0] ==1  #can only work on batch of 1
      x= x[0]
    return ' '.join([self.index2word[index] for index in x])

  def one_y_as_text(self,y):
    """ 1x20x2000 or 20x2000 input, in case of first will work on y[0]"""
    if len(y.shape)==3: 
      assert y.shape[0] ==1  #can only work on batch of 1
      y=y[0]
      
    best_token = np.argmax(y,1)
    return ' '.join([self.index2word[index] for index in best_token])

  
cache_dir='cache' 
#dataset('quora_dups','http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv',False,cache_dir) 
#dataset('bible4','https://codeload.github.com/keithecarlson/Zero-Shot-Style-Transfer/zip/master',extract=True,cache_dir=cache_dir
#       pattern=('Zero-Shot-Style-Transfer-master/Data/Bibles/ASV/*/*.txt','Zero-Shot-Style-Transfer-master/Data/Bibles/BBE/*/*.txt')

#x,y,z,t,and good, will happen , here
row_parser= lambda file_name,line: (','.join(line.split(',')[4:]),
                                    file_name.split('/')[-1]) #map x to x,style_file
dataset = Dataset('bible_csv','https://codeload.github.com/ashual/style-transfer/zip/master',extract=True,cache_dir=cache_dir,
                  
                  pattern='style-transfer-master/datasets/bible-corpus/t_[yb]*.csv',skip_first=True,row_parser=row_parser)    #kbd
dataset.fit()        
x_train, x_train_d, y_train,style_train = dataset.result.train
x_val, x_val_d,y_val ,style_val= dataset.result.val
x_test,x_test_d,y_test ,style_test= dataset.result.test

print ('train',x_train.shape,x_train_d.shape,y_train,style_train.shape)
print('val',x_val.shape)
print ('train in MB x,y',x_train.nbytes/1e6)


#,t_bbe,BBE,english,Bible in Basic English,,http://en.wikipedia.org/wiki/Bible_in_Basic_English,,Public Domain,
                  #,t_dby,DARBY,english,Darby English Bible,,http://en.wikipedia.org/wiki/Darby_Bible,,Public Domain,
                  #,t_kjv,KJV,english,King James Version,,http://en.wikipedia.org/wiki/King_James_Version,,Public Domain,


fpath cache/datasets/bible_csv
HARDCODED MAX LINES = 10000
['cache/datasets/style-transfer-master/datasets/bible-corpus/t_bbe.csv', 'cache/datasets/style-transfer-master/datasets/bible-corpus/t_ylt.csv'] #lines 10000 first 3 lines
('At the first God made the heaven and the earth.', 't_bbe.csv') 
 ('And the earth was waste and without form; and it was dark on the face of the deep: and the Spirit of God was moving on the face of the waters.', 't_bbe.csv') 
 ('"And God said, Let there be light: and there was light."', 't_bbe.csv')
(val_count,test_count) 8000 9000
HARDCODED MAX LINES = 10000
['cache/datasets/style-transfer-master/datasets/bible-corpus/t_bbe.csv', 'cache/datasets/style-transfer-master/datasets/bible-corpus/t_ylt.csv'] #lines 10000 first 3 lines
("In the beginning of God's preparing the heavens and the earth --", 't_ylt.csv') 
 ('"the earth hath existed waste and void, and darkness `is\' on the face of the deep, and the Spirit of God fluttering on the face of the waters,"', 

In [37]:
#!head -50 cache/datasets/style-transfer-master/datasets/bible-corpus/bible_version_key.csv
#!ls -lh cache/datasets/style-transfer-master/datasets/bible-corpus

for i in [2,3,4,998,999,1000,1001]: # good:[5000, 2002,3001]:
  print ('\n',i)
  for j in ['ylt','bbe']:#['ylt','kjv','wbt','asv','web','bbe','dby']:
    !sed -n -e {i}p cache/datasets/style-transfer-master/datasets/bible-corpus/t_{j}.csv

# differences: 
# very old: ylt
# very dynamic: bbe
# middle ground (5 similiar wbt)


 2
1001001,1,1,1,In the beginning of God's preparing the heavens and the earth --
1001001,1,1,1,At the first God made the heaven and the earth.

 3
1001002,1,1,2,"the earth hath existed waste and void, and darkness `is' on the face of the deep, and the Spirit of God fluttering on the face of the waters,"
1001002,1,1,2,And the earth was waste and without form; and it was dark on the face of the deep: and the Spirit of God was moving on the face of the waters.

 4
1001003,1,1,3,"and God saith, `Let light be;' and light is."
1001003,1,1,3,"And God said, Let there be light: and there was light."

 998
1034016,1,34,16,"then we have given our daughters to you, and your daughters we take to ourselves, and we have dwelt with you, and have become one people;"
1034016,1,34,16,Then we will give our daughters to you and take your daughters to us and go on living with you as one people.

 999
1034017,1,34,17,"and if ye hearken not unto us to be circumcised, then we have taken our daughter, and hav

In [38]:
#show a sample of x_train
for i in range(1150,1152):
  print ('\ntokens  :' , x_train_d[i])
  print ('as words:',[dataset.index2word[index] for index in x_train_d[i] ])
  print ('original:',dataset.parsed.train[i][0].split(' '))
  print (dataset.one_x_as_text(x_train_d[i]))
  #print (dataset.one_y_as_text(y_train[i]))


tokens  : [9.997e+03 2.170e+02 2.130e+02 4.600e+01 1.440e+02 1.240e+02 4.000e+00
 1.121e+03 2.000e+00 9.999e+03 1.000e+00 2.158e+03 7.000e+00 1.566e+03
 3.000e+00 5.920e+02 9.250e+02 5.000e+00 8.390e+02 2.950e+02]
as words: ['t_bbe.csv', '"Now', 'Joseph', 'was', 'taken', 'down', 'to', 'Egypt;', 'and', '<oov>', 'the', 'Egyptian,', 'a', 'captain', 'of', 'high', 'position', 'in', "Pharaoh's", 'house,']
original: ['"Now', 'Joseph', 'was', 'taken', 'down', 'to', 'Egypt;', 'and', 'Potiphar', 'the', 'Egyptian,', 'a', 'captain', 'of', 'high', 'position', 'in', "Pharaoh's", 'house,', 'got', 'him', 'for', 'a', 'price', 'from', 'the', 'Ishmaelites', 'who', 'had', 'taken', 'him', 'there."']
t_bbe.csv "Now Joseph was taken down to Egypt; and <oov> the Egyptian, a captain of high position in Pharaoh's house,

tokens  : [9.997e+03 6.000e+00 1.000e+00 3.500e+01 4.600e+01 1.400e+01 4.580e+02
 2.000e+00 1.000e+01 1.680e+02 5.038e+03 2.000e+00 1.000e+01 4.600e+01
 1.570e+02 5.000e+00 1.000e+00 1.390e+02

# Model defintion




## G

In [39]:
from keras.models import Model
from keras.layers import Input, LSTM, Dense, Embedding,CuDNNLSTM,Bidirectional,Concatenate,Dropout

# size of tokenizer indexes

num_decoder_tokens = num_encoder_tokens = len(dataset.word2index) 
print (num_decoder_tokens)



# Define an input sequence and process it.  
# input: batch,num-of-words of integers (not-one-hot)
# each batch should be of the size and padded, but between batches size can be different
# batch 1:   len 7,8,9,10(max)  inside the same batch, size is similiar (less waste on GPU)
# batch 2 :    len 97-100(max)  between batches size is different.
# order is random... sometime small batch sometime large 

encoder_inputs = Input(shape=(None,),name='encoder_inputs')

shared_embedding = Embedding(num_encoder_tokens, 
                     embedding_dim, 
                     #weights=[word_embedding_matrix], if there is one (word2vec)
                     #trainable=False,                            
                     #input_length=MAX_SEQUENCE_LENGTH, if there is one
                     )
#see dropout disucssion: https://github.com/keras-team/keras/issues/7290. iliaschalkidis 
#Dropout(noise_shape=(batch_size, 1, features))
x = shared_embedding(encoder_inputs) 
if (cuddlstm):
  encoder_lstm=CuDNNLSTM(latent_dim, return_state=True)
else:
  print ('using LSTM with dropout!')
  #need to tune the dropout values (maybe fast.ai tips) , just invented those value
  encoder_lstm=LSTM(latent_dim, return_state=True,dropout=0.3,recurrent_dropout=0.3)
if (bidi_encoder):
  encoder_lstm=Bidirectional(encoder_lstm,merge_mode='concat')
  x, forward_h, forward_c, backward_h, backward_c = encoder_lstm(x) #output,h1,c1,h2,c2
  state_h = Concatenate()([forward_h, backward_h])
  state_c = Concatenate()([forward_c, backward_c])
else:  
  x, state_h, state_c = encoder_lstm(x)

  
  
#sentence embedding LSTM: h,c  
encoder_states = [state_h, state_c]

# Set up the decoder, using `encoder_states` as initial state.
#looks similiar to  
#gold decoder_ouputs: [1,0,0] [0,1,0] one-hot of the below
#encoder_inputs:      hello   world <end> <end>
#decoder_input      : <style> hello world <end>   | style = bible1,bible2,(maybe-generic for pertraining)
 
decoder_inputs = Input(shape=(None,),name='decoder_inputs')  

decoder_latent_dim = latent_dim*2 if bidi_encoder else latent_dim #bi-di pass merge of h1+h2, c1+c2
if (cuddlstm):
  decoder_lstm = CuDNNLSTM(decoder_latent_dim, return_sequences=True,return_state=True) #returned state used in inference
else:
  decoder_lstm = LSTM(decoder_latent_dim, return_sequences=True,return_state=True)
decoder_outputs, _, _  = decoder_lstm(shared_embedding(decoder_inputs), initial_state=encoder_states)
decoder_dense  = Dense(num_decoder_tokens, activation='softmax',name='decoder_softmax')
decoder_outputs = decoder_dense(decoder_outputs)

# Define the model that will turn
# `encoder_input_data` & `decoder_input_data` into `decoder_target_data`
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)

# Compile & run training
print (model.summary())







# now for the INFER models (re-arrangement of the prev one)
# Remember that the training model varaibles were:
#                                        decoder_outputs
#encoder   --->    encoder_states  -->   decoder_lstm  
#shared_embedding                        shared_embeddings  
#encoder_inputs                          decdoer_inputs                  
encoder_model = Model(encoder_inputs, encoder_states)

decoder_states_inputs = [Input(shape=(decoder_latent_dim,)), Input(shape=(decoder_latent_dim,))]

decoder_outputs2, state_h, state_c = decoder_lstm(shared_embedding(decoder_inputs), initial_state=decoder_states_inputs)

decoder_states = [state_h, state_c]
decoder_outputs2 = decoder_dense(decoder_outputs2)

decoder_model = Model(
    [decoder_inputs] + decoder_states_inputs,
    [decoder_outputs2] + decoder_states)



    


10000
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
encoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 300)    3000000     encoder_inputs[0][0]             
                                                                 decoder_inputs[0][0]             
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) [(None, 512), (None, 1142784     embedding_3[0][0]                
____

##  classifier

**Intro**

(1) We start with regukar seq2seq is encoder->embedding->decoder and trained with reconstruction-loss

(2)We can build style-discriminator where the target is to classify author-style from the sentence-embedding.
When training it you need to freeze the encoder and decoder parts of the model, then:
input1: sentence --freezed encoder--> embedding    (no need to run decoder)
input2: style (one-hot)
output: style (one-hot)  
The discriminator can be a simple classifier (dense-based) with simple minimize cross-entropy target.

(3) The smart-part: We want to train the encoder to create an embedding which will fool the discriminator.
We will freeze the discriminator weights, and train the encoder-decoder similiarly to (1) with extra objective.
That the loss from the discriminator will be Maximized. 


In [40]:

#### Classifier model #############################
# works on the embedding itself
style_concat= Concatenate(name='d__concat1')(encoder_states)

a = Dropout(0.1,name='d__dropout1')(style_concat)                          
# if your inputs have shape  (batch_size, timesteps, features) and you want the dropout mask to be the same for all timesteps, you can use noise_shape=(batch_size, 1, features).
a = Dense(100,activation='relu',name='d__dense1')(a) 
#a = keras.layers.LeakyReLU()(a) #LeakyReLU  #why leaky? see: how to train your GAN - BUT IT FAILED TO LEARN (accuracy always 0.5)
a = Dropout(0.1,name='d__dropout2')(a)
style_outputs = Dense(len(dataset.style2index),activation='softmax',name='d__dense_softmax')(a)

d = Model(encoder_inputs,style_outputs)  #style_outputs : batch , one-hot-encoding-of-style
print (d.summary())





__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
encoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 300)    3000000     encoder_inputs[0][0]             
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) [(None, 512), (None, 1142784     embedding_3[0][0]                
__________________________________________________________________________________________________
concatenate_5 (Concatenate)     (None, 512)          0           bidirectional_3[0][1]            
                                                                 bidirectional_3[0][3]            
__________

## adverserial model

In [41]:
from keras import backend as K
############################# Adv model
# Note that it does not have new layers, just combining all of them with new loss
def inverse_categorical_crossentropy(y_true, y_pred):
  #need to implement it better , sum(1/categorical_crossentropy_per_sample)
  # if discriminator is random, on 2 styles, if expect 50% which should mean logloss of 1. so 1/1= 1
  # if discriminator is great, 99%, log-loss close to 0 , so 1/0 is big.
  # so expeceted range is GREAT=1 , BAD=BIGGG
  
  return 1/(K.categorical_crossentropy(y_true, y_pred)+0.0001)
#style_outputs
adv_model = Model([encoder_inputs, decoder_inputs],[decoder_outputs,style_outputs])

print (adv_model.summary())


__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
decoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
encoder_inputs (InputLayer)     (None, None)         0                                            
__________________________________________________________________________________________________
embedding_3 (Embedding)         (None, None, 300)    3000000     encoder_inputs[0][0]             
                                                                 decoder_inputs[0][0]             
__________________________________________________________________________________________________
bidirectional_3 (Bidirectional) [(None, 512), (None, 1142784     embedding_3[0][0]                
__________

# compile

In [44]:
# compile it all
def print_trainable(model):
  for layer in model.layers:
    print (layer.name,layer,layer.trainable)
    

def set_trainable(model,trainable) :
  """ set all layers of the model (ignores Input) to trainable True/False"""
  for layer in model.layers:
    if type(layer)==keras.engine.topology.InputLayer:
      pass
    else:
      layer.trainable = trainable 

#optimizer = ''#clipvalue=0.5,clipnorm=1.0) #TODO: values!!!!!    
######################################3
#compile all, set trainable parts (as keras hold it before compilation)
set_trainable(model,True)
model.trainable = True
model.compile(optimizer='adam', loss='categorical_crossentropy')  #30peocs loss: 0.2836 - val_loss: 0.4428
print ('\nmodel compiled with:')
print_trainable(model)


# when training d, the encoder should not change.
# impl. detail: instead of choose layers one by one, we first set all d True then override part with False
set_trainable(d,True)    
set_trainable(model,False) #setting it back(it's parts of d)
d.trainable = True
d.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])
print ('\nd compiled with:')
print_trainable(d)

# when training adv_model, the decoder should not change
set_trainable(adv_model,False)
set_trainable(model,True)  #so d is still false
adv_model.trainable = True
adv_model.compile(optimizer='adam', 
                  loss=['categorical_crossentropy',inverse_categorical_crossentropy],
                  loss_weights=[1, 1])
print ('\nadv compiled with:')
print_trainable(adv_model)



model compiled with:
decoder_inputs <keras.engine.topology.InputLayer object at 0x7fb019293978> False
encoder_inputs <keras.engine.topology.InputLayer object at 0x7faf530894e0> False
embedding_3 <keras.layers.embeddings.Embedding object at 0x7fb0192937f0> True
bidirectional_3 <keras.layers.wrappers.Bidirectional object at 0x7fb018bc75f8> True
concatenate_5 <keras.layers.merge.Concatenate object at 0x7fb018bc7d68> True
concatenate_6 <keras.layers.merge.Concatenate object at 0x7fb018bc7e48> True
cu_dnnlstm_6 <keras.layers.cudnn_recurrent.CuDNNLSTM object at 0x7fb018bc7dd8> True
decoder_softmax <keras.layers.core.Dense object at 0x7faf53fca940> True

d compiled with:
encoder_inputs <keras.engine.topology.InputLayer object at 0x7faf530894e0> False
embedding_3 <keras.layers.embeddings.Embedding object at 0x7fb0192937f0> False
bidirectional_3 <keras.layers.wrappers.Bidirectional object at 0x7fb018bc75f8> False
concatenate_5 <keras.layers.merge.Concatenate object at 0x7fb018bc7d68> False
con

# Sample and show sample

In [43]:
max_decoder_seq_length=dataset.MAX_SEQUENCE_LENGTH
print (max_decoder_seq_length)
# TODO: now it's greedy and use argmax
# maybe to choose randomly by distribution
# and maybe to implement BEAM SEARCH
def decode_sequence(input_seq,style,verbose=False):
    assert input_seq.shape == (1, max_decoder_seq_length )
    if verbose: print ('input_seq',input_seq.shape)
    # Encode the input as state vectors.
    states_value = encoder_model.predict(input_seq)
    if verbose: print ('encoder result states_value','h',states_value[0].shape,states_value[0].mean(),'c',states_value[1].shape,states_value[1].mean())
    
    # Generate empty target sequence of length 1.
    #                      batch,word-number value is token (0 /122)
    target_seq = np.zeros((1, 1))
    # Populate the first character of target sequence with the start character.
    target_seq[0, 0] = dataset.word2index[style]

    # Sampling loop for a batch of sequences
    # (to simplify, here we assume a batch of size 1).
    stop_condition = False
    decoded_sentence = []
    while not stop_condition:
        #start with encoder-state then change to self state
        output_tokens, h, c = decoder_model.predict( [target_seq] + states_value) 
        if verbose: print ('output_tokens',output_tokens.shape,output_tokens.mean(),'h',h.shape,'c',c.shape)

        # Sample a token
        sampled_token_index = np.argmax(output_tokens)# [0, -1, :])
        if verbose: print ('sampled_token_index',sampled_token_index.shape,output_tokens.max(),sampled_token_index)
        decoded_sentence.append(sampled_token_index)

        # Exit condition: either hit max length
        # or find stop character.
        if (sampled_token_index == dataset.word2index['<s>'] or
           len(decoded_sentence) > max_decoder_seq_length):
          stop_condition = True

        # Update the target sequence (of length 1).
        target_seq = np.zeros((1, 1))
        target_seq[0, 0] = sampled_token_index

        # Update states
        states_value = [h, c]

    res= np.array(decoded_sentence) #[dataset.index2word[index] for index in decoded_sentence]
    return res

  

def show_sample(data_type='val',teacher_forcing=False,sample_id=[],replace_style=['t_bbe.csv','t_ylt.csv'],show_other_style=True):
  """ 
    data_type - train/val/test
    teacher_forcing : default False, sample argmax as for normal test. If true, feed decode-input from the dataset itself
    sample_id : verse-id to sample (list)
    replace_style: if true, will pass a different style
  """
  for i in sample_id:
    print ('#'*30,'verb',i,'#'*30)
    data={'train': dataset.result.train,'val':dataset.result.val , 'test':dataset.result.test}

    # extract the i`th row data
    one_x= data[data_type][0][i:i+1]
    if one_x.shape[0]==0:
      print ('sample out of range',data[data_type][0].shape)
    one_x_d= data[data_type][1][i:i+1]
    style_as_text = dataset.index2word[one_x_d[0,0]]


    print (f'encoder_input  [{style_as_text}]:',dataset.one_x_as_text(one_x))
    if (show_other_style):
      #temp , currently +8000/+2000 as half the size of data TODO: in other case different way
      other_i=i + int(len(data[data_type][0])/2)
      other_x= data[data_type][0][other_i]
      other_style=  dataset.index2word[data[data_type][1][other_i][0]]
      print (f'encoder_input  [{other_style}]:',dataset.one_x_as_text(other_x))

    for replace_style in ['t_bbe.csv','t_ylt.csv']:
      # always replace_style:
      one_x_d= np.copy(one_x_d)
      one_x_d[0,0]=dataset.word2index[replace_style]

      if teacher_forcing:
        p = model.predict([one_x,one_x_d])
        print (f'decoder TF     [{replace_style}]:',dataset.one_y_as_text(p))
      else:
        p = decode_sequence(one_x,style_as_text)
        print (f'decoder sample [{replace_style}]:',dataset.one_x_as_text(p))
          
          
# please see more in error-anlysis section
#show_sample('train',sample_id=[0,1,2],teacher_forcing=True) 
#show_sample('val',sample_id=[0,1,2],teacher_forcing=True) 

    

20


# TRAINING

In [0]:
import keras
from keras.callbacks import ReduceLROnPlateau
class LossHistory(keras.callbacks.Callback):
    def __init__(self):
      self.losses = {'loss':[],'val_loss':[]}
      
    #def on_train_begin(self, logs={}):
    #  pass  
    
    def on_epoch_end(self, batch, logs={}):
      for loss in ['loss','val_loss']:
        self.losses[loss].append(logs.get(loss))
        
np.random.seed(42)


def gen(t,batch_size,gen_type):
    """grn_type = g/d/adv"""
    x_encoder,x_decoder,y_one_hot_decoder,y_style = t
    WORDS=x_encoder.shape[1]
    while 1:
      ind=np.array(np.random.randint(len(x_encoder),size=(batch_size) ))
      
      #build decoder target data on the fly
      one_hot = np.zeros((batch_size,WORDS,len(dataset.index2word)))
      for row in range(batch_size):
         for t in range(WORDS):
            one_hot[row,t,int(x_encoder[ind][row,t])] = 1
      
      if gen_type=='g':
        yield ([x_encoder[ind],x_decoder[ind]],one_hot)
      elif gen_type=='d':
        yield (x_encoder[ind],y_style[ind])
      elif gen_type=='adv':
        yield ([x_encoder[ind],x_decoder[ind]],[one_hot,y_style[ind]]) #y_one_hot_decoder[ind]
      else:
        raise ValueError('gen_type unkown',gen_type)



import matplotlib.pyplot as plt 

# summarize history for loss
def plt_losses(loss_history,title,with_val=False):
  plt.plot(loss_history.losses['loss'][:])
  if with_val:
    plt.plot(loss_history.losses['val_loss'][:])
  plt.title(title)
  med=0.1 if len(loss_history.losses['loss'])<1 else np.median(loss_history.losses['loss'])
  plt.ylim(ymax=med+1.5)
  plt.ylabel('loss')
  plt.xlabel('epoch')
  plt.legend(['train', 'val'], loc='upper right')
  

def plt_all():  
  plt.figure(figsize=(14,4))
  plt.subplot(131) #numrows, numcols, fignum
  plt_losses(loss_history,'g loss')  
  plt.subplot(132)
  plt_losses(loss_history_d,'d loss')  
  plt.subplot(133)
  plt_losses(loss_history_adv,'adv loss') 
  plt.show()


  
loss_history = LossHistory()
loss_history_d = LossHistory()
loss_history_adv = LossHistory()  


def train_g(steps,validation_steps=1):
    #set_trainable(model,True)  
    
    model.fit_generator(gen(dataset.result.train,batch_size,'g'),
                        steps,
                        validation_steps=validation_steps,
                        validation_data=gen(dataset.result.val,batch_size,'g'),
                        callbacks=[loss_history],
                        verbose=0 if steps<50 else 1,
                        #max_queue_size=10,
                        #workers=2
                       )
    #model.fit([x_train, x_train_d], y_train,

    
def train_d(steps,validation_steps=1):
  #set_trainable(d,True)  #only for warning
  #set_trainable(model,False)  
  #d.trainable=True
  d.fit_generator(gen(dataset.result.train,batch_size,'d'),
                        steps,
                        validation_steps=validation_steps,
                        validation_data=gen(dataset.result.val,batch_size,'d'),
                        verbose=0 if steps<50 else 1,
                        callbacks=[loss_history_d])

  # d.fit(x_train, style_train,

  
def train_adv(steps,validation_steps=1):
  #adv_model.fit([x_train, x_train_d], [y_train,style_train],
  #set_trainable(d,False) 
  #set_trainable(model,True)  
  adv_model.fit_generator(gen(dataset.result.train,batch_size,'adv'),
                        steps,
                        validation_steps=validation_steps,
                        validation_data=gen(dataset.result.val,batch_size,'adv'),
                        verbose=0 if steps<50 else 1,
                        callbacks=[loss_history_adv])
  
 
  
#gener=gen(dataset.result.train,10,'d')
#x1,y=next(gener)
#print ('test gen batch 2',x1.shape,y.shape,y)

#plt_all()     
#print_trainable(d)
#train_d(500,10)
#train_g(50,10)  # THIS KILLS train_d after it
#train_adv(50,10)  # THIS KILLS train_d after it
#train_d(500,10)

In [52]:
steps=50
validation_steps=1
d.fit_generator(gen(dataset.result.train,batch_size,'d'),
                        steps,
                        validation_steps=validation_steps,
                        validation_data=gen(dataset.result.val,batch_size,'d'),
                        callbacks=[loss_history],
                        verbose=0 if steps<50 else 1,
                       )

model.fit_generator(gen(dataset.result.train,batch_size,'g'),
                        steps,
                        validation_steps=validation_steps,
                        validation_data=gen(dataset.result.val,batch_size,'g'),
                        callbacks=[loss_history],
                        verbose=0 if steps<50 else 1,
                       )
d.fit_generator(gen(dataset.result.train,batch_size,'d'),
                        steps,
                        validation_steps=500,
                        validation_data=gen(dataset.result.val,batch_size,'d'),
                        callbacks=[loss_history],
                        verbose=0 if steps<50 else 1,
                       )



#set_trainable(model,False)  
#d.trainable=True
#print_trainable(d)
#train_d(50,10)
#train_g(1,10)
#train_d(50,10)


Epoch 1/1
 5/50 [==>...........................] - ETA: 1s - loss: 0.6930 - acc: 0.5281

  'Discrepancy between trainable weights and collected trainable'


Epoch 1/1
Epoch 1/1


<keras.callbacks.History at 0x7faf51e3aac8>

In [39]:


          



epoc = int(len(dataset.result.train[0])/batch_size)

# start with a good pretraining for few epocs


for i in range(20):
  train_g(epoc*1,validation_steps=100)  #pretrain
  l=loss_history.losses['loss'][-1:][0]
  if l<1.0:
    print ('early break at break at',l,'epoc',i)
    break
show_sample('train',sample_id=[0,1,2],teacher_forcing=True)   #,8000+0

for i in range(20):
  train_d(epoc,validation_steps=100)
  l=loss_history_d.losses['loss'][-1:][0]
  if l<2:
    print ('early break at break at',l,'epoc',i)


small_steps=5
for e in range(10):
  print ('epocs',e)
  for i in range(int(epoc/small_steps)):
    print ('expect to wait a minute here...')
    train_d(small_steps)
    train_adv(small_steps)
  show_sample('train',sample_ids=[0],teacher_forcing=True)   #,8000+0
  plt_all()
  
print ('done')


# Expect to see nice results on G with 10 words , when loss is <2. If not continue training a bit

Epoch 1/1
  6/250 [..............................] - ETA: 2:22 - loss: 9.1770

KeyboardInterrupt: ignored

In [11]:
train_d(epoc,validation_steps=100) # IF ran after train_g NOT WORKING
train_g(5,validation_steps=5)
train_d(epoc,validation_steps=100) # IF ran after train_g NOT WORKING

NameError: ignored

# Error analysis

In [379]:

for i in [1,2,3]:
  print ('\n','#'*10,'VAL hope for modest results','verb',i,'#'*30)
  show_sample('val',sample_id=i,teacher_forcing=False)      
    

for i in [0,1,2,3]:
  print ('\n','# TRAIN - expect good results '*30,'verb',i,'#'*30)
  show_sample('train',sample_id=i,teacher_forcing=False)   
    


 ########## VAL hope for modest results verb 1 ##############################
encoder_input  [t_bbe.csv]: "Then David said, You are not to do this, my brothers, after what the Lord has given us, who has
encoder_input  [t_ylt.csv]: "And David saith, `Ye do not do so, my brethren, with that which Jehovah hath given to us, and He
decoder sample [t_bbe.csv]: "Then David said, You are not to do what my heart up on the day of my lord's ears, so that
decoder sample [t_ylt.csv]: "Then David said, You are not to do what my heart up on the day of my lord's ears, so that

 ########## VAL hope for modest results verb 2 ##############################
encoder_input  [t_bbe.csv]: Who is going to give any attention to you in this <oov> for an equal part will be given to
encoder_input  [t_ylt.csv]: "and who doth hearken to you in this thing? for as the portion of him who was brought down into
decoder sample [t_bbe.csv]: Who is not able to give honour to the children of Israel, but any other god for Is

In [287]:



#a= model.predict([x_train[s:e], x_train_d[s:e]])
#for i in range(dataset.MAX_SEQUENCE_LENGTH):
#  best=np.argmax(a[0,i])
#  print (i,best,dataset.index2word[best],a[0,i,best],a[0,i,0])

#NEED TO FIX CODE HERE
from keras.losses import categorical_crossentropy
p=model.predict([x_val,x_val_d])
scores=K.eval(K.sum(categorical_crossentropy(K.constant(p), K.constant(y_val) ),axis=1))
worse_10 = scores.argsort()[::-1][:10]

for i in range(len(worse_10)):
  bad=worse_10[i]
  print (i,'arg',bad,'score',scores[bad],show_sample('val',False,bad))
  


ValueError: ignored

## Error of style disc.

In [438]:
### Error of style discriminator





i=100

#print (style_val[i:i+10])
#print(d.predict(x_val[i:i+10]))#, style_train,

p=1900 #switch point between style in db
print('train',d.evaluate(x_train[i:i+p],style_train[i:i+p]))
print('val',d.evaluate(x_val[i:i+p],style_val[i:i+p]))

print_trainable(d)


train [1.1920930376163597e-07, 1.0]
val [8.483208160400398, 0.47368421052631576]
-0.0010554983 b 0.0
Epoch 1/1
-0.0010554983 b 0.0
encoder_inputs <keras.engine.topology.InputLayer object at 0x7fd81e9e82e8> False
embedding_9 <keras.layers.embeddings.Embedding object at 0x7fd81e9e8198> True
cu_dnnlstm_17 <keras.layers.cudnn_recurrent.CuDNNLSTM object at 0x7fd81ecdca58> True
concatenate_16 <keras.layers.merge.Concatenate object at 0x7fd81ea1d240> False
dropout_29 <keras.layers.core.Dropout object at 0x7fd81ea1d1d0> False
d_dense1 <keras.layers.core.Dense object at 0x7fd81ea1d390> False
dropout_30 <keras.layers.core.Dropout object at 0x7fd81ea1d550> False
d_dense_softmax <keras.layers.core.Dense object at 0x7fd81ea1d128> False


# FAQ and wierd exceptions

* If you get exceptions related to cuda-lstm , inside the show_sample, but you actially not using it at all.  You will need to restart the notebook (thinking it's TF issue/bug)

In [54]:
# If d not training (accuracy close to 0.5): check this. from som reason, it is not trainlable
# solution is to recompile, and redefine train_d . TODO: find source of problem
#  Discrepancy between trainable weights and collected trainable weights, did you set `model.trainable` without calling `model.compile` after ?
#  'Discrepancy between trainable weights and collected trainable'
def print_d_mean():
  print(d.get_layer('d__dense_softmax').get_weights()[0].mean(),'b',d.get_layer('d__dense_softmax').get_weights()[1].mean())
print_d_mean()

#set_trainable(d,True)    
#set_trainable(model,False) #setting it back(it's parts of d)
#d.compile(optimizer='adam', loss='categorical_crossentropy',metrics=['accuracy'])


#def train_d(steps,validation_steps=1):
#set_trainable(d,True)  #only for warning
#set_trainable(model,False)
#d.trainable=True

print_d_mean()

d.fit_generator(gen(dataset.result.train,batch_size,'d'),
                      250,
                      validation_steps=50,
                      validation_data=gen(dataset.result.val,batch_size,'d'),
                      verbose=1,
                      callbacks=[loss_history_d])

print_d_mean()




0.017797627 b 0.0


  'Discrepancy between trainable weights and collected trainable'


Epoch 1/1
0.017797627 b 0.0
0.017797627 b 0.0
