---
# PREPARATION

In [88]:
import tensorflow as tf
import tensorflow.keras.backend as K
import numpy as np
from tensorflow.keras import layers
from tensorflow.keras import layers, Model
import os
from tensorflow.keras.layers.experimental.preprocessing import TextVectorization
import string
import re

In [89]:
def save_dataset(dataset,fileName):
  path = os.path.join('./tfDatasets/', fileName)
  tf.data.experimental.save(dataset, path)

def load_dataset(fileName):
  path = os.path.join("./tfDatasets/", fileName)
  new_dataset = tf.data.experimental.load(path,
      tf.TensorSpec(shape=(), dtype=tf.string))
  return new_dataset

---
# Building an Efficient TensorFlow Input Pipeline  for Character-Level Text Generation

To load the data into our pipeline, I will use `tf.data.TextLineDataset()` method.

In [90]:
batch_size = 64
#raw_data_ds = tf.data.TextLineDataset(["nietzsche.txt"])
raw_data_ds = tf.data.TextLineDataset(["story.txt"])

Let's see some lines from the uploaded text:

In [91]:
for elems in raw_data_ds.take(10):
    print(elems.numpy().decode("utf-8"))

rick grew up in a troubled household . he never found good support in family , and turned to gangs . it was n't long before rick got shot in a robbery . the incident caused him to turn a new leaf .
laverne needs to prepare something for her friend 's party . she decides to bake a batch of brownies . she chooses a recipe and follows it closely . laverne tests one of the brownies to make sure it is delicious .
sarah had been dreaming of visiting europe for years . she had finally saved enough for the trip . she landed in spain and traveled east across the continent . she did n't like how different everything was .
gina was worried the cookie dough in the tube would be gross . she was very happy to find she was wrong . the cookies from the tube were as good as from scratch . gina intended to only eat 2 cookies and save the rest .
it was my final performance in marching band . i was playing the snare drum in the band . we played thriller and radar love . the performance was flawless .
i ha

---
# 3. COMBINE ALL LINES INTO A SINGLE TEXT 

Since our aim is to prepare a train dataset for a **character-level** text generator, we need to **convert the line-by-line text into char-by-char text**. 

Therefore, we first combine all line-by-line text as **a single text**:

In [92]:
text=""
for elem in raw_data_ds:
   text=text+(elem.numpy().decode('utf-8'))


print(text[:1000])

rick grew up in a troubled household . he never found good support in family , and turned to gangs . it was n't long before rick got shot in a robbery . the incident caused him to turn a new leaf .laverne needs to prepare something for her friend 's party . she decides to bake a batch of brownies . she chooses a recipe and follows it closely . laverne tests one of the brownies to make sure it is delicious .sarah had been dreaming of visiting europe for years . she had finally saved enough for the trip . she landed in spain and traveled east across the continent . she did n't like how different everything was .gina was worried the cookie dough in the tube would be gross . she was very happy to find she was wrong . the cookies from the tube were as good as from scratch . gina intended to only eat 2 cookies and save the rest .it was my final performance in marching band . i was playing the snare drum in the band . we played thriller and radar love . the performance was flawless .i had bee

---
# 4. SPLIT THE TEXT INTO TOKENS

## Check the size of the corpus

In [93]:
print("Corpus length:", int(len(text)/1000),"K chars")

Corpus length: 516 K chars


In [94]:
chars = sorted(list(set(text)))
print("Total disctinct chars:", len(chars))

Total disctinct chars: 85


## Set the split parameters

We can split the text into two sets of **fixed-size char sequences** as below:
* The first sequence (**`input_chars`**) is  the **input data** (X) to the model which will receive a fixed-size (**`maxlen`**) character sequence 
* The second sequence (**`next_char`**) is  the **output data** (y) to the model which is only  1 char 

While creating these sequences, we can jump over the data by setting **`step`** to a fixed character number.

We define all these parameters below:


In [95]:
# cut the text in semi-redundant sequences of maxlen characters
maxlen = 20
step = 3
input_chars = []
next_char = []

Using the above parameters, we can split the text into **input (X)** and **output (y)** sequences:

In [96]:
for i in range(0, len(text) - maxlen, step):
    input_chars.append(text[i : i + maxlen])
    next_char.append(text[i + maxlen])

## Check the generated sequences
After splitting the text, we can check the number of sequences and see a sample input and output:

In [97]:
print("Number of sequences:", len(input_chars))
print("input X  (input_chars)  --->   output y (next_char) ")

for i in range(5):
  print( input_chars[i],"   --->  ", next_char[i])



Number of sequences: 172247
input X  (input_chars)  --->   output y (next_char) 
rick grew up in a tr    --->   o
k grew up in a troub    --->   l
rew up in a troubled    --->    
 up in a troubled ho    --->   u
 in a troubled house    --->   h


---
# 5. CREATE X & y DATASETS

We can use these two sequences to create **X and y datasets** by using **`tf.data.Dataset.from_tensor_slices()`** method:

In [98]:
X_train_ds_raw=tf.data.Dataset.from_tensor_slices(input_chars)
y_train_ds_raw=tf.data.Dataset.from_tensor_slices(next_char)

Let's see some input-output pairs:

In [99]:
for elem1, elem2 in zip(X_train_ds_raw.take(5),y_train_ds_raw.take(5)):
   print(elem1.numpy().decode('utf-8'),"----->", elem2.numpy().decode('utf-8'))

rick grew up in a tr -----> o
k grew up in a troub -----> l
rew up in a troubled ----->  
 up in a troubled ho -----> u
 in a troubled house -----> h


---
# 6. PREPROCESS THE TEXT

We need to process these datasets before feeding them into a model. 




## What are the preprocessing steps?

The processing of each sample contains the following steps:

* **standardize** each sample (usually lowercasing + punctuation stripping): 

  In this tutorial, we will create a **custom standardization function** to show how to apply your code to strip un-wanted chars and symbols.

* **split** each sample into substrings (usually words):

  As in this part, we choose to split the text into **fixed-size character sequences**, we will write a **custom split function**

* **recombine** substrings into tokens (usually ngrams):
  We will leave it as 1 ngram (char)

* **index tokens** (associate a unique int value with each token)

* **transform** each sample using this index, either into a vector of ints or a dense float vector.

In [100]:
def custom_standardization(input_data):
    lowercase     = tf.strings.lower(input_data)
    stripped_html = tf.strings.regex_replace(lowercase, "<br />", " ")
    stripped_num  = tf.strings.regex_replace(stripped_html, "[\d-]", " ")
    stripped_punc  =tf.strings.regex_replace(stripped_num, 
                             "[%s]" % re.escape(string.punctuation), "")    
    return stripped_punc

def char_split(input_data):
  return tf.strings.unicode_split(input_data, 'UTF-8')

def word_split(input_data):
  return tf.strings.split(input_data)

### Set the text vectorization parameters

* We can limit the number of distinct characters by setting `max_features`
* We set an explicit `sequence_length`, since our  model needs **fixed-size** input sequences.


In [101]:
# Model constants.
max_features = 96           # Number of distinct chars / words  
embedding_dim = 16             # Embedding layer output dimension
sequence_length = maxlen       # Input sequence size


### Create the text vectorization layer

* The **text vectorization layer** is initialized below. 
* We are using this layer to normalize, split, and map strings to integers, so we set our 'output_mode' to '**int**'.


In [102]:
vectorize_layer = TextVectorization(
    standardize=custom_standardization,
    max_tokens=max_features,
    split=char_split, # word_split or char_split
    output_mode="int",
    output_sequence_length=sequence_length,
)

### Adapt the Text Vectorization layer to the train dataset

Now that the **Text Vectorization layer** has been created, we can call `adapt` on a text-only dataset to create the vocabulary with indexing. 

You don't have to batch, but for very large datasets this means you're not keeping spare copies of the dataset in memory.

In [103]:
vectorize_layer.adapt(X_train_ds_raw.batch(batch_size))

We can take a look at the size of the vocabulary

In [104]:
print("The size of the vocabulary (number of distinct characters): ", len(vectorize_layer.get_vocabulary()))

The size of the vocabulary (number of distinct characters):  29


Let's see the first 5 entries in the vocabulary:

In [105]:
print("The first 10 entries: ", vectorize_layer.get_vocabulary()[:10])

The first 10 entries:  ['', '[UNK]', ' ', 'e', 't', 'a', 'o', 'h', 'i', 'n']


You can access the vocabulary by using an index:

In [106]:
vectorize_layer.get_vocabulary()[3]

'e'

After preparing the **Text Vectorization layer**,  we need a helper function to **convert a given raw text to a Tensor** by using this layer:

In [107]:
def vectorize_text(text):
  text = tf.expand_dims(text, -1)
  return tf.squeeze(vectorize_layer(text))

A simple test of the function:

In [108]:
vectorize_text("Ne ister gönül?")

<tf.Tensor: shape=(20,), dtype=int64, numpy=
array([ 9,  3,  2,  8, 10,  4,  3, 11,  2, 17,  1,  9,  1, 13,  0,  0,  0,
        0,  0,  0])>

### Apply the **Text Vectorization** onto X and y datasets

In [109]:
# Vectorize the data.
X_train_ds = X_train_ds_raw.map(vectorize_text)
y_train_ds = y_train_ds_raw.map(vectorize_text)

X_train_ds.element_spec, y_train_ds.element_spec

(TensorSpec(shape=(20,), dtype=tf.int64, name=None),
 TensorSpec(shape=(20,), dtype=tf.int64, name=None))

### Convert **y** to a single char representation

In [110]:
y_train_ds=y_train_ds.map(lambda x: x[0])

In [111]:
for elem in y_train_ds.take(1):
  print("shape: ", elem.shape, "\n next_char: ",elem.numpy())

shape:  () 
 next_char:  6


### Check the tensor dimensions to ensure that we have max-sequence size inputs and a single output:

In [112]:
X_train_ds.take(1), y_train_ds.take(1)

(<TakeDataset element_spec=TensorSpec(shape=(20,), dtype=tf.int64, name=None)>,
 <TakeDataset element_spec=TensorSpec(shape=(), dtype=tf.int64, name=None)>)

### Let's see an example pair:

In [113]:
for (X,y) in zip(X_train_ds.take(5), y_train_ds.take(5)):
  print(X.numpy()," --> ",y.numpy())

[11  8 16 23  2 17 11  3 14  2 15 21  2  8  9  2  5  2  4 11]  -->  6
[23  2 17 11  3 14  2 15 21  2  8  9  2  5  2  4 11  6 15 22]  -->  13
[11  3 14  2 15 21  2  8  9  2  5  2  4 11  6 15 22 13  3 12]  -->  2
[ 2 15 21  2  8  9  2  5  2  4 11  6 15 22 13  3 12  2  7  6]  -->  15
[ 2  8  9  2  5  2  4 11  6 15 22 13  3 12  2  7  6 15 10  3]  -->  7


# 7. FINALIZE THE DATA PIPELINE

## Join the input (X) and output (y) values as a single dataset

In [114]:
train_ds =  tf.data.Dataset.zip((X_train_ds,y_train_ds))

## Set data pipeline optimizations
Do async prefetching / buffering of the data for best performance on GPU

In [115]:
AUTOTUNE = tf.data.AUTOTUNE
train_ds = train_ds.shuffle(buffer_size=512).batch(batch_size, drop_remainder=True).cache().prefetch(buffer_size=AUTOTUNE)

## Check the size of the dataset (in batches):

In [116]:
print("The size of the dataset (in batches)): ", train_ds.cardinality().numpy())

The size of the dataset (in batches)):  2691


## Again, let's check the tensor dimensions of **input X** and **output y**:

In [117]:
for sample in train_ds.take(1):
  print("input (X) dimension: ", sample[0].numpy().shape, "\noutput (y) dimension: ",sample[1].numpy().shape)

input (X) dimension:  (64, 20) 
output (y) dimension:  (64,)


## Basic LSTM Model

In [118]:
# define model 
# A integer input for vocab indices.
inputs = tf.keras.Input(shape=(sequence_length), dtype="int64")
# Next, we add a layer to map those vocab indices into a space of dimensionality
# 'embedding_dim'.
x = layers.Embedding(max_features, embedding_dim)(inputs)
#x = layers.Dropout(0.5)(x)
x = layers.LSTM(128, return_sequences=True)(x)
x = layers.Flatten()(x)
predictions=  layers.Dense(max_features, activation='softmax')(x)
model_LSTM = tf.keras.Model(inputs, predictions,name="model_LSTM")
#sequence_length
# compile model
model_LSTM.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model_LSTM.summary())


Model: "model_LSTM"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_3 (InputLayer)        [(None, 20)]              0         
                                                                 
 embedding_2 (Embedding)     (None, 20, 16)            1536      
                                                                 
 lstm_2 (LSTM)               (None, 20, 128)           74240     
                                                                 
 flatten_2 (Flatten)         (None, 2560)              0         
                                                                 
 dense_2 (Dense)             (None, 96)                245856    
                                                                 
Total params: 321,632
Trainable params: 321,632
Non-trainable params: 0
_________________________________________________________________
None


In [119]:
model_LSTM.fit(train_ds, epochs=3) #, validation_data= train_ds.skip(20000)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x7f536c9c7f50>

In [120]:
model_LSTM.save("model_LSTM")



INFO:tensorflow:Assets written to: model_LSTM/assets


INFO:tensorflow:Assets written to: model_LSTM/assets


In [121]:
model_LSTM = tf.keras.models.load_model('model_LSTM')

In [122]:
vectorize_text("Ayrılıktan parça parça olmuş")


<tf.Tensor: shape=(20,), dtype=int64, numpy=
array([ 5, 19, 11,  1, 13,  1, 23,  4,  5,  9,  2, 21,  5, 11,  1,  5,  2,
       21,  5, 11])>

In [123]:
def sample(preds, temperature=0.2):
    # helper function to sample an index from a probability array
    preds=np.squeeze(preds)
    
    preds = np.asarray(preds).astype("float64")
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

In [135]:
def generate_text(model, seed_original, step,seed):
    seed= vectorize_text(seed_original)
    decode_sentence(seed.numpy().squeeze())
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print("...Diversity:", diversity)
        seed= vectorize_text(seed_original).numpy().reshape(1,-1)
        

        generated = (seed)
        for i in range(step):
            #print(seed.shape)
            predictions=model.predict(seed)
            pred_max= np.argmax(predictions.squeeze())
            #print("pred_max: ", pred_max)
            next_index = sample(predictions, diversity)
            #print("next_index: ", next_index)
            generated = np.append(generated, next_index)
            seed= generated[-sequence_length:].reshape(1,sequence_length)
        decode_sentence(generated)
    



In [136]:
def decode_sentence (encoded_sentence):
  deceoded_sentence=[]
  for word in encoded_sentence:
    
    deceoded_sentence.append(vectorize_layer.get_vocabulary()[word])
  sentence= ''.join(deceoded_sentence)
  print(sentence)
  return sentence


In [139]:
seed=str(input("Enter the words:\n "))
seed=np.array(list(seed))

Enter the words:
 alice project knife broken


In [140]:
generate_text(model_LSTM, "iteration: ",100,seed)

iteration 
...Diversity: 0.2
iteration  foud of out of the was a sone the was so porect the work and the sonter the was to a lontire and th
...Diversity: 0.5
iteration ficy of to eeaun a off the torteraar the dostert of the fromect gutent th t me jorsatite ang to te
...Diversity: 1.0
iteration  ofole of ennear or ton of asdrows of ithoupe his rejearents rlast in wad in a nama a ducuuns to om 
...Diversity: 1.2
iteration  thik barcutenefes cromscatssencickt a anc tume cutonk in thenik wonkouts ouf of thillkinw wastee
