<a href="https://colab.research.google.com/github/afoteygh/Research/blob/main/Measurements_text_generation_using_an_lstm_in_keras.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Text Generation using an LSTM in Keras
In this kernel you we will go over how to let a network create text in the style of sir arthur conan doyle. This kernel is heavily based on the [official keras text generation example](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py).  I also made [a video](https://youtu.be/QtQt1CUEE3w) on text generation using an LSTM network.
Content:
1. [Introduction](#1)
2. [Loading in data](#2)
3. [Preprocessing](#3)  
    3.1 [Map chars to integers](#3.1)  
    3.2 [Split up into subsequences](#3.2)
4. [Building model](#4)  
    4.1 [Helper Functions](#4.1)  
    4.2 [Defining callbacks and training the model](#4.2)
5. [Generate new text](#5)  
6. [Conclusion](#6)

<a id="1"></a> 
## 1. Introduction
Because the sequence in an text is important we will recurrent neural network which can remember its previous inputs.

In [2]:
from __future__ import print_function
from keras.callbacks import LambdaCallback
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import LSTM
from keras.optimizers import RMSprop
from keras.utils.data_utils import get_file
import numpy as np
import random
import sys
import io
from sklearn import metrics
from sklearn.metrics import mean_squared_error

<a id="2"></a> 
## 2. Loading in data

In [3]:
text = open('LSTMemerging-trojan.txt', 'r').read().lower()
print('text length', len(text))

text length 4472199


In [5]:
print(text[:400])


alert tcp $home_net any -> $external_net 25 (msg:"et trojan suspicious smtp handshake outbound"; flow:established,to_server; content:"001 ruthere"; depth:11; metadata: former_category malware; reference:url,doc.emergingthreats.net/bin/view/main/2008562; classtype:unknown; sid:2008562; rev:3; metadata:created_at 2010_07_30, updated_at 2010_07_30;)

alert tcp $external_net 25 -> $home_net any (msg:


<a id="3"></a> 
## 3. Preprocessing

<a id="3.1"></a> 
### 3.1 Map chars to integers

Because we will be training on a character level we need to relate each unique character with a number.
We are going to create two dicts one from character to integer and one to transform back to character

In [6]:
chars = sorted(list(set(text)))
print('total chars: ', len(chars))

total chars:  69


In [7]:
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

<a id="3.2"></a> 
### 3.2 Split up into subsequences
Creates an array of sentence data with the length maxlen as well as an array with the next character.

In [8]:
maxlen = 200
step = 3
sentences = []
next_chars = []
for i in range(0, len(text) - maxlen, step):
    sentences.append(text[i: i + maxlen])
    next_chars.append(text[i + maxlen])
print('nb sequences:', len(sentences))

nb sequences: 1490667


In [9]:
print(sentences[:3])
print(next_chars[:3])

['\nalert tcp $home_net any -> $external_net 25 (msg:"et trojan suspicious smtp handshake outbound"; flow:established,to_server; content:"001 ruthere"; depth:11; metadata: former_category malware; refere', 'ert tcp $home_net any -> $external_net 25 (msg:"et trojan suspicious smtp handshake outbound"; flow:established,to_server; content:"001 ruthere"; depth:11; metadata: former_category malware; reference', ' tcp $home_net any -> $external_net 25 (msg:"et trojan suspicious smtp handshake outbound"; flow:established,to_server; content:"001 ruthere"; depth:11; metadata: former_category malware; reference:ur']
['n', ':', 'l']


In [10]:
# Print length
print(len(sentences))

1490667


We need to reshape our data in a format we can pass to the Keras LSTM The shape look like [samples, time steps, features]

In [11]:
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
    for t, char in enumerate(sentence):
        x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1

In [12]:
print(x[:3])
print(y[:3])

[[[ True False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]
  ...
  [False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]]

 [[False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]
  ...
  [False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]]

 [[False  True False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]
  ...
  [False False False ... False False False]
  [False False False ... False False False]
  [False False False ... False False False]]]
[[False False False False False False False False False False False False
  False False False False False False False False False False False False
  False False False False False Fals

<a id="4"></a> 
## 4. Building model
For this kernel I will use a really small LSTM network but if you want to get better results feel free to replace it with a bigger network.

In [13]:
model = Sequential()
model.add(LSTM(128, input_shape=(maxlen, len(chars))))
model.add(Dropout(0.2)) 
model.add(Dense(len(chars)))
model.add(Activation('softmax'))

In [14]:
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy','mae','mse'])

<a id="4.1"></a> 
### 4.1 Helper Functions

I got this function from the lstm_text_generation example from keras. [https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py](https://github.com/keras-team/keras/blob/master/examples/lstm_text_generation.py)


Samples an index from a probability array with some temperature.

In [15]:
def sample(preds, temperature=1.0):
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

Callback function to print predicted text generated by our LSTM. It prints generated text with 5 different temperatures [0.2, 0.5, 1.0, 1.2]. 0.2 will generate text with more ordinary word. 1.2 will generate wilder guesses.

In [17]:
def on_epoch_end(epoch, logs):
    # Function invoked at end of each epoch. Prints generated text.
    print()
    print('----- Generating text after Epoch: %d' % epoch)

    start_index = random.randint(0, len(text) - maxlen - 1)
    for diversity in [0.2, 0.5, 1.0, 1.2]:
        print('----- diversity:', diversity)

        generated = ''
        sentence = text[start_index: start_index + maxlen]
        generated += sentence
        print('----- Generating with seed: "' + sentence + '"')
        sys.stdout.write(generated)

        for i in range(400):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]

            generated += next_char
            sentence = sentence[1:] + next_char

            sys.stdout.write(next_char)
            sys.stdout.flush()
        print()
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)

<a id="4.2"></a> 
### 4.2 Defining callbacks and training the model

In [18]:
from keras.callbacks import ModelCheckpoint

filepath = "weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss',
                             verbose=1, save_best_only=True,
                             mode='min')

In [19]:
from keras.callbacks import ReduceLROnPlateau
reduce_lr = ReduceLROnPlateau(monitor='loss', factor=0.2,
                              patience=1, min_lr=0.001)

In [20]:
callbacks = [print_callback, checkpoint, reduce_lr]

In [21]:
model.fit(x, y, batch_size=256, epochs=1, callbacks=callbacks)

----- Generating text after Epoch: 0
----- diversity: 0.2
----- Generating with seed: "classtype:trojan-activity; sid:2023665; rev:3; metadata:affected_product windows_xp_vista_7_8_10_server_32_64_bit, affected_product mac_osx, attack_target client_endpoint, deployment perimeter, signat"
classtype:trojan-activity; sid:2023665; rev:3; metadata:affected_product windows_xp_vista_7_8_10_server_32_64_bit, affected_product mac_osx, attack_target client_endpoint, deployment perimeter, signature_severity major, created_at 2016_07_01, malware_family abusert, updated_at 2019_09_28;)

alert udp $home_net any -> any 53 (msg:"et trojan abuse.ch ssl blacklist malicious ssl cert (minion cnc checkin"; flow:established,to_server; content:".php?maid="; http_uri; content:"&counat="; http_uri; content:"&user-agent|3a 20|mozilla|0d 0a|"; http_header; content:"post"; nocase; http_uri; content:"&per
----- diversity: 0.5
----- Generating with seed: "classtype:trojan-activity; sid:2023665; rev:3; metadata:affe

<tensorflow.python.keras.callbacks.History at 0x7fe5c0393668>

In [None]:
testScore = model.evaluate(x,y, verbose=0)
print ('\nTest Scores:  acc={}, mae={}, mse={}'.format(*testScore)) 


Test Scores:  acc=2.0600175274515076, mae=0.4020946325097816, mse=0.033119961810506984


<a id="5"></a> 
## 5. Generate new text


We can generate text using the same approach as in the on_epoch_end helper function create by Keras.

In [None]:
def generate_text(length, diversity):
    # Get random starting text
    start_index = random.randint(0, len(text) - maxlen - 1)
    generated = ''
    sentence = text[start_index: start_index + maxlen]
    generated += sentence
    for i in range(length):
            x_pred = np.zeros((1, maxlen, len(chars)))
            for t, char in enumerate(sentence):
                x_pred[0, t, char_indices[char]] = 1.

            preds = model.predict(x_pred, verbose=0)[0]
            next_index = sample(preds, diversity)
            next_char = indices_char[next_index]
            generated += next_char
            sentence = sentence[1:] + next_char
              
    return generated

#testScore = model.evaluate(preds, diversity, verbose=0)
#print ('\nTest Scores:  mse={}, mae={}'.format(*testScore))
#testScore = model.evaluate(preds, verbose=0)
#print ('\nTest Scores:  acc={}'.format(*testScore)) 
#print("Accuracy:",metrics.accuracy_score(y_test, predictions))

In [None]:
print(generate_text(5000, 0.2))

lf, and shouted out, 'you'd better not dout she mout mout so the much the wat in and the mont the whe the mout the mot so mout the mout it she wat in the mong the mome the she won she wore the the mous so the the what the whe wond and the more the the she pore the whe the she the mout the the mout the what the the the the hame the her the the whe the mone the mont the the coust the mome the the mome the the the were the the moust the moust it and the moust the could the moust whe wor her the she the moust the the the mome the mout the mout wor the whe the whe the more she the the the the wher the she the mont the she the the mong she the was the the she she the ware the sout the she the whe she the whe she the sher the the the she the mout the whe the more the the mout the the the the more the mouse the she mout the mome the wore the the the mont the what the mont she she she wat so the she the the the the mong the she the mout the the the couse the she mong the she the the more the mu

In [None]:
#testScore = model.evaluate(preds, diversity, verbose=0)
#print ('\nTest Scores:  mse={}, mae={}'.format(*testScore))

# print("##############################################################")
# print("Accuracy:",metrics.accuracy_score(y_test, predictions))
# print("Kappa Stats:",metrics.cohen_kappa_score(y_test, predictions))
# print("Precision:",metrics.precision_score(y_test, predictions))
# print("Recall:",metrics.recall_score(y_test, predictions))
# print("Mean Absolute Error:",metrics.mean_absolute_error(y, preds))
# print("Mean Squared Error:",metrics.mean_squared_error(diversity, preds))
# print("F-Measure:",metrics.recall_score(y_test, predictions))
# print("##############################################################")

<a id="6"></a> 
## 6. Conclusion


After 5 epochs our LSTM performed a ok job and I'm more than satisfied with the result.

Here are a few things you can change to get better results

1. Add more LSTM Layers.
2. Use more LSTM Cells.
3. Train for more than 5 epochs. (25+)
4. Add dropout Layer.
5. Play around with the batch-size

