# 0. Introduction

<h1>RUAK - Are you a Hegel?</h1>

> The greatest challenge to any thinker is stating the problem in a way that will allow a solution.

<a href="https://en.wikipedia.org/wiki/Bertrand_Russell">Bertrand Russell</a>
<br><br>
<h2>About the project</h2>
Philosophy is a fundamental human thought movement. Everyone is a philosopher. The only question is what kind of philosopher you are. This project tries to answer that question.
Using natural language processing (NLP), texts of different authors are used for categorization.
With the help of these texts any sentence can be categorically determined.
To understand how written language works and what the differents are between authors it helps to analyse the context of the sentences. Though visualization it is simpler to see structural varieties such as average sentence length, word class ratio and the use of <a href="https://en.wikipedia.org/wiki/Stop_word">stop words</a>.
<br><br>
<h2>About this notebook</h2>
You can open this Jupyter notebook in Google Colab to use a GPU and have a nice platform for editing.
<br>
<a href="https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

<br>

<h2>Information on use</h2>
<h3>Paths:</h3>
The following data needs to be loaded. Please adjust the paths accordingly (1.2.1): 
<ul>
<li><code>source_path</code> - Path which contains the text files</li>
<li><code>dataframe_file_path</code> - Path for loading and saving the DataFrame</li>
<li><code>word2vec_path</code> - Path to Word2Vec model</li>
<li><code>hyperband_tuner_output_path</code> - Path to hyperparameter tuner working directory</li>
<li><code>checkpoint_path</code> - Path where the checkpoints from the training process are stored</li>
<li><code>model_h5_path</code> - Path where the model (h5 format) should be stored or loaded from</li>
</ul>
<br>
<h3>Speed:</h3>
Some processes may take a while depending on the settings and hardware requirements. To speed up the process, certain changes can be made. Obviously, the total amount of data also determines the overall speed. If possible try to use a machine with a GPU - like Google Colab!
<ul>
<li>The easiest way to speed up all processes is to switch to <code>test_mode</code> (1.2.1). This will have a strong impact on the results. Lemmatization and pos tagging is <b>not</b> disabled in <code>test_mode</code>.
<li>Adjust the parameters to fit your needs (1.2.1)
  <ul>
    <li><code>epochs</code> - Iterations for training</li>
    <li><code>search_epochs</code> - Iterations for finding the best hyperparameters</li>
    <li><code>executions_per_trial</code> - Number of models that should be built and fit for each trial for robustness purposes.</li>
  </ul>   
</li>
<li><code>hyperband_iterations</code> - The number of times to iterate over the full Hyperband algorithm.</li>
<li>POS tagging - this process uses Scad not be executedpy to tag every word in a sentence (4.2.1). Set <code>lemmatization_enabled</code> to <code>False</code> to skip it.</li>
<li>Prepare values for visualization (6.1.1.) - if <code>lemmatization_enabled</code> is set to <code>True</code> the list of unique vocabulary for each author is lemmatized. This will slow down the process.</li>
</ul>
<br>
<h3>Additional information:</h3>    
<br><br>
<h2>Content</h2>

* [1. Preparations](#1)
* [2. Loading text data](#2)
* [3. Collect data and create word collection](#3)
* [4. Create and extend DataFrame](#4)
* [5. Store or load DataFrame](#4)
* [6. Visualization of data](#5)
* [7. Prepare and split](#6)
* [8. Hyperparameter tuning](#7)
* [9. Model preparation and training](#8)
* [10. Save or load model](#8)
* [11. Evaluation](#11)
* [12. TensorBoard](#12)






# 1. Preparations
**Set the language**: `english` or `german`

In [None]:
language = 'english'

Install Keras tuner and Spacy core. You may install more dependencies if you don't run this in Google Colab.

In [None]:
!rm -rf ./logs/
import spacy.cli

!pip install nltk
!pip install -q -U keras-tuner
!pip install tensorboard

if language == 'german':
  spacy.cli.download('de_core_news_md')
elif language == 'english':
  spacy.cli.download('en_core_web_sm')
else:
  raise ValueError("'language' set to an invalid value!")

Only needed for Google Drive and Colab

In [None]:
if 'google.colab' in str(get_ipython()):
  from google.colab import drive
  drive.mount('/content/drive')

## 1.1. Imports

In [None]:
import urllib, IPython, os, datetime, re, nltk, tensorboard, operator, random
import tensorflow as tf
from tensorflow import keras
import tensorflow_hub as hub
from tensorflow.keras.layers import Dense, Embedding, LSTM, Bidirectional, Dropout, BatchNormalization 
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping, TensorBoard
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.callbacks import TensorBoard
from tensorflow.keras.preprocessing.sequence import pad_sequences
import tensorflow.keras.activations as activations
import tensorflow.keras.losses as losses
import tensorflow.keras.optimizers as optimizers
import kerastuner as kt
from keras.utils.vis_utils import plot_model
import matplotlib.pyplot as plt
from gensim.models import Word2Vec
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from wordcloud import WordCloud
from collections import Counter
import spacy
from spacy.lemmatizer import Lemmatizer
from spacy import displacy


if language == 'german':
  import de_core_news_md
  from spacy.lang.de.stop_words import STOP_WORDS
elif language == 'english':
  import en_core_web_sm
  from spacy.lang.en.stop_words import STOP_WORDS
else:
  raise ValueError("'language' set to an invalid value!")

## 1.2. Downloads for NLTK and Spacy

In [None]:
nltk.download('punkt')
spacy.prefer_gpu()

if language == 'german':
  nlp = de_core_news_md.load()
elif language == 'english':
  nlp = en_core_web_sm.load()
else:
  raise ValueError("'language' set to an invalid value!")

## 1.2. Magic functions and global variables

Magic functions

In [None]:
%matplotlib inline
%load_ext tensorboard

### **1.2.1. Set variables and paths** <a class="anchor" id="1-2-1"></a>

Set `session_id` for providing unique file names

In [None]:
session_id = datetime.datetime.now().strftime("%d/%m/%Y - %H:%M")

**This is the place where some information is needed. Please go though the steps and modify the information according to your needs.**

To free some space after the traing and validation data is created set `auto_free_memory` to `True`.

In [None]:
auto_free_memory = False

For tesing the notebook set `test_mode` to `True`. POS tagging (4.2.1.) and lemmatization (6.1.) will **not** be disabled.

In [None]:
test_mode = True

Paths used for storing and loading. Should **never** end with `/` or file ending.

In [None]:
source_path = '/content/drive/My Drive/RUAK/input/processed'
dataframe_path = '/content/drive/My Drive/RUAK/output/dataframe'
word2vec_path = '/content/drive/My Drive/RUAK/output/embedding/w2v'
w2v_model_name = 'full_700_iter100_win7_8'
hyperband_tuner_output_path = '/content/drive/My Drive/RUAK/output/hp_tuning'
checkpoint_path = '/content/drive/My Drive/RUAK/output/training_checkpoints'
model_h5_path = '/content/drive/My Drive/RUAK/output/models'

List of files to process and author names. Files should be named after author (e.g. `plato.txt`). `file_names` should contain at least 3 files.

In [None]:
file_names = [
    'kant.txt', 
    'nietzsch.txt', 
    'platon.txt', 
    'rousseau.txt']

Parameters needed for tuning and training.

In [None]:
batch_size=40
epochs=30
search_epochs=20
early_stopping_patience=5
executions_per_trial=3
hyperband_iterations=3

## 1.3 [Stop words](https://en.wikipedia.org/wiki/Stop_word)

In [None]:
def replace_umlaut(string):
    string = string.replace('ä', 'ae')
    string = string.replace('ö', 'oe')
    string = string.replace('ü', 'ue')
    string = string.replace('Ä', 'Ae')
    string = string.replace('Ö', 'Oe')
    string = string.replace('Ü', 'Ue')
    return string.replace('ß', 'ss')

stop_words = set([replace_umlaut(word) for word in STOP_WORDS])
print(f'Stop words count: {len(stop_words)}.')

# 2. Loading text data

In [None]:
if len(file_names) < 3:
  raise ValueError("'file_names' should contain at least 3 files. Add more files at (1.2.1)!")

if test_mode == True:
  file_names = file_names[0:3]

for file_name in file_names:
  text_dir = tf.keras.utils.get_file(file_name, origin=f'file://{source_path}/{file_name}')

parent_dir = os.path.dirname(text_dir)

# 3. Collect data and create word collections

## 3.1. Prepare word collections

In [None]:
author_names = [name[:-4].capitalize() for name in file_names]

Function to add words to `words_without_stop_words` and `unique_words_without_stop_words`.

In [None]:
def add_words(sentence):
  for word in sentence.split():
    word = re.sub(r"[^a-zA-Z]+", "", word)
    if word == '' or len(word) == 1:
      continue
    if word.lower() not in stop_words:
      words_without_stop_words.append(word)
      unique_words_without_stop_words.add(word)

words_without_stop_words = []
unique_words_without_stop_words = set()

## 3.2. Extract sentences

Extract sentences from files and creates labels list. Adjust the language for the `nltk.sent_tokenizer` if needed.

In [None]:
labels = []
sentences = []

for index, file_name in enumerate(file_names):

  path = os.path.join(parent_dir, file_name)

  with open(path, 'rb') as file: 
    text = str(file.read())
    nltk_sentences = nltk.sent_tokenize(text, language=language)

    for sentence in nltk_sentences:
      sentence = str(sentence).replace("b'", "")
      sentences.append(sentence)
      labels.append(index)
      add_words(sentence)

    print(f"Sentences for {file_name} with label: {index} added.")

print(f'\n{len(sentences)} sentences found.')
print(f'{len(words_without_stop_words)} words found (excl. stop words).')
print(f'{len(unique_words_without_stop_words)} unique words found (excl. stop words).')

Collect most commen words except the stop words.

In [None]:
most_common = [word[0] for word in Counter(words_without_stop_words).most_common(20)]
most_common_count = {k: v for k, v in Counter(words_without_stop_words).most_common(20)}
print('Most common words:')
print(most_common)

## 3.3. Clean data

In [None]:
def short_sentences(length):
  short_sentences = [sentence for sentence in sentences if len(sentence.split()) <= length]
  print(f'Found {len(short_sentences)} sentences shorter than {length} words.\n')
  return short_sentences

def long_sentences(length):
  long_sentences = [sentence for sentence in sentences if len(sentence.split()) >= length]
  print(f'Found {len(long_sentences)} sentences longer than {length} words.\n')
  return long_sentences

### 3.3.1. Remove sentences

Set min and max length for sentences

In [None]:
min_length = 6
max_length = 400

Get invalid sentences

In [None]:
invalid_sentences = short_sentences(min_length) + long_sentences(max_length)
print(f'Found {len(invalid_sentences)} invalid sentences.')

#### 3.3.1.1. Investigate invalid sentences
Print 5 examples of `invalid_sentences`

In [None]:
for i in random.sample(range(10, len(invalid_sentences)-1), 10):
  print(invalid_sentences[i])

Use Spacy [Visualizer](https://spacy.io/usage/visualizers) to show a random invalid sentence.

In [None]:
if tf.test.gpu_device_name() == '': # Needed to avoid chash in Spacy when running a GPU.
  doc = nlp(invalid_sentences[random.randint(0, len(invalid_sentences)-1)])
  displacy.render(doc, style="dep", jupyter=True, options={'compact':'True'})

### 3.3.2. Provide cleaned data

In [None]:
cleaned_labels = []
cleaned_sentences = []
print(f"'sentences' list length before removal: {len(sentences)}.")
for index, sentence in enumerate(sentences):
  if sentence not in invalid_sentences:
    cleaned_sentences.append(sentence)
    cleaned_labels.append(labels[index])  
print(f"'sentences' list length after removal: {len(cleaned_sentences)}.")
print(f'{len(invalid_sentences)} sentences removed.')

# 4. Create and extend DataFrame
Some helper methods

In [None]:
def stop_word_ratio_fn(sentence):
  count = 0
  for word in sentence.split():
    word = re.sub(r"[^a-zA-Z]+", "", word)
    if word.lower() in stop_words:
      count += 1
  return round(count/len(sentence.split()) * 100, 2)

def stop_word_count_fn(sentence):
  count = 0
  for word in sentence.split():
    word = re.sub(r"[^a-zA-Z]+", "", word)
    if word.lower() in stop_words:
      count += 1
  return count

def mean_word_length_fn(sentence):
  return round(np.array([len(word) for word in sentence.replace('.','').split()]).mean(), 2)

def pos_count(sentence, pos):
  doc = nlp(sentence)
  return len([w.pos_ for w in doc if w.pos_ == pos])

## 4.1. Create DataFrame

In [None]:
df = pd.DataFrame({'label': cleaned_labels, 'sentence': cleaned_sentences})
df.head()

Remove 90% of the rows for `test_mode`

In [None]:
if test_mode == True:
  print(f'Before drop: {df.shape}')
  df = df.drop(df.sample(frac=0.9).index)
  print(f'After drop: {df.shape}')
else:
  print('Test mode not enabled - nothing dropped.')

## 4.2. Construct new data

In [None]:
df['author'] = df['label'].map(lambda x: author_names[x])
df['word_count'] = df['sentence'].str.split().str.len()
df['mean_word_length'] = df['sentence'].map(mean_word_length_fn)
df['stop_words_ratio'] = df['sentence'].map(stop_word_ratio_fn)
df['stop_words_count'] = df['sentence'].map(stop_word_count_fn)

### 4.2.1. POS tagging <a class="anchor" id="4-2-*1*"></a>
Add columns and values for [POS tagging](https://en.wikipedia.org/wiki/Part-of-speech_tagging).A list of tags can be found [here](https://spacy.io/api/annotation). **This may take a while!**

In [None]:
pos_tagging_enabled = False

In [None]:
if pos_tagging_enabled == True:
  pos_tags = ['ADJ', 'ADV', 'ADP', 'AUX', 'DET', 'NUM', 'X', 'INTJ', 'CONJ',
              'CCONJ', 'SCONJ', 'PROPN', 'NOUN', 'PRON', 'PART', 'VERB']
            
  for tag in pos_tags:
    df[f'{tag}_count'] = df['sentence'].map(lambda sen: pos_count(sen, tag))

## 4.3. Preview processed DataFrame

In [None]:
df.head(df.shape[0])

# 5. Store or load DataFrame

Save DataFrame to CSV if needed.

In [None]:
if test_mode == False:
  file_path = f'{dataframe_path}/ruak_dataframe.csv'
else: 
  file_path = f'{dataframe_path}/ruak_dataframe_testing.csv'  

df.to_csv(file_path, index=False)
print(f'DataFrame saved to: {file_path}.')

Load the DataFrame from CSV if needed.

In [None]:
if test_mode == False:
  file_path = f'{dataframe_path}/ruak_dataframe.csv'
else: 
  file_path = f'{dataframe_path}/ruak_dataframe_testing.csv'  

df = pd.read_csv(file_path)
df.head()

# 6. Visualization of data

## 6.1 Prepare values for visualization

Count vocabulary

In [None]:
def vocabulary_count_fn(series, lemmatization):
  vocabulary = set()
  for sentence in series:
    if lemmatization == True:
      words = lemmatize(sentence)
    else:
      words = sentence.split()  
    for word in words:
      if word.lower() not in stop_words:
        word = re.sub(r'[^a-zA-Z]+', '', word)
        vocabulary.add(word.lower())
  return len(vocabulary)

[Lemmatize](https://en.wikipedia.org/wiki/Lemmatisation)

In [None]:
def lemmatize(sentence):
  words = set()
  doc = nlp(sentence)
  for word in doc:
    words.add(word.lemma_)
  return list(words)

### 6.1.1. Prepare values for visualization <a class="anchor" id="6-1-1"></a>

In [None]:
lemmatization_enabled = False

Prepare values for visualization. Enable lemmatization to get more a more prezise `unique_vocabulary_count`. **This will slow down the process!**

In [None]:
median_sentence_length = df.groupby('label')['word_count'].median()
median_stop_words = df.groupby('label')['stop_words_ratio'].median()
sentence_count = df.groupby('label')['sentence'].count()
unique_vocabulary_count = df.groupby('label')['sentence'].apply(lambda ser: vocabulary_count_fn(ser, lemmatization_enabled))

if 'ADJ_count' in df: # Checks if dataframe contain POS tagging information.
  pos_df = df.groupby('author')[[f'{tag}_count' for tag in pos_tags]+['word_count']] \
  .sum().apply((lambda x: x/x['word_count']*100), axis=1) \
  .drop('word_count', axis=1)

  pos_tags = ['ADJ', 'ADV', 'ADP', 'AUX', 'DET', 'NUM', 'X', 'INTJ', 'CONJ', 
              'CCONJ', 'SCONJ', 'PROPN', 'NOUN', 'PRON', 'PART', 'VERB']

## 6.2. Draw visualization

### 6.2.1 Data distribution
The data should be equally split between authors.

In [None]:
plt.pie(df['author'].value_counts(),
        explode=np.full(len(author_names), 0.1),
        radius=2,
        autopct='%1.0f%%', 
        labels=author_names,
        shadow=True,
        startangle=90,
        textprops={'size': 15})
plt.show()

### 6.2.2. Comparing authors

In [None]:
fig, axs = plt.subplots(4,1, figsize=(10,15))
fig.tight_layout(h_pad=6)

axs[0].bar(author_names, sentence_count)
axs[0].set_ylabel('Number of sentences', fontdict={'color':'gray', 'size':12})
axs[0].tick_params(axis='both', colors='gray', labelsize=12)
axs[0].grid()

axs[1].bar(author_names, median_sentence_length)
axs[1].set_ylabel('Median sentence lenth', fontdict={'color':'gray', 'size':12})
axs[1].tick_params(axis='both', colors='gray', labelsize=12)
axs[1].grid()

axs[2].bar(author_names, unique_vocabulary_count)
axs[2].set_ylabel('Unique vocabulary count\n(excl. stop words)', fontdict={'color':'gray', 'size':12})
axs[2].tick_params(axis='both', colors='gray', labelsize=12)
axs[2].grid()

axs[3].bar(author_names, median_stop_words)
axs[3].set_ylabel('Median stop words ratio', fontdict={'color':'gray', 'size':12})
axs[3].tick_params(axis='both', colors='gray', labelsize=12)
axs[3].grid()

plt.show()

Word classes by authors. A list of tags can be found [here](https://spacy.io/api/annotation).

In [None]:
if 'pos_df' in locals() and 'pos_tags' in locals():
  ax = pos_df.plot(kind='barh', figsize=(15,15))
  ax.set_xlabel('Percentage of word class in vocabulary', fontdict={'color':'gray', 'size':17})
  ax.set_ylabel('Author', fontdict={'color':'gray', 'size':17})
  ax.tick_params(axis='both', colors='gray', labelsize=17)
  ax.legend(pos_tags)
  ax.xaxis.set_tick_params(labeltop='on')
  ax.grid()
  plt.show()
else:
  print("'pod_df' or 'pos_tags' not available!") 

### 6.2.3. Common words

In [None]:
fig, axs = plt.subplots(1,1, figsize=(30,10))

axs.bar(most_common_count.keys(), most_common_count.values(), color='orange')
axs.set_ylabel('Number of sentences', fontdict={'color':'gray', 'size':17})
axs.tick_params(axis='both', colors='gray', labelsize=17)
axs.grid()

plt.show()

### 6.2.4. Sentences by authors
This shows the sentence structure, lemmas, and pos tags of one random sentences from each author.

In [None]:
if tf.test.gpu_device_name() == '': # Needed to avoid chash in Spacy when running a GPU.
  for author in author_names:
    sentences_series = df.loc[(df['author'] == author) & (df['sentence'].str.len() < 50)]['sentence']
    print(f'\nSentence by {author}:')
    doc = nlp(sentences_series.sample(n=1).values[0])
    displacy.render(doc, 
                    style="dep", 
                    jupyter=True, 
                    options={'compact':'True', 'add_lemma': 'True'})

### 6.2.5. Word cloud

In [None]:
wordcloud = WordCloud(width=5000, 
                      height=4000,
                      max_words=20,  
                      background_color ='black', 
                      stopwords = stop_words, 
                      min_font_size = 10).generate_from_frequencies(most_common_count) 

plt.figure(figsize=(20, 12), facecolor='k', edgecolor ='k') 
plt.imshow(wordcloud) 
plt.axis("off") 
plt.tight_layout(pad=0) 
plt.show() 

# 7. Prepare and split

## 7.1. Tokenize

In [None]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(df['sentence'].values)
print(f"{len(df['sentence'].values)} sentences from {len(file_names)} authors.")
print(f'{len(tokenizer.word_counts)} unique vocabularies.')

## 7.2. Encode

In [None]:
encoded_sentences = tokenizer.texts_to_sequences(df['sentence'].values)
padded_sentences = pad_sequences(encoded_sentences, padding='post')

Test the encoder

In [None]:
print(df['sentence'].values[0])
print(np.array(padded_sentences[0]))
print(tokenizer.sequences_to_texts([padded_sentences[0]]))

## 7.3. Splitting
Create train and test data for the fitting proccess.

In [None]:
X_train, X_valid, y_train, y_valid = train_test_split(padded_sentences, df['label'].values, test_size=0.1)
print(f'Shape of the splited X_train: {X_train.shape}')
print(f'Shape of the splited y_train: {y_train.shape}')
print(f'Shape of the splited X_valid: {X_valid.shape}')
print(f'Shape of the splited y_valid: {y_valid.shape}')

# 8. Hyperparameter tuning

Free some space

In [None]:
if auto_free_memory == True:
  del df
  del wordcloud
  del encoded_sentences
  del sentences
  del labels

## 8.1. Setup the hypermodel

### 8.1.1. Load the Word2Vec model
For german the custom Word2Vec layer is used.

In [None]:
def embedding_matrix_custom_model():
    model = Word2Vec.load(f'{word2vec_path}/{w2v_model_name}.model')
    embedding_matrix = np.zeros((len(model.wv.vocab), model.vector_size))
    for i in range(len(model.wv.vocab)):
        embedding_vector = model.wv[model.wv.index2word[i]]
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    return embedding_matrix

For English we use Wiki-words-500 provided by Tensorflow Hub.

In [None]:
def embedding_matrix_hub_model():
    hub_layer = hub.KerasLayer("https://tfhub.dev/google/Wiki-words-500/2", input_shape=[], dtype=tf.string)
    return hub_layer.get_weights()[0]

In [None]:
if language == 'german':
  embedding_matrix = embedding_matrix_custom_model()
elif language == 'english':
  embedding_matrix = embedding_matrix_hub_model()
else:
  raise ValueError("'language' set to an invalid value!")

print(f'Embedding_matrix shape: {embedding_matrix.shape}')

In [None]:
embedding_matrix.shape[1]

In [None]:
hub_layer.get_weights()[0].shape

###8.1.2. Define the hypermodel

In [None]:
def hypermodel(hp):

  if test_mode == True:
    hp_dense_count = hp.Int('dense_count', min_value=1, max_value=2, step=1)
    hp_embedding_trainable = hp.Choice('embedding_trainable', [False])
    hp_with_batch_normalization = hp.Choice('with_batch_normalization', [True])
    hp_lstm_units = hp.Int('lstm_units', 32, 64, step=32)
    hp_dropout = hp.Choice('dropout', [0.25])
    hp_learning_rate = hp.Choice('learning_rate', [0.001])
    hp_adam_epsilon = hp.Choice('adam_epsilon', values=[1e-08])
  else:
    hp_dense_count = hp.Int('dense_count', min_value=1, max_value=7, step=1)
    hp_embedding_trainable = hp.Choice('embedding_trainable', [True, False])
    hp_with_batch_normalization = hp.Choice('with_batch_normalization', [True, False])
    hp_lstm_units = hp.Int('lstm_units', 256, 512, step=128)
    hp_dropout = hp.Choice('dropout', [0.0, 0.1, 0.25, 0.5])
    hp_learning_rate = hp.Choice('learning_rate', [0.01, 0.001, 0.0001])
    hp_adam_epsilon = hp.Choice('adam_epsilon', values=[1e-07, 1e-08])

  model = tf.keras.Sequential()

  model.add(Embedding(len(embedding_matrix),
            output_dim=embedding_matrix.shape[1],
            weights=[embedding_matrix], 
            trainable=hp_embedding_trainable,
            mask_zero=True))

  model.add(Bidirectional(LSTM(hp_lstm_units, return_sequences=True)))
  if hp_embedding_trainable == True:
    BatchNormalization()
  model.add(Dropout(hp_dropout))

  model.add(Bidirectional(LSTM(hp_lstm_units, return_sequences=True)))
  if hp_embedding_trainable == True:
    BatchNormalization()  
  model.add(Dropout(hp_dropout))

  model.add(Bidirectional(LSTM(hp_lstm_units, return_sequences=False)))
  if hp_embedding_trainable == True:
    BatchNormalization()  
  model.add(Dropout(hp_dropout))

  for i in range(hp_dense_count):

    if test_mode == True:
      hp_dense_units = hp.Int(f'dense_units{i}', 64, 128, step=64)
      hp_dense_activation = hp.Choice(f'dense_activation_{i}', values=['relu'])
    else: 
      hp_dense_units = hp.Int(f'dense_units{i}', 64, 512, step=64)
      hp_dense_activation = hp.Choice(f'dense_activation_{i}', values=['tanh', 'relu'])

    model.add(Dense(hp_dense_units, activation=hp_dense_activation))

  model.add(Dense(len(file_names), activation='softmax'))

  model.compile(optimizer=optimizers.Adam(learning_rate=hp_learning_rate, epsilon=hp_adam_epsilon),
              loss=losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
  return model

## 8.2. Run the tuner
Reduce parameters for `testing_mode`

In [None]:
if test_mode == True:
  epochs=4
  search_epochs=1
  early_stopping_patience=4
  executions_per_trial=1
  hyperband_iterations=1

Set variables

In [None]:
max_epochs = epochs+5
project_name = 'RUAK'
verbose = 2
if test_mode == True:
  max_epochs = 1
  project_name = 'RUAK_testing'
  verbose = 0

Prepare the hyperband tuner

In [None]:
tuner = kt.Hyperband(hypermodel,
                     objective='val_accuracy', 
                     executions_per_trial=executions_per_trial,
                     max_epochs=max_epochs,
                     hyperband_iterations=hyperband_iterations,
                     directory=hyperband_tuner_output_path,
                     project_name=project_name,
                     overwrite=True)

Run the tuner to search for best parameters. The result are the optimal hyperparameters: `best_hps` and a list of `best_models`.

In [None]:
class ClearTrainingOutput(tf.keras.callbacks.Callback):
  def on_train_end(*args, **kwargs):
    IPython.display.clear_output(wait=True)

tuner.search(X_train, y_train, 
             epochs=search_epochs,
             validation_data = (X_valid, y_valid),
             callbacks = [ClearTrainingOutput(), EarlyStopping('val_accuracy', patience=1)],
             verbose=verbose)

best_hps = tuner.get_best_hyperparameters(1)[0]
best_models = tuner.get_best_models(num_models=3)

tuner.results_summary()

# 9. Model preparation and training

Get summaries of the best models and choose the model for training.

In [None]:
for model in best_models:
  model.summary()

Choose preferred model

In [None]:
chosen_model = best_models[0]
del best_models

Plot model structure

In [None]:
plot_model(chosen_model, show_shapes=True, show_layer_names=True)

## 9.1. Prepare callbacks
TensorBoard preparation

In [None]:
log_dir = os.path.join('logs', session_id)
tb_callback = TensorBoard(log_dir=log_dir, histogram_freq=1)

Create `ModelCheckpoint` and `EarlyStopping` callbacks.

In [None]:
cp_callback = ModelCheckpoint(filepath=f'{checkpoint_path}/cp.ckpt',
                                                 save_weights_only=True,
                                                 verbose=2)

es_callback = EarlyStopping('val_accuracy', patience=early_stopping_patience, restore_best_weights=True)

## 9.2. Model training

In [None]:
callbacks = [cp_callback, es_callback, tb_callback]
if test_mode == False:
    callbacks = [es_callback, tb_callback]
    print('Running in test mode!')
 
h = chosen_model.fit(X_train, 
                      y_train, 
                      epochs=epochs, 
                      batch_size=batch_size, 
                      validation_data=(X_valid, y_valid), 
                      callbacks=callbacks,
                      verbose=verbose)

# 10. Save or load model

## 10.1. Save model

In [None]:
if test_mode == False:
  model.save(f'{model_h5_path}/ruak_model.h5')

## 10.2. Load model

In [None]:
if test_mode == False:
  chosen_model = tf.keras.models.load_model(f'{model_h5_path}/ruak_model.h5')

## 10.3. Load weights

In [None]:
if test_mode == False:
  latest = tf.train.latest_checkpoint(f'{checkpoint_path}/cp.ckpt')
  chosen_model.load_weights(latest)

# 11. Evaluation

Draw charts to show compare training and validation results

In [None]:
fig, axs = plt.subplots(2,1, figsize=(8, 6))

epochs = range(len(h.history['accuracy']))
axs[0].plot(epochs, h.history['accuracy'], color='red', marker='x')
axs[0].plot(epochs, h.history['val_accuracy'], color='green', marker='.')
axs[0].legend(labels=['Training accuracy','Validation accuracy'])
axs[0].set_ylabel('Accuracy', fontdict={'color':'gray', 'size':12})
axs[0].tick_params(labelbottom=False)
axs[0].grid()

epochs = range(len(h.history['loss']))
axs[1].plot(epochs, h.history['loss'], color='red', marker='x')
axs[1].plot(epochs, h.history['val_loss'], color='green', marker='.')
axs[1].legend(labels=['Training loss','Validation loss'])
axs[1].set_ylabel('Loss', fontdict={'color':'gray', 'size':12})
axs[1].tick_params(labelbottom=False)
axs[1].grid()

plt.show()

Show loss and accuracy

In [None]:
val_loss, val_acc = chosen_model.evaluate(X_valid)

print(f'Validation Accuracy: {val_acc}')
print(f'Validation Loss: {val_loss}')

## 11.1. Test the model

Test the model. Add  `sample_sentences` to get the probability distribution for each author.

In [None]:
sample_sentences = [
                    # 0 - Platon
                    "Das ziemt uns ja auch nicht, Sokrates",
                    # 1 - Nietzsche
                    "noch hat er seine that nicht ueberwunden.", 
                    # 2 - Kant
                    "Die Grenzen der Ausdehnung bestimmen die Figur."
                   ]

# Add your own sentences:
# sample_sentences = ["Here you can try your own sentences.", "Let's see what kind of philosopher you are."]

In [None]:
encoded_sample_sentences = tokenizer.texts_to_sequences(sample_sentences)
padded_sample_sentences = pad_sequences(encoded_sample_sentences, maxlen=X_train.shape[1], padding='post')
predictions = model.predict(padded_sample_sentences)

predictions_df = pd.DataFrame()
for index, prediction in enumerate(predictions):
  for i, pre in enumerate(prediction):
    predictions_df = predictions_df.append({
      'sentence_number': index,
      'author': author_names[i],
      'prediction': pre,
      'sentence': sample_sentences[index]
    }, ignore_index=True)

predictions_df.head(predictions_df.shape[0])

Draw bars for each sample sentence

In [None]:
fig, axs = plt.subplots(len(predictions), 1, figsize=(10,10))
fig.tight_layout(h_pad=6)
for sen_id, pre in enumerate(predictions):
  for i, p in enumerate(pre):
    axs[sen_id].barh(author_names, pre)
    axs[sen_id].set_title(f'Sentence {sen_id}: {sample_sentences[sen_id][0:70]}...', 
                          fontdict={'color':'gray', 'size':12, 'fontweight':'bold'})
    axs[sen_id].set_ylabel('Author', fontdict={'color':'gray', 'size':12})
    axs[sen_id].set_xlabel('Probability', fontdict={'color':'gray', 'size':12})
    
    axs[sen_id].tick_params(axis='both', colors='gray', labelsize=12)
    axs[sen_id].grid()

plt.show()

# 12. TensorBoard <a class="anchor" id="12"></a>

In [None]:
%tensorboard --logdir logs

Clean TensorBoad logs

In [None]:
!rm -rf ./logs/