# Project Abysima: Language Generation Experiments

The following notebook will experiment with generating a language using neural networks and generative deep learning.
This is, by no means, a production-ready system, nor is it a complete network; rather, the purpose of this experiment
is to see what is possible with creating a language.

For more information on the process and supporting research, please refer to the Linguistics Paper document found in the
`01 - Areas of Responsibility` directory.

The following source code and datasets are licensed under the Mozilla Public License v2.0. Please refer to the LICENSE
file that came with this repository for more information on what your rights are with usage and modification of this
software. If a LICENSE file is not provided, you can obtain a copy at https://www.mozilla.org/en-US/MPL/2.0/.

## Part 1: Collecting the Dataset

Before we can create networks that will be able to train off of words, sniglets, and non-words, we first need to create
a dataset that the networking models will be able to understand.

To do this, we will create two pools of data: a list full of valid words, and a list of random strings, both of equal
count. The list of valid words derives from the words found in the `/usr/share/dict/words` file found on UNIX and Linux
systems, which contains words of various languages. A list of basic Japanese words written in Romaji are also included
in the list.

Words in the valid pool are then trimmed based on the average length of the words in that dataset, removing any words
that are bigger than the specified length. This is crucial because this will help prevent miscalculations due to a lot
of whitespace. Additionally, words that are acronyms, contractions, and/or less than three characters long are removed
from the list. The invalid word dataset is generated from an algorithm that randomly selects letters from three to the
average length.

Neural networks need each entry in the dataset to be of the same length, since they are fundamentally rows in a large
matrix. To accomplish this, we will append space characters (`' '`) at the end of words when needed; this is known as
_padding a sequence_.

We also need to be able to indicate which words are valid and which ones were randomly generated. To do this, we will
add an extra column to the dataset that will indicate its validity by writing either "valid" or "invalid".

The dataset is then shuffled between twenty-five to fifty times to make sure that we aren't only training on valid words
or vice-versa. Once this shuffling is complete, we will split the dataset into two pools, where eighty percent (80%) of
the data will go into a training pool, and twenty percent (20%) will go into a testing pool. The training pool will be
used to train the network, and the testing pool will be used to test that the trained network is making as close to an
accurate prediction as possible.

We then take these two pools and write them to CSV files which all of our models will be able to read. The script that
implements this process is available in `create_dataset.py` in the project's root.

In [None]:
# Import the Pandas library, which will read the CSV files that we wrote.
import pandas as pd

# Import the training and testing pools.
DF_TRAINING_POOL = pd.read_csv("../datasets/dtrain.csv")
DF_TESTING_POOL = pd.read_csv("../datasets/dtest.csv")

# Make a preview of the data frame from the training pool. Note that our features are the eight characters, and the
# target is the 'Valid' column.
DF_TRAINING_POOL.head()

In [31]:
# Neural networks will need to encode the data in order to be able to train. We will write an encoder and apply it to
# the dataset here.
from string import ascii_lowercase
import numpy as np


def encode_features(feature) -> float:
    """Returns an encoded number that represents the data item."""
    if feature == ' ' or feature == 'invalid':
        return 0.0
    elif feature == 'valid':
        return 1.0
    else:
        return (ascii_lowercase.index(feature) + 1) / 26.0


# Convert the data frames into NumPy arrays, which will be used in the neural networks.
DATA_TRAIN = DF_TRAINING_POOL.to_numpy()
DATA_TEST = DF_TESTING_POOL.to_numpy()

# Make the mapping function that will convert the strings in the datasets into numbers with the function we defined
# earlier and map them on our datasets.
map_func = np.vectorize(encode_features)
DATA_TRAIN = map_func(DATA_TRAIN)
DATA_TEST = map_func(DATA_TEST)

DATASET_COUNT = DATA_TRAIN.shape[0] + DATA_TEST.shape[0]  # type: ignore

# Print out the sizes of the training and testing datasets.
print(f"Train shape: {DATA_TRAIN.shape}")  # type: ignore
print(f"Test shape: {DATA_TEST.shape}")  # type: ignore
print(f"Total rows: {DATASET_COUNT}")


Train shape: (79572, 9)
Test shape: (19894, 9)
Total rows: 99466


In [None]:
# Split the training and testing datasets into X and Y components. X will contain all of the features, and Y will
# contain the target value.
X_train, y_train = DATA_TRAIN[:, :-1], DATA_TRAIN[:, -1]  # type: ignore
X_test, y_test = DATA_TEST[:, :-1], DATA_TEST[:, -1]  # type: ignore

print(X_train[:5])
print(y_train[:5])

## Part 2: Creating our Networks

Now that we have our dataset ready, we will begin creating neural networks that will train on the data we specify.
These neural networks operate similar to our own brains and will try to "learn" what makes a word valid by using
mathematical equations running in the background.

To accomplish this, we will utilize two frameworks that exist: Tensorflow and CoreML. Tensorflow is a library created
by Google to make neural networks from scratch without writing all of the code to process the math. Likewise, CoreML
is a library made by Apple that lets developers create neural networks for use in apps on their platforms (macOS, iOS,
tvOS, and watchOS).

For this experiment, we will design three networks:

- First, a fully-connected neural network (FCNN). This type of network indicates that all of the nodes in the network
  link up to each other in some way. This is the most "basic" neural network in the list.
- Next, a recurrent neural network (RNN) using the Long Short-Term Memory strategy (LSTM). Recurrent neural networks
  operate very similarly to FCNNs in that nodes are connected. However, it recognizes that the data it receives is
  sequential, meaning that they appear in a sequence. The network will perform mathematical operations and learn with
  this in mind.
- Finally, a CoreML model created with Create ML. This dataset is pre-trained and automatically selected an algorithm
  that it thinks works best for the dataset. Exact implementation is unknown since Apple hides this from the developer.

In [None]:
# We will specify parameters here that will be used to train the networks we are creating. These parameters can be
# adjusted by us at any time to optimize the algorithms. These parameters are known as 'hyperparameters'.

# Specify the number of "iterations" the neural networks will run under. In this case, an iteration indicates a session
# of training by reading the data and running operations on it.
KERAS_EPOCHS = 500

# Specify the number of batches the neural networks will use. To speed up training, our networks will run updates after
# a certain number of batches, making updates as necessary.
KERAS_BATCHES = 256

In [None]:
# Import the Tensorflow and Keras libraries needed to make two of the networks.
from tensorflow import keras
from tensorflow.keras.layers import Dense

# Create the FCNN. This will have a first layer that maps to the number of characters in our set: in this case, 8. We
# also include some hidden layers of various lengths before including a final layer that will filter down to a single
# input.
FCNN = keras.Sequential()
FCNN.add(Dense(8, input_dim=8, activation="relu"))
FCNN.add(Dense(32, activation='relu'))
FCNN.add(Dense(1, activation="sigmoid"))

# Compile the model and use 'binary cross-entropy' to foce the network to either say "yes" or "no". We will also use the
# adam optimizer and list accuracy in our metrics for further analysis.
FCNN.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print out a summary of the FCNN.
FCNN.summary()

In [None]:
# We will now begin training the FCNN with the training data we encoded earlier, using the hyperparameters we defined.
# To prevent overcorrection or memorization of the data, we will dedicate 20% of the data to validation. We will store
# the results of the training session for later analysis.
FCNN_results = FCNN.fit(X_train,
                        y_train,
                        epochs=KERAS_EPOCHS,
                        batch_size=KERAS_BATCHES,
                        verbose=1,
                        validation_split=0.2)

In [None]:
# Now, we will create the recurrent neural network in a similar fashion.
from tensorflow import keras
from tensorflow.keras.layers import LSTM, Dense, Embedding

# Create the FCNN. This will have a first layer that maps to the number of characters in our set: in this case, 8. We
# also include some hidden layers of various lengths before including a final layer that will filter down to a single
# input.
RNN = keras.Sequential()
RNN.add(Embedding(DATASET_COUNT + 1, 64, input_length=8))
RNN.add(LSTM(4, activation='tanh'))
RNN.add(Dense(1, activation='sigmoid'))

# Compile the model and use 'binary cross-entropy' to foce the network to either say "yes" or "no". We will also use the
# adam optimizer and list accuracy in our metrics for further analysis.
RNN.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print out a summary of the FCNN.
RNN.summary()

In [None]:
# We will now begin training the RNN with the training data we encoded earlier, using the hyperparameters we defined.
# To prevent overcorrection or memorization of the data, we will dedicate 20% of the data to validation. We will store
# the results of the training session for later analysis.
RNN_results = RNN.fit(X_train,
                      y_train,
                      epochs=KERAS_EPOCHS,
                      batch_size=KERAS_BATCHES,
                      verbose=1,
                      validation_split=0.2)