# Project Abysima: Generating a Language with Machine Learning

The following notebook will experiment with generating a language using neural networks and generative deep learning.
This is, by no means, a production-ready system, nor is it a complete network; rather, the purpose of this experiment
is to see what is possible with creating a language.

For more information on the process and supporting research, please refer to the Linguistics Paper document found in the
`01 - Areas of Responsibility` directory.

The following source code and datasets are licensed under the Mozilla Public License v2.0. Please refer to the LICENSE
file that came with this repository for more information on what your rights are with usage and modification of this
software. If a LICENSE file is not provided, you can obtain a copy at https://www.mozilla.org/en-US/MPL/2.0/.

## Part I: Initial Setup

This notebook will utilize the TensorFlow and Keras libraries to create the generative neural networks for this project,
as well as some other Python libraries to parse the data from CSV files or other file formats.

In [1]:
# Import the Keras packages we will use to create the neural network models, and to parse the dataset.
from tensorflow import keras

# Import pandas to be able to read the CSV file
import pandas as pd

# Import typing hints to make code readable.
from typing import List

## Part II: Dataset Parsing and Encoding

Before the dataset can be thrown into the neural network, we first need to sanitize and prepare the dataset for use with
the network. This will also include creating training, validation, and testing sets to ensure accuracy while mitigating
overfitting.

### Differentiating Sounds and the Use of IPA Symbols

Given that different languages have different writing systems, and correlation between characters and sounds may be
difficult to extrapolate, we will assume that all strings in this network will be written using the International
Phonetic Alphabet (IPA). IPA symbols correspond to a particular sound and can be classified by different features with
respect to how the sound is made.

Note that not all languages use every symbol available in IPA. For instance, English does not use [Ɣ].

In [2]:
# Create a blank mapping for the IPA symbols. This will be used to map various IPA symbols to numbers
# that the neural network can use, rather than Unicode characters.
IPA_SYMBOLS = {}

# Open the IPA symbols mapping file in Pandas to read the data as a CSV file data frame, and verify
# the names and features are correct in the file.
ipa_symbols_file = pd.read_csv("ipasymbs.csv")
print(ipa_symbols_file.head(8))

  IPA  Feature
0   p        1
1   t        2
2   k        3
3   ʔ        4
4   b        5
5   d        6
6   g        7
7   m        8


In [3]:
# Iterate through every row in the list and add the mapping from IPA symbol to feature.
for _, row in ipa_symbols_file.iterrows():
    IPA_SYMBOLS[row['IPA']] = row['Feature']

print(IPA_SYMBOLS.keys())

dict_keys(['p', 't', 'k', 'ʔ', 'b', 'd', 'g', 'm', 'n', 'ɲ', 'ŋ', 'ɸ', 'f', 'θ', 's', 'ʃ', 'ç', 'x', 'ħ', 'h', 'ẞ', 'v', 'ð', 'z', 'Ʒ', 'Ɣ', 'ʁ', 'ʕ', 'ts', 'tʃ', 'ʣ', 'ʤ', 'w', 'ɹ', 'j', 'l', 'r', 'ɾ', 'i', 'ü', 'ɨ', 'ɯ', 'u', 'ɪ', 'ʊ', 'e', 'ə', 'o', 'ɛ', 'ʌ', 'ɔ', 'æ', 'a', 'ɑ'])


In [32]:
# Verify we can map IPA symbols to integers by passing in an IPA string and replacing the characters
# with feature integers.
# 
# Example: <shout> [ʃaʊt] -> [16, 53, 45, 2]
class IPAString(str):
    """A subclass of string in which the characters are IPA symbols that correspond to features."""
    def __str__(self):
        res = str.__repr__(self).replace("''", "")
        return f"[{res}]"
    def to_features(self) -> List[int]:
        return [IPA_SYMBOLS[c] for c in self]

shout_features = IPAString("ʃaʊt").to_features()
print(shout_features)

[16, 53, 45, 2]


In [42]:
# Rules for phonemes will be defined as the following format, stored in a .rules file:
#    [ʃa] -> [ʊ]
#    [ʃa] -> [i]
#
# This will need to be parsed and converted into a matrix that the network can use. To do this
# some functions have been written in another file.
from phoneme_rules import read_rules_as_rows

# To demonstrate and verify the file is read correctly, a sample file will be parsed. The result
# should be a two-dimensional array that includes the left side of the rules as inputs, and the
# right side of the rules as outputs.
sample_data = read_rules_as_rows("sample.rules")

# Split the data into its inputs and outputs.
inputs = []
outputs = []
for row in sample_data:
    inputs.append(row[:-1])
    outputs.append(row[len(row) - 1])
print(inputs, outputs)

[['ʃ', 'a'], ['ʃ', 'a']] ['Ʊ', 'i']
