<a href="https://colab.research.google.com/github/William-Hill/wakandan_name_textgenrnn/blob/master/Wakandan_TextGenRNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

We are going to use a character-based Recurrent Neural Network (RNN) to generate new Wakandan names using a dataset of existing Wakandan names.

In order to do this, we are going to use several Data Science techniques in addition to using deep learning to accomplish this task.  We will use the OSEMN framework that has the following steps:



1.   Obtain Data
2.   Scrub Data
3.   Explore Data
4.   Model Data
5.   Interpret Data

![OSEMN_Framework_Diagram](https://miro.medium.com/max/3870/1*eE8DP4biqtaIK3aIy1S2zA.png)



Note: Enable GPU acceleration to execute this notebook faster. In Colab: *Runtime > Change runtime type > Hardware acclerator > GPU*. If running locally make sure TensorFlow version >= 1.11.

# Setup

## Install TensorFlow and other dependencies

We are going to use the `textgenrnn` library to create our model. `textgenrnn` is a Python 3 module on top of Keras/TensorFlow for creating character level Recurrent Neural Networks (RNN)

We will need to install `tensorflow` as `textgenrnn` depends on it.

We install `bokeh` as well. `bokeh` is a Python library for interactive visualization that targets web browsers for representation

In [0]:
!pip install textgenrnn tensorflow-gpu bokeh scrapy plotly==4.1.0 gpt-2-simple

## Import dependencies

Next we'll import the dependencies we just installed along with some others that we'll need for our project.

In [0]:
import requests, zipfile, io, glob, os, json
from textgenrnn import textgenrnn
from google.colab import files
from datetime import datetime
import scrapy
from scrapy.crawler import CrawlerProcess
import re


# A Quick Experiment

Before diving into OSEMN process to build our Wakandan Name Generator model, let's do a quick experiment.

### Dataset from PyTorch RNN example

We are going to download an example dataset that is used in a tutorial for Facebook's Deep Learning framework, PyTorch.  This dataset is used in a tutorial similar to this one in that it demonstrates using a character level RNN to generate names.

The data is stored as a zip file.  Let's download and extract it.

In [0]:
r = requests.get("https://download.pytorch.org/tutorial/data.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

### Explore the data

Let's take a look at the contents of our data.

The dataset consists of 18 text files containing surnames from 18 languages of origin.

Since the names come from various languages, they are encoded as Unicode strings in order to support language specific symbols such as umlauts.  We will need to convert the Unicode strings to ASCII.

We loop over the text files and add all the names for each language to a dictionary.

In [0]:
def findFiles(path): 
  return glob.glob(path)

print(findFiles('data/names/*.txt'))


import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)


# Turn a Unicode string to plain ASCII, thanks to http://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []
category_counts = {}

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines
    category_counts[category] = len(lines)

counts = category_counts.values()
print("counts:", counts)

n_categories = len(all_categories)
print("all_categories:", all_categories)
print("category_lines:", category_lines)
print("category_counts:", category_counts)
print("spanish count:", len(category_lines["Spanish"]))

###Visualize the data

Let's plot the data to get a better look at it.

#### Bar chart

In [0]:
from bokeh.io import show, output_file, output_notebook
from bokeh.plotting import figure
from bokeh.models import HoverTool, ColumnDataSource

output_notebook()
output_file("bars.html")

counts = list(category_counts.values())
countries = list(category_counts.keys())
source = ColumnDataSource(data=dict(countries=countries, counts=counts))

p = figure(x_range=countries, plot_height=250, title="Name Counts",
           toolbar_location=None, tools="")
p.sizing_mode = 'scale_width'

p.vbar(x='countries', top='counts', width=0.9, source=source)
p.add_tools(HoverTool(tooltips=[("Count", "@counts")]))

p.xgrid.grid_line_color = None
p.y_range.start = 0

show(p)

We can see that our data is not uniform (and has a large outlier in Russian names.....hmm, interesting 🤔)

Let's visualize our data in a different way to see if we can get more insight

#### Chloropleth

Let's plot our data on a map to get a visual breakdown of the regions represented by the dataset

First we will need to download some country shape files to create our map

In [0]:
r = requests.get("https://www.naturalearthdata.com/http//www.naturalearthdata.com/download/110m/cultural/ne_110m_admin_0_countries.zip")
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()

Now let's create a CSV file with the countries in our dataset.  The CSV will create rows with the language's country (or countries) of origin, its ISO-3166 country code, and the count.

In [0]:
country_codes = {"Spanish": "ESP", "Japanese": "JPN", "Chinese": "CHN", "Czech": "CZE", 
                 "Portuguese": "PRT", "Scottish": "GBR", "English": "GBR", 
                 "Italian": "ITA", "Irish": "IRL", "Korean": "Kor", "German": 
                 "DEU", "Greek": "GRC", "French": "FRA", 
                 "Arabic": ["Jordan", "Palestine", "Syria", "Lebanon", "Morocco", "Mauritania", "Algeria", "Tunisia", "Libya", "Sudan", "Somalia", "Egypt", "Saudi Arabia", "Yemen", "Oman", "Qatar", "Bahrain", "Kuwait", "Comoros", "Iraq", "Djibouti", "United Arab Emirates"],
                 "Russian": "RUS", "Polish": "POL", "Vietnamese": "VNM", "Dutch": "NLD"
                }

import csv

csv_columns = ['Language','Code','Count']

try:
    with open('name_count_by_language.csv', 'w') as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=csv_columns)
        writer.writeheader()
        for key, value in category_counts.items():
            writer.writerow({'Language': key, 'Code': country_codes[key], 'Count': value})
except IOError:
    print("I/O error") 

Now we can plot the data on the map using Plotly

In [0]:

import plotly.graph_objects as go
import pandas as pd

# df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_world_gdp_with_codes.csv')
df = pd.read_csv('name_count_by_language.csv')

fig = go.Figure(data=go.Choropleth(
    locations = df['Code'],
    z = df['Count'],
    text = df['Language'],
    colorscale = 'Blues',
    autocolorscale=False,
    reversescale=True,
    marker_line_color='darkgray',
    marker_line_width=0.5,
    colorbar_title = 'Names in Language',
))

fig.update_layout(
    title_text='Names by Language',
    geo=dict(
        showframe=False,
        showcoastlines=False,
        projection_type='equirectangular'
    ),
    annotations = [dict(
        x=0.55,
        y=0.1,
        xref='paper',
        yref='paper',
        text='Source: <a href="https://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html">\
            Facebook PyTorch Tutorials</a>',
        showarrow = False
    )]
)

fig.show()

### Bias in Data

You may ask why we did that little experiment?  It was to demonstrate how data can be biased.  By visualizing the data from as a choropleth, we can see that virtually no names that native to the continent of Africa are represented.

# Obtain Data

### Wakandan Name dataset

We need to obtain our Wakandan name dataset.  We will use this dataset to create a name that is structurally similar to a known Wakandan name. 

The dataset will be split into male Wakandan name and female Wakandan names.  The dataset is split because Wakandan names usually have a certain format depending on gender.  A male Wakandan name typically begins with a consonant, then an apostrophe, then the rest of the name (for example T'Challa).  

### Scrape data

In [0]:
class NamesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'https://marvel.fandom.com/wiki/Category:Wakandans',
        'https://marvel.fandom.com/wiki/Category:Wakandans?from=Shikona+%28Earth-161%29%0AShikona+%28Earth-161%29',
        'https://marvel.fandom.com/wiki/Category:Wakandans?from=Zatama+%28Earth-616%29%0AZatama+%28Earth-616%29'
    ]
    
    custom_settings = {
        'FEED_FORMAT':'json',
        'FEED_URI': 'names.json'
    }

    def parse(self, response):
        wakandan_name_links = response.css(
            ".category-page__member a::attr(href)").getall()
        for link in wakandan_name_links:
            yield response.follow(link, self.parse_name)
            # yield {
            #     'text': quote.css('span.text::text').get(),
            #     'author': quote.css('small.author::text').get(),
            #     'tags': quote.css('div.tags a.tag::text').getall(),
            # }
        self.print_names()
        
        next_page = response.css(".category-page__pagination-next a::attr('href')")
        if next_page:
          print("Found link to next page")
          url = response.urljoin(next_page[0].extract())
          yield scrapy.Request(url, self.parse)
          #TODO: Try both of lines below
#           self.parse(url)
#           yield response.follow(url, self.parse)

    def parse_name(self, response):
        name = response.css(
            'div[data-source="RealName"] div::text').get().strip()
        if not name:
            name = response.css(
                '.page-header__title::text').get().strip()
        name = re.sub(r" ?\([^)]+\)", "", name)
        scrape_gender = response.css('div[data-source="Gender"] a::text').get()
        yield {
            'name': name,
            'gender': scrape_gender,
        }

        
process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(NamesSpider)
process.start() # the script will block here until the crawling is finished

# Scrub The Data

Now that we have scraped the raw dataset from the wiki, we need to pre-process, or scrub, the data to prepare it for use with the model.

We need to split the dataset by masculine names and feminine names

In [0]:
wakandan_masculine_names = set()
wakandan_feminine_names = set()

with open('names.json') as names:
        name_data = json.load(names)

    
for name in name_data:
  if name['gender'] == 'Female':
    wakandan_feminine_names.add(name['name'])
  else:
    wakandan_masculine_names.add(name['name'])
#   print("value:", value)


print("wakandan_masculine_names:", wakandan_masculine_names)



# Explore The Data

### Visualize the data

Let's plot at the raw data split by gender.

In [0]:
#TODO: Make this a pie chart instead
output_notebook()
output_file("names_raw_data.html")

counts = [len(wakandan_masculine_names), len(wakandan_feminine_names)]
gender = ["masculine", "feminine"]
source = ColumnDataSource(data=dict(gender=gender, counts=counts))

p = figure(x_range=gender, plot_height=250, title="Name Counts",
           toolbar_location=None, tools="")
p.sizing_mode = 'scale_width'

p.vbar(x='gender', top='counts', width=0.9, source=source)
p.add_tools(HoverTool(tooltips=[("Count", "@counts")]))

p.xgrid.grid_line_color = None
p.y_range.start = 0

show(p)

We can see that the data is skewed towards masculine names.  Most real world data will not be uniform so this isn't very unusual (but could be a symptom of bias).


There's still some scrubbing that we need to do before we can feed it to the model.

The dataset is split into male Wakandan names and female Wakandan names.  The dataset is split because Wakandan names usually have a certain format depending on gender.  A male Wakandan name typically begins with a consonant, then an apostrophe, then the rest of the name (for example T'Challa).  

However, there are outliers in our data. As new writers such as Ta'Nehisi Coates and Roxane Gay have expanded the Wakandan lore, the names have become less uniform.  For example, A'di is the niece of King T'Challa. She has a name that is typically suggests that she identifies as a male.  She's still more exception than rule though.  

For this project, we are going to normalize the data in that male and female names follow the established patterns for simplicity's sake.  We will scrub away any names that don't conform to the patterns for both datasets.

In [0]:
wakandan_masculine_names = [x for x in wakandan_masculine_names if "'" in x]
print("wakandan_masculine_names:", wakandan_masculine_names)
print("wakandan_masculine_names:", len(wakandan_masculine_names))
wakandan_feminine_names = [x for x in wakandan_feminine_names if "'" not in x]

# Model the Data

## Create `textgenrnn` object

Before we train the model, we must create an instance of the `textgenrnn` class.

We can provide a `name` argument to the class that it will use to save the model to a file after training

In [0]:
model_name = 'wakandan_names'   # change to set file name of resulting trained models/texts

textgen = textgenrnn(name=model_name)

## Train the model

Now we are ready to train the model on the dataset.

We use the `train_on_texts` method to train the model on our Wakanda name dataset.  `textgenrnn` comes with a pre-trained model out of the box that we could use for Transfer Learning, but we are going to train our model from scratch so we pass a value of True to the `new_model` argument.

We are training for 60 epochs with a test generation output occurring at every 5th epoch.

We set our `train_size` and `dropout` to 0.7 and 0.2, respectively to help combat overfitting.

In [0]:
textgen.reset()
textgen.train_on_texts(wakandan_masculine_names, new_model=True, num_epochs=60,  gen_epochs=5, train_size=0.7, dropout=0.2) 

# Interpret the Data

## Generate names

Our model is now trained on the dataset and can be used to generate new names.

The model isn't perfect as it will generate some names that are already in the dataset (a sign of slight overfitting) or it will generate some names that don't quite confirm to the semantic structure of a typical Wakandan name.  But for a model that was trained on a **really small** dataset for under 5 minutes, it does pretty good!

In [0]:
textgen.generate()

## Measure Fitness

As you can see, we get some duplicate names from the dataset generating when using our model.  This is a symptom of overfitting.  But since our dataset is so small, that is somewhat expected.  

But we still would like a way to measure how good our model is at generating new names.  One way we could do this is with a simple brute force measurement.  We can measure the number of times our model generates a unique name out of 100 invocations to measure the percentage. We don't the name generated to just be unique, but also semantically similar to the names in dataset on which the model was trained.  We can use Levenshtein distance to measure how close the generated name is to a name in the dataset.  We can experiment with the similarity threshold, but we arbitrarily chose 70% as a starting point.

In [0]:
#TODO: write fitness test function

In [0]:
# file_path = 'data/names/Russian.txt'

# textgen.reset()
# textgen.train_from_file(file_path, new_model=True, num_epochs=10, gen_epochs=5)

## Save results to a file

We can continously call the `generate` function to generate names to the console.

We can also generate an arbitrary number of names and write them to a file.

We use the `generate_to_file` function to do so.  We can pass in the number of names we want to generate to the `n` parameter (cringes at single letter variable name).

We could also pass in a string to the `prefix` parameter to act as a seed for generating our name.

We then use Colab's `file` API to download our file.

In [0]:
# this temperature schedule cycles between 1 very unexpected token, 1 unexpected token, 2 expected tokens, repeat.
# changing the temperature schedule can result in wildly different output!
temperature = [1.0, 0.5, 0.2, 0.2]   
prefix = None   # if you want each generated text to start with a given seed text

# if train_cfg['line_delimited']:
#   n = 1000
#   max_gen_length = 60 if model_cfg['word_level'] else 300
# else:
#   n = 1
#   max_gen_length = 2000 if model_cfg['word_level'] else 10000
  
timestring = datetime.now().strftime('%Y%m%d_%H%M%S')
gen_file = '{}_gentext_{}.txt'.format(model_name, timestring)

textgen.generate_to_file(gen_file,
                         temperature=temperature,
                         prefix=prefix,
                         n=5,
                         max_gen_length=9)
files.download(gen_file)

# Saving the model

When we trained the model, `textgenrnn` saved the weights, vocabulary, and configuration that resulted from the training to separate files.  Each of those files has a prefix of the `model_name` that we defined earlier in the code.

We can download these files so that they can be loaded into a new `textgenrnn` model

In [0]:
files.download('{}_weights.hdf5'.format(model_name))
files.download('{}_vocab.json'.format(model_name))
files.download('{}_config.json'.format(model_name))

#Future Work

*   Addressing non-binary names
*   Run model in a loop with different epochs and visualize results


