# Classifying Names with a Character Level RNN

In this tutorial we try to predict the nationality of a name by processing it through a bidirectional LSTM one character at a time. We have 18 input files, each corresponding to a different nationality. We will store the names in a dictionary of the form `{'Nationality': [name1, name2, ...], ...}`. The [`glob` module](https://docs.python.org/3/library/glob.html) is more convenient than `os.listdir` for this purpsose, as it returns the full path.

In [35]:
from io import open
import glob
import string
from functools import reduce


file_list = glob.glob('../data/names/*.txt')
print(file_list)

['../data/names/German.txt', '../data/names/Arabic.txt', '../data/names/Vietnamese.txt', '../data/names/Dutch.txt', '../data/names/Polish.txt', '../data/names/Portuguese.txt', '../data/names/Scottish.txt', '../data/names/Korean.txt', '../data/names/Irish.txt', '../data/names/Russian.txt', '../data/names/Czech.txt', '../data/names/Greek.txt', '../data/names/Italian.txt', '../data/names/Spanish.txt', '../data/names/French.txt', '../data/names/Japanese.txt', '../data/names/English.txt', '../data/names/Chinese.txt']


Let's create the dictionary:

In [46]:
category_lines = {}
for file_name in file_list:
    nationality = os.path.basename(os.path.splitext(file_name)[0])
    name_list = []
    for line in open(file_name, 'r'):
        name_list.append(line.rstrip())
    category_lines[nationality] = name_list

Some of the names contain accents and other non-ASCII characters that can make things complicated. We may convert all the characters to plain ASCII, as shown in the vignette, but our task would probably be easier if we could account for the non-ASCII characters. Let's try this approach! We read each file in turn, and for each file we extract the set of unique characters.

In [26]:
def get_unique_characters(filename):
    character_sets = []
    for name in open(filename, 'r'):
        character_sets.append(set(name.lower()))
    unique_characters = reduce(set.union, character_sets)
    return unique_characters

Let's try with the first element of `file_list`, which contains the German names.

In [30]:
unique_german = get_unique_characters(file_list[0])
print(unique_german)
print(len(unique_german))

{'g', 'ß', 'ä', 'w', 'i', 'k', 'd', 'n', 'l', 'e', 'ü', 'x', 'q', 'o', 'ö', ' ', 'y', '\n', 'c', 'b', 'a', 'h', 'f', 'z', 's', 'm', 't', 'r', 'v', 'u', 'j', 'p'}
32


Let's run this method on all files:

In [29]:
all_unique_characters = [get_unique_characters(f) for f in file_list]
all_unique_characters = reduce(set.union, all_unique_characters)
print(all_unique_characters)
print(len(all_unique_characters))

{"'", ',', 'g', 'ß', 'ä', 'w', 'i', 'k', 'ż', '-', '/', '1', 'á', 'e', 'õ', 'ł', 'ö', ' ', 'ñ', '\n', 'b', 'ã', 'a', 'é', ':', 's', 'm', 't', '\xa0', 'u', 'p', 'ú', 'ą', 'à', 'ì', 'd', 'ń', 'n', 'ê', 'l', 'ò', 'ü', 'x', 'q', 'ś', 'o', 'í', 'ç', 'ù', 'y', 'c', 'h', 'f', 'z', 'è', 'r', 'v', 'j', 'ó'}
59


This gives us all the characters appearing in the files. We can turn this into a dictionary to later create one-hot encodings of the individual characters.

In [34]:
character_dict = {char: ix for ix, char in enumerate(all_unique_characters)}
print(character_dict)

{"'": 0, ',': 1, 'g': 2, 'ß': 3, 'ä': 4, 'w': 5, 'i': 6, 'k': 7, 'ż': 8, '-': 9, '/': 10, '1': 11, 'á': 12, 'e': 13, 'õ': 14, 'ł': 15, 'ö': 16, ' ': 17, 'ñ': 18, '\n': 19, 'b': 20, 'ã': 21, 'a': 22, 'é': 23, ':': 24, 's': 25, 'm': 26, 't': 27, '\xa0': 28, 'u': 29, 'p': 30, 'ú': 31, 'ą': 32, 'à': 33, 'ì': 34, 'd': 35, 'ń': 36, 'n': 37, 'ê': 38, 'l': 39, 'ò': 40, 'ü': 41, 'x': 42, 'q': 43, 'ś': 44, 'o': 45, 'í': 46, 'ç': 47, 'ù': 48, 'y': 49, 'c': 50, 'h': 51, 'f': 52, 'z': 53, 'è': 54, 'r': 55, 'v': 56, 'j': 57, 'ó': 58}
