# Preparing the Names Data

The names data, which is available in the directory ./data/names of this repo, contains a few thousand surnames from 18 languages of origin.  In this notebook I will load that data and split it into train and test sets.  To improve the quality of the classifiers trained on these names it will help to make the training and test datasets balanced across languages.

If you are not interested in the details of how I load, balance and split the data, you can safely skip this notebook without loss of continuity since the functions discussed here are also implemented in the [utils module](https://github.com/bobflagg/classifying-names/blob/master/util.py) and in future notebooks I'll just use those methods to load and prepare the data.

## Loading the Data

Included in the ./data/names directory are 18 text files named as “[Language].txt”. Each file contains one name per line. 

In [4]:
DIRECTORY = './data/names/'

While reading the names, I'll [convert](http://stackoverflow.com/a/518232/2809427) 
from Unicode to ASCII using the following method.

In [8]:
import string
import unicodedata

ALL_LETTERS = string.ascii_letters + " .,;'"

def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in ALL_LETTERS
    )

Following [Sean Robertson](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html?highlight=lstm), I'll first build a dictionary of lists of names per language, {language: [names ...]}.

In [12]:
import os

language2names = {}
languages = []

def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for fname in os.listdir(DIRECTORY): 
    path = os.path.join(DIRECTORY, fname)
    language = os.path.splitext(os.path.basename(fname))[0]
    languages.append(language)
    names = readLines(path)
    language2names[language] = names

n_languages = len(languages)
print("%3s %12s  %4s" % ("  ", "language", "cnt"))
for i, language in enumerate(languages):
    print("%2d. %12s: %4d" % (i + 1, language, len(language2names[language])))

        language   cnt
 1.       Arabic: 2000
 2.      Chinese:  268
 3.        Czech:  519
 4.        Dutch:  297
 5.      English: 3668
 6.       French:  277
 7.       German:  724
 8.        Greek:  203
 9.        Irish:  232
10.      Italian:  709
11.     Japanese:  991
12.       Korean:   94
13.       Polish:  139
14.   Portuguese:   74
15.      Russian: 9408
16.     Scottish:  100
17.      Spanish:  298
18.   Vietnamese:   73


## Splitting the Data

I'll use the [scikit-learn](https://scikit-learn.org/stable/index.html)  [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method to split the samples for each language into train and test sets.

In [29]:
from random import shuffle
from sklearn.model_selection import train_test_split

X_train = []
y_train = []

X_test = []
y_test = []

print(" %12s   #train    #test" % ("language", ))
for i, language in enumerate(languages):
    names_train, names_test = train_test_split(
        language2names[language], 
        test_size=0.10, 
        random_state=42
    )
    X_train.extend(names_train)
    y_train.extend([language] * len(names_train))

    X_test.extend(names_test)
    y_test.extend([language] * len(names_test))

    print("%12s: %8d %8d" % (language, len(names_train), len(names_test)))


print("%12s: %8d %8d" % ("All", len(X_train), len(X_test)))


     language   #train    #test
      Arabic:     1800      200
     Chinese:      241       27
       Czech:      467       52
       Dutch:      267       30
     English:     3301      367
      French:      249       28
      German:      651       73
       Greek:      182       21
       Irish:      208       24
     Italian:      638       71
    Japanese:      891      100
      Korean:       84       10
      Polish:      125       14
  Portuguese:       66        8
     Russian:     8467      941
    Scottish:       90       10
     Spanish:      268       30
  Vietnamese:       65        8
         All:    18060     2014


We've not got a balanced split of the data into train and test sets but names for the 
same language are still grouped together. 

In [33]:
print(y_train[:8])
print(y_test[:8])

['Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic']
['Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic', 'Arabic']


In [None]:
This can be easily fixed.

In [34]:
Z = list(zip(X_train, y_train))
shuffle(Z)
X_train, y_train = zip(*Z)

Z = list(zip(X_test, y_test))
shuffle(Z)
X_test, y_test = zip(*Z)

In [35]:
print(y_train[:8])
print(y_test[:8])

('Russian', 'German', 'Russian', 'Japanese', 'Russian', 'English', 'Russian', 'Russian')
('English', 'Dutch', 'Russian', 'Arabic', 'Czech', 'Czech', 'Japanese', 'Russian')


## Encoding Names

To feed the names into a neural network it will be convient to regard a name as a list of characters and to map characters to integers.  For this I'll create a character to integer encoding and encoding and decoding methods.

In [42]:
chars = ALL_LETTERS
num_chars = len(chars)

print("num_chars:", num_chars)

char2int = {ch:ii for ii, ch in enumerate(chars)}
int2char = {ii:ch for ch, ii in char2int.items()}

def encode_names(some_names):
    some_names_encoded = []
    for name in some_names: some_names_encoded.append([char2int[c] for c in name])
    return some_names_encoded
    
def decode_name(name): 
    return "".join(int2char[i] for i in name)    

def decode_names(some_names): 
    return [decode_name(name) for name in some_names]

num_chars: 57


Let's use these methods to build encoded versions of the data.

In [37]:
X_train_encoded = encode_names(X_train)
X_test_encoded = encode_names(X_test)

Here's a sanity check to verify the encodings are working as expected.

In [39]:
n = 10

print("Train names sanity check:")
X_train_decoded = decode_names(X_train_encoded[:n])
for i in range(n):
    print("%19s -->> %s" % (X_train_decoded[i], y_train[i]))
    
print("\nTest names sanity check:")
X_test_decoded = decode_names(X_test_encoded[:n])
for i in range(n):
    print("%19s -->> %s" % (X_test_decoded[i], y_test[i]))

Train names sanity check:
          Chernovol -->> Russian
              Hofer -->> German
         Halapkhaev -->> Russian
            Tadeshi -->> Japanese
             Mokeev -->> Russian
        Whittingham -->> English
           Tunnikov -->> Russian
           Baisarov -->> Russian
      Christodoulou -->> Greek
            Liharev -->> Russian

Test names sanity check:
              Eglin -->> English
              Simon -->> Dutch
             Onikov -->> Russian
             Basara -->> Arabic
              Rezac -->> Czech
            Wykruta -->> Czech
           Kimiyama -->> Japanese
         Bekmahanov -->> Russian
                 Ba -->> Arabic
              Tahan -->> Arabic


## Embedding Names

Names are now represented as lists of integers but this is not a convenient representation for our classification algorithms. The fix requires two changes. The first is to represent a single letter as a “one-hot vector” of size <1 x num_chars>. A one-hot vector is filled with 0s except for a 1 at index of the current letter, e.g. "b" = <0 1 0 0 0 ...>. With this change the names are represented as lists of one-hot vectors but these lists may have different lengths for different names.  The second patch fixes this by padding the lists when required so that they all have the same length.

Let's find the maximum length of a name in our data.

In [43]:
import numpy as np

seq_length = np.max([len(name) for name in X_train + X_test])
print("seq-length: %d" % seq_length)

seq-length: 19


The following methods support representing a list of $n$ names as a tensor with dimension 
$n$ x seq-length x num-chars.

In [45]:
import torch

def embed_names(some_names):
    some_names_encoded = encode_names(some_names)
    some_names_embedded = np.zeros((len(some_names), seq_length, num_chars), dtype=np.float32)
    index = 0
    for name in some_names_encoded:
        position = seq_length - len(name)
        for char in name: 
            some_names_embedded[index, position, char] = 1
            position += 1
        index += 1
    return torch.from_numpy(some_names_embedded) 

def embed2name(values, indices):
    return decode_name([int(indices[i]) for i in range(seq_length) if values[i] == 1])
    
def embed2names(names):
    batch_size = names.shape[0]
    values, indices = torch.topk(names, 1)
    values = values.squeeze()
    indices = indices.squeeze()
    return [
        embed2name(values[i], indices[i]) for i in range(batch_size)
    ]

Here's another sanity check to verify our embedding behaves as expected.

In [46]:
X_train_embedded = embed_names(X_train)
print("X_train_embedded:", X_train_embedded.shape, ";", X_train_embedded.dtype)

n = 10
some_names = embed2names(X_train_embedded[:n])
for i in range(n):
    print("%19s -->> %s" % (some_names[i], y_train[i]))
    
X_test_embedded = embed_names(X_test)
print("X_test_embedded:", X_test_embedded.shape, ";", X_test_embedded.dtype)

n = 10
some_names = embed2names(X_test_embedded[:n])
for i in range(n):
    print("%19s -->> %s" % (some_names[i], y_test[i]))

X_train_embedded: torch.Size([18060, 19, 57]) ; torch.float32
          Chernovol -->> Russian
              Hofer -->> German
         Halapkhaev -->> Russian
            Tadeshi -->> Japanese
             Mokeev -->> Russian
        Whittingham -->> English
           Tunnikov -->> Russian
           Baisarov -->> Russian
      Christodoulou -->> Greek
            Liharev -->> Russian
X_test_embedded: torch.Size([2014, 19, 57]) ; torch.float32
              Eglin -->> English
              Simon -->> Dutch
             Onikov -->> Russian
             Basara -->> Arabic
              Rezac -->> Czech
            Wykruta -->> Czech
           Kimiyama -->> Japanese
         Bekmahanov -->> Russian
                 Ba -->> Arabic
              Tahan -->> Arabic


## Encoding Labels

Feeding lables into a nerual netword will also require an encoding, which is given below.

In [50]:
distinct_labels = list(set(y_train + y_test))
distinct_labels.sort()
num_classes = len(distinct_labels)
label2int = {l:i for i, l in enumerate(distinct_labels)}
int2label = {i:l for i, l in enumerate(distinct_labels)}

def encode_labels(some_labels): 
    encoded_labels = np.array([label2int[label] for label in some_labels], dtype=np.float32)
    return torch.from_numpy(encoded_labels)

def codes2labels(some_labels): 
    return [int2label[int(i)] for i in some_labels]

y_train_encoded = encode_labels(y_train)
y_test_encoded = encode_labels(y_test)


Here's the expected sanity check.

In [51]:
print("encoded train labels:", y_train_encoded.shape, ";", y_train_encoded.dtype)
n = 10
some_names = embed2names(X_train_embedded[:n])
same_labels = codes2labels(y_train_encoded[:n])
for i in range(n):
    print("%19s -->> %s" % (some_names[i], same_labels[i]))

print("\nencoded test labels:", y_test_encoded.shape, ";", y_test_encoded.dtype)
n = 10
some_names = embed2names(X_test_embedded[:n])
same_labels = codes2labels(y_test_encoded[:n])
for i in range(n):
    print("%19s -->> %s" % (some_names[i], same_labels[i]))

encoded train labels: torch.Size([18060]) ; torch.float32
          Chernovol -->> Russian
              Hofer -->> German
         Halapkhaev -->> Russian
            Tadeshi -->> Japanese
             Mokeev -->> Russian
        Whittingham -->> English
           Tunnikov -->> Russian
           Baisarov -->> Russian
      Christodoulou -->> Greek
            Liharev -->> Russian

encoded test labels: torch.Size([2014]) ; torch.float32
              Eglin -->> English
              Simon -->> Dutch
             Onikov -->> Russian
             Basara -->> Arabic
              Rezac -->> Czech
            Wykruta -->> Czech
           Kimiyama -->> Japanese
         Bekmahanov -->> Russian
                 Ba -->> Arabic
              Tahan -->> Arabic
