## [NLP FROM SCRATCH: CLASSIFYING NAMES WITH A CHARACTER-LEVEL RNN](https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html#nlp-from-scratch-classifying-names-with-a-character-level-rnn)

##### We will be building and training a basic character-level RNN to classify words. This tutorial, along with the following two, show how to do preprocess data for NLP modeling “from scratch”, in particular not using many of the convenience functions of torchtext, so you can see how preprocessing for NLP modeling works at a low level.

##### A character-level RNN reads words as a series of characters - outputting a prediction and “hidden state” at each step, feeding its previous hidden state into each next step. We take the final prediction to be the output, i.e. which class the word belongs to.

#### Specifically, we’ll train on a few thousand surnames from 18 languages of origin, and predict which language a name is from based on the spelling:

In [2]:
from glob import glob

In [3]:
import string

In [4]:
from tqdm import tqdm
import urllib
from zipfile import ZipFile
import os

In [5]:
url = "https://download.pytorch.org/tutorial/data.zip"

In [6]:
home = os.environ['HOME']
data_dir = f"{home}/torch/"
tar_file = data_dir + url.split('/')[-1]

In [7]:
class TqdmUpTo(tqdm):
    def update_to(self, b=1, bsize=1, tsize=None):
        if tsize is not None:
            self.total = tsize
        self.update(b * bsize - self.n)

In [8]:
if not os.path.isdir(data_dir):
    os.mkdir(data_dir)

In [9]:
with TqdmUpTo(unit='B', unit_scale=True, miniters=1, desc=tar_file) as t:
    urllib.request.urlretrieve(url=url, filename=tar_file, reporthook=t.update_to)

/home/drclab/torch/data.zip: 2.88MB [00:00, 3.54MB/s]                            


In [10]:

with ZipFile(tar_file, "r") as zip:
    zip.extractall(data_dir)

In [11]:
for r, d, files in os.walk(data_dir):
    print(r, d, files)

/home/drclab/torch/ ['data'] ['data.zip']
/home/drclab/torch/data ['names'] ['eng-fra.txt']
/home/drclab/torch/data/names [] ['Arabic.txt', 'Irish.txt', 'Japanese.txt', 'Spanish.txt', 'Vietnamese.txt', 'Korean.txt', 'Portuguese.txt', 'Greek.txt', 'Polish.txt', 'Russian.txt', 'Dutch.txt', 'Scottish.txt', 'French.txt', 'German.txt', 'Chinese.txt', 'Czech.txt', 'Italian.txt', 'English.txt']


In [12]:
glob(data_dir+"data/names/*.txt")

['/home/drclab/torch/data/names/Arabic.txt',
 '/home/drclab/torch/data/names/Irish.txt',
 '/home/drclab/torch/data/names/Japanese.txt',
 '/home/drclab/torch/data/names/Spanish.txt',
 '/home/drclab/torch/data/names/Vietnamese.txt',
 '/home/drclab/torch/data/names/Korean.txt',
 '/home/drclab/torch/data/names/Portuguese.txt',
 '/home/drclab/torch/data/names/Greek.txt',
 '/home/drclab/torch/data/names/Polish.txt',
 '/home/drclab/torch/data/names/Russian.txt',
 '/home/drclab/torch/data/names/Dutch.txt',
 '/home/drclab/torch/data/names/Scottish.txt',
 '/home/drclab/torch/data/names/French.txt',
 '/home/drclab/torch/data/names/German.txt',
 '/home/drclab/torch/data/names/Chinese.txt',
 '/home/drclab/torch/data/names/Czech.txt',
 '/home/drclab/torch/data/names/Italian.txt',
 '/home/drclab/torch/data/names/English.txt']

In [13]:
all_letters = string.ascii_letters +" .,;'"

In [15]:
n_letters =  len(all_letters)

In [16]:
n_letters

57

In [14]:
import unicodedata

In [20]:
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

In [21]:
unicodeToAscii('Ślusàrski')

'Slusarski'

In [36]:
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

In [22]:
category_lines = {}
all_categories = []

In [26]:
def findFiles(path): return glob(path)

In [37]:
for filename in findFiles('/home/drclab/torch/data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

In [39]:
category_lines.keys()

dict_keys(['Arabic', 'Irish', 'Japanese', 'Spanish', 'Vietnamese', 'Korean', 'Portuguese', 'Greek', 'Polish', 'Russian', 'Dutch', 'Scottish', 'French', 'German', 'Chinese', 'Czech', 'Italian', 'English'])

In [40]:
n_categories = len(all_categories)

In [41]:
n_categories

19

_____

To represent a single letter, we use a “one-hot vector” of size <1 x n_letters>. A one-hot vector is filled with 0s except for a 1 at index of the current letter, e.g. "b" = <0 1 0 0 0 ...>.

To make a word we join a bunch of those into a 2D matrix <line_length x 1 x n_letters>.

In [42]:
all_letters.find('x')

23

In [43]:
def letterToIndex(letter):
    return all_letters.find(letter)

In [46]:
import torch

In [47]:
# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

In [48]:
letterToTensor('b')

tensor([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])

In [49]:
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

In [50]:
lineToTensor('John')

tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0