# Lab 10: RNNs

Today we'll do some preliminary investigation of simple RNNs similar to the form of Figure 10.3 in Goodfellow et al.

## First tutorial: Predicting the language a surname comes from

Let's begin with one of the official PyTorch tutorials on classifying surnames from 18 languages based on the character sequence. (https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html)

## Runtime environment

For the runtime environment, we won't get much benefit from the GPU in today's lab, as the examples are not batched, and it's not easy to do so, since different words have different lengths. Since the GPU server isn't going to help us much, we'll run locally using Jupyter.

## Surnames dataset
Get your Jupyter environment up and running, then download the dataset (https://download.pytorch.org/tutorial/data.zip) and unzip it in your project directory.

Here's code directly from the tutorial to read the names into a dictionary of the form { language1: [name1, name2, ...], language2: ... }

In [1]:
from __future__ import unicode_literals, print_function, division
from io import open
import glob
import os
import unicodedata
import string

In [2]:
def findFiles(path): return glob.glob(path)

print(findFiles('data/names/*.txt'))



all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

['data/names/Spanish.txt', 'data/names/Irish.txt', 'data/names/Chinese.txt', 'data/names/Arabic.txt', 'data/names/Vietnamese.txt', 'data/names/English.txt', 'data/names/Russian.txt', 'data/names/Portuguese.txt', 'data/names/Dutch.txt', 'data/names/Scottish.txt', 'data/names/Greek.txt', 'data/names/Czech.txt', 'data/names/Italian.txt', 'data/names/Korean.txt', 'data/names/French.txt', 'data/names/Polish.txt', 'data/names/Japanese.txt', 'data/names/German.txt']


In [3]:
# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

Slusarski


In [12]:
# Build the category_lines dictionary, a list of names per language
category_lines = {}
all_categories = []

# Read a file and split into lines
def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

# test
for c in all_categories[:2]:
    print(c)
    print(category_lines[c]) 

Spanish
['Abana', 'Abano', 'Abarca', 'Abaroa', 'Abascal', 'Abasolo', 'Abel', 'Abello', 'Aberquero', 'Abreu', 'Acosta', 'Agramunt', 'Aiza', 'Alamilla', 'Albert', 'Albuquerque', 'Aldana', 'Alfaro', 'Alvarado', 'Alvarez', 'Alves', 'Amador', 'Andreu', 'Antunez', 'Aqua', 'Aquino', 'Araujo', 'Araullo', 'Araya', 'Arce', 'Arechavaleta', 'Arena', 'Aritza', 'Armando', 'Arreola', 'Arriola', 'Asis', 'Asturias', 'Avana', 'Azarola', 'Banderas', 'Barros', 'Basurto', 'Bautista', 'Bello', 'Belmonte', 'Bengochea', 'Benitez', 'Bermudez', 'Blanco', 'Blanxart', 'Bolivar', 'Bonaventura', 'Bosque', 'Bustillo', 'Busto', 'Bustos', 'Cabello', 'Cabrera', 'Campo', 'Campos', 'Capello', 'Cardona', 'Caro', 'Casales', 'Castell', 'Castellano', 'Castillion', 'Castillo', 'Castro', 'Chavarria', 'Chavez', 'Colon', 'Costa', 'Crespo', 'Cruz', 'Cuellar', 'Cuevas', "D'cruz", "D'cruze", 'De la cruz', 'De la fuente', 'Del bosque', 'De leon', 'Delgado', 'Del olmo', 'De santigo', 'Diaz', 'Dominguez', 'Duarte', 'Durante', 'Echevar

OK, try it out. You can see some results with a query like

In [14]:
print(category_lines['Italian'][:5])

['Abandonato', 'Abatangelo', 'Abatantuono', 'Abate', 'Abategiovanni']


## One-hot Encoding (What is it?)

reference: https://victorzhou.com/blog/one-hot/

One-Hot Encoding is a single integer and produces a vector where a single element is 1 and all other elements are 0, like [0, 1, 0, 0]. The encoding can be used in text generator because we can assume that all characters are orthogonal together. This is known as integer encoding. For Machine Learning, this encoding can be problematic.

|Character|Value|One-Hot|
|---------|-----|-------|
|A|1|1 0 0 0 ...|
|B|2|0 1 0 0 ...|
|C|3|0 0 1 0 ...|

Not only characters can be generated as one-hot, but also words, such as color, tree, and so on.

|Word|Value|One-Hot|
|---------|-----|-------|
|Red|1|1 0 0 0 ...|
|Blue|2|0 1 0 0 ...|
|Green|3|0 0 1 0 ...|

## One-hot representation of characters
Next, we'll convert each letter in each word to a one-hot representation, for example "b" = < 0 1 0 0 0 ...>. The tensor size will be <linelength x 1 x nletters>. The first dimension is the number of characters in a given line of a data file, the second dimension is the index into the batch (we have a batch size of 1 here), and the third dimension indexes the different characters.

Here's how to do it:

### Example in other library

In [15]:
# Using scikit-learn’s OneHotEncoder:
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse=False)
print(encoder.fit_transform([['red'], ['green'], ['blue']]))

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


In [16]:
# Using numpy
import numpy as np

arr = [2, 1, 0]
max = np.max(arr) + 1
print(np.eye(max)[arr])

[[0. 0. 1.]
 [0. 1. 0.]
 [1. 0. 0.]]


### Turn back to real code: using pytorch

In [None]:
import torch

# Find letter index from all_letters, e.g. "a" = 0
def letterToIndex(letter):
    return all_letters.find(letter)

# Just for demonstration, turn a letter into a <1 x n_letters> Tensor
def letterToTensor(letter):
    tensor = torch.zeros(1, n_letters)
    tensor[0][letterToIndex(letter)] = 1
    return tensor

# Turn a line into a <line_length x 1 x n_letters>,
# or an array of one-hot letter vectors
def lineToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][letterToIndex(letter)] = 1
    return tensor

print(letterToTensor('J'))

print(lineToTensor('Jones').size())