In [2]:
# For tips on running notebooks in Google Colab, see
# https://pytorch.org/tutorials/beginner/colab

NLP From Scratch: Generating Names with a Character-Level RNN
=============================================================

**Author**: [Sean Robertson](https://github.com/spro),,, Note: this notebook is modified from the original one.

We are still hand-crafting a small RNN with a few linear layers. The big
difference is instead of predicting a category after reading in all the
letters of a name, we input a category and output one letter at a time.
Recurrently predicting characters to form language (this could also be
done with words or other higher order constructs) is often referred to
as a \"language model\".

Preparing the Data
------------------

Data are saved in the directory 'data/names'. It includes 18 text files named as
`[Language].txt`. Each file contains a bunch of names, one name per
line, mostly romanized (but we still need to convert from Unicode to
ASCII).

We\'ll end up with a dictionary of lists of names per language,
`{language: [names ...]}`. The generic variables \"category\" and
\"line\" (for language and name in our case) are used for later
extensibility.


In [3]:
# If you are using your local machine, you should not comment 'f_path = 'data/names/*.txt'' and the rest of this block should be kept commented
f_path = 'data/names/*.txt'

In [4]:
# # if you are using colab, please uncomment all the below lines then follow below instructions:
# # You will be asked for a premesion to access your google drive, please acceept it
# !git clone https://github.com/VictorCeballos/KAUST-AI-SS.git
# from google.colab import drive
# drive.mount('/content/drive')

# # - Locate /data/names/

# f_path = '/content/KAUST-AI-SS/Week 5 - Natural Language Processing/Labs/day7/data/names/*.txt'

# # - Providing the correct path will let you run the next block without an error.

In [5]:
from io import open
import glob
import os
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'-"
n_letters = len(all_letters) + 1 # Plus EOS marker

def findFiles(path): return glob.glob(path)

# Turn a Unicode string to plain ASCII, thanks to https://stackoverflow.com/a/518232/2809427
def unicodeToAscii(s):
    return ''.join(
        c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

# Read a file and split into lines
def readLines(filename):
    with open(filename, encoding='utf-8') as some_file:
        return [unicodeToAscii(line.strip()) for line in some_file]

# Build the category_lines dictionary, a list of lines per category
category_lines = {}
all_categories = []
for filename in findFiles(f_path):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines

n_categories = len(all_categories)

if n_categories == 0:
    raise RuntimeError('Data not found. Make sure that you downloaded data '
        'from https://download.pytorch.org/tutorial/data.zip and extract it to '
        'the current directory.')

print('# categories:', n_categories, all_categories)
print(unicodeToAscii("O'Néàl"))

# categories: 18 ['Czech', 'German', 'Arabic', 'Japanese', 'Chinese', 'Vietnamese', 'Russian', 'French', 'Irish', 'English', 'Spanish', 'Greek', 'Italian', 'Portuguese', 'Scottish', 'Dutch', 'Korean', 'Polish']
O'Neal


Task-1: Creating the Network
====================

This network extends the last tutorial\'s RNN with an extra argument for the category tensor, which is concatenated
along with the others. The category tensor is a one-hot vector just like
the letter input.

We will interpret the output as the probability of the next letter. When
sampling, the most likely output letter is used as the next input
letter.

I added a second linear layer `o2o` (after combining hidden and output)
to give it more muscle to work with. There\'s also a dropout layer,
which [randomly zeros parts of its
input](https://arxiv.org/abs/1207.0580) with a given probability (here
0.1) and is usually used to fuzz inputs to prevent overfitting. Here
we\'re using it towards the end of the network to purposely add some
chaos and increase sampling variety.

![](https://i.imgur.com/jzVrf7f.png)


In [6]:
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        self.hidden_size = hidden_size

        self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size) ### Following the architecture, i2h FC layer has in dim of the 3 tensors
        self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size) ### i2o FC layer is same as i2h but different out dim
        self.o2o = nn.Linear(hidden_size + output_size, output_size) ###  the last layer
        # self.i2h = None #To Do
        # self.i2o = None #To Do
        # self.o2o = None #To Do
        self.dropout = nn.Dropout(0.1) ### if there is time and not many people know it,,, explain it
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, category, input, hidden):
        input_combined = torch.cat((category, input, hidden), 1) ### simple forward pass but concatenating all the tensors when it is needed
        hidden = self.i2h(input_combined) 
        output = self.i2o(input_combined)
        output_combined = torch.cat((hidden, output), 1)
        output = self.o2o(output_combined)
        output = self.dropout(output)
        output = self.softmax(output)
        # input_combined = None #To Do, Hint: use torch.cat
        # hidden = None #To Do
        # output = None #To Do
        # output_combined = None #To Do, Hint: use torch.cat
        # output = None #To Do
        # output = None #To Do
        # output = None #To Do
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)