This work is taken from different sources and is a compilation rather than my original work. Please treat it as a tutorial rather than a novel work as it is meant for that.

References:

https://karpathy.github.io/2015/05/21/rnn-effectiveness/

https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html

https://blog.floydhub.com/a-beginners-guide-on-recurrent-neural-networks-with-pytorch/


# Task 1 - Language Classification using RNNs

In [31]:
#The data/names folder contains names in different languages. We want to build a text-classification model using RNNs
from __future__ import unicode_literals, print_function, division #needed for compatibility between 3.x and 2.x python versions
from io import open
import glob
import os
import torch

def findFiles(path) : return glob.glob(path)

print(findFiles('data/names/*.txt'))


['data/names\\Arabic.txt', 'data/names\\Chinese.txt', 'data/names\\Czech.txt', 'data/names\\Dutch.txt', 'data/names\\English.txt', 'data/names\\French.txt', 'data/names\\German.txt', 'data/names\\Greek.txt', 'data/names\\Irish.txt', 'data/names\\Italian.txt', 'data/names\\Japanese.txt', 'data/names\\Korean.txt', 'data/names\\Polish.txt', 'data/names\\Portuguese.txt', 'data/names\\Russian.txt', 'data/names\\Scottish.txt', 'data/names\\Spanish.txt', 'data/names\\Vietnamese.txt']


The code in the next cell takes all the files above and returns a dictionary of String and a List of Strings like
   {language1:[name1L1, name2L1,...], language2:[name1L2 name2L2...], ....}

In [19]:
import unicodedata
import string

all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

def unicodeToAscii(s):
    return ''.join(
    c for c in unicodedata.normalize('NFD', s)
        if unicodedata.category(c) != 'Mn'
        and c in all_letters
    )

print(unicodeToAscii('Ślusàrski'))

category_lines = {}
all_categories = []

def readLines(filename):
    lines = open(filename, encoding='utf-8').read().strip().split('\n')
    return [unicodeToAscii(line) for line in lines]

for filename in findFiles('data/names/*.txt'):
    category = os.path.splitext(os.path.basename(filename))[0]
    all_categories.append(category)
    lines = readLines(filename)
    category_lines[category] = lines
    
    
n_categories = len(all_categories)

Slusarski


Now let's try to make the input into a format our model can understand - i.e. as Tensors. We use one-hot encoding as there are only 26 letters in the alphabet and even after taking into account some extra characters like "," commas etc., the data would not be sparse if we use one-hot encdoing (57).

For more information - see sparsity in NLP (Google it).
If you need a refresher on Tensors in Pytorch, see this: https://pytorch.org/docs/stable/tensors.html
Just remember Tensors are like matrices in Math and you should be fine.

For how "Enumerate" in python works, see https://www.geeksforgeeks.org/enumerate-in-python/

In [27]:
n = len(all_letters)
n

57

In [33]:
def letterToIndex(l):
    return all_letters.find(l)

def tensorFromLetter(l): #<1xn sized tensor
    tensor = torch.zeros(1, n)
    tensor[0][letterToIndex(l)] = 1
    return tensor

def lineToTensorArray(line): #returns no of letters x 1 x n sized tensor
    tensor = torch.zeros(len(line), 1, n)
    for index, letter in enumerate(line): 
        tensor[index][0][letterToIndex(letter)] = 1
    return tensor
        
print(tensorFromLetter('J'))

print(lineToTensorArray('Jones').size())

tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
         0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
         0., 0., 0.]])
torch.Size([5, 1, 57])


Now onto creating our model. We are using RNNs here. To see how they work, head on to the references mentioned in the article in the starting.
Keep in mind, when creating an RNN model, we can either concatenate input and hidden states or add them after multiplying with input to hidden and hidden to hidden weights respectively.

In [None]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(RNN, self).__init__()
        
        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input)
        