# Homework: classify the origin of names using a character-level RNN

In this homework we will use an rnn-based model to perform classification. The goal is threefold:

1. Get more hands on with the preprocessing needed to perform text classification from A to Z. No preprocessing is done for you!
2. Use embeddings and RNNs in conjunction at the character level to perform classification.
3. Write a function that takes as input a string, and outputs the name of the predicted class.

However, here are guidelines to help you through all the steps:

1. Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
2. Use the keras tokenizer at the character level to tokenize your input into integer sequences.
3. Pad your sequences using the keras preprocessing tools.
4. Build a model that uses, minimally, an embedding layer, an RNN (of your choice) and a dense layer to output the logits or probabilities for the target classes (name origins).
5. Fit the model and evaluate on the test set.

### Importing libraries and getting data

In [None]:
%tensorflow_version 2.x
import numpy as np
from glob import glob
import re
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow import keras
from keras.layers import Input, Dense, LSTM
from sklearn import preprocessing
from keras.preprocessing.text import Tokenizer
from sklearn.preprocessing import OneHotEncoder

In [None]:
# Download the data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2022-05-14 22:03:24--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 18.64.174.109, 18.64.174.23, 18.64.174.42, ...
Connecting to download.pytorch.org (download.pytorch.org)|18.64.174.109|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2022-05-14 22:03:24 (21.6 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Korean.

#### As part of pre-processing we are replacing certain characters and punctuations. But we will be preserving space, inverted comma and non-english alphbets to preserve essence of certain languages.

In [None]:
data = []
for filename in glob('data/names/*.txt'):
  origin = filename.split('/')[-1].split('.txt')[0]
  names = open(filename).readlines()
  for name in names:
    name = name.replace(u'\xa0', ' ')
    data.append((re.sub(r'[/\\,\-?#@:0-9]', '', name.strip()), origin)) #keeping only alphabets as part of pre-processing along with stripping.

names, origins = zip(*data)

Defining encoders. Label encoder is used first to get different origins and then these different classes are encoded with one hot encoding.

In [None]:
#label encoding different language classes
label_encoder = preprocessing.LabelEncoder()
enc = OneHotEncoder(sparse=False)

In [None]:
origin_int = label_encoder.fit_transform(origins)
origin_label = enc.fit_transform(origin_int.reshape(-1, 1))

#This is how each row would look after encoding is completed
origin_label[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0.])

In [None]:
#Numbere of language classes
lang_classes = label_encoder.classes_
print("The number of unique language classes: ", len(lang_classes))
print("\nThe different languages found are as below:\n",lang_classes)

#18 languages detected

The number of unique language classes:  18

The different languages found are as below:
 ['Arabic' 'Chinese' 'Czech' 'Dutch' 'English' 'French' 'German' 'Greek'
 'Irish' 'Italian' 'Japanese' 'Korean' 'Polish' 'Portuguese' 'Russian'
 'Scottish' 'Spanish' 'Vietnamese']


### Char-Level Tokenizing

In [None]:
t  = Tokenizer(char_level=True)

In [None]:
seq = []
for name in names:
  t.fit_on_texts(name)

print("Count of characters:",t.word_counts)

Count of characters: OrderedDict([('a', 16516), ('h', 7688), ('n', 9961), ('b', 3657), ('i', 10422), ('k', 6922), ('g', 3217), ('y', 3619), ('o', 11106), ('c', 3070), ('e', 10764), ('u', 4720), ('w', 1127), ('l', 6713), ('j', 1351), ('m', 4351), ('p', 1711), ('r', 8262), ('s', 7985), ('t', 5956), ('d', 3899), ('x', 73), ('f', 1778), ('v', 6315), ('z', 1932), (' ', 116), ('q', 98), ('ó', 13), ('á', 13), ('ú', 7), ('í', 14), ('é', 23), ("'", 87), ('à', 10), ('ñ', 6), ('ż', 2), ('ń', 1), ('ł', 1), ('ś', 3), ('ą', 1), ('ò', 3), ('ù', 1), ('ì', 1), ('è', 2), ('ã', 2), ('õ', 1), ('ü', 11), ('ä', 13), ('ö', 24), ('ß', 9), ('ê', 1), ('ç', 1)])


Now, that we have dictionary of our characters, we can get sequences for each character as shown below.

In [None]:
data = t.texts_to_sequences(names)
data[:2]

[[1, 8, 5], [16, 1, 4, 9]]

In [None]:
pad_data = tf.keras.preprocessing.sequence.pad_sequences(data, padding='post')
pad_data[:2]

array([[ 1,  8,  5,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0],
       [16,  1,  4,  9,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0]], dtype=int32)

In [None]:
#splitting data into training set and the rest of the set
X_train,X_rest,y_train,y_rest = train_test_split(pad_data, origin_label, test_size=0.2, random_state=4)

#splitting the rest set further into test and validation set
X_valid,X_test,y_valid,y_test = train_test_split(X_rest, y_rest, test_size=0.5, random_state=4)

In [None]:
X_train[:2]

array([[ 6,  1,  8, 14,  1,  5,  2, 11,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0],
       [22,  4, 21, 21,  3,  6,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         0,  0,  0]], dtype=int32)

In [None]:
y_train[:2]

array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
        0., 0.]])

In [None]:
print(X_train.shape)
print(y_train.shape)

(16059, 19)
(16059, 18)


### Building Model

Model is built using Embedding layer, LSTM layers and Dense layers with Adam as optimizer.

In [None]:
tf.random.set_seed(42)

In [None]:
vocab_size = len(X_train)
embed_size = 25

In [None]:
keras.backend.clear_session()

model = keras.models.Sequential()
model.add(keras.layers.Embedding(vocab_size, embed_size, input_shape=[None], mask_zero=True))
model.add(LSTM(512, return_sequences=True, input_shape=X_train.shape[1:]))
model.add(LSTM(128, dropout=0.5))
model.add(Dense(18, activation = 'softmax'))
model.compile(loss='categorical_crossentropy', optimizer=keras.optimizers.Adam(), metrics=["accuracy"])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 25)          401475    
                                                                 
 lstm (LSTM)                 (None, None, 512)         1101824   
                                                                 
 lstm_1 (LSTM)               (None, 128)               328192    
                                                                 
 dense (Dense)               (None, 18)                2322      
                                                                 
Total params: 1,833,813
Trainable params: 1,833,813
Non-trainable params: 0
_________________________________________________________________


In [None]:
class ResetStatesCallback(keras.callbacks.Callback):
    def on_epoch_begin(self, epoch, logs):
        self.model.reset_states()

In [None]:
history = model.fit(X_train, y_train, epochs=10, validation_data=(X_valid, y_valid),callbacks=[ResetStatesCallback()])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
#Evaluating on test set
model.evaluate(X_test, y_test)



[0.5771210193634033, 0.8296812772750854]

### Getting predictions

In [None]:
#Just for reference, re-printing our list of language class in which order label encoder assigned the labels
lang_classes

array(['Arabic', 'Chinese', 'Czech', 'Dutch', 'English', 'French',
       'German', 'Greek', 'Irish', 'Italian', 'Japanese', 'Korean',
       'Polish', 'Portuguese', 'Russian', 'Scottish', 'Spanish',
       'Vietnamese'], dtype='<U10')

In [None]:
def predict_origin(input_name):
  #pre-process input string
  assert isinstance(input_name, str)
  input_name = input_name.replace(u'\xa0', ' ')
  input_name = re.sub(r'[/\\,\-?#@:0-9]', '', input_name.strip())

  #get tokens for input string and flatten list of lists to list for padding
  input = t.texts_to_sequences(input_name)
  flat_list = [item for sublist in input for item in sublist]

  #padding based on length of trained model
  input = tf.keras.preprocessing.sequence.pad_sequences([flat_list], padding='post', maxlen=20)

  #get prediction based on trained model
  output = model.predict(input) 

  #get highest probability amongst 18 outputs and its index
  max_val = np.amax(output, axis=1)
  max_idx = np.argmax(output, axis=1)

  #return language class and its probability
  the_origin = lang_classes[max_idx] 
  return the_origin, max_val*100

### Testing on random last names

In [None]:
lang, prob = predict_origin("Cha")
print("The name is {} predicted with probability of {}".format(lang, prob))

The name is ['Vietnamese'] predicted with probability of [32.05766]


In [None]:
lang, prob = predict_origin("Schmidt")
print("The name is {} predicted with probability of {}".format(lang, prob))

The name is ['German'] predicted with probability of [59.0673]


In [None]:
lang, prob = predict_origin("Trump")
print("The name is {} predicted with probability of {}".format(lang, prob))

The name is ['English'] predicted with probability of [72.211655]
