# Homework: classify the origin of names using a character-level RNN

In this homework we will use an rnn-based model to perform classification. The goal is threefold:

1. Get more hands on with the preprocessing needed to perform text classification from A to Z. No preprocessing is done for you!
2. Use embeddings and RNNs in conjunction at the character level to perform classification.
3. Write a function that takes as input a string, and outputs the name of the predicted class.

However, here are guidelines to help you through all the steps:

1. Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
2. Use the keras tokenizer at the character level to tokenize your input into integer sequences.
3. Pad your sequences using the keras preprocessing tools.
4. Build a model that uses, minimally, an embedding layer, an RNN (of your choice) and a dense layer to output the logits or probabilities for the target classes (name origins).
5. Fit the model and evaluate on the test set.

In [1]:
%tensorflow_version 2.x
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
import tensorflow as tf

In [2]:
# Download the data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2022-05-16 02:49:08--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 18.64.174.109, 18.64.174.42, 18.64.174.23, ...
Connecting to download.pytorch.org (download.pytorch.org)|18.64.174.109|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2022-05-16 02:49:08 (19.7 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Korean.

In [54]:
data = []
for filename in glob('data/names/*.txt'):
  origin = filename.split('/')[-1].split('.txt')[0]
  names = open(filename).readlines()
  for name in names:
    data.append((name.strip(), origin))

names, origins = zip(*data)
names_train, names_test, origins_train, origins_test = train_test_split(names, origins, test_size=0.3, shuffle=True, random_state=42)


# Lets look at the data

In [55]:
for name, origin in zip(names_train[:20], origins_train[:20]):
  print(name.ljust(20), origin)

Pett                 English
Hiro                 Japanese
Khouri               Arabic
Frusher              English
Costa                Portuguese
Watts                English
Khouri               Arabic
Slapnickova          Czech
Ricchetti            Italian
Remeslo              Russian
Izumi                Japanese
Groisman             Russian
Hurrell              English
Jangel               Russian
Vitoshkin            Russian
Bissette             French
Juravkov             Russian
Hakimi               Arabic
Shalahonov           Russian
Jeleznyak            Russian


In [56]:
origins = []
for x in origins_train:
    if x not in origins:
        origins.append(x)
print(origins)

['English', 'Japanese', 'Arabic', 'Portuguese', 'Czech', 'Italian', 'Russian', 'French', 'Scottish', 'Irish', 'German', 'Greek', 'Dutch', 'Chinese', 'Vietnamese', 'Korean', 'Spanish', 'Polish']


In [61]:
#Figure out the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
from keras.preprocessing.text import Tokenizer
from tensorflow import keras

origin_train_encoded = [origins.index(origin) for origin in origins_train]
origin_test_encoded = [origins.index(origin) for origin in origins_test]

#Use the keras tokenizer at the character level to tokenize your input into integer sequences.
# Training set
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(names_train)
sequences = tokenizer.texts_to_sequences(names_train)

# Test set
sequences_1 = tokenizer.texts_to_sequences(names_test)

In [63]:
#Pad your sequences using the keras preprocessing tools.
sequences = tf.keras.preprocessing.sequence.pad_sequences(
    sequences,
    maxlen=None,
    dtype='int32',
    padding='pre',
    truncating='pre',
    value=0.0
)

sequences_1 = tf.keras.preprocessing.sequence.pad_sequences(
    sequences_1,
    maxlen=None,
    dtype='int32',
    padding='pre',
    truncating='pre',
    value=0.0
)

In [8]:
origin_train_encoded_array = np.array(origin_train_encoded)
origin_test_encoded_array = np.array(origin_test_encoded)
sequences_array = np.array(sequences)
sequences_1_array = np.array(sequences_1)

numpy.ndarray

In [73]:
embed_size = 128
model = keras.models.Sequential([
    keras.layers.Embedding(input_dim=len(origin_train_encoded), 
                           output_dim=embed_size,
                           mask_zero=True, # just ignore zeroes instead of learning it
                           input_shape=[None]),
    keras.layers.GRU(128, return_sequences=True),
    keras.layers.GRU(128),
    tf.keras.layers.Dense(len(origins), activation='softmax')
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

In [80]:
history = model.fit(sequences_array, origin_train_encoded_array, epochs=5, validation_data=(sequences_1_array, origin_test_encoded_array))


Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [81]:
def predict_origin(name):
  assert isinstance(name, str)
  tokenizer.fit_on_texts([name])
  x_new = tokenizer.texts_to_sequences([name])
  y_proba = model.predict(x_new)
  d = dict(enumerate(y_proba.flatten(), 1))
  ini_list = ['Arabic', 'Czech', 'Russian', 'English', 'Dutch', 'German', 'Spanish', 'Polish', 'Scottish', 'French', 'Italian', 'Korean', 'Vietnamese', 'Chinese', 'Japanese', 'Greek', 'Irish', 'Portuguese']
  # change keys of dictionary from probabilities to origins
  final_dict = dict(zip(ini_list, list(d.values())))
  # sort the dictionary by largest to smallest probability
  sort_final_dict = sorted(final_dict.items(), key=lambda x: x[1], reverse=True)
  for i in sort_final_dict:
	  print(i[0], i[1])
  # Print out the origin of the name:
  x = sort_final_dict[0]
  return print("The origin of {} is {}.".format(name, x[0]))

In [90]:
predict_origin("inoseke")

Japanese 0.96832454
Russian 0.028065262
Spanish 0.001280578
Czech 0.00052627915
Italian 0.000523185
Portuguese 0.00048513536
Polish 0.00032964814
Greek 0.00026979213
English 4.7200872e-05
Dutch 4.3072527e-05
Arabic 3.0994757e-05
German 2.6547756e-05
Irish 2.3635155e-05
French 1.8155122e-05
Korean 3.310234e-06
Scottish 1.2909151e-06
Chinese 7.971009e-07
Vietnamese 6.4440854e-07
The origin of inoseke is Japanese.


In [91]:
predict_origin("putin")

Russian 0.6824243
Italian 0.2246525
Czech 0.028965894
German 0.017480794
English 0.014840596
Vietnamese 0.009190067
Chinese 0.005551206
Polish 0.0035569298
Korean 0.0034208207
Irish 0.0028147467
Japanese 0.0026763615
Spanish 0.0014864838
French 0.0013024184
Portuguese 0.00092454866
Dutch 0.0003513253
Greek 0.00024308743
Scottish 0.00011644085
Arabic 1.530532e-06
The origin of putin is Russian.


In [92]:
predict_origin("mohammod")

Arabic 0.4521369
French 0.16324107
Portuguese 0.10108853
Spanish 0.060494307
Dutch 0.04124422
German 0.034854922
Japanese 0.029759597
Russian 0.025759159
Irish 0.022234743
English 0.020454599
Greek 0.016606148
Polish 0.016008899
Scottish 0.006844208
Italian 0.00446331
Czech 0.003412573
Vietnamese 0.0006495028
Chinese 0.00059616676
Korean 0.00015117408
The origin of mohammod is Arabic.
