# Natural Language Processing- Classifying the origin of names using a character-level RNN

In this task, I used rnn-based model to perform classification. The aim include:

1. To get started with the preprocessing needed to perform text classification from A to Z.
2. Use embeddings and RNNs in conjunction at the character level to perform classification.
3. Write a function that takes as input a string, and outputs the name of the predicted class.

To do this, I followed these steps:

1. Decide the number of classes, and map the classes to integers (or one-hot vectors). This is needed for fitting the model and training it to do classification.
2. Use the keras tokenizer at the character level to tokenize input into integer sequences.
3. Pad sequences using the keras preprocessing tools.
4. Build a model that uses, minimally, an embedding layer, an RNN and a dense layer to output the logits or probabilities for the target classes (name origins).
5. Fit the model and evaluate on the test set.
6. Write a function that takes a string as input and predicts the origin (as its original string value)

In [3]:
#!pip install keras-tuner

In [4]:
%tensorflow_version 2.x
import numpy as np
from glob import glob
from sklearn.model_selection import train_test_split
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, Dense, Bidirectional, Dropout
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, GRU, Dense, Bidirectional, Dropout
from kerastuner import HyperModel
import kerastuner as kt


Colab only includes TensorFlow 2.x; %tensorflow_version has no effect.


  from kerastuner import HyperModel


In [5]:
# Download the data
!wget https://download.pytorch.org/tutorial/data.zip
!unzip data.zip

--2024-06-29 03:10:00--  https://download.pytorch.org/tutorial/data.zip
Resolving download.pytorch.org (download.pytorch.org)... 13.35.35.91, 13.35.35.99, 13.35.35.55, ...
Connecting to download.pytorch.org (download.pytorch.org)|13.35.35.91|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2882130 (2.7M) [application/zip]
Saving to: ‘data.zip’


2024-06-29 03:10:01 (76.6 MB/s) - ‘data.zip’ saved [2882130/2882130]

Archive:  data.zip
   creating: data/
  inflating: data/eng-fra.txt        
   creating: data/names/
  inflating: data/names/Arabic.txt   
  inflating: data/names/Chinese.txt  
  inflating: data/names/Czech.txt    
  inflating: data/names/Dutch.txt    
  inflating: data/names/English.txt  
  inflating: data/names/French.txt   
  inflating: data/names/German.txt   
  inflating: data/names/Greek.txt    
  inflating: data/names/Irish.txt    
  inflating: data/names/Italian.txt  
  inflating: data/names/Japanese.txt  
  inflating: data/names/Korean.txt   

In [6]:
data = []
for filename in glob('data/names/*.txt'):
  origin = filename.split('/')[-1].split('.txt')[0]
  names = open(filename).readlines()
  for name in names:
    data.append((name.strip(), origin))

names, origins = zip(*data)
names_train, names_test, origins_train, origins_test = train_test_split(names, origins, test_size=0.25, shuffle=True, random_state=123)

# Look at the data

In [12]:
for name, origin in zip(names_train[:20], origins_train[:20]):
  print(name.ljust(20), origin)

Davidson             Scottish
Vyalko               Russian
Jahaev               Russian
Woo                  Korean
Abana                Spanish
Atiyeh               Arabic
Minyukov             Russian
Bachmeier            German
Gershkovitsh         Russian
Albinesku            Russian
Badyin               Russian
Androsyuk            Russian
Judasin              Russian
Velichkin            Russian
Viron                Russian
Kattan               Arabic
Ashbridge            English
Major                English
Hilton               English
Hunov                Russian


#### Map the classes to integers

In [9]:
origins_set = set(origins_train)
origin_to_int = {origin: i for i, origin in enumerate(origins_set)}
int_to_origin = {i: origin for origin, i in origin_to_int.items()}

origins_train_int = [origin_to_int[origin] for origin in origins_train]
origins_test_int = [origin_to_int[origin] for origin in origins_test]


#### Tokenize the names at the character level.

In [10]:
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(names_train)

names_train_seq = tokenizer.texts_to_sequences(names_train)
names_test_seq = tokenizer.texts_to_sequences(names_test)

max_length = max(len(seq) for seq in names_train_seq)


#### Pad the sequences to ensure they are all of the same length.

In [11]:
names_train_padded = pad_sequences(names_train_seq, maxlen=max_length, padding='post')
names_test_padded = pad_sequences(names_test_seq, maxlen=max_length, padding='post')


#### Build a model using an embedding layer, and a dense layer to output the logits for the target classes.

In [18]:
# Define vocab_size
vocab_size = len(tokenizer.word_index) + 1

# Convert names_train_padded and origins_train_int to NumPy arrays
names_train_padded = np.array(names_train_padded)
origins_train_int = np.array(origins_train_int)


# Define the HyperModel
class MyHyperModel(HyperModel):
    def build(self, hp):
        model = Sequential()
        model.add(Embedding(
            input_dim=vocab_size,
            output_dim=hp.Int('embedding_dim', min_value=32, max_value=128, step=32),
            input_length=max_length
        ))
        model.add(Bidirectional(LSTM(
            units=hp.Int('lstm_units', min_value=64, max_value=256, step=64),
            return_sequences=True
        )))
        model.add(Bidirectional(GRU(
            units=hp.Int('gru_units', min_value=64, max_value=256, step=64)
        )))
        model.add(Dropout(rate=hp.Float('dropout', min_value=0.2, max_value=0.5, step=0.1)))
        model.add(Dense(len(origins_set), activation='softmax'))

        model.compile(
            optimizer='adam',
            loss='sparse_categorical_crossentropy',
            metrics=['accuracy']
        )
        return model

hypermodel = MyHyperModel()

In [19]:
# Set up the tuner
tuner = kt.Hyperband(
    hypermodel,
    objective='val_accuracy',
    max_epochs=20,
    factor=3,
    directory='my_dir',
    project_name='name_origin_classification'
)

stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

Reloading Tuner from my_dir/name_origin_classification/tuner0.json


In [20]:
# Perform hyperparameter tuning
tuner.search(names_train_padded, origins_train_int, epochs=50, validation_split=0.2, callbacks=[stop_early])

best_hps = tuner.get_best_hyperparameters(num_trials=1)[0]

Trial 26 Complete [00h 00m 57s]
val_accuracy: 0.81401526927948

Best val_accuracy So Far: 0.8189970254898071
Total elapsed time: 00h 15m 42s


#### Fit the Model

In [None]:
# train the model with the best hyperparameters
model = tuner.hypermodel.build(best_hps)
model.fit(names_train_padded, origins_train_int, epochs=50, validation_split=0.2, callbacks=[stop_early])

#### Model Evaluation

In [25]:
# Evaluate the model on the test set
names_test_padded = np.array(names_test_padded)
origins_test_int = np.array(origins_test_int)

loss, accuracy = model.evaluate(names_test_padded, origins_test_int)
print(f'Test Accuracy: {accuracy:.4f}')

Test Accuracy: 0.8283


#### Write a function to predict the origin of names

In [26]:
def predict_origin(*names):
    predictions = {}
    for name in names:
        assert isinstance(name, str)
        name_seq = tokenizer.texts_to_sequences([name])
        name_padded = pad_sequences(name_seq, maxlen=max_length, padding='post')
        prediction = model.predict(name_padded)
        predicted_class = np.argmax(prediction, axis=1)[0]
        predictions[name] = int_to_origin[predicted_class]
    return predictions



In [34]:
#Apply the function

predicted_origins = predict_origin("Justin", "Trudeau")
for name, origin in predicted_origins.items():
    print(f"The predicted origin of {name} is {origin}.")


The predicted origin of Justin is English.
The predicted origin of Trudeau is French.
