# Assignment 5 - Recurrent Neural Network

**Problem Statement #1:**
Build a sequential model to classify names into gender.

Input to the model will be a name, i.e. a sequence of characters.

Use one hot representation of the  characters.

Remove non-ascii characters, if there are any

Outputs:
Show the effect of the following on the accuracy

1. RNN cells - Simple RNN, LSTM and GRU

2. Dataset size (Randomly select 25%, 50%, 75% and 100% of the data) . For each partial dataset use 80% as training data.

Report overall and class-wise accuracies for all the combinations. (class-wises accuracy should report percentage of correctly predicted male names and female names)

**Problem Statement #2:**
Train a language model using these names.

Output
Generate 100 male names and 100 female names.

Measure the accuracy of classifying these names by using the best-performing model from part 1

**Problem Statement #2a**:
Train a language model using names starting with A, M, and Z.

Output
Generate 50 names

Use perplexity to show the quality of these names, i.e. how realistic these names are


In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [19]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import accuracy_score
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, SimpleRNN, LSTM, GRU, Dropout, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical

In [4]:
data = pd.read_csv('/content/drive/MyDrive/Colab/Deep/name_gender.csv').dropna()
data

Unnamed: 0,name,gender,probability
0,Aaban,M,1.0
1,Aabha,F,1.0
2,Aabid,M,1.0
3,Aabriella,F,1.0
4,Aada,F,1.0
...,...,...,...
95020,Zyvion,M,1.0
95021,Zyvon,M,1.0
95022,Zyyanna,F,1.0
95023,Zyyon,M,1.0


# Problem Statement #1:

In [4]:
# Remove non-ASCII characters
data['name'] = data['name'].apply(lambda x: ''.join(filter(lambda y: y.isascii(), x)))

# Define the dataset sizes to evaluate
dataset_sizes = [0.25, 0.5, 0.75, 1.0]

# Define RNN cell types
rnn_cells = [SimpleRNN, LSTM, GRU]

results = {}

for cell in rnn_cells:
    results[cell.__name__] = {}
    for size in dataset_sizes:
        # Randomly select subset of the dataset
        subset_data = data.sample(frac=size, random_state=42)

        subset_data['name'] = subset_data['name'].str.lower()

        # Split dataset into features and labels
        X = subset_data['name']
        y = pd.get_dummies(subset_data['gender'])  # One-hot encoding for gender

        # Tokenize characters
        tokenizer = Tokenizer(char_level=True)
        tokenizer.fit_on_texts(X)
        X_seq = tokenizer.texts_to_sequences(X)

        # Pad sequences to ensure they have the same length
        max_length = max([len(seq) for seq in X_seq])
        X_padded = pad_sequences(X_seq, maxlen=max_length, padding='post')

        # Split data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

        # Create RNN model
        model = Sequential([
            Embedding(len(tokenizer.word_index)+1, 32, input_length=max_length),
            cell(64),
            Dense(2, activation='softmax')
        ])

        model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

        # Train the model
        model.fit(X_train, y_train, epochs=10, batch_size=64, verbose=1)

        # Evaluate the model
        loss, accuracy = model.evaluate(X_test, y_test, verbose=1)
        #results[cell.__name__][size] = accuracy

        # Predict classes for test set
        y_pred_probs = model.predict(X_test)
        y_pred = np.argmax(y_pred_probs, axis=1)
        y_true = np.argmax(np.array(y_test), axis=1)

        # Calculate class-wise accuracy
        total_male = np.sum(y_true == 0)
        total_female = np.sum(y_true == 1)

        correct_male = np.sum((y_pred == 0) & (y_true == 0))
        correct_female = np.sum((y_pred == 1) & (y_true == 1))

        accuracy_male = correct_male / total_male * 100 if total_male > 0 else 0
        accuracy_female = correct_female / total_female * 100 if total_female > 0 else 0

        results[cell.__name__][size] = {'Overall Accuracy': accuracy, 'Male Accuracy': accuracy_male, 'Female Accuracy': accuracy_female}


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
E

In [5]:
# Results
for cell, sizes in results.items():
    print(f"RNN Cell Type: {cell}")
    for size, accuracy in sizes.items():
        print(f"\nDataset Size: {size * 100}% - Overall Accuracy: {accuracy['Overall Accuracy']:.2f}%")
        print(f"Male Accuracy: {accuracy['Male Accuracy']:.2f}%")
        print(f"Female Accuracy: {accuracy['Female Accuracy']:.2f}%")
    print()

RNN Cell Type: SimpleRNN

Dataset Size: 25.0% - Overall Accuracy: 0.82%
Male Accuracy: 93.04%
Female Accuracy: 61.82%

Dataset Size: 50.0% - Overall Accuracy: 0.85%
Male Accuracy: 91.64%
Female Accuracy: 74.64%

Dataset Size: 75.0% - Overall Accuracy: 0.87%
Male Accuracy: 88.18%
Female Accuracy: 83.89%

Dataset Size: 100.0% - Overall Accuracy: 0.86%
Male Accuracy: 85.97%
Female Accuracy: 87.01%

RNN Cell Type: LSTM

Dataset Size: 25.0% - Overall Accuracy: 0.84%
Male Accuracy: 84.08%
Female Accuracy: 83.34%

Dataset Size: 50.0% - Overall Accuracy: 0.86%
Male Accuracy: 89.33%
Female Accuracy: 80.27%

Dataset Size: 75.0% - Overall Accuracy: 0.87%
Male Accuracy: 89.86%
Female Accuracy: 82.92%

Dataset Size: 100.0% - Overall Accuracy: 0.88%
Male Accuracy: 90.88%
Female Accuracy: 83.02%

RNN Cell Type: GRU

Dataset Size: 25.0% - Overall Accuracy: 0.84%
Male Accuracy: 87.76%
Female Accuracy: 77.48%

Dataset Size: 50.0% - Overall Accuracy: 0.86%
Male Accuracy: 87.82%
Female Accuracy: 83.77%

D

#### Based on these results, LSTM and GRU seem to have slightly better performance compared to SimpleRNN. However, LSTM it shows slightly higher accuracy so we will use LSTM for problem statement 2.

# Problem Statement #2:

In [48]:
data['name'] = data['name'].apply(lambda x: ''.join(filter(lambda y: y.isascii(), x)))

# dataset size for evaluation
dataset_size = 0.8

results = {}
results['LSTM'] = {}

# Randomly select subset of the dataset
subset_data = data.sample(frac=dataset_size, random_state=42)

subset_data['name'] = subset_data['name'].str.lower()

# Split dataset into features and labels
X = subset_data['name']
y = pd.get_dummies(subset_data['gender'])  # One-hot encoding for gender

# Tokenize characters
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(X)
X_seq = tokenizer.texts_to_sequences(X)

# Pad sequences to ensure they have the same length
max_length = max([len(seq) for seq in X_seq])
X_padded = pad_sequences(X_seq, maxlen=max_length, padding='post')

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# Create LSTM model
model = Sequential([
    Embedding(len(tokenizer.word_index)+1, 32, input_length=max_length),
    LSTM(64),
    Dense(2, activation='softmax')
])

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the LSTM model
model.fit(X_train, y_train, epochs=10, batch_size=64, verbose=1)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7e326917e950>

In [87]:
# Create a transition matrix for Markov Chain
def create_transition_matrix(names):
    transition_matrix = {}
    for name in names:
        for i in range(len(name) - 1):
            char, next_char = name[i], name[i + 1]
            if char not in transition_matrix:
                transition_matrix[char] = {}
            if next_char not in transition_matrix[char]:
                transition_matrix[char][next_char] = 1
            else:
                transition_matrix[char][next_char] += 1
    for char in transition_matrix:
        total = sum(transition_matrix[char].values())
        for next_char in transition_matrix[char]:
            transition_matrix[char][next_char] /= total
    return transition_matrix

def generate_names(transition_matrix, num_names):
    names = []
    for _ in range(num_names):
        name = np.random.choice(list(transition_matrix.keys()))
        while len(name) < np.random.randint(4, 7):
            char = name[-1]
            next_char = np.random.choice(list(transition_matrix[char].keys()),
                                         p=list(transition_matrix[char].values()))
            name += next_char
        names.append(name.capitalize())
    return names

# Filter names for each gender
male_names = data[data['gender'] == 'M']['name'].tolist()
female_names = data[data['gender'] == 'F']['name'].tolist()

# Create transition matrices for Markov Chain for male and female names
transition_matrix_male = create_transition_matrix(male_names)
transition_matrix_female = create_transition_matrix(female_names)

# Generate 100 male and 100 female names using Markov Chain
generated_male_names = generate_names(transition_matrix_male, 100)
generated_female_names = generate_names(transition_matrix_female, 100)

# Print generated male and female names
print("Generated Male Names:")
print(*generated_male_names, sep='\n')

print("\nGenerated Female Names:")
print(*generated_female_names, sep='\n')

# Prepare sequences for generated names
sequences_male = []
sequences_female = []
for name in generated_male_names:
    sequence = [tokenizer.word_index[char] for char in name if char in tokenizer.word_index]
    sequences_male.append(sequence)
for name in generated_female_names:
    sequence = [tokenizer.word_index[char] for char in name if char in tokenizer.word_index]
    sequences_female.append(sequence)

padded_sequences_male = pad_sequences(sequences_male, maxlen=max_length, padding='post')
padded_sequences_female = pad_sequences(sequences_female, maxlen=max_length, padding='post')

# Measure accuracy for generated names using the LSTM model
accuracy_male = model.evaluate(padded_sequences_male, np.tile([1, 0], (len(padded_sequences_male), 1)))
accuracy_female = model.evaluate(padded_sequences_female, np.tile([0, 1], (len(padded_sequences_female), 1)))

print(f"Accuracy for generated male names: {accuracy_male[1] * 100:.2f}%")
print(f"Accuracy for generated female names: {accuracy_female[1] * 100:.2f}%")

Generated Male Names:
Saab
Ngane
Zartha
Lish
Pron
Clajan
Isha
Coveja
Magai
Ichz
Leluz
Quly
Zrala
Pertaa
Venid
Zani
Meic
Furan
Uare
Habat
Dekeb
Haisa
Msonet
Raden
Rrelew
Mimmu
Shaka
Ncaq
Prath
Hinere
Orai
Baren
Vinee
Crell
Metr
Nnaven
Quris
Gelr
Uiardy
Dahia
Wdende
Hass
Wiuix
Ghag
Zhaee
Celm
Thaiei
Yarni
Zalol
Huka
Ardon
Gharay
Uithya
Ashiq
Resher
Gick
Phrahm
Phot
Ereron
Inhicr
Enwi
Zeele
Miel
Donk
Oorio
Gievo
Vedyv
Juare
Fryny
Wavae
Fonde
Yale
Ersto
Erar
Wwrude
Lapl
Griak
Gilm
Mawi
Jaiv
Fenzi
Derse
Ariri
Daish
Beto
Hobaqu
Treche
Rasla
Ssare
Xsana
Yharia
Panwu
Ckarh
Minon
Chisa
Phaun
Uaza
Khabb
Omiko
Faron

Generated Female Names:
Chay
Zisy
Zynee
Jaync
Edim
Waiece
Leseie
Tret
Vrri
Moshid
Uxyav
Cidel
Arale
Lanic
Akatt
Eshae
Vran
Nnet
Ptalyn
Pamyma
Udari
Mara
Izaner
Ukea
Quli
Zaie
Blelyo
Beerov
Aninar
Yllor
Leriti
Wanan
Waui
Kiana
Bbeg
Sshi
Creri
Viann
Cashvi
Kyaku
Phmal
Tand
Vaver
Lllen
Zjima
Lyarl
Uneana
Gitrin
Qura
Kylaha
Qules
Fary
Ronan
Qusha
Cayaha
Quya
Zetit
Tharm
Carch
Lomyly
Pria

# Problem Statement #2a:

In [32]:
names = data['name'].str.lower().tolist()

# Filter names starting with A, M, and Z
names = [name for name in names if name[0] in ['a', 'm', 'z']]

# Get unique characters and create char-to-int mapping
chars = sorted(list(set(' '.join(names))))
char_to_int = {c: i for i, c in enumerate(chars)}

# Create sequences for training
seq_length = 10
dataX = []
dataY = []
for name in names:
    for i in range(len(name) - seq_length):
        seq_in = name[i:i + seq_length]
        seq_out = name[i + seq_length]
        dataX.append([char_to_int[char] for char in seq_in])
        dataY.append(char_to_int[seq_out])
n_patterns = len(dataX)

X = np.reshape(dataX, (n_patterns, seq_length, 1))
X = X / float(len(chars))
y = to_categorical(dataY, num_classes=len(chars))

# Define and train the LSTM model
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dense(len(chars), activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam')

model.fit(X, y, epochs=20, batch_size=128, verbose=1)

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


<keras.src.callbacks.History at 0x7e32542c6050>

In [43]:
# Create a transition matrix for Markov Chain
def create_transition_matrix(names):
    transition_matrix = {}
    for name in names:
        for i in range(len(name) - 1):
            char, next_char = name[i], name[i + 1]
            if char not in transition_matrix:
                transition_matrix[char] = {}
            if next_char not in transition_matrix[char]:
                transition_matrix[char][next_char] = 1
            else:
                transition_matrix[char][next_char] += 1
    for char in transition_matrix:
        total = sum(transition_matrix[char].values())
        for next_char in transition_matrix[char]:
            transition_matrix[char][next_char] /= total
    return transition_matrix

# Generate names using the Markov Chain model
def generate_names(transition_matrix, num_names, min_length=4, max_length=10):
    names = []
    for _ in range(num_names):
        name = np.random.choice(['a', 'm', 'z'])  # Start name with A, M, or Z
        while len(name) < max_length:
            char = name[-1]
            next_char = np.random.choice(list(transition_matrix[char].keys()),
                                         p=list(transition_matrix[char].values()))
            name += next_char
            if next_char in ['a', 'm', 'z'] and len(name) >= min_length:
                break
        names.append(name.capitalize())
    return names

# Generate transition matrix
transition_matrix = create_transition_matrix(names)

# Generate 50 names using the Markov Chain model
generated_names = generate_names(transition_matrix, 50)
print('Generated 50 Names:')
print(*generated_names, sep='\n')

Generated 50 Names:
Ayshra
Afrria
Menneclero
Maica
Zicilia
Amakeliely
Mekia
Makela
Zyika
Zysia
Zialeenz
Zackoushm
Marisa
Zana
Zeeylviria
Zaya
Zeenletore
Zesta
Maiea
Zartilka
Ayalyna
Aina
Arina
Ziaua
Zeyrndna
Zonelia
Maethm
Meausha
Zymeleya
Maim
Anikeria
Axsouliena
Mona
Adalya
Aroniyelli
Zorkenntta
Mcketrism
Aunia
Minda
Mierra
Melm
Mallm
Zeahena
Mazrisa
Axela
Zyaha
Zacovelllr
Ania
Amariseond
Misha


In [44]:
# Calculate perplexity for the generated names
def calculate_perplexity(generated_names, transition_matrix):
    perplexity_values = []
    for name in generated_names:
        log_prob = 0
        for i in range(len(name) - 1):
            char, next_char = name[i], name[i+1]
            if char in transition_matrix and next_char in transition_matrix[char]:
                log_prob += -np.log2(transition_matrix[char][next_char])
            else:
                log_prob += -np.log2(1e-7)  # Smoothing for unseen transitions
        perplexity = 2 ** (log_prob / (len(name) - 1))
        perplexity_values.append(perplexity)
    return perplexity_values

# Generate perplexity scores for the generated names
perplexity_scores = calculate_perplexity(generated_names, transition_matrix)

# Print generated names and their perplexity scores
print('Generated Names and their Perplexity Scores:')
for i, name in enumerate(generated_names):
    print(f"{name} --> Perplexity = {perplexity_scores[i]:.2f}")

Generated Names and their Perplexity Scores:
Ayshra --> Perplexity = 158.83
Afrria --> Perplexity = 117.90
Menneclero --> Perplexity = 60.84
Maica --> Perplexity = 406.12
Zicilia --> Perplexity = 86.53
Amakeliely --> Perplexity = 30.49
Mekia --> Perplexity = 339.00
Makela --> Perplexity = 127.23
Zyika --> Perplexity = 372.53
Zysia --> Perplexity = 274.11
Zialeenz --> Perplexity = 69.86
Zackoushm --> Perplexity = 71.28
Marisa --> Perplexity = 106.91
Zana --> Perplexity = 604.12
Zeeylviria --> Perplexity = 46.65
Zaya --> Perplexity = 894.50
Zeenletore --> Perplexity = 55.53
Zesta --> Perplexity = 316.07
Maiea --> Perplexity = 435.97
Zartilka --> Perplexity = 105.02
Ayalyna --> Perplexity = 63.10
Aina --> Perplexity = 694.78
Arina --> Perplexity = 183.33
Ziaua --> Perplexity = 440.93
Zeyrndna --> Perplexity = 138.88
Zonelia --> Perplexity = 49.87
Maethm --> Perplexity = 271.33
Meausha --> Perplexity = 82.41
Zymeleya --> Perplexity = 67.37
Maim --> Perplexity = 2443.55
Anikeria --> Perplex

In [45]:
# Calculate the average perplexity
avg_perplexity = sum(perplexity_scores) / len(perplexity_scores)

print(f"Average Perplexity: {avg_perplexity:.2f}")

Average Perplexity: 286.72


The average perplexity for the generated names is 286.72. It indicates a certain level of inconsistency and uncertainty in character transitions across the generated names, potentially impacting the perceived quality. Perplexity is a measure of how well a probability model predicts a sample, and lower perplexity values indicate better performance.
 - Some names like 'Zialeenz', 'Minda', 'Axela', 'Ania', 'Misha', etc., have lower perplexity scores and might resemble names more commonly found in the dataset.
 - Names such as 'Zaya', 'Maim', 'Melm', 'Zeyrndna', etc., have higher perplexity scores, suggesting they are less similar or less frequently occurring names in the dataset.