# Lab 8

## Character-Level Surname Classification with Feed-Forward Neural Networks

In this notebook we will use a surnames dataset that contains 10,000 surnames from 18 different nationalities (Arabic, Chinese, Czech, Dutch, English, French, German, Greek, Irish, Italian, Japanese, Korean, Polish, Portuguese, Russian, Scottish, Spanish, and Vietnamese).

The goal is simple: show how a basic feed-forward neural network can learn to classify surnames by nationality using **character-level representations**. This is different from the word-level embeddings we used before!

**Why character-level?** For names, spelling patterns matter more than word meaning. Character-level models can capture patterns like:
- Russian names ending in "-ov" or "-ova"
- Spanish names ending in "-ez"
- Irish names starting with "O'"
- Vietnamese names with specific character combinations

You can run this lab both locally or in Colab.

- To run in Colab just go to `https://colab.research.google.com`, sign-in and you upload this notebook. Colab has GPU access for free.
- To run locally just run `jupyter notebook` and access the notebook in this lab. You would need to first install the requirements in `requirements.txt`

Follow the instructions. Good luck!

In [None]:
!nvidia-smi

In [None]:
# CRITICAL: Version constraints for compatibility
# These versions are tested and required for this course
!pip install 'numpy<2' \
             'tensorflow==2.15.0' \
             'pandas' \
             'matplotlib' \
             'scikit-learn'

In [None]:
import tensorflow as tf
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
import numpy as np
import random
import os
import pandas as pd
import warnings
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

TRACE = False

def set_seeds_and_trace():
    """Set seeds for reproducibility across numpy, tensorflow, and python random"""
    os.environ['PYTHONHASHSEED'] = '0'
    np.random.seed(42)
    tf.random.set_seed(42)
    random.seed(42)
    if TRACE:
        tf.debugging.set_log_device_placement(True)

set_seeds_and_trace()
warnings.filterwarnings('ignore')
print("Setup complete!")

## Getting the Dataset

We'll download the surnames dataset from Google Drive. This dataset contains 10,000 surnames labeled with their nationality of origin.

In [None]:
%%writefile get_data.sh
if [ ! -f surnames.csv ]; then
  wget --no-check-certificate 'https://drive.google.com/uc?export=download&id=1AQxhGLhoHE162HALo1NkgdaN-LVY4QWo' -O surnames.csv
fi

In [None]:
!bash get_data.sh

In [None]:
# Load the surnames dataset
df = pd.read_csv('surnames.csv')
print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {df.columns.tolist()}")
print(f"\nFirst few rows:")
df.head(10)

In [None]:
# Check the distribution of nationalities in the dataset
nationality_counts = df['nationality'].value_counts()
print("Nationality distribution:")
print(nationality_counts)

# Visualize the distribution
plt.figure(figsize=(12, 5))
nationality_counts.plot(kind='bar')
plt.title('Distribution of Surnames by Nationality')
plt.xlabel('Nationality')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.show()

## Character-Level Encoding

For character-level models, we need to:
1. Build a **character vocabulary** - a mapping of every unique character to an index
2. Convert each surname into a **one-hot encoded matrix**

**Example**: If our vocabulary is `{'a': 0, 'b': 1, 'c': 2}` and max length is 3:
- "cab" becomes a 3x3 matrix where each row is a one-hot vector for that character
- Row 0: [0, 0, 1] (for 'c')
- Row 1: [1, 0, 0] (for 'a')  
- Row 2: [0, 1, 0] (for 'b')

This is different from word embeddings! Here we're working with individual characters, not words.

### Demo: Building the Character Vocabulary

In [None]:
def build_char_vocabulary(surnames):
    """
    Build a character vocabulary from all surnames.
    Returns a dictionary mapping characters to indices.
    """
    # Collect all unique characters from all surnames
    all_chars = set()
    for surname in surnames:
        all_chars.update(surname.lower())  # Convert to lowercase for consistency
    
    # Sort characters for consistent ordering and add padding character
    sorted_chars = sorted(all_chars)
    
    # Create character to index mapping (reserve 0 for padding)
    char_to_idx = {char: idx + 1 for idx, char in enumerate(sorted_chars)}
    char_to_idx['<PAD>'] = 0  # Padding character for shorter names
    
    return char_to_idx

# Build the vocabulary
char_to_idx = build_char_vocabulary(df['surname'])
vocab_size = len(char_to_idx)

print(f"Vocabulary size: {vocab_size}")
print(f"\nCharacter to index mapping (first 20):")
print(dict(list(char_to_idx.items())[:20]))

In [None]:
# Find the maximum surname length to determine padding
max_length = df['surname'].str.len().max()
print(f"Maximum surname length: {max_length}")
print(f"\nWe'll use this as our fixed sequence length for all surnames.")

### Lab: Encode Surnames to One-Hot Matrices

Now you'll implement the function to convert surnames into one-hot encoded matrices.

In [None]:
def encode_surname(surname, char_to_idx, max_length):
    """
    Convert a surname to a one-hot encoded matrix.
    
    Args:
        surname: String - the surname to encode
        char_to_idx: Dict - character to index mapping
        max_length: Int - maximum length for padding
    
    Returns:
        numpy array of shape (max_length, vocab_size) - one-hot encoded matrix
    """
    # Initialize matrix with zeros
    vocab_size = len(char_to_idx)
    matrix = np.zeros((max_length, vocab_size))
    
    # Convert surname to lowercase
    surname_lower = surname.lower()
    
    # FILL: For each character in the surname (up to max_length):
    #   1. Get the character index from char_to_idx
    #   2. Set the corresponding position in the matrix to 1 (one-hot encoding)
    # Hint: matrix[position, char_index] = 1
    
    for i, char in enumerate(surname_lower[:max_length]):
        if char in char_to_idx:
            char_idx = None  # FILL: get the index for this character
            matrix[i, char_idx] = None  # FILL: set the one-hot value
    
    return matrix

# Test with an example
test_surname = "Chen"
encoded = encode_surname(test_surname, char_to_idx, max_length)
print(f"Encoding '{test_surname}':")
print(f"Shape: {encoded.shape}")
print(f"\nFirst 3 rows (one per character):")
print(encoded[:3, :])

### Lab: Prepare the Full Dataset

Now encode all surnames and prepare labels for training.

In [None]:
# Encode all surnames
print("Encoding all surnames...")
X = np.array([encode_surname(surname, char_to_idx, max_length) for surname in df['surname']])
print(f"X shape: {X.shape}")  # Should be (num_samples, max_length, vocab_size)

# Create nationality to label mapping
nationalities = df['nationality'].unique()
nationality_to_idx = {nat: idx for idx, nat in enumerate(sorted(nationalities))}
idx_to_nationality = {idx: nat for nat, idx in nationality_to_idx.items()}

print(f"\nNumber of nationalities: {len(nationality_to_idx)}")
print(f"Nationality to index mapping:")
print(nationality_to_idx)

# FILL: Convert nationalities to numeric labels using nationality_to_idx
# Hint: df['nationality'].map(nationality_to_idx)
y = None  # FILL

# FILL: Convert labels to one-hot encoding using tf.keras.utils.to_categorical
# The number of classes should be len(nationality_to_idx)
y_onehot = None  # FILL

print(f"\ny shape: {y_onehot.shape}")

In [None]:
# Split into training and test sets
# Use 80/20 split with random_state for reproducibility
X_train, X_test, y_train, y_test = train_test_split(X, y_onehot, test_size=0.2, random_state=42)

print(f"Training set: {X_train.shape}")
print(f"Test set: {X_test.shape}")

## Building a Simple Feed-Forward Neural Network

Now we'll build a simple Multi-Layer Perceptron (MLP) to classify surnames. The architecture is:

1. **Flatten layer**: Convert the (max_length, vocab_size) matrix into a single vector
2. **Dense layer**: Hidden layer with 128 neurons and ReLU activation
3. **Dropout layer**: Regularization to prevent overfitting
4. **Dense layer**: Output layer with 18 neurons (one per nationality) and softmax activation

This is the simplest neural network architecture for this task!

### Demo: Build the Model

In [None]:
model = Sequential()

# FILL: Add a Flatten layer to convert the (max_length, vocab_size) matrix to a 1D vector
# Hint: Flatten(input_shape=(max_length, vocab_size))
model.add()

# FILL: Add a Dense layer with 128 neurons and 'relu' activation
model.add()

# FILL: Add a Dropout layer with 0.3 dropout rate for regularization
model.add()

# FILL: Add the output Dense layer with softmax activation
# The number of neurons should equal the number of nationalities
model.add()

model.summary()

### Lab: Compile and Train the Model

Now compile and train the model!

In [None]:
# FILL: Compile the model
# Hint: For multi-class classification with one-hot labels, use 'categorical_crossentropy' loss
# Use 'adam' optimizer and track 'accuracy' metric
model.compile(
    optimizer=None,  # FILL
    loss=None,  # FILL
    metrics=None  # FILL
)

# FILL: Train the model
# Use validation_split=0.2, epochs=20, batch_size=32
# Add early stopping callback to prevent overfitting
callback = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = None  # FILL: use model.fit() with the callback

print("\nTraining complete!")

In [None]:
# Plot training history to visualize learning progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss plot - shows how well the model fits the data
ax1.plot(history.history['loss'], label='Training Loss', color='blue')
ax1.plot(history.history['val_loss'], label='Validation Loss', color='green')
ax1.set_title('Model Loss')
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.legend()
ax1.grid(True)

# Accuracy plot - shows classification performance
ax2.plot(history.history['accuracy'], label='Training Accuracy', color='blue')
ax2.plot(history.history['val_accuracy'], label='Validation Accuracy', color='green')
ax2.set_title('Model Accuracy')
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.legend()
ax2.grid(True)

plt.tight_layout()
plt.show()

In [None]:
# Evaluate on test set - this shows how well the model generalizes to unseen data
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"\nTest Accuracy: {test_accuracy:.4f} ({test_accuracy*100:.2f}%)")
print(f"Test Loss: {test_loss:.4f}")

# Interpretation:
# - Test accuracy around 40-60% is typical for this 18-class problem (random guess = 5.6%)
# - Character-level models work well for this task because surnames have distinctive patterns
# - Higher accuracy would require more sophisticated architectures (RNNs, CNNs)

## Testing with Custom Surnames

Let's test the model with some example surnames and see if it can correctly predict their nationality!

In [None]:
def predict_nationality(surname, model, char_to_idx, max_length, idx_to_nationality):
    """
    Predict the nationality of a surname using the trained model.
    
    Args:
        surname: String - surname to classify
        model: Trained Keras model
        char_to_idx: Character vocabulary dictionary
        max_length: Maximum sequence length
        idx_to_nationality: Reverse mapping from index to nationality name
    
    Returns:
        predicted_nationality: String - predicted nationality
        confidence: Float - model confidence (0-1)
    """
    # Encode the surname using the same encoding function
    encoded = encode_surname(surname, char_to_idx, max_length)
    
    # Reshape for model input: add batch dimension
    # Model expects (batch_size, max_length, vocab_size)
    encoded = encoded.reshape(1, max_length, len(char_to_idx))
    
    # Get model prediction (softmax probabilities for each nationality)
    prediction = model.predict(encoded, verbose=0)  # verbose=0 suppresses progress bar
    
    # Get the class with highest probability
    predicted_idx = np.argmax(prediction[0])  # argmax returns index of maximum value
    predicted_nationality = idx_to_nationality[predicted_idx]
    confidence = prediction[0][predicted_idx]  # Probability for predicted class
    
    return predicted_nationality, confidence

# Test with various surnames from different nationalities
test_surnames = [
    'Smith',      # English
    'Ivanov',     # Russian
    'Chen',       # Chinese
    'O\'Brien',   # Irish
    'Martinez',   # Spanish
    'Schmidt',    # German
    'Nakamura'    # Japanese
]

print("Predictions:")
print("-" * 60)
for surname in test_surnames:
    nationality, confidence = predict_nationality(surname, model, char_to_idx, max_length, idx_to_nationality)
    print(f"{surname:15} -> {nationality:15} (confidence: {confidence:.2%})")

## Summary

In this notebook, you learned:

1. **Character-level representations**: How to work with individual characters instead of words
   - Built a character vocabulary from the dataset
   - Encoded surnames as one-hot matrices
   
2. **One-hot encoding**: Converting characters to numerical matrices for neural networks
   - Each character becomes a sparse vector
   - Fixed-length sequences with padding
   
3. **Feed-forward networks**: Building a simple MLP for text classification
   - Flatten layer to convert 2D input to 1D
   - Dense layers with ReLU activation
   - Dropout for regularization
   - Softmax output for multi-class classification
   
4. **Pattern recognition**: How neural networks learn spelling patterns that indicate nationality
   - Russian names: -ov, -ova endings
   - Spanish names: -ez endings
   - Irish names: O' prefix
   - Different character distributions across languages

**Key takeaway**: Character-level models are powerful for tasks where spelling patterns matter more than word semantics - like names, hashtags, or rare words!

### Optional Challenge

Try improving the model by:
- **Adding more hidden layers**: Add another Dense layer before the output
- **Experimenting with different activation functions**: Try LeakyReLU or ELU
- **Using a CNN architecture**: Add Conv1D layers to capture local character patterns
- **Testing with your own surname**: Modify the test cell to try your friends' surnames!
- **Analyzing errors**: Find which nationality pairs the model confuses most