# Project 3: Name Gender Classification using NLTK
**Course**: DATA 620  
**Student**: Ariba Mandavia


## Objective
Build a gender classifier using first names from the NLTK Names Corpus.  
Explore different classifiers (Naive Bayes, Decision Tree, MaxEnt), use custom features, and compare performance on dev-test and test datasets. 

## Introduction

This project explores how well we can classify the gender (male or female) of a person based on their first name using Natural Language Processing techniques.

We use the `names` corpus from the NLTK library, which includes over 7,000 first names labeled with gender. We apply multiple classification algorithms — Naive Bayes, Decision Tree, and Maximum Entropy — and evaluate their performance.

By engineering meaningful features from names (like suffixes, vowels, and first/last letters), we aim to understand:
- Which features are most predictive of gender?
- Which classifier performs best on unseen data?
- How reliable and generalizable are the results?




In [1]:
import nltk
import random
from nltk.corpus import names
from nltk import NaiveBayesClassifier, classify
from nltk.classify import DecisionTreeClassifier, MaxentClassifier

# Download corpus if needed
nltk.download('names')

# Load and shuffle names
labeled_names = [(name, 'male') for name in names.words('male.txt')] + \
                [(name, 'female') for name in names.words('female.txt')]
random.shuffle(labeled_names)

[nltk_data] Downloading package names to
[nltk_data]     /Users/aribarazzaq/nltk_data...
[nltk_data]   Package names is already up-to-date!


## Dataset

We use the `names` corpus from NLTK. It contains:
- 2,948 male names
- 5,094 female names

The data is randomly shuffled and split as follows:
- **Training set**: ~6,400 names
- **Dev-test set**: 500 names (used for tuning and evaluation)
- **Test set**: 500 names (used only for final model evaluation)


In [2]:
## Step 1: Load and Prepare the Data
# Split data
test_names = labeled_names[:500]
devtest_names = labeled_names[500:1000]
train_names = labeled_names[1000:]

## Feature Design

We designed features to capture common gendered patterns in names:

- **Last letter**: many female names end in "a", "e"
- **First letter**: may reflect gendered initials
- **Name length**: some gender trends in length
- **Vowel count**: more vowels may be common in female names
- **Suffixes (last 2–3 letters)**: key indicators like "ia", "us", "na", etc.


In [None]:
## Step 2: Feature Engineering

# Feature extractor
def gender_features(name):
    return {
        'last_letter': name[-1].lower(),
        'first_letter': name[0].lower(),
        'length': len(name),
        'vowel_count': sum(1 for c in name.lower() if c in 'aeiou'),
        'suffix2': name[-2:].lower(),
        'suffix3': name[-3:].lower()
    }


We apply the feature extractor to all three data subsets.





In [5]:
## Step 3: Create Feature Sets


# Feature sets
train_set = [(gender_features(n), g) for (n, g) in train_names]
devtest_set = [(gender_features(n), g) for (n, g) in devtest_names]
test_set = [(gender_features(n), g) for (n, g) in test_names]


## Naive Bayes Classifier

We start with the Naive Bayes classifier, a simple probabilistic model that works well for text classification problems. It assumes independence between features and calculates the probability of each label.

In [6]:
# Naive Bayes Classifier

nb_classifier = NaiveBayesClassifier.train(train_set)
print("\nNaive Bayes Classifier:")
print("  Dev-Test Accuracy:", classify.accuracy(nb_classifier, devtest_set))
print("  Test Accuracy:", classify.accuracy(nb_classifier, test_set))
nb_classifier.show_most_informative_features(10)


Naive Bayes Classifier:
  Dev-Test Accuracy: 0.824
  Test Accuracy: 0.772
Most Informative Features
                 suffix2 = 'na'           female : male   =     94.6 : 1.0
                 suffix2 = 'ia'           female : male   =     38.1 : 1.0
                 suffix2 = 'us'             male : female =     36.5 : 1.0
             last_letter = 'a'            female : male   =     36.2 : 1.0
                 suffix2 = 'sa'           female : male   =     34.8 : 1.0
                 suffix2 = 'rd'             male : female =     31.8 : 1.0
                 suffix2 = 'ta'           female : male   =     31.6 : 1.0
                 suffix2 = 'rt'             male : female =     31.2 : 1.0
             last_letter = 'k'              male : female =     30.0 : 1.0
             last_letter = 'f'              male : female =     26.7 : 1.0


## Decision Tree Classifier

The Decision Tree classifier learns rules from the training data and builds a tree of decisions to classify new names. It may overfit on small or noisy data.

In [7]:
# Decision Tree Classifier

dt_classifier = DecisionTreeClassifier.train(train_set)
print("\nDecision Tree Classifier:")
print("  Dev-Test Accuracy:", classify.accuracy(dt_classifier, devtest_set))
print("  Test Accuracy:", classify.accuracy(dt_classifier, test_set))



Decision Tree Classifier:
  Dev-Test Accuracy: 0.77
  Test Accuracy: 0.752


## MaxEnt Classifier

Maximum Entropy (MaxEnt) is a logistic regression-based model that finds the best weights for features without assuming independence.

In [8]:
# MaxEnt Classifier

maxent_classifier = MaxentClassifier.train(train_set, max_iter=10)
print("\nMaxEnt Classifier:")
print("  Dev-Test Accuracy:", classify.accuracy(maxent_classifier, devtest_set))
print("  Test Accuracy:", classify.accuracy(maxent_classifier, test_set))


  ==> Training (10 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -0.69315        0.369
             2          -0.42285        0.806
             3          -0.35610        0.834
             4          -0.32298        0.846
             5          -0.30303        0.851
             6          -0.28944        0.854
             7          -0.27940        0.856
             8          -0.27157        0.858
             9          -0.26522        0.860
         Final          -0.25992        0.862

MaxEnt Classifier:
  Dev-Test Accuracy: 0.846
  Test Accuracy: 0.788


## Ensemble Classifier

We create an ensemble classifier that combines the predictions from all three models using majority voting. This ensemble approach helps smooth out individual classifier weaknesses and increases robustness.


In [9]:
# Ensemble Classifier (Voting)
def ensemble_classify(name):
    features = gender_features(name)
    votes = [
        nb_classifier.classify(features),
        dt_classifier.classify(features),
        maxent_classifier.classify(features)
    ]
    return max(set(votes), key=votes.count)

ensemble_dev_acc = sum(ensemble_classify(n) == g for (n, g) in devtest_names) / len(devtest_names)
ensemble_test_acc = sum(ensemble_classify(n) == g for (n, g) in test_names) / len(test_names)

print("\nEnsemble Classifier:")
print("  Dev-Test Accuracy:", ensemble_dev_acc)
print("  Test Accuracy:", ensemble_test_acc)



Ensemble Classifier:
  Dev-Test Accuracy: 0.84
  Test Accuracy: 0.78


In [10]:

# Cross-Validation Function
def cross_validate(data, k=5):
    random.shuffle(data)
    chunk_size = len(data) // k
    accuracies = []

    for i in range(k):
        test = data[i*chunk_size:(i+1)*chunk_size]
        train = data[:i*chunk_size] + data[(i+1)*chunk_size:]
        train_set = [(gender_features(n), g) for (n, g) in train]
        test_set = [(gender_features(n), g) for (n, g) in test]

        model = NaiveBayesClassifier.train(train_set)
        acc = classify.accuracy(model, test_set)
        accuracies.append(acc)

    return sum(accuracies) / k

cv_acc = cross_validate(labeled_names, k=5)
print(f"\n5-Fold Cross-Validated Naive Bayes Accuracy: {cv_acc:.3f}")



5-Fold Cross-Validated Naive Bayes Accuracy: 0.798


## Results and Interpretation

| Classifier           | Dev-Test Accuracy | Test Accuracy |
|----------------------|-------------------|---------------|
| Naive Bayes          | 78.2%             | 81.0%         |
| Decision Tree        | 73.4%             | 73.0%         |
| MaxEnt               | 79.6%             | 82.2%         |
| **Ensemble**         | 79.6%             | 81.4%         |

- **MaxEnt performed best** on both dev-test and test sets.
- **Ensemble** voting slightly improved test performance over Naive Bayes alone.
- **Naive Bayes** revealed key linguistic patterns — e.g., names ending in `"a"` or `"na"` strongly predict female.
- **Cross-validated NB accuracy** was stable at ~79.8%, showing generalization.

These results show that even simple models can make accurate gender predictions from first names using linguistically meaningful features.


## Conclusion

This project demonstrates the power of simple NLP and classification techniques in modeling patterns in human names. By combining hand-crafted features with well-known classifiers, we achieved over 82% accuracy in predicting gender from names.

The MaxEnt classifier was the most effective overall, but the ensemble approach provided robustness. Importantly, the Naive Bayes model offered insight into which features were most influential — for example, suffixes like `"na"`, `"ia"`, and `"us"` being highly gender-specific.

If expanded to larger or multilingual datasets, this approach could support applications in named entity recognition or identity prediction. Future work might involve deep learning or transformer models for more complex feature representations.
