<a href="https://colab.research.google.com/github/geedoubledee/data620_project3/blob/main/DATA620_Project3_GDavis_BDavidoff.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DATA620: Project 3
#### by Glen Davis and Brett Davidoff

In [None]:
# Import libraries
import nltk
from nltk.corpus import names
import numpy as np
import random
from sklearn.feature_extraction import DictVectorizer
from sklearn.inspection import permutation_importance
from sklearn.model_selection import ShuffleSplit
from sklearn.metrics import accuracy_score
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
import textwrap as tw

We load the names dataset from NLTK and shuffle the order of the entries so that it's random.

In [None]:
# Load names dataset from NLTK
nltk.download('names', quiet=True)
male_names = [(name, 'male') for name in names.words('male.txt')]
female_names = [(name, 'female') for name in names.words('female.txt')]
all_names = male_names + female_names

# Shuffle the dataset to ensure it's randomly ordered
random.seed(4657)
random.shuffle(all_names)
ln = len(all_names)
print(f"\nThe first 10 out of {ln} total entries in the shuffled names dataset:\n")
wrapped = tw.fill(str(all_names[:10]))
print(wrapped)


The first 10 out of 7944 total entries in the shuffled names dataset:

[('Iseabal', 'female'), ('Andee', 'female'), ('Englebart', 'male'),
('Susi', 'female'), ('Row', 'female'), ('Delmar', 'male'), ('Faina',
'female'), ('Nero', 'male'), ('Dena', 'female'), ('Crista', 'female')]


We define a function to extract a small number of features from the entries in the names dataset: a) the last letter of the name; b) the first letter of the name; c) the count of total characters within the name; and d) the count of vowel characters within the name.

In [None]:
def feature_extraction(name):
    name = name.lower()
    features = {
        "last_letter": name[-1],
        "first_letter": name[0],
        "length": len(name),
        "num_vowels": sum(name.count(v) for v in "aeiou"),
    }
    return features

We apply the feature extraction function to the entries in the names dataset.

In [None]:
# Extract features
features = [feature_extraction(name) for name, gender in all_names]
print(f"\nExtracted features for the first 10 entries in the shuffled names dataset:\n")
wrapped = tw.fill(str(features[:10]))
print(wrapped)


Extracted features for the first 10 entries in the shuffled names dataset:

[{'last_letter': 'l', 'first_letter': 'i', 'length': 7, 'num_vowels':
4}, {'last_letter': 'e', 'first_letter': 'a', 'length': 5,
'num_vowels': 3}, {'last_letter': 't', 'first_letter': 'e', 'length':
9, 'num_vowels': 3}, {'last_letter': 'i', 'first_letter': 's',
'length': 4, 'num_vowels': 2}, {'last_letter': 'w', 'first_letter':
'r', 'length': 3, 'num_vowels': 1}, {'last_letter': 'r',
'first_letter': 'd', 'length': 6, 'num_vowels': 2}, {'last_letter':
'a', 'first_letter': 'f', 'length': 5, 'num_vowels': 3},
{'last_letter': 'o', 'first_letter': 'n', 'length': 4, 'num_vowels':
2}, {'last_letter': 'a', 'first_letter': 'd', 'length': 4,
'num_vowels': 2}, {'last_letter': 'a', 'first_letter': 'c', 'length':
6, 'num_vowels': 2}]


We vectorize the features so that the categorical features are represented numerically for faster model building and testing.

In [None]:
# Convert categorical features to numerical features
vectorizer = DictVectorizer()
features_vect = vectorizer.fit_transform(features).toarray()

We extract the response variable: the gender labels.

In [None]:
# Extract labels
labels = np.array([gender for name, gender in all_names])
print(f"\nLabels for the first 10 entries in the shuffled names dataset:\n")
wrapped = tw.fill(str(labels[:10]))
print(wrapped)


Labels for the first 10 entries in the shuffled names dataset:

['female' 'female' 'male' 'female' 'female' 'male' 'female' 'male'
'female' 'female']


We split the vectorized features and the labels into train, validate, and test sets.

In [None]:
# Split the dataset into train_validate and test sets
sss = ShuffleSplit(n_splits = 1, test_size = 500, random_state = 42)
sss.get_n_splits(features_vect, labels)
train_validate_index, test_index = next(sss.split(features_vect, labels))
x_train_validate, x_test = features_vect[train_validate_index], features_vect[test_index]
y_train_validate, y_test = labels[train_validate_index], labels[test_index]

# Perform another split on the train_validate set
sss = ShuffleSplit(n_splits = 1, test_size = 500, random_state = 43)
sss.get_n_splits(x_train_validate, y_train_validate)
train_index, validate_index = next(sss.split(x_train_validate, y_train_validate))
x_train, x_validate = x_train_validate[train_index], x_train_validate[validate_index]
y_train, y_validate = y_train_validate[train_index], y_train_validate[validate_index]

We train a Decision Tree Classifier and a Naive Bayes Classifier, and we calculate the predictive accuracy for both models using the validate set.

In [None]:
# Train a Decision Tree Classifier
classifierDT = DecisionTreeClassifier()
classifierDT.fit(x_train, y_train)

# Train a Naive Bayes Classifier
classifierNB = GaussianNB()
classifierNB.fit(x_train, y_train)

# Evaluate DTC on the validation set
preds_val_DT = classifierDT.predict(x_validate)
acc_val_DT = accuracy_score(y_validate, preds_val_DT)
print(f"\nDecision Tree Classifier: Validation Set Predictive Accuracy: {acc_val_DT}")

# Evaluate NBC on the validation set
preds_val_NB = classifierNB.predict(x_validate)
acc_val_NB = accuracy_score(y_validate, preds_val_NB)
print(f"Naive Bayes Classifier: Validation Set Predictive Accuracy: {acc_val_NB}")


Decision Tree Classifier: Validation Set Predictive Accuracy: 0.776
Naive Bayes Classifier: Validation Set Predictive Accuracy: 0.764


Using a small number of straightforward text features derived from the names, the Decision Tree Classifier and the Naive Bayes Classifier both have relatively strong performance. However, the Decision Tree Classifier's predictive accuracy of 78% beats the Naive Bayes Classifier's score by 1.6%.

 In an attempt to improve both models' predictive accuracy, we expand the feature extraction function so that it derives a wider variety of features from the names. The new features include: the ratio of vowel characters to total characters; character sequences of lengths two to four;

In [None]:
# Increase feature selection complexity
def feature_extraction2(name):
    name = name.lower()
    features = feature_extraction(name)
    features["vowel_to_length_ratio"] = features["num_vowels"] / features["length"]
    features["first_2_letters"] = name[:2]
    features["last_2_letters"] = name[-2:]
    features["first_3_letters"] = name[:3] if features["length"] > 2 else 0
    features["last_3_letters"] = name[-3:] if features["length"] > 2 else 0
    features["num_consonants"] = sum(name.count(c) for c in "bcdfghjklmnpqrstvwxyz")
    features["consonant_to_vowel_ratio"] = features["num_consonants"] / features["num_vowels"] if features["num_vowels"] > 0 else 0
    for n in range(2, 5): # Add all 2- to 4-letter ngrams
        for i in range(len(name) - n + 1):
            ngram = name[i:i+n]
            features[f"{n}gram_{ngram}"] = features.get(f"{n}gram_{ngram}", 0) + 1
    return features

We apply the newly expanded function to the names dataset and update the train, validate, and test sets to include the new features. Importantly, we use the same indices we generated for the original splits so that the observations remain in the same order and groups. Then we refit the classifiers and calculate their new predictive accuracy scores.

In [None]:
# Extract and vectorize new features, then split the data again using the same indices as earlier
features = [feature_extraction2(name) for name, gender in all_names]
features_vect = vectorizer.fit_transform(features).toarray()
x_train_validate, x_test = features_vect[train_validate_index], features_vect[test_index]
y_train_validate, y_test = labels[train_validate_index], labels[test_index]
x_train, x_validate = x_train_validate[train_index], x_train_validate[validate_index]
y_train, y_validate = y_train_validate[train_index], y_train_validate[validate_index]

# Refit the models
classifierDT.fit(x_train, y_train)
classifierNB.fit(x_train, y_train)

# Evaluate new DTC on the validation set
preds_val_DT = classifierDT.predict(x_validate)
acc_val_DT = accuracy_score(y_validate, preds_val_DT)
print(f"\nDecision Tree Classifier: Validation Set Predictive Accuracy: {acc_val_DT}")

# Evaluate new NBC on the validation set
preds_val_NB = classifierNB.predict(x_validate)
acc_val_NB = accuracy_score(y_validate, preds_val_NB)
print(f"Naive Bayes Classifier: Validation Set Predictive Accuracy: {acc_val_NB}")


Decision Tree Classifier: Validation Set Predictive Accuracy: 0.78
Naive Bayes Classifier: Validation Set Predictive Accuracy: 0.846


The Decision Tree Classifier's performance only improved by 0.2%, but the Naive Bayess Classifier's performance improved 8.2%, and it now beats the Decision Tree Classifier by 6.4%.

We take a look at the 20 most important features in the Decision Tree Classifier. Calculating feature importance for Naive Bayes Classifiers requries permutation and is unfortunately too costly computation-wise in this instance, so we skip the calculations for that model.

In [None]:
# Extract feature importance estimates from the classifiers
feature_imp_DT = classifierDT.feature_importances_

# Get feature names from the DictVectorizer
feature_names = vectorizer.get_feature_names_out()

# Combine names and importances
feature_imp_DT = zip(feature_names, feature_imp_DT)
feature_imp_DT = sorted(feature_imp_DT, key=lambda x: x[1], reverse=True)

# Print top N most important features
print("\nTop 20 features in the Decision Tree Classifier:\n")
for feature, importance in feature_imp_DT[:20]:
    print(f"{feature}: {importance}")



Top 20 features in the Decision Tree Classifier:

last_letter=a: 0.1712474982701188
last_letter=e: 0.08568029374999002
last_letter=i: 0.04638251628264688
last_letter=y: 0.02285590738568853
2gram_ly: 0.02147401260829455
length: 0.01531039973630272
last_3_letters=een: 0.011641215012002139
2gram_nn: 0.01026805868668367
num_consonants: 0.009945397238504945
2gram_is: 0.009919731958735679
last_2_letters=ah: 0.008816397507745918
last_letter=l: 0.008514944404524574
consonant_to_vowel_ratio: 0.0083677953425022
2gram_el: 0.007128011315891874
2gram_ne: 0.006790810682062194
num_vowels: 0.006680710229687934
first_letter=c: 0.006221632673721103
2gram_be: 0.006205142364663398
last_2_letters=yn: 0.0060674935890985306
vowel_to_length_ratio: 0.005874865421348605


Finally, we calculate the predictive accuracy for both models on the test set.

In [None]:
# Evaluate new DTC on the validation set
preds_val_DT = classifierDT.predict(x_test)
acc_val_DT = accuracy_score(y_test, preds_val_DT)
print(f"\nDecision Tree Classifier: Validation Set Predictive Accuracy: {acc_val_DT}")

# Evaluate new NBC on the validation set
preds_val_NB = classifierNB.predict(x_test)
acc_val_NB = accuracy_score(y_test, preds_val_NB)
print(f"Naive Bayes Classifier: Validation Set Predictive Accuracy: {acc_val_NB}")


Decision Tree Classifier: Validation Set Predictive Accuracy: 0.794
Naive Bayes Classifier: Validation Set Predictive Accuracy: 0.806
