<a href="https://colab.research.google.com/github/abnormalPotassium/DATA620/blob/main/Project%203/Project%203%20Attempt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project 3: Names Corpus Classifiers

By: Al Haque, Taha Ahmad


---
## Goal

This project's goal is to:
-  Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can.
  -  Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set.
  -  Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress.
  -  Once you are satisfied with your classifier, check its final performance on the test set.
  -  Describe how the performance on the test set compares to the performance on the dev-test set and if the divergence is expected.


---
## Package Installation

Any packages that need to be installed for working on the classifiers can be added in the code block below. The very initial package assumption is that we'll simply need nltk and possibly pandas.

In [None]:
!pip install nltk
!pip install pandas


---
## Dataset Loading

[Our data](http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/names/0.html) is a set of names collected by Mark Kantrowitz and Bill Ross in 1994 where the names are separated in files by gender. There are 7944 observations in total with 5001 female names and 2943 males that are sorted alphabetically.

The nltk package allows for directly downloading and accessing this dataset which we do below. Note that dataset loading process is largely identical to sample code provided in the [nltk book](https://www.nltk.org/book/ch06.html) by Steven Bird, Ewan Klein, and Edward Loper.

In [59]:
import nltk
nltk.download('names')

[nltk_data] Downloading package names to /root/nltk_data...
[nltk_data]   Package names is already up-to-date!


True

After downloading the dataset that consists of two separate files for male and female, we create a combined list of lists which has each name paired with its gender. Since, these are initially sorted alphabetically we used the random package's ability to shuffle lists in place to allow us to split the data randomly. Note that a seed is set here to encourage reproducibility.

In [60]:
from nltk.corpus import names
import random

labeled_names = ([(name, 'male') for name in names.words('male.txt')] + [(name, 'female') for name in names.words('female.txt')])

random.seed(1337)
random.shuffle(labeled_names)

We split the `labeled_names` list into a training set with 6944 observations, a dev-test set with 500 observations, and a test set with 500 observations. The training set will be used for training the models, the dev-test set will be used for initially testing the trained models while further developing them, and the test set will be used for the final performance test.



In [61]:
train_names = labeled_names[1000:]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[:500]

---
## Model Building

With our dataset split we can now focus on building a model to classify gender from names. We will focus on three different types of classifiers and optimizing these three classifiers to get the best predictive performance out of them.

---
### Classifier 1: Naive Bayes

#### Book Base

We begin the base of our naive Bayes classifier by looking at how the ntlk book tackles it. We have three different ways to extract features from our dataset:

1.   A very simple approach that uses the last letter of the name as the singular feature.
2.   A complex approach that predicts based on the first letter of the name, the last letter of the name, and two features for every single letter in the alphabet based on if a letter is present in the name and how many of the letter are present.
3.   A simple approach that uses the last two letters of the name as features.



In [62]:
def gender_features1(word):
  return {'last_letter': word[-1].lower()}

def gender_features2(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

def gender_features3(word):
  return {'suffix1': word[-1:].lower(),
          'suffix2': word[-2:].lower()}

def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

If we check the accuracy of each of these feature sets trained to a naive Bayes model we see that the more features we have the greater the accuracy. However, the book showcases that our complex featureset actually ends up overfitting the test set. Thus, our starting point for improvement should be adding and modifying features to the suffix based classifier.

In [63]:
train_set1 = [(gender_features1(n), gender) for (n, gender) in train_names]
classifier1 = nltk.NaiveBayesClassifier.train(train_set1)

train_set2 = [(gender_features2(n), gender) for (n, gender) in train_names]
classifier2 = nltk.NaiveBayesClassifier.train(train_set2)

train_set3 = [(gender_features3(n), gender) for (n, gender) in train_names]
classifier3 = nltk.NaiveBayesClassifier.train(train_set3)

devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

print(f"""
The gender prediction accuracy for our last letter based classifier is {nltk.classify.accuracy(classifier1, devtest_set)*100}%
The gender prediction accuracy for our large featureset classifier is {nltk.classify.accuracy(classifier2, devtest_set)*100}%
The gender prediction accuracy for our suffix based classifier is {nltk.classify.accuracy(classifier3, devtest_set)*100}%
""")


The gender prediction accuracy for our last letter based classifier is 77.2%
The gender prediction accuracy for our large featureset classifier is 80.0%
The gender prediction accuracy for our suffix based classifier is 78.2%



#### Building Upon The Base

Here we begin working on improving our existing naive Bayes model. We can take some insight into what features are most important by using the informative features function. Perhaps unsurprisingly, we see that using the last letter is the most important for our first two classifiers, while the suffixes provide much more information with higher ratios of determination.

In [64]:
classifier1.show_most_informative_features(5)
print("")
classifier2.show_most_informative_features(5)
print("")
classifier3.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     39.4 : 1.0
             last_letter = 'a'            female : male   =     34.7 : 1.0
             last_letter = 'f'              male : female =     12.5 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0

Most Informative Features
             last_letter = 'k'              male : female =     39.4 : 1.0
             last_letter = 'a'            female : male   =     34.7 : 1.0
             last_letter = 'f'              male : female =     12.5 : 1.0
             last_letter = 'p'              male : female =     11.8 : 1.0
             last_letter = 'v'              male : female =      9.8 : 1.0

Most Informative Features
                 suffix2 = 'na'           female : male   =    157.3 : 1.0
                 suffix2 = 'la'           female : male   =     71.9 : 1.0
                 suf

Our first modification will be adding the length of the name to determine if it is a useful indicator for classification.

In [65]:
def gender_features4(word):
  return {'suffix1': word[-1:].lower(),
          'suffix2': word[-2:].lower(),
          'length': len(word)}

def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    features["length"] = len(name)
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

train_set4 = [(gender_features4(n), gender) for (n, gender) in train_names]
classifier4 = nltk.NaiveBayesClassifier.train(train_set4)

devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

print(f"""
The gender prediction accuracy for our suffix and length classifier is {nltk.classify.accuracy(classifier4, devtest_set)*100}%
The gender prediction accuracy for our large featureset classifier is {nltk.classify.accuracy(classifier2, devtest_set)*100}%
The gender prediction accuracy for our suffix based classifier is {nltk.classify.accuracy(classifier3, devtest_set)*100}%
""")


The gender prediction accuracy for our suffix and length classifier is 77.8%
The gender prediction accuracy for our large featureset classifier is 80.0%
The gender prediction accuracy for our suffix based classifier is 78.2%



Perhaps unsurprisingly, the length of a name does not matter much for determining the gender of the name. We have an accuracy measure that ends up worse than our purely suffix based classifier which is extremely interesting. As generally more features in a model lead to higher accuracy. What's likely happening here is since naive bayes models utilize each feature individually to contribute to the classification, our length feature is working against our suffix features.

Checking this we can also see that length does not appear anywhere in the top 10 and in fact ends up in the 100s when checked further.

In [66]:
classifier4.show_most_informative_features(10)
# Ran separately: classifier4.show_most_informative_features(200)

Most Informative Features
                 suffix2 = 'na'           female : male   =    157.3 : 1.0
                 suffix2 = 'la'           female : male   =     71.9 : 1.0
                 suffix1 = 'k'              male : female =     39.4 : 1.0
                 suffix2 = 'ia'           female : male   =     36.9 : 1.0
                 suffix2 = 'us'             male : female =     35.4 : 1.0
                 suffix1 = 'a'            female : male   =     34.7 : 1.0
                 suffix2 = 'sa'           female : male   =     34.0 : 1.0
                 suffix2 = 'ra'           female : male   =     34.0 : 1.0
                 suffix2 = 'rt'             male : female =     32.1 : 1.0
                 suffix2 = 'ch'             male : female =     24.8 : 1.0


Upon researching further on what makes a female name and male name different, it seems that the beginning of names is another indicator of gender. A softer beginning prefix indicates that the name is more likely to be female while a harder prefix indicates that the names is more likely to be male. To add this to our classifier we can have prefix features added to our suffix classifier.

In [67]:
def gender_features5(word):
  return {'suffix1': word[-1:].lower(),
          'suffix2': word[-2:].lower(),
          'prefix1': word[0].lower(),
          'prefix2': word[0:2].lower()}

def gender_features(name):
    features = {}
    features["first_letter"] = name[0].lower()
    features["last_letter"] = name[-1].lower()
    features["suffix1"] = name[-1:].lower()
    features["suffix2"] = name[-2:].lower()
    features["prefix1"] = name[0].lower()
    features["prefix2"] = name[0:2].lower()
    features["length"] = len(name)
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count({})".format(letter)] = name.lower().count(letter)
        features["has({})".format(letter)] = (letter in name.lower())
    return features

train_set5 = [(gender_features5(n), gender) for (n, gender) in train_names]
classifier5 = nltk.NaiveBayesClassifier.train(train_set5)

devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

print(f"""
The gender prediction accuracy for our suffix and prefix classifier is {nltk.classify.accuracy(classifier5, devtest_set)*100}%
The gender prediction accuracy for our suffix and length classifier is {nltk.classify.accuracy(classifier4, devtest_set)*100}%
The gender prediction accuracy for our large featureset classifier is {nltk.classify.accuracy(classifier2, devtest_set)*100}%
The gender prediction accuracy for our suffix based classifier is {nltk.classify.accuracy(classifier3, devtest_set)*100}%
""")


The gender prediction accuracy for our suffix and prefix classifier is 80.0%
The gender prediction accuracy for our suffix and length classifier is 77.8%
The gender prediction accuracy for our large featureset classifier is 80.0%
The gender prediction accuracy for our suffix based classifier is 78.2%



We manage to increase classifier accuracy by 2% through adding our prefixes. Which makes the accuracy of this classifier match the large feature set classifier, but with less features. Providing this model isn't overfitting, this might be the final model we want to use. Of course, this all depends on testing against the actual test set of names.

Still, suffixes can be shown to be more important in determining the gender of a name with our naive-bayes classifier below:

In [68]:
classifier5.show_most_informative_features(10)
# Ran separately: classifier5.show_most_informative_features(200)

Most Informative Features
                 suffix2 = 'na'           female : male   =    157.3 : 1.0
                 suffix2 = 'la'           female : male   =     71.9 : 1.0
                 suffix1 = 'k'              male : female =     39.4 : 1.0
                 suffix2 = 'ia'           female : male   =     36.9 : 1.0
                 suffix2 = 'us'             male : female =     35.4 : 1.0
                 suffix1 = 'a'            female : male   =     34.7 : 1.0
                 suffix2 = 'sa'           female : male   =     34.0 : 1.0
                 suffix2 = 'ra'           female : male   =     34.0 : 1.0
                 suffix2 = 'rt'             male : female =     32.1 : 1.0
                 suffix2 = 'ch'             male : female =     24.8 : 1.0


A final modification to this classifier we want to make is reducing features to just the two letter suffix and prefix to attempt to reduce any potential overfitting.

In [107]:
def gender_features6(word):
  return {'suffix2': word[-2:].lower(),
          'prefix2': word[0:2].lower()}

train_set6 = [(gender_features6(n), gender) for (n, gender) in train_names]

devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

classifier6 = nltk.NaiveBayesClassifier.train(train_set6)

print(f"""
The gender prediction accuracy for our cut down suffix and prefix classifier is {nltk.classify.accuracy(classifier6, devtest_set)*100}%
The gender prediction accuracy for our suffix and prefix classifier is {nltk.classify.accuracy(classifier5, devtest_set)*100}%
The gender prediction accuracy for our suffix and length classifier is {nltk.classify.accuracy(classifier4, devtest_set)*100}%
The gender prediction accuracy for our large featureset classifier is {nltk.classify.accuracy(classifier2, devtest_set)*100}%
The gender prediction accuracy for our suffix based classifier is {nltk.classify.accuracy(classifier3, devtest_set)*100}%
""")


The gender prediction accuracy for our cut down suffix and prefix classifier is 82.19999999999999%
The gender prediction accuracy for our suffix and prefix classifier is 80.0%
The gender prediction accuracy for our suffix and length classifier is 77.8%
The gender prediction accuracy for our large featureset classifier is 80.0%
The gender prediction accuracy for our suffix based classifier is 78.2%



We are able to increase our accuracy and hopefully reduce overfitting by taking away any influence singular character suffix and prefixes had on our names.

---
### Classifier 2: Maximum Entropy Model

The maximum entropy model uses maximum likelihood estimation to have each feature of the model contribute to the classification in a combined manner. While a naive bayes model has each feature contribute in an individual manner. Let's see how well a few of our feature sets fare with this model. Ideally, we would have larger feature sets become more accurate within a maximum entropy model due to features harmonizing together rather than tugging in individual directions.


In [70]:
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

maxent_classifier6 = nltk.MaxentClassifier.train(train_set6, trace=0)
maxent_classifier3 = nltk.MaxentClassifier.train(train_set3, trace=0)
maxent_classifier5 = nltk.MaxentClassifier.train(train_set5, trace=0)
maxent_classifier2 = nltk.MaxentClassifier.train(train_set2, trace=0)


print(f"""
The gender prediction accuracy for our cut down suffix and prefix classifier is {nltk.classify.accuracy(maxent_classifier6, devtest_set)*100}%
The gender prediction accuracy for our suffix and prefix classifier is {nltk.classify.accuracy(maxent_classifier5, devtest_set)*100}%
The gender prediction accuracy for our large featureset classifier is {nltk.classify.accuracy(maxent_classifier2, devtest_set)*100}%
The gender prediction accuracy for our suffix based classifier is {nltk.classify.accuracy(maxent_classifier3, devtest_set)*100}%
""")


The gender prediction accuracy for our cut down suffix and prefix classifier is 81.39999999999999%
The gender prediction accuracy for our suffix and prefix classifier is 81.6%
The gender prediction accuracy for our large featureset classifier is 82.19999999999999%
The gender prediction accuracy for our suffix based classifier is 77.0%



We can see how a maximum entropy model utilizes bigger feature sizes better than a naive bayes model with our large featureset features and suffix and prefix features increasing in accuracy while our smaller models decrease in accuracy.

Surpisingly our cut down suffix and prefix model is simpler but still has a better accuracy than every classifier except for the large featureset max entropy model. So, we keep it as a candidate for our final model.

---
### Classifier 3: Decision Tree

Finally we have the decision tree model which utilizes iteration through different features to make boundaries within the features that will lead to branching paths that once more have boundaries on a different or even the same feature checked. Decision trees do not work very well with featuresets that have many different unique values per feature. Thus, our decision tree models will likely perform worse than either of our previous models.

In [73]:
devtest_set = [(gender_features(n), gender) for (n, gender) in devtest_names]

dt_classifier6 = nltk.DecisionTreeClassifier.train(train_set6)
dt_classifier3 = nltk.DecisionTreeClassifier.train(train_set3)
dt_classifier5 = nltk.DecisionTreeClassifier.train(train_set5)
dt_classifier2 = nltk.DecisionTreeClassifier.train(train_set2)


print(f"""
The gender prediction accuracy for our cut down suffix and prefix classifier is {nltk.classify.accuracy(dt_classifier6, devtest_set)*100}%
The gender prediction accuracy for our suffix and prefix classifier is {nltk.classify.accuracy(dt_classifier5, devtest_set)*100}%
The gender prediction accuracy for our large featureset classifier is {nltk.classify.accuracy(dt_classifier2, devtest_set)*100}%
The gender prediction accuracy for our suffix based classifier is {nltk.classify.accuracy(dt_classifier3, devtest_set)*100}%
""")


The gender prediction accuracy for our cut down suffix and prefix classifier is 73.6%
The gender prediction accuracy for our suffix and prefix classifier is 73.6%
The gender prediction accuracy for our large featureset classifier is 80.0%
The gender prediction accuracy for our suffix based classifier is 76.6%



As expected, our decision tree models were a downgrade in accuracy for almost every featureset. It's only our large featureset classifier which contains multiple binary features that retains its accuracy of the naive-bayes. Even then, a maximum entropy classifier version of that featureset would be better to use.

In this case our data is not suited to the model and we are left with the naive-bayes cut down suffix and prefix classifier as the leading candidate for the final model.

---
## Conclusion

Here we test the final versions of each classifier against the test set to see which model is possibly the best for predicting our data.

In [108]:
test_set = [(gender_features(n), gender) for (n, gender) in test_names]

print(f"""
The gender prediction accuracy for our cut down naive-bayes suffix and prefix classifier is {nltk.classify.accuracy(classifier6, test_set)*100}%
The gender prediction accuracy for our maximum entropy large featureset classifier is {nltk.classify.accuracy(maxent_classifier2, test_set)*100}%
The gender prediction accuracy for our decision tree large featureset classifier is {nltk.classify.accuracy(dt_classifier2, test_set)*100}%
""")


The gender prediction accuracy for our cut down naive-bayes suffix and prefix classifier is 80.60000000000001%
The gender prediction accuracy for our maximum entropy large featureset classifier is 80.0%
The gender prediction accuracy for our decision tree large featureset classifier is 78.60000000000001%



With the best of each type of model evaluated against the test set we see that in a measure of pure accuracy our cut-down naive-bayes suffix and prefix classifier narrowly beats out the maximum entropy large featureset model. The decision tree classifier isn't even in the competition here. There is a slight decrease from our dev-test accuracy, but that's to be expected for any model and it is not a concerning amount.

Naive-bayes models are at their best with distinct non-overlapping features that provide important information towards classifying, which ends up being the best way to classify names since they are typically short and low on data to extract from that would make a maximum entropy model stand out. For a decision tree model to work here we would need names to have features that had only a few possible values to contribute information to gender.

In [138]:
from nltk.metrics import ConfusionMatrix

test_class = classifier6.classify_many([x for x,y in test_set])
test_true = [y for x,y in test_set]

cm = ConfusionMatrix(test_true,test_class)

print(cm)
print(cm.evaluate())

       |   f     |
       |   e     |
       |   m   m |
       |   a   a |
       |   l   l |
       |   e   e |
-------+---------+
female |<288> 29 |
  male |  68<115>|
-------+---------+
(row = reference; col = test)

   Tag | Prec.  | Recall | F-measure
-------+--------+--------+-----------
female | 0.8090 | 0.9085 | 0.8559
  male | 0.7986 | 0.6284 | 0.7034



Looking at the confusion matrix we see that our final naive-bayes model ends up classifying females more correctly than males while it is only accurate 2/3 of the time for males. This is useful info to know since our data had more female names than males it is possible that our model is relying a bit on that to classify names more as female rather than the characteristics.

---
## Video Presentation

The code below allows a YouTube link to the video presentation to be inserted for the url variable and will then display the YouTube video within the notebook itself.

A regex match extracts the video ID from the URL which is then fed into the IPython package's built in Youtube embedder.

In [139]:
url = "https://youtu.be/tYksib7BFWA"

In [140]:
from IPython.display import YouTubeVideo
import re

reg = r"(?:v=|\/)([0-9A-Za-z_-]{11}).*"
urlid = re.search(reg, url)[1]

YouTubeVideo(urlid, width=800, height=450)