# DATA 620 Project 3
## Every Student in CUNY SPS's Summer 2018 DATA 620 class

Collaborating together as a single group, the class was tasked with building the best name gender classifier we could.

## Data setup

In [1]:
import nltk
from nltk.corpus import names
from nltk.classify import apply_features
import random
import pandas as pd

The `nltk` library was of the utmost importance in this project; it was used for the names corpus and for its classifiers. The library `random` was used for shuffling the names, and `pandas` was used for creating a function to test the accuracy of the final gender-predicting function more efficiently.

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])

The names provided by `nltk` were utilized for training and testing our algorithms, with male and female names being stored in a single variable.

## Determination of accuracy

Most of the class utilized the Naive Bayes method, and so, when creating a function for determining the accuracy of any given combination of features, it was determined the Naive Bayes method of classification would be used once more.

In [3]:
def accuracy(number_of_runs, function_to_use):
    acc_df = {
        "classifier": [],
        "train_set_accuracy": [],
        "test_set_accuracy": [],
        "devtest_set_accuracy": [],
        "devtest_errors": []
    }
    for i in range(number_of_runs):
        random.shuffle(names)
        acc_train_names = names[1000:]
        acc_devtest_names = names[500:1000]
        acc_test_names = names[:500]
        acc_train_set = [(function_to_use(n), g) for (n,g) in acc_train_names]
        acc_devtest_set = [(function_to_use(n), g) for (n,g) in acc_devtest_names]
        acc_test_set = [(function_to_use(n), g) for (n,g) in acc_test_names]
        acc_classifier = nltk.NaiveBayesClassifier.train(acc_train_set)
        acc_df["classifier"].append(acc_classifier)
        acc_df["train_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_train_set))
        acc_df["test_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_test_set))
        acc_df["devtest_set_accuracy"].append(nltk.classify.accuracy(acc_classifier, acc_devtest_set))
        acc_errors = []
        for (name, tag) in acc_devtest_names:
            acc_guess = acc_classifier.classify(function_to_use(name))
            if acc_guess != tag:
                acc_errors.append( (tag, acc_guess, name) )
        acc_df["devtest_errors"].append(acc_errors)
    acc_df = pd.DataFrame.from_dict(acc_df)
    return(acc_df)

It was decided that a dictionary - later to be transformed into a data frame - would be created to store the number of runs performed for the given created function for checking features against the names in the `names` variable. This is why this function, `accuracy`, has a parameter called `number_of_runs`, so that the class could determine how many times a given function should be run before being considered accurate. Ultimately the number settled on was 100.

Within the accuracy function itself the names were shuffled for every run; for each shuffling of the names, the first 500 names would be used as a test set, the next 500 for the dev test, and the remaining names for the training set. The classifiers for each run were kept, as were the list of errors.

Lastly, the data frame would be returned, best stored in another user-defined variable.

## Gender features

**Natural Language Processing with Python**, Chapter 6, provided two premade functions with features to check against the corpus of names. The class made a third function to compare against the accuracy of with the textbook's examples.

In [4]:
def textbook_gender_features_1(word):
    return {'last_letter': word[-1]}

This is the textbook's first example of testing for gender features. All it tests for is the last letter of the name.

In [5]:
def textbook_gender_features_2(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

This is the textbook's second example of testing for gender features. It expands upon the previous example by checking for the last letter of a given name, but also by looking into the first letter, the number of times each letter appears, and whether or not the letter was present in the name at all.

In [6]:
def class_gender_features(name):
    features = {}
    temp_name = name
    eng_cons_clusters = ["bl", "br", "ch", "cl", "cr", "dr", "fl", "fr", "gl", "gr", "pl", "pr", "sc", "sh", "sk", "sl", "sm", "sn", "sp", "st", "sw", "th", "tr", "tw", "wh", "wr", "sch", "scr", "shr", "sph", "spl", "spr", "squ", "str", "thr"]
    features["firstletter"] = name[0].lower() 
    features["lastletter"] = name[-1].lower() 
    features["prefix"] = name[:3].lower() if len(name) > 4 else name[:2].lower() 
    features["suffix"] = name[-3:].lower() if len(name) > 4 else name[-2:].lower()
    clusters = []
    for cluster in eng_cons_clusters[::-1]:
        if cluster in temp_name:
            temp_name = temp_name.replace(cluster, "")
            clusters.append(cluster)
    features["english_consonant_clusters_1"] = clusters[0] if len(clusters) > 0 else None
    features["english_consonant_clusters_2"] = clusters[1] if len(clusters) > 1 else None
    features["english_consonant_clusters_3"] = clusters[2] if len(clusters) > 2 else None
    return features

This was the class's function. It utilizes the first and last letter from the previous text book, but it also looks for the prefix and suffix - or first and last two or three letters, depending on the name's length - of a name and looks for whether or not any of the consonant clusters in English are present.

### Honorable mentions

The class had also attempted functions that looked into the

* letter order
* first, second, and third letter at the beginning;
* first, second, and third letter at the end;
* first two letters;
* first three letters;
* last two letters;
* last three letters;
* double letters;
* combination of letters (any);
* last letter - if it was "y", "a", "e", "i", "k", "o", "r", "s", "t"
* number of syllables

## Testing accuracy

The hope for our class when it came to this project was to beat out the accuracy of the gender feature functions provided by the textbook. To do so, we ran each function 100 times.

In [7]:
textbook_df_1 = accuracy(100, textbook_gender_features_1)
textbook_df_1.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.762863,0.76266,0.76074
std,0.001756,0.016983,0.018034
min,0.758641,0.71,0.716
25%,0.761521,0.756,0.746
50%,0.762673,0.764,0.764
75%,0.763969,0.772,0.772
max,0.766993,0.804,0.81


The first function, while simplistic, has fairly impressive results; the average accuracy across the board is between 76.1% and 76.2%. It showed us that less could be more.

In [8]:
textbook_df_2 = accuracy(100, textbook_gender_features_2)
textbook_df_2.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.778436,0.77102,0.77394
std,0.001988,0.016616,0.018594
min,0.773906,0.728,0.726
25%,0.777506,0.76,0.7635
50%,0.778226,0.77,0.773
75%,0.779666,0.7805,0.784
max,0.783122,0.812,0.822


The second function provided by the textbook, while slightly more complex, had an average accuracy across the board that ranged from 77.1% to 77.8%. This showed the class that looking into a few more features could produce a substantial increase in accuracy.

In [9]:
class_df = accuracy(100, class_gender_features)
class_df.describe()

Unnamed: 0,train_set_accuracy,test_set_accuracy,devtest_set_accuracy
count,100.0,100.0,100.0
mean,0.883759,0.83494,0.828
std,0.001636,0.017847,0.017167
min,0.879752,0.786,0.79
25%,0.882488,0.8215,0.8155
50%,0.883713,0.834,0.83
75%,0.884937,0.8465,0.8385
max,0.887097,0.882,0.864


The class's function was more complex than what the textbook offered. It resulted in an average accuracy of 82.8% to 88.3%, and sometimes even higher depending on the run. It succeeded in overcoming the results the textbook provided.

## Conclusion

In conclusion, through working together and challenging each other, a group of well-over 20 students managed to come up with a list of features to pair against the names corpus provided by the `nltk` library that challenged *and* defeated the accuracy of those provided by our class's textbook by more than 5%. This is exactly what we expected of our final function as we set out with the personal goals of beating the textbook's provided functions.