# Project 3 Text Analysis with NLTK Classification

## By Team VC

### Intro into Text Mining 

For project 3, we were given the task to create and improve a classifier that correctly guesses the gender of random names (e.g. Charles, Roberta, Eve) as shown in Chapter 6 of NLP. We trained a decision tree classifier by submitting a 500 name training set and testing it with other names.

Some initial functions from the text are included as a benchmark and we know we've improved our gender classifier if it has a higher accuracy rate than the text book classifiers. One noticed issue is the accuracy is directly proportional to the amount of test names. We decided on the text classfiyer to use the Decision Tree model for our approach, as it can efficently scale down the possibilities of the correct choice dependent on the branch.

First, Let's start up the classfiyer and its following model.

#### Gender Identification With A Decision Tree

In [2]:
### Initialize a classifier
import re
import pandas as pd
import random
import nltk

# Download names if they don't exist locally
try:
    nltk.data.find('corpora/names')
except LookupError:
    nltk.download('names')
    
from nltk.corpus import names


### Defining Different Features for the Decision Tree Classifiers

In order to train our machine to guess the correct gender, we incorporated the deciding factors. The machine will take account the examined parts of the name and see if there's a strong polarity with a specific feature of the name and gender. For example, the given feature takes the first and last letter of the name for its deiciding feature and the machine make an inference. For example, we could see there is a strong trend of female names ending with an 'e'. 

We split up the feature deciders into their own functions (gender_features3 & gender_features4) to limit on what factors may lead the machine to overfit its results. Then, we displayed all the created functions with their accuracy results compared overall. For our created feature sets: The gender_features 4 and 3 uses parts of US cultural structure for the text analysis. For example, gender_features4 has a vowel feature which would return a count of variables used in a name. The program can take this naming feature and make a guess that more feminine names use more vowels than masculine ones, or there is a pattern of a specific vowel count.

In [3]:
#Features from NLTK Chapter 6
# Guessing the gender from the last letter of a name
def gender_features(word):
    return {'last_letter': word[-1] }

# Guess the gender from the first/last letter and counting letters
def gender_features2(name):
    features = {}
    features['firstletter'] = name[0].lower()
    features['lastletter'] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count (%s)' % letter] = name.lower().count(letter)
        features['has (%s)' % letter] = (letter in name.lower())
    return features

#our attempts at features

# Just a predictably bad guesser
def bad_feature(word):
    return {'bleah' : 1}

#Features which take in account specific letters with strong accuray given in the g1 func
def gender_features3(name):
    features = {}
    features['firstletter'] = name[0].lower()
    features['lastletter'] = name[-1].lower()
    features['.*arry$'] = re.match(r".*arry$", name)
    features['.*b[ea]rt$'] = re.match(".*b[ea]rt$", name)
    features['.*ie$'] = re.match(".*ie$", name)
    features['Sch'] = re.match(r"Sch", name)
    features['Pam'] = re.match(r"Pam", name)
    features['V.*a$'] = re.match(r"V.*a$", name)
    features['M.*l$'] = re.match(r"M.*l$", name) 
    
    return features

#features with a count of the number of vowels, letters with softer sounds, ending for re-gender names
def gender_features4(name):
    features={}
    features['firstletter'] = name[0].lower()
    features['lastletter'] = name[-1].lower()
    for vowels in 'aeiou':
        features['Vowel']=(vowels in name.lower())
    features['.*y$']=re.match(r".*y$", name)
    features['.*ia$']=re.match(r".*ia$", name)
    features['.*li']=re.match(r".*li$", name)
    features['.*ck$']=re.match(r".*ck$", name)
    return features
    
# create a list of feature function to test
gender_functions = [gender_features, 
                    gender_features2, 
                    bad_feature, 
                    gender_features3,
                    gender_features4
                   ]


### Evaulating the feature functions

We can now put our created functions up to the test. We compiled three types of name lists with a random combination of names of all genders inside and divided it up into three sets:
  * 500 name training set
  * 500 name development-test set
  * 6500 name official test sets
  
We then iterated through our feauture functions through all its features sets and trained our decison trees with the collection. Then, the classifier was used to identify the accuracy of its guesses with the collection. 

All our runs produced different rankings, as the randomization of the names can benefit certain feature sets. It is shown the high accuracy the given gender_features2 driven by count. For example, our bad feature set seems to have a identical ranking compared to the feature set given from the chapter. This shows  the classifier can be influenced into wrong guesses, as the gender_feature will base its guess all on the last letter than to guess blindly in the bad feature. 

For our positive scores, Features 4 & 3 holds higher ranking in the dev and test accuracy compared to the others. It can be seen  the classifier has a higher chance in dev/test accuracy if its training accuracy is already high. However, there are the odds that the training set had more revelant letter for the given feature sets than others.

In [16]:
# Compile all names into a list
all_names = ([(name, 'male') for name in names.words('male.txt')] +
            [(name, 'female') for name in names.words('female.txt')])

# randomize the entire list
random.shuffle(all_names)


# Setup the train, devtest and test sets
train_names, devtest_names, test_names = all_names[0:500], all_names[500:1000], all_names[1000:]


In [17]:
# Iterate through the different feature functions 
# and compare their accuracy

results_list = []

# For each feature function, try to classify the test sets
for fn in gender_functions:
    train_set = [(fn(n), g) for (n,g) in train_names]
    devtest_set = [(gender_features2(n), g) for (n,g) in devtest_names]
    test_set = [(gender_features2(n), g) for (n,g) in test_names]
    
    # make a classifier from the training set
    classifier = nltk.classify.DecisionTreeClassifier.train(train_set)

    # Print the classifier logic
    #print(classifier)

    # Get the accuracies
    accuracy_train = nltk.classify.accuracy(classifier, train_set)
    accuracy_devtest = nltk.classify.accuracy(classifier, devtest_set)
    accuracy_test = nltk.classify.accuracy(classifier, test_set)

    results_list.append([fn.__name__, accuracy_train, accuracy_devtest, accuracy_test])


results_df = pd.DataFrame(results_list,
                          columns=['Function', 
                                   'Training Accuracy', 
                                   'Devtest Accuracy',
                                   'Test Accuracy',
                                   ])

results_df.head()


Unnamed: 0,Function,Training Accuracy,Devtest Accuracy,Test Accuracy
0,gender_features,0.78,0.366,0.36924
1,gender_features2,0.962,0.72,0.743664
2,bad_feature,0.608,0.634,0.63076
3,gender_features3,0.872,0.758,0.755616
4,gender_features4,0.9,0.754,0.759217


In [58]:
# Optional code to detail incorrect guesses
# 
#errors = []
#for (name, tag) in devtest_names:
#    guess = classifier.classify(gender_features(name))
#    if guess != tag:
#        errors.append( (tag, guess, name) )

#for (tag, guess, name) in sorted(errors):
#    print( 'correct=%-8s guess=%-8s name=%-30s' % (tag, guess, name))


### Takeaways

The feature functions the machine uses to learn benefits off the relevancy of its training set. The machine has a higher possibility accuracy if the supplied features training hits correct branches. As our set was modeled from the decision tree, its probability is at the high variance. Once the machine travels down a branch of a stronger association with a gender, it cannot traverse onto the other options. 

So, the feature options that take account cultural relevancy like feminine names like lia or marcelia or masculine names with harsher letters can flip the machines into hard choices. Gender_features2 has a highest test accuracy as its uses non-cultural features and uses counts to make the assumptions, which allows the machines to pick more freely between branches. The machine can set its choices later in the branches.

One item not mentioned in the gender classfication is the set of names used and their variety. We cannot determine if the names used are just names relevant to a certain type of group and this can affect the accuracy when a new group of names are introduce. This can alter the testing results if the training group's names lack diversity.