## Project 3
Al Haque, Taha Ahmad


Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python,
and any features you can think of, build the best name gender classifier you can. For this part, we will essentially re-use code from the textbook and create simple features to improve the gender classifier. We will utilize Naive Bayes and Decision Tree Classifier to make our predictions

In [124]:
import nltk

The first step is to decide what features of the input are relevant and how to encode those features, we will first use the gender_features function from the textbook

In [125]:
## Make the function

def gender_feature(word):
    return{'first letter': word[0].lower()}



In [126]:
# Let's check as an example.
gender_feature('Fiona')

{'first letter': 'f'}

Extracts the male and females name from both texts and puts them in a dictionary format (Taken from the textbook)

In [127]:
from nltk.corpus import names
## Extracts the male and females name from both texts and puts them in a dictionary format
labeled_names = ([(name, 'male') for name in names.words('male.txt')] +
[(name, 'female') for name in names.words('female.txt')])
import random
# Set a random seed to prevent the names from being shuffled every time we called it
random.seed(543)
random.shuffle(labeled_names)

In [128]:
# Here the names are randomized
labeled_names

[('Rozelle', 'female'),
 ('Augusto', 'male'),
 ('Augusta', 'female'),
 ('Cheslie', 'female'),
 ('Carlota', 'female'),
 ('Rupert', 'male'),
 ('Floria', 'female'),
 ('Joellen', 'female'),
 ('Jessica', 'female'),
 ('Arel', 'male'),
 ('Beale', 'male'),
 ('Alanah', 'female'),
 ('Rebecka', 'female'),
 ('Maurene', 'female'),
 ('Correy', 'female'),
 ('Vern', 'male'),
 ('Bryana', 'female'),
 ('Hewett', 'male'),
 ('Allie', 'female'),
 ('Garvin', 'male'),
 ('Waylin', 'male'),
 ('Nathanael', 'male'),
 ('Tull', 'male'),
 ('Mollee', 'female'),
 ('Peter', 'male'),
 ('Beau', 'female'),
 ('Nathaniel', 'male'),
 ('Terri-Jo', 'female'),
 ('Alan', 'male'),
 ('Amalea', 'female'),
 ('Kaile', 'female'),
 ('Glenine', 'female'),
 ('Conroy', 'male'),
 ('Nealy', 'male'),
 ('Carroll', 'female'),
 ('Malcolm', 'male'),
 ('Alyce', 'female'),
 ('Riannon', 'female'),
 ('Gabbi', 'female'),
 ('Hermy', 'male'),
 ('Odelinda', 'female'),
 ('Layney', 'female'),
 ('Colene', 'female'),
 ('Teane', 'female'),
 ('Latashia', 'fem

In [129]:
# Here we create the train,dev_test,and test_set
# Training set is used to train the model,and the dev-test set is used to perform error analysis, the test is
# used to evaluate of the system
train_names = labeled_names[1000:]
devtest_names = labeled_names[500:1000]
test_names = labeled_names[:500]

In [133]:
train_set = [(gender_feature(n), gender) for (n, gender) in train_names]
devtest_set = [(gender_feature(n), gender) for (n, gender) in devtest_names]
test_set = [(gender_feature(n), gender) for (n, gender) in test_names]
# Here we call the Naive Bayes Classifier
classifier = nltk.NaiveBayesClassifier.train(train_set)
dtclassifier = nltk.DecisionTreeClassifier.train(train_set)
print(f'The accuracy is of the dev-test is: {nltk.classify.accuracy(classifier, devtest_set)}')
print(f'The accuracy is of the dev-test is: {nltk.classify.accuracy(dtclassifier, devtest_set)}')

The accuracy is of the dev-test is: 0.616
The accuracy is of the dev-test is: 0.616


Creating a simple feature such as plucking the first letter from each word gives us a 62% accuracy in determining the gender of a person,let's add more features onto our gender_feature function

In [134]:
classifier.show_most_informative_features(5)

Most Informative Features
            first letter = 'w'              male : female =      5.1 : 1.0
            first letter = 'u'              male : female =      2.8 : 1.0
            first letter = 'q'              male : female =      2.7 : 1.0
            first letter = 'x'              male : female =      2.3 : 1.0
            first letter = 'k'            female : male   =      2.3 : 1.0


In [138]:
# Here we add the last letter feature and see if the last letter of a person's name can determine their gender
def gender_feature2(word):
    return{'first letter': word[0].lower(),
          'last letter':word[-1].lower()}
    

In [140]:
train_set2 = [(gender_feature2(n), gender) for (n, gender) in train_names]
devtest_set2 = [(gender_feature2(n), gender) for (n, gender) in devtest_names]
test_set2 = [(gender_feature2(n), gender) for (n, gender) in test_names]
# Here we call the Naive Bayes Classifier again and decision Tree
classifier2 = nltk.NaiveBayesClassifier.train(train_set2)
dtclassifier2 = nltk.DecisionTreeClassifier.train(train_set2)
print(f'The accuracy is of the dev-test is: {nltk.classify.accuracy(classifier2, devtest_set2)}')
print(f'The accuracy is of the dev-test is: {nltk.classify.accuracy(dtclassifier2, devtest_set2)}')

The accuracy is of the dev-test is: 0.774
The accuracy is of the dev-test is: 0.778


Here we have improved the accuracy of the dev-test up to 78% percent which is not surprising since the textbook also tested this feature,let's test one more feature which is the length of the name 

In [141]:
classifier2.show_most_informative_features(5)

Most Informative Features
             last letter = 'a'            female : male   =     31.9 : 1.0
             last letter = 'k'              male : female =     31.1 : 1.0
             last letter = 'v'              male : female =     17.7 : 1.0
             last letter = 'f'              male : female =     17.5 : 1.0
             last letter = 'p'              male : female =     12.0 : 1.0


In [147]:
## Add the length of a name into the dictionary attributes
def gender_feature3(word):
    return{'first letter': word[0].lower(),
          'last letter':word[-1].lower(),
          'length' : len(word)}

In [148]:
train_set3 = [(gender_feature3(n), gender) for (n, gender) in train_names]
devtest_set3 = [(gender_feature3(n), gender) for (n, gender) in devtest_names]
test_set3 = [(gender_feature3(n), gender) for (n, gender) in test_names]
# Here we call the Naive Bayes Classifier again
classifier3 = nltk.NaiveBayesClassifier.train(train_set3)
dtclassifier3 = nltk.DecisionTreeClassifier.train(train_set3)
print(f'The accuracy is of the dev-test is: {nltk.classify.accuracy(classifier3, devtest_set3)}')
print(f'The accuracy is of the dev-test is: {nltk.classify.accuracy(dtclassifier3, devtest_set3)}')

The accuracy is of the dev-test is: 0.776
The accuracy is of the dev-test is: 0.768


In [149]:
## Let's test this classifier on our test-set
print(f'The accuracy of the test set using Naive Bayes is: {nltk.classify.accuracy(classifier3, test_set3)}')
print(f'The accuracy of the test set using Decision Tree is: {nltk.classify.accuracy(dtclassifier3, test_set3)}')

The accuracy of the test set using Naive Bayes is: 0.792
The accuracy of the test set using Decision Tree is: 0.76


In [150]:
## It seems the length wasn't even considered at all
classifier3.show_most_informative_features(5)

Most Informative Features
             last letter = 'a'            female : male   =     31.9 : 1.0
             last letter = 'k'              male : female =     31.1 : 1.0
             last letter = 'v'              male : female =     17.7 : 1.0
             last letter = 'f'              male : female =     17.5 : 1.0
             last letter = 'p'              male : female =     12.0 : 1.0


## Analysis 
At the moment it seems like features such as the length of a person's name and the first letter of a person's name only made tiny incremental improvement into predicting the gender of a person's name, but ultimately they were only slight increases. For Naive Bayes it was able to generalize better onto the test set compared to the decision tree classifier but it was only a slight increase