# Gender Identification

The purpose of this notebook is to build a model to classify in male or female names.

In [18]:
import random 
from nltk.corpus import names 
import nltk 

### Features

The first step is to decide what **features** of the input are relevant and how to encode those features.

In [1]:
# The following function builds a dictionary containing relevant information about a given name:

def gender_features(word):
    return{ 'last_letter': word[-1]}

In [20]:
gender_features('Shrek')

{'last_letter': 'k'}

In [19]:
# The next step is to create a list of examples and corresponding class labels
names = ([(name, 'male') for name in names.words('male.txt')] + 
        [(name, 'female') for name in names.words('female.txt')])

### Training Set and Test Set

In [23]:
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500: ], featuresets[:500]
classifier = nltk.NaiveBayesClassifier.train(train_set)

In [29]:
print('Label Class for Neo:', classifier.classify(gender_features('Neo')))
print('Label Class for Trinity:', classifier.classify(gender_features('Trinity')))
print('Label Class for Camila:', classifier.classify(gender_features('Camila')))
print('Label Class for Juan Pablo:', classifier.classify(gender_features('Juan Pablo')))

Label Class for Neo: male
Label Class for Trinity: female
Label Class for Camila: female
Label Class for Juan Pablo: male


In [31]:
# Obtain accuracy of the model
print(nltk.classify.accuracy(classifier, test_set))

0.602


In [33]:
# Check which features it found most effectivefor distinguishing the names' genders. 
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.5 : 1.0
             last_letter = 'k'              male : female =     34.1 : 1.0
             last_letter = 'f'              male : female =     15.9 : 1.0
             last_letter = 'p'              male : female =     13.5 : 1.0
             last_letter = 'v'              male : female =     12.7 : 1.0


The listing shows that the names in the training set that end in a are female 35.5 times more often than they are male, but names that end in k are male 34.1 times more often than they are female. These are known as **likelihood ratios**. 