## Supervised Classification

![classifier](images/1.png)

In [19]:
''' 
Classification is the task of choosing the correct class label for a given input. In basic
classification tasks, each input is considered in isolation from all other inputs, and the
set of labels is defined in advance. Some examples of classification tasks are:

• Deciding whether an email is spam or not.
• Deciding what the topic of a news article is, from a fixed list of topic areas such as
“sports,” “technology,” and “politics.”
• Deciding whether a given occurrence of the word bank is used to refer to a river
bank, a financial institution, the act of tilting to the side, or the act of depositing
something in a financial institution.

'''

# Gender Identification

''' 
Names ending in a, e, and i are likely to be female, while names ending in k, o, r, s, and
t are likely to be male. Let’s build a classifier to model these differences more precisely.
'''

''' 
The first step in creating a classifier is deciding what features of the input are relevant,
and how to encode those features.
'''
import nltk
def gender_features(word):
    return {'last_letter': word[-1],
            'first_letter': word[0],
            'length': len(word),
            'first_vowel': first_vowel(word),
            'second_letter': word[1] if len(word)>1 else ''
            }

def first_vowel(word):
    for letter in word:
        if letter in 'aeiouAEIOU':
            return letter
    return ''

In [20]:
gender_features('Anson')

{'last_letter': 'n',
 'first_letter': 'A',
 'length': 5,
 'first_vowel': 'A',
 'second_letter': 'n'}

In [21]:
from nltk.corpus import names
import random

names = ([(name, 'male') for name in names.words('male.txt')] + 
         [(name, 'female') for name in names.words('female.txt')])

random.shuffle(names)

In [22]:
featuresets = [(gender_features(n), g) for (n,g) in names]
train_set, test_set = featuresets[500:], featuresets[:500]

classifier = nltk.NaiveBayesClassifier.train(train_set)

In [26]:
classifier.classify(gender_features('Harry'))

'male'

In [28]:
classifier.classify(gender_features('Praveen'))


'male'

In [29]:
nltk.classify.accuracy(classifier, test_set)

0.77

In [30]:
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'k'              male : female =     44.5 : 1.0
             last_letter = 'a'            female : male   =     33.0 : 1.0
             last_letter = 'f'              male : female =     17.3 : 1.0
           second_letter = 'k'              male : female =     15.3 : 1.0
             last_letter = 'p'              male : female =     11.9 : 1.0


In [32]:
''' 
When working with large corpora, constructing a single list that contains the features
of every instance can use up a large amount of memory. In these cases, use the function
nltk.classify.apply_features, which returns an object that acts like a list but does not
store all the feature sets in memory
'''
from nltk.classify import apply_features
train_set = apply_features(gender_features, names[500:])
test_set = apply_features(gender_features, names[:500])

# pg 224 - choosing the right features