## Project 3

This is a collaborative project conducted by the Fall 2017 students of DATA 620 at The City University of New York, in partial fulfillment of the requirements for the MS in Data Science degree.

### Problem Description

This is a Team Project! For this project, please work with the entire class as one collaborative group! Your project should be submitted (as an IPython Notebook via GitHub) by end of day on Monday, October 25th. The group should present their code and findings in our meet-up on Tuesday October 26th. The ability to be an effective member of a virtual team is highly valued in the data science job market.
Using any of the three classifiers described in chapter 6 of Natural Language Processing with Python, and any features you can think of, build the best name gender classifier you can. Begin by splitting the Names Corpus into three subsets: 500 words for the test set, 500 words for the dev-test set, and the remaining 6900 words for the training set. Then, starting with the example name gender classifier, make incremental improvements. Use the dev-test set to check your progress. Once you are satisfied with your classifier, check its final performance on the test set. How does the performance on the test set compare to the performance on the dev-test set? Is this what you'd expect?

Source: Natural Language Processing with Python, exercise 6.10.2.

### Contributors Include

* Joy Payton
* Keith Folsom
* Sonya Hong


### First, Obtain the Corpus

Note: If not already executed, nltk.download() will allow you access to the names corpus

In [1]:
import nltk
from nltk.corpus import names
import random
import numpy as np
from nltk.metrics import *
import re
    

import string
from textstat.textstat import textstat
#nltk.download('names')

In [2]:
names = ([(name, 'male') for name in names.words('male.txt')] + \
         [(name, 'female') for name in names.words('female.txt')])

#names = random.shuffle(names)

In [10]:
random.shuffle(names)

# let's see what the randomly shuffles names look like
names[1:10]

[('TeresaAnne', 'female'),
 ('Gene', 'female'),
 ('Penelopa', 'female'),
 ('Eudora', 'female'),
 ('Ursuline', 'female'),
 ('Ray', 'male'),
 ('Carlena', 'female'),
 ('Deny', 'female'),
 ('Linoel', 'male')]

### Create three subsets for development and error analysis of the models.

##### Development set:
* 6900 names for the training set
* 500 names for the dev-test set  

##### Test set:
* 500 names for the testing set

In [4]:
test_names, devtest_names, train_names = names[0:500], names[500:1000], names[1000:]

In [5]:
# Confirm the size of the three subsets
print("Training Set = {}".format(len(train_names)))
print("Dev-Test Set = {}".format(len(devtest_names)))
print("Test Set = {}".format(len(test_names)))

Training Set = 6944
Dev-Test Set = 500
Test Set = 500


### Feature Extractor Functions

This section below is to incrementallly improve the feature extraction functions which are subsequently applied to the development and test datasets.

In [6]:
# book example
def gender_features(name):
    return {'last_letter': name[-1]}

# most names beginning with a vowel are associated with females
def gender_features2(name):
    return {'first_letter': name[0]}

# from the book feature extractor that overfits 
def gender_features3(name):
    features = {}
    features["firstletter"] = name[0].lower()
    features["lastletter"] = name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features["count(%s)" % letter] = name.lower().count(letter)
        features["has(%s)" % letter] = (letter in name.lower())
    return features

## from the book
def gender_features4(name):
    return {"suffix1": name[-1:], "suffix2": name[-2:]}

In [22]:
gender_features3('amber')
gender_features4('amber')

{'suffix1': 'r', 'suffix2': 'er'}

### Helper Functions

In [7]:
# Generic function to generate an error list based the arguments provided
# Accepts the classifer, names dataset, and the extractor function
# Returns the list of errors

def generate_errors(classifier, dataset, extractor_function): 
    
    errors = [] 

    for (name, tag) in dataset:
        guess = classifier.classify(extractor_function(name)) 
        if guess != tag: 
            errors.append((tag, guess, name))
            
    return errors

In [8]:
# Generic function to display classification errors
# Accepts the error list and an optional argument to show only n number of errors

def show_errors(errors, n=None):
   
    if n is not None: errors = errors[:n]
            
    for (tag, guess, name) in sorted(errors): 
        print('correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name))

In [11]:
[(gender_features(n), g)  for (n, g) in train_names]

[({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'i'}, 'female'),
 ({'last_letter': 'd'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'b'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'e'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'c'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'n'}, 'male'),
 ({'last_letter': 'a'}, 'female'),
 ({'last_letter': 'i'}, 'female'),
 ({'last_letter': 'n'}, 'female'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 't'}, 'male'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'd'}, 'female'),
 ({'last_letter': 'i'}, 'female'),
 ({'last_letter': 'y'}, 'female'),
 ({'last_letter': 'y'}, 'male'),
 ({'last_letter': 'd'}, 'male'),
 ({'last_letter': 'c'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'last_letter': 'k'}, 'male'),
 ({'last_letter': 'e'}, 'female'),
 ({'las

### Gender Identification Models (Try Some Models)

###  NaiveBayes
#### Gender Classification Model 1

In [9]:
# apply the first gender_feature extractor function from the book the the three datasets
train_set = [(gender_features(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features(n), g)  for (n, g) in test_names]

In [13]:
classifier = nltk.NaiveBayesClassifier.train(train_set) 

In [14]:
# Examine the likelihood ratios
classifier.show_most_informative_features(5)

Most Informative Features
             last_letter = 'a'            female : male   =     35.4 : 1.0
             last_letter = 'k'              male : female =     27.9 : 1.0
             last_letter = 'f'              male : female =     15.3 : 1.0
             last_letter = 'p'              male : female =      9.9 : 1.0
             last_letter = 'd'              male : female =      9.2 : 1.0


In [16]:
print(nltk.classify.accuracy(classifier, devtest_set))

0.794


In [13]:
# display the classification errors
# calls the helper functions above
show_errors(generate_errors(classifier, devtest_names, gender_features))

correct=female   guess=male     name=Adel                          
correct=female   guess=male     name=Aleen                         
correct=female   guess=male     name=Alexis                        
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Allyn                         
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Babs                          
correct=female   guess=male     name=Caitlin                       
correct=female   guess=male     name=Caril                         
correct=female   guess=male     name=Charmion                      
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Christen                      
correct=female   guess=male     name=Christian                     
correct=female   guess=male     name=Chrystal                      
correct=female   guess=male     name=Ciel       

#### Gender Classification Model 2

In [23]:
# apply the gender_feature3 extractor function from the book the the three datasets

train_set = [(gender_features3(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features3(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features3(n), g)  for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set) 

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.788


In [24]:
# Examine the likelihood ratios
classifier.show_most_informative_features(20)

Most Informative Features
              lastletter = 'a'            female : male   =     35.4 : 1.0
              lastletter = 'k'              male : female =     27.9 : 1.0
              lastletter = 'f'              male : female =     15.3 : 1.0
              lastletter = 'p'              male : female =      9.9 : 1.0
              lastletter = 'd'              male : female =      9.2 : 1.0
              lastletter = 'v'              male : female =      9.2 : 1.0
              lastletter = 'o'              male : female =      9.1 : 1.0
                count(v) = 2              female : male   =      8.4 : 1.0
              lastletter = 'm'              male : female =      8.0 : 1.0
              lastletter = 'r'              male : female =      6.6 : 1.0
                count(a) = 3              female : male   =      5.7 : 1.0
              lastletter = 'w'              male : female =      5.5 : 1.0
              lastletter = 'z'              male : female =      5.1 : 1.0

In [16]:
show_errors(generate_errors(classifier, devtest_names, gender_features3), 30)

correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Ardyth                        
correct=female   guess=male     name=Betsy                         
correct=female   guess=male     name=Christen                      
correct=female   guess=male     name=Dorcas                        
correct=female   guess=male     name=Gerry                         
correct=female   guess=male     name=Guenevere                     
correct=female   guess=male     name=Gunvor                        
correct=female   guess=male     name=Honey                         
correct=female   guess=male     name=Margot                        
correct=female   guess=male     name=Meridith                      
correct=female   guess=male     name=Mikako                        
correct=female   guess=male     name=Odette                        
correct=female   guess=male     name=Tobie                         
correct=female   guess=male     name=Wendy      

#### Gender Classification Model 3

In [26]:
# apply the gender_feature4 extractor function from the book the the three datasets

train_set = [(gender_features4(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features4(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features4(n), g)  for (n, g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set) 

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.81


In [27]:
# Examine the likelihood ratios
classifier.show_most_informative_features(20)

Most Informative Features
                 suffix2 = 'na'           female : male   =    151.7 : 1.0
                 suffix2 = 'la'           female : male   =     69.8 : 1.0
                 suffix2 = 'ia'           female : male   =     37.8 : 1.0
                 suffix1 = 'a'            female : male   =     35.4 : 1.0
                 suffix2 = 'sa'           female : male   =     32.2 : 1.0
                 suffix2 = 'ta'           female : male   =     29.4 : 1.0
                 suffix1 = 'k'              male : female =     27.9 : 1.0
                 suffix2 = 'us'             male : female =     26.4 : 1.0
                 suffix2 = 'io'             male : female =     25.0 : 1.0
                 suffix2 = 'do'             male : female =     25.0 : 1.0
                 suffix2 = 'ra'           female : male   =     24.1 : 1.0
                 suffix2 = 'ld'             male : female =     23.0 : 1.0
                 suffix2 = 'rd'             male : female =     22.8 : 1.0

In [19]:
show_errors(generate_errors(classifier, devtest_names, gender_features4), 30)

correct=female   guess=male     name=Adel                          
correct=female   guess=male     name=Allison                       
correct=female   guess=male     name=Amber                         
correct=female   guess=male     name=Ardyth                        
correct=female   guess=male     name=Caril                         
correct=female   guess=male     name=Charo                         
correct=female   guess=male     name=Christen                      
correct=female   guess=male     name=Dorcas                        
correct=female   guess=male     name=Gerry                         
correct=female   guess=male     name=Gunvor                        
correct=female   guess=male     name=Jaleh                         
correct=female   guess=male     name=Karel                         
correct=female   guess=male     name=Linell                        
correct=female   guess=male     name=Lisabeth                      
correct=female   guess=male     name=Margot     

###  Decision Tree
#### Gender Classification Model 4

In [28]:
# apply the first gender_feature extractor function from the book the the three datasets
train_set = [(gender_features(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features(n), g)  for (n, g) in test_names]
classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, devtest_set))

0.794


In [29]:
# display the classification errors
# calls the helper functions above
show_errors(generate_errors(classifier, devtest_names, gender_features))

correct=female   guess=male     name=Anais                         
correct=female   guess=male     name=Brett                         
correct=female   guess=male     name=Camel                         
correct=female   guess=male     name=Cameo                         
correct=female   guess=male     name=Carlin                        
correct=female   guess=male     name=Cloris                        
correct=female   guess=male     name=Con                           
correct=female   guess=male     name=Daloris                       
correct=female   guess=male     name=Demeter                       
correct=female   guess=male     name=Demetris                      
correct=female   guess=male     name=Dew                           
correct=female   guess=male     name=Ealasaid                      
correct=female   guess=male     name=Emlyn                         
correct=female   guess=male     name=Fawn                          
correct=female   guess=male     name=Flo        

#### Gender Classification Model 5

In [30]:
# apply the gender_feature3 extractor function from the book the the three datasets

train_set = [(gender_features3(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features3(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features3(n), g)  for (n, g) in test_names]

classifier = nltk.DecisionTreeClassifier.train(train_set)

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.818


In [31]:
show_errors(generate_errors(classifier, devtest_names, gender_features3), 30)

correct=female   guess=male     name=Ann-Mari                      
correct=female   guess=male     name=Avivah                        
correct=female   guess=male     name=Bobbi                         
correct=female   guess=male     name=Con                           
correct=female   guess=male     name=Demetris                      
correct=female   guess=male     name=Flo                           
correct=female   guess=male     name=Gerhardine                    
correct=female   guess=male     name=Lauren                        
correct=female   guess=male     name=Noreen                        
correct=female   guess=male     name=Odille                        
correct=female   guess=male     name=Oreste                        
correct=female   guess=male     name=Wally                         
correct=female   guess=male     name=Waly                          
correct=male     guess=female   name=Alden                         
correct=male     guess=female   name=Berke      

#### Gender Classification Model 6

In [32]:
# apply the gender_feature4 extractor function from the book the the three datasets

train_set = [(gender_features4(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features4(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features4(n), g)  for (n, g) in test_names]

classifier = nltk.DecisionTreeClassifier.train(train_set)

## accuracy
print(nltk.classify.accuracy(classifier, devtest_set))

0.83


In [35]:
show_errors(generate_errors(classifier, devtest_names, gender_features4), 30)

correct=female   guess=male     name=Con                           
correct=female   guess=male     name=Flo                           
correct=female   guess=male     name=Jaclin                        
correct=female   guess=male     name=Leigh                         
correct=female   guess=male     name=Mame                          
correct=female   guess=male     name=Nil                           
correct=female   guess=male     name=Pier                          
correct=male     guess=female   name=Alden                         
correct=male     guess=female   name=Arnie                         
correct=male     guess=female   name=Chrisy                        
correct=male     guess=female   name=Clancy                        
correct=male     guess=female   name=Cole                          
correct=male     guess=female   name=Cyrille                       
correct=male     guess=female   name=Donn                          
correct=male     guess=female   name=Dru        

### Model Selection (Choose Best Candidate) 

#### Check the model's final performance on the test set. 

#### How does the performance on the test set compare to the performance on the dev-test set? 

#### Is this what you'd expect?

In [5]:

    
import re
import string
from textstat.textstat import textstat

ModuleNotFoundError: No module named 'textstat'

In [85]:
class NLP_Classifier():
    def __init__(self,model):
        self.model = model
        #self.feat_num = feat_num
        
        #train_set = [(gender_features3(n), g)  for (n, g) in train_names]
        #devtest_set = [(gender_features3(n), g)  for (n, g) in devtest_names]
        #test_set = [(gender_features3(n), g)  for (n, g) in test_names]


    def get_features(self,name,feat_num):
        '''
        Parameters:
            name - string of name to extract feature
            feat_num - itterable colleciton of integers specifying features. *Defaults to 1:9 inclusive
                1: last letter
                2: first letter
                3: Vowel counts
                4: Hard consonant count
                5: Soft consonant count
                6: Syllable Count
                7: Name length
                8: Last two chars
                9: Last three chars
                10: char count --> feature for all alpha chars
                11: char present --> feature for all alpha chars (boolean)
        Returns:
            features: a dictionary of extracted features
        '''
        features = {}
        
        
        
        # Converts feat_num to itterable if type is int
        if type(feat_num) is int:
            feat_num = (0, feat_num)        
       
        # Gender Feature 1: Last letter - book example
        if 1 in feat_num:
            features['last_letter'] = name[-1].lower()
            
        # Gender Feature 2: First letter - most names beginning with a vowel --> females
        if 2 in feat_num:
            features['first_letter'] = name[0].lower()
            
        # Gender Feature 3: Vowel Counts
        if 3 in feat_num:
            features['vowel_count'] = len(re.sub(r'[^aeiou]', '', name.lower()))
            
        # Gender Feature 4: Hard consonants using general rules of c and g
        if 4 in feat_num:
            features['hard_consts'] = len(re.findall(r'[cg][^eiy]', name.lower()))/2
            
        # Gender Feature 5: Soft consonants using general rules of c and g
        if 5 in feat_num:
            features['soft_consts'] = len(re.findall(r'[cg][eiy]', name.lower()))/2
            
        # Gender Feature 6: Syllable Count of names via textstat
        if 6 in feat_num:
            features['syllable_count'] = textstat.syllable_count(name.lower())
    
        # Gender Feature 7: Name length
        if 7 in feat_num:
            features["length"] = len(name)
        
        # Gender Feature 8: Last two chars
        if 8 in feat_num:
            features["last2letters"] = name[-2:].lower()
            
        # Gender Feature 9: Last three chars
        if 9 in feat_num:
            features["last3letters"] = name[-3:].lower()
    
        # Gender Feature 10: Char Counts (overfitts)
        if 10 in feat_num:
            for letter in string.ascii_lowercase:
                features["count_{0}".format(letter)] = name.lower().count(letter)
                
        # Gender Feature 11: Char Booleans (overfitts)
        if 11 in feat_num:
            for letter in string.ascii_lowercase:
                features["has_{0}".format(letter)] = letter in name.lower()
        
        
        if 12 in feat_num:
            features = {}
            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter in letters:
                features["count(%s)" % letter] = name.lower().count(letter)


        if 13 in feat_num:
            features = {}
            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter1 in letters:
                for letter2 in letters:
                    features["has("+letter1+letter2+")"] = (letter1+letter2 in name.lower())

        if 14 in feat_num:
            features["first2Letters"]=name[0:2].lower()

        if 15 in feat_num:
            features = {}
            features["firstletter"] = name[0].lower()
            features["lastletter"] = name[-1].lower()
            features["last2letter"] = name[-2:].lower()
            features["last3letter"] = name[-3:].lower()

            letters=list(map(chr, range(ord('a'), ord('z') + 1)))
            for letter1 in letters:
                features["count("+letter1+")"] = name.lower().count(letter1)
                features["has("+letter1+")"] = (letter1 in name.lower())
                # iterate over 2-grams
                for letter2 in letters:

                    features["has("+letter1+letter2+")"] = (letter1+letter2 in name.lower())


        if 16 in feat_num:
            # define features
            features = {}
            # has(fo) = True
            features["has(fo)"] = ('fo' in name.lower())
            # has(hu) = True
            features["has(hu)"] = ('hu' in name.lower())
            # has(rv) = True
            features["has(rv)"] = ('rv' in name.lower())    
            # has(rw) = True
            features["has(rw)"] = ('rw' in name.lower()) 
            # has(sp) = True
            features["has(sp)"] = ('sp' in name.lower())

            # lastletter = 'a'
            features["lastletter=a"] = ('a' in name[-1:].lower())
            # lastletter = 'f'
            features["lastletter=f"] = ('f' in name[-1:].lower())
            # lastletter = 'k'
            features["lastletter=k"] = ('k' in name[-1:].lower())

            # last2letter = 'ch'
            features["last2letter=ch"] = ('ch' in name[-2:].lower())
            # last2letter = 'do'
            features["last2letter=do"] = ('do' in name[-2:].lower())
            # last2letter = 'ia'
            features["last2letter=ia"] = ('ia' in name[-2:].lower())
            # last2letter = 'im'
            features["last2letter=im"] = ('im' in name[-2:].lower())
            # last2letter = 'io'
            features["last2letter=io"] = ('io' in name[-2:].lower())
            # last2letter = 'la'
            features["last2letter=la"] = ('la' in name[-2:].lower())
            # last2letter = 'ld'
            features["last2letter=ld"] = ('ld' in name[-2:].lower())
            # last2letter = 'na'
            features["last2letter=na"] = ('na' in name[-2:].lower())
            # last2letter = 'os'
            features["last2letter=os"] = ('os' in name[-2:].lower())
            # last2letter = 'ra'
            features["last2letter=ra"] = ('ra' in name[-2:].lower())
            # last2letter = 'rd'
            features["last2letter=rd"] = ('rd' in name[-2:].lower())
            # last2letter = 'rt'
            features["last2letter=rt"] = ('rt' in name[-2:].lower())
            # last2letter = 'sa'
            features["last2letter=sa"] = ('sa' in name[-2:].lower())
            # last2letter = 'ta'
            features["last2letter=ta"] = ('ta' in name[-2:].lower())
            # last2letter = 'us'
            features["last2letter=us"] = ('us' in name[-2:].lower())

            # last3letter = 'ana'
            features["last3letter=ana"] = ('ana' in name[-3:].lower())    
            # last3letter = u'ard'
            features["last3letter=ard"] = ('ard' in name[-3:].lower())        
            # last3letter = u'ita'
            features["last3letter=ita"] = ('ita' in name[-3:].lower())    
            # last3letter = u'nne'
            features["last3letter=nne"] = ('nne' in name[-3:].lower())    
            # last3letter = u'tta'
            features["last3letter=tta"] = ('tta' in name[-3:].lower())    
        
        return features

    
    def show_errors(self, errors, n=None):
        if n is not None: errors = errors[:n]          
        for (tag, guess, name) in sorted(errors): 
            print('correct=%-8s guess=%-8s name=%-30s' %(tag, guess, name))
        return None 
    
    
    def classifier_report(self,classifier,dataset,feat_num):
        feat_num = int(feat_num)
        dataset_predictions = [classifier.classify(self.get_features(n,feat_num))  for (n, g) in dataset]
        dataset_gold = [g  for (n, g) in dataset]
        cm=ConfusionMatrix(dataset_gold, dataset_predictions)
        print(cm)
    
    def fit_model(self,feat_num):
       
        for i in np.arange(1,feat_num):
            feat_num =int(i)
            errors = [] 
            
            train_set = [(self.get_features(n,feat_num), g)  for (n, g) in train_names]
            devtest_set = [(self.get_features(n,feat_num), g)  for (n, g) in devtest_names]
            test_set = [(self.get_features(n,feat_num), g)  for (n, g) in test_names] 
            classifier = self.model.train(train_set) 
            
            # For errors list
            for (name, tag) in devtest_names:
                guess = classifier.classify(self.get_features(name,feat_num)) 
                if guess != tag: 
                    errors.append((tag, guess, name))    
                    
            # Print errors        
            self.show_errors(errors, 0)
            
            #Print classifier report
            self.classifier_report(classifier,train_names,i)
            
            # Only for NaiveBayes
            if self.model ==nltk.NaiveBayesClassifier:
                classifier.show_most_informative_features(5)
            
            print("Accuracy of model {} using feature {}:{}".format(self.model,feat_num,nltk.classify.accuracy(classifier, devtest_set)))
            
        

In [86]:
clf = NLP_Classifier(nltk.NaiveBayesClassifier)
clf.fit_model(16)

       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3594> 782 |
  male |  859<1709>|
-------+-----------+
(row = reference; col = test)

Most Informative Features
             last_letter = 'a'            female : male   =     32.3 : 1.0
             last_letter = 'k'              male : female =     30.9 : 1.0
             last_letter = 'f'              male : female =     28.9 : 1.0
             last_letter = 'p'              male : female =     14.2 : 1.0
             last_letter = 'v'              male : female =     10.5 : 1.0
Accuracy of model <class 'nltk.classify.naivebayes.NaiveBayesClassifier'> using feature 1:0.78
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4173> 203 |
  male | 2234 <334>|
-------+-----------+
(row = reference; col = test)

Most Informative Featu

       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3729> 647 |
  male | 1006<1562>|
-------+-----------+
(row = reference; col = test)

Most Informative Features
                 has(rv) = True             male : female =     29.0 : 1.0
                 has(hu) = True             male : female =     23.5 : 1.0
                 has(fo) = True             male : female =     21.5 : 1.0
                 has(rw) = True             male : female =     17.6 : 1.0
                 has(sp) = True             male : female =     16.5 : 1.0
Accuracy of model <class 'nltk.classify.naivebayes.NaiveBayesClassifier'> using feature 13:0.764
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3946> 430 |
  male | 1738 <830>|
-------+-----------+
(row = reference; col = test)

Most Informative Fea

In [65]:
clf = NLP_Classifier(nltk.DecisionTreeClassifier)

       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<4376>   . |
  male | 2568   <.>|
-------+-----------+
(row = reference; col = test)

Most Informative Features
Accuracy of model <class 'nltk.classify.naivebayes.NaiveBayesClassifier'> using feature 0:0.638
       |    f      |
       |    e      |
       |    m    m |
       |    a    a |
       |    l    l |
       |    e    e |
-------+-----------+
female |<3594> 782 |
  male |  859<1709>|
-------+-----------+
(row = reference; col = test)

Most Informative Features
             last_letter = 'a'            female : male   =     32.3 : 1.0
             last_letter = 'k'              male : female =     30.9 : 1.0
             last_letter = 'f'              male : female =     28.9 : 1.0
             last_letter = 'p'              male : female =     14.2 : 1.0
             last_letter = 'v'              male : female =     10.5 :

In [68]:
train_set_predictions = [classifier.classify(gender_features10(n))  for (n, g) in train_names]
train_set_gold = [g  for (n, g) in train_names]
cm=ConfusionMatrix(train_set_gold, train_set_predictions)

NameError: name 'classifier' is not defined

In [57]:
train_set = [(gender_features(n), g)  for (n, g) in train_names]
devtest_set = [(gender_features(n), g)  for (n, g) in devtest_names]
test_set = [(gender_features(n), g)  for (n, g) in test_names]

In [66]:
train_set = [(clf.get_features(n,1), g)  for (n, g) in train_names]


In [22]:
class WithClass():
    def __init__(self):
        self.value = "Bob"
    def my_func(self):
        print(self.value)
    def test_class(self):
        my_func(self)

In [21]:
WithClass().my_func()

Bob


In [23]:
WithClass().test_class()

NameError: name 'my_func' is not defined

In [28]:
names[10]

def gender_fea(name):
    return {'last_letter': name[-1]}

In [34]:
names[10][0]

'Aubry'

In [42]:
get_features(names[10][0],2)
#gender_fea(names[10][0])

{'first_letter': 'a'}

[('Anjela', 'female'),
 ('Lane', 'male'),
 ('Morganne', 'female'),
 ('Mauricio', 'male'),
 ('Lela', 'female'),
 ('Hamil', 'male'),
 ('Dorothy', 'female'),
 ('Patrice', 'female'),
 ('Broderic', 'male'),
 ('Nonah', 'female'),
 ('Aubry', 'female'),
 ('Malory', 'female'),
 ('Gonzalo', 'male'),
 ('Shalom', 'male'),
 ('Gilligan', 'female'),
 ('Caro', 'female'),
 ('Xymenes', 'male'),
 ('Rosalie', 'female'),
 ('Page', 'female'),
 ('Catha', 'female'),
 ('Xylia', 'female'),
 ('Karita', 'female'),
 ('Lita', 'female'),
 ('Kylie', 'female'),
 ('Brody', 'male'),
 ('Binnie', 'female'),
 ('Hannis', 'female'),
 ('Betta', 'female'),
 ('Johnna', 'female'),
 ('Theressa', 'female'),
 ('Ursulina', 'female'),
 ('Fay', 'female'),
 ('Yuri', 'male'),
 ('Cherianne', 'female'),
 ('Westbrook', 'male'),
 ('Hallie', 'female'),
 ('Kirbie', 'female'),
 ('Marcellina', 'female'),
 ('Aldo', 'male'),
 ('Nicoline', 'female'),
 ('Korrie', 'female'),
 ('Bartholomeus', 'male'),
 ('Charmane', 'female'),
 ('Phillis', 'female'),