## **NLP: Classifying Text**

**Submitted by:** Euclides

**Course:** CUNY DATA 620

**Data Source:** NLTK Names Package Corpus 

### **Introduction**

Using the 'Names' corpus embedded in the NLTK package an algorithm is developed to identify whether a name is either male or female. 


In [1]:
import nltk
from nltk.corpus import names
import random

In [2]:
names.fileids()

['female.txt', 'male.txt']

In [3]:
names = ([(name, 'male') for name in names.words('male.txt')]+
         [(name, 'female') for name in names.words('female.txt')])



In [4]:
names[:10]

[('Aamir', 'male'),
 ('Aaron', 'male'),
 ('Abbey', 'male'),
 ('Abbie', 'male'),
 ('Abbot', 'male'),
 ('Abbott', 'male'),
 ('Abby', 'male'),
 ('Abdel', 'male'),
 ('Abdul', 'male'),
 ('Abdulkarim', 'male')]

In [5]:
random.seed(10)
random.shuffle(names)
names[:10]

[('Gabrila', 'female'),
 ('Rosario', 'female'),
 ('Annabella', 'female'),
 ('Mead', 'female'),
 ('Pepe', 'male'),
 ('Malina', 'female'),
 ('Kaari', 'female'),
 ('Gabbie', 'female'),
 ('Tatiania', 'female'),
 ('Philippine', 'female')]

In [6]:
#Length of tuple list
print("Length of Corpus:", len(names))

Length of Corpus: 7944


In [7]:
#Train Set, Dev Set, Test Set 
train_names = names[:500] #First Random 500
dev_names = names[500:1000] #Next Random 500
test_names = names[1000:] #Balance of words for Test Set 

In [8]:
#Feature 1: First Letter 
def gender_feature1(word):
    return {'first letter': word[0]}

#Feature 2: First Two Letters   
def gender_feature2(word):
    return {'prefix1': word[0],
            'prefix2': word[1]}

#Feature 3: Last Letter 
def gender_feature3(word):
    return {'last letter': word[-1]}

#Feature 4: Last Two Letters 
def gender_feature4(word):
    return {'suffix1': word[-1],
            'suffix2': word[-2]}

#Feature 4: First and Last Letter
def gender_feature5(word):
    return {'suffix1': word[-1],
            'prefix1': word[0]}

#Create a list of functions     
functions = [gender_feature1, gender_feature2, gender_feature3, gender_feature4, gender_feature5]

In [9]:
#Loop to test all features on development list 
i = 1
for features in functions:
    train_set = [(features(n), g) for (n,g) in train_names]
    devtest_set = [(features(n), g) for (n,g) in dev_names]
    classifier = nltk.NaiveBayesClassifier.train(train_set)
    print ("Feature",i, "Accuracy on Development Set: ",nltk.classify.accuracy(classifier, devtest_set)) 
    i= i + 1

    

Feature 1 Accuracy on Development Set:  0.61
Feature 2 Accuracy on Development Set:  0.606
Feature 3 Accuracy on Development Set:  0.734
Feature 4 Accuracy on Development Set:  0.752
Feature 5 Accuracy on Development Set:  0.774


In [10]:
train_set = [(features(n), g) for (n,g) in train_names]
test_set = [(features(n), g) for (n,g) in test_names]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print("Feature 5 Accuracy on Test Set: ",round(nltk.classify.accuracy(classifier, test_set),4))



Feature 5 Accuracy on Test Set:  0.7602
