# Learning to classify text

Text classification is the task of classifying a document into a known category.
In this case we use Multinomial Naive bayes (generative algorithm) to learn and classify the documents based on labelled examples.  (supervised learning)

The first problem is to process the text.
We use the *bag of words* representation to turn each document into a vector

# Feature extraction

In [1]:
import sklearn
import sklearn.feature_extraction.text

vec=sklearn.feature_extraction.text.CountVectorizer()
mat=vec.fit_transform([
    open('lyrics/clash/livinginfame.txt').read(),
    open('lyrics/clash/londoncalling.txt').read(),
    open('lyrics/clash/hateandwar.txt').read(),
    
    open('lyrics/gaga/pokerface.txt').read(),
    open('lyrics/gaga/lovesickgirl.txt').read(),
    open('lyrics/gaga/animal.txt').read()])




Turn all the the document vectors into a matrix.

One dimension are the words (features) and the other dimension is each document.

Note: problem of sparsity, stopping, stemming, misspellings etc

In [2]:
mat

<6x411 sparse matrix of type '<class 'numpy.int64'>'
	with 581 stored elements in Compressed Sparse Row format>

In [3]:
mat[1,3]

3

In [4]:
print(vec.get_feature_names())

['aaaye', 'about', 'after', 'age', 'aggression', 'ain', 'alike', 'all', 'alone', 'am', 'an', 'and', 'animal', 'another', 'any', 'apparel', 'are', 'around', 'as', 'at', 'away', 'baby', 'back', 'bad', 'bank', 'barracuda', 'bathroom', 'battle', 'be', 'beat', 'beatlemania', 'beats', 'been', 'before', 'behave', 'best', 'better', 'big', 'bite', 'bitten', 'black', 'blame', 'blockhead', 'bluffin', 'body', 'bodysnatcher', 'bout', 'boys', 'breath', 'brother', 'burnin', 'but', 'by', 'cage', 'calling', 'can', 'cards', 'care', 'casino', 'cause', 'cheat', 'check', 'chick', 'circles', 'city', 'clash', 'close', 'come', 'coming', 'control', 'cooler', 'cops', 'crazy', 'creator', 'creep', 'cries', 'cupboard', 'currency', 'dat', 'day', 'days', 'de', 'deal', 'death', 'declared', 'deh', 'dem', 'dial', 'die', 'direction', 'dis', 'do', 'does', 'doesn', 'don', 'down', 'draw', 'dread', 'dreader', 'dream', 'drowning', 'dust', 'eh', 'else', 'em', 'engines', 'english', 'er', 'erectors', 'error', 'even', 'every', '

# Train the algorithm

This is supervised learning, so label our examples

In [5]:
target=["clash"]*3+["gaga"]*3
target

['clash', 'clash', 'clash', 'gaga', 'gaga', 'gaga']

Our prior is 50/50

In [6]:
import sklearn.naive_bayes
classifier=sklearn.naive_bayes.MultinomialNB().fit(mat,target)

classifier.class_count_

array([3., 3.])

Take a look at our distinguishing features

In [7]:
scores=set(zip(classifier.feature_count_[0],classifier.feature_count_[1],vec.get_feature_names()))
scores

{(0.0, 1.0, 'are'),
 (0.0, 1.0, 'bank'),
 (0.0, 1.0, 'bathroom'),
 (0.0, 1.0, 'been'),
 (0.0, 1.0, 'before'),
 (0.0, 1.0, 'better'),
 (0.0, 1.0, 'bluffin'),
 (0.0, 1.0, 'cards'),
 (0.0, 1.0, 'casino'),
 (0.0, 1.0, 'check'),
 (0.0, 1.0, 'chick'),
 (0.0, 1.0, 'crazy'),
 (0.0, 1.0, 'feel'),
 (0.0, 1.0, 'fold'),
 (0.0, 1.0, 'gambling'),
 (0.0, 1.0, 'glue'),
 (0.0, 1.0, 'gun'),
 (0.0, 1.0, 'gunning'),
 (0.0, 1.0, 'hair'),
 (0.0, 1.0, 'hand'),
 (0.0, 1.0, 'hard'),
 (0.0, 1.0, 'heart'),
 (0.0, 1.0, 'his'),
 (0.0, 1.0, 'hit'),
 (0.0, 1.0, 'hooked'),
 (0.0, 1.0, 'hug'),
 (0.0, 1.0, 'hunter'),
 (0.0, 1.0, 'hunting'),
 (0.0, 1.0, 'intuition'),
 (0.0, 1.0, 'isn'),
 (0.0, 1.0, 'kind'),
 (0.0, 1.0, 'kiss'),
 (0.0, 1.0, 'let'),
 (0.0, 1.0, 'light'),
 (0.0, 1.0, 'little'),
 (0.0, 1.0, 'luck'),
 (0.0, 1.0, 'lying'),
 (0.0, 1.0, 'marvelous'),
 (0.0, 1.0, 'mess'),
 (0.0, 1.0, 'mirror'),
 (0.0, 1.0, 'muffin'),
 (0.0, 1.0, 'neighbors'),
 (0.0, 1.0, 'pair'),
 (0.0, 1.0, 'pay'),
 (0.0, 1.0, 'please'),
 (0.0,

In [8]:
delta_score=[]
for i in scores:
    delta_score.append((i[1]-i[0],i[2]))
delta_score.sort()
delta_score[:10]                       

[(-25.0, 'say'),
 (-23.0, 'and'),
 (-16.0, 'the'),
 (-15.0, 'hate'),
 (-14.0, 'london'),
 (-11.0, 'calling'),
 (-10.0, 'clash'),
 (-10.0, 'dem'),
 (-10.0, 'eh'),
 (-10.0, 'is')]

In [9]:
delta_score[-10:]

[(15.0, 're'),
 (15.0, 'where'),
 (16.0, 'face'),
 (19.0, 'an'),
 (20.0, 'it'),
 (21.0, 'love'),
 (21.0, 'she'),
 (39.0, 'mum'),
 (49.0, 'oh'),
 (66.0, 'animal')]

# Test the algorithm

In [10]:
testdoc=vec.transform([open('lyrics/gaga/pokerface.txt').read()])
testdoc


<1x411 sparse matrix of type '<class 'numpy.int64'>'
	with 104 stored elements in Compressed Sparse Row format>

In [11]:
classifier.predict(testdoc)

array(['gaga'], dtype='<U5')

In [12]:
testdoc=vec.transform([open('lyrics/clash/dirtypunk.txt').read()])
classifier.predict(testdoc)

array(['clash'], dtype='<U5')