In text classification, we are usually trying to classify a piece of text or document to decide what class label to assign it.  There are numerous ways we can use classification within NLTK. 

### 1. Bag of words feature extraction 

Text feature extraction is a process of transforming a list of words into a feature set that is usable by a classifier.  The bag of words model is the simplest method; we construct a word presence feature set from all the words of an instnace.  The order of the words doesn't matter, or how many times a word occurs, only whether a word is present in a list of words or not.  

The idea is to convert a list of words into a dictionary, where each word becomes a key with the value of `True`.  

In [1]:
# Bag of words method 
def bag_of_words(words):
    return dict([(word, True) for word in words])

In [2]:
bag_of_words(['the', 'quick', 'brown', 'fox'])

{'the': True, 'quick': True, 'brown': True, 'fox': True}

To improve the bag-of-words technique, we can also filter for stopwords: 

In [5]:
# Filter for stopwords 
from nltk.corpus import stopwords 

def bag_of_words_not_in_set(words, badwords):
    return bag_of_words(set(words) - set(badwords))

def bag_of_non_stopwords(words, stopfile='english'):
    badwords = stopwords.words(stopfile)
    return bag_of_words_not_in_set(words, badwords)

In [6]:
bag_of_non_stopwords(['the', 'quick', 'brown', 'fox', 'is', 'gone'])

{'brown': True, 'gone': True, 'fox': True, 'quick': True}

Another option is to include significant bigrams instead of just single words.  Significant bigrams are less common than most individual words, so including them in the bag-of-words model can improve a classifier.  We can use the `BigramCollocationFinder` to do this:  

In [9]:
# Include bigrams 
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

def bag_of_bigram_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return bag_of_words(words + bigrams)

In [10]:
bag_of_bigram_words(['the', 'quick', 'brown', 'fox'])

{'the': True,
 'quick': True,
 'brown': True,
 'fox': True,
 ('brown', 'fox'): True,
 ('quick', 'brown'): True,
 ('the', 'quick'): True}

In the argument above, we returned the 200 most common bigrams, but we can change the argument to get a different number. 

### 2. Naive Bayes classifier 