# Classifying *text* data

To this point, we haven't been classifying text data, really. The question is: If we are working with text, what should the features be?

## The data set: Random acts of pizza

[Kaggle link](https://www.kaggle.com/c/random-acts-of-pizza)

As described on Kaggle:

> This competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. Participants must create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

### First, we'll read it in, and take a look at it

In [None]:
import csv
csvfile = open('corpora/pizza.csv')
pizza_reader = csv.reader(csvfile)
data_list = []
for row in pizza_reader:
    data_list.append(row)

In [None]:
len(data_list)

In [None]:
data_list[0]

### Next we have to tokenize it

A note: Tokenizing this turned out to be a little of a pain. 
The problem was that, since this was text posted online, people used all sorts of abbreviations, and contractions were important.

I finally ended up using a special purposes tokenizer I created for some twitter data. This uses some techniques that will learn about in a later week. Basically, it's the result of a lot of fiddling.

In [None]:
def bruces_twitter_tokenizer(text):
    import re
    
    def is_contraction(the_text):
        contraction_patterns = re.compile(r"(?i)(.)('ll|'re|'ve|n't|'s|'m|'d)\b")
        return contraction_patterns.search(the_text)
    
    punctuation_class = r"([\.\-\/&\";:\(\)\?\!\]\[\{\}\*#])"
    
    # eliminate_urls
    text = re.sub(r"http\S*", "", text)
    
    # elimintate hashtags, user mentionds
    text = re.sub(r"#\S*", "", text)
    text = re.sub(r"@\S*", "", text)
    
    # Separate most punctuation at end of words

    text = re.sub(r"(\w)" + punctuation_class, r'\1 \2 ', text)
    
    # Separate most punctuation at start of words
    text = re.sub(punctuation_class + r"(\w)", r'\1 \2', text)
    
    # Separate punctuation from other punctuation
    text = re.sub(punctuation_class + punctuation_class, r'\1 \2 ', text)
    
    # Put spaces between + and = signs and digits. Also %s that follow a digit, $s that come before a digit
    text = re.sub(r"(\d)([+=%])", r'\1 \2 ', text)
    text = re.sub(r"([\$+=])(\d)", r'\1 \2', text)
    
    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)
    
    #when we have two double quotes make it 1.
    #
    text = re.sub("\"\"", "\"", text)

    # Separate leading and trailing single and double quotes .
    text = re.sub(r"(\'\s)", r' \1', text)
    text = re.sub(r"(\s\')", r'\1 ', text)
    text = re.sub(r"(\"\s)", r' \1', text)
    text = re.sub(r"(\s\")", r'\1 ', text)
    text = re.sub(r"(^\")", r'\1 ', text)
    text = re.sub(r"(^\')", r'\1 ', text)
    text = re.sub(r"('\'$)", r' \1', text)
    text = re.sub(r"('\"$)", r' \1', text)

    #Separate parentheses where appropriate
    text = re.sub(r"(\)\s)", r' \1', text)
    text = re.sub(r"(\s\()", r'\1 ', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)
    
    # separate single quotes in the middle of words
    # text = re.sub(r"(\w)(\')(\w)", r'\1 \2 \3', text)
    
    # separate out 's at the end of words
    text = re.sub(r"(\w)(\'s)(\s)", r"\1 s ", text)
    split_text = text.split()
    
    return split_text


In [None]:
tokenized_data_list = []
for row in data_list:
    new_row = [bruces_twitter_tokenizer(row[0].lower()), row[1]]
    tokenized_data_list.append(new_row)

In [None]:
print(tokenized_data_list[0])

### Converting each text item to a featureset

There are many ways we could convert each response to a featureset. To start, we are essentially going to use a sort of vector space representation. We will are going to create a vocabulary, as we have done before. Then each feature will be the presence or absence of each item in our vocabulary.

First we build a stop list:

In [None]:
f = open("lists/stop-words_english_5_en.txt")
stop_list = f.read().split("\n")
stop_list += list('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’')
stop_list += list("abcdefghijklmnopqrstuvwxyz0123456789")
stop_list = set(stop_list)

Next we create a frequency distribution, ignoring words on the stop list

In [None]:
import nltk
word_fdist = nltk.FreqDist() # the corpus frequences
for row in tokenized_data_list:
    for word in row[0]:
        if not word in stop_list:
            word_fdist[word] += 1

Our vocabulary will be the 100 most common words

In [None]:
mc = word_fdist.most_common(100)
vocab = [w[0] for w in mc]
print(vocab)

Now we create the feature sets

In [None]:
def word_feature_extractor(the_text, vocab):
    features = {}
    for word in vocab:
        features['contains(%s)' % word] = (word in the_text)
    return features

In [None]:
word_feature_extractor(tokenized_data_list[0][0], vocab)

In [None]:
labeled_featuresets = []
for row in tokenized_data_list:
    new_lf = [word_feature_extractor(row[0], vocab), row[1]]
    labeled_featuresets.append(new_lf)

In [None]:
print(labeled_featuresets[0])

### Separate all of the labeled featuresets into a test set and a training set

I'm going to reserve just about 10% of the data for testing

In [None]:
test_size = 400

In [None]:
test_set = labeled_featuresets[:400]
train_set = labeled_featuresets[400:]

In [None]:
pizza_classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(pizza_classifier, test_set)

In [None]:
gold_list = [t[1] for t in test_set]
test_list = [pizza_classifier.classify(t[0]) for t in test_set]
cm = nltk.ConfusionMatrix(gold_list, test_list)
print(cm)

### Try a decision tree

In [None]:
dt_pizza_classifier = nltk.DecisionTreeClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(dt_pizza_classifier, test_set)

In [None]:
gold_list = [t[1] for t in test_set]
test_list = [dt_pizza_classifier.classify(t[0]) for t in test_set]
cm = nltk.ConfusionMatrix(gold_list, test_list)
print(cm)

### Try a slighly different feature extractor

In [None]:
def word_feature_extractor_with_length(the_text, vocab):
    features = {}
    for word in vocab:
        features['contains(%s)' % word] = (word in the_text)
    if len(the_text) > 25:
        features["long"] = True
    else:
        features["long"] = False
    return features

In [None]:
labeled_featuresets = []
for row in tokenized_data_list:
    new_lf = [word_feature_extractor_with_length(row[0], vocab), row[1]]
    labeled_featuresets.append(new_lf)

In [None]:
test_size = 400

In [None]:
test_set = labeled_featuresets[:400]
train_set = labeled_featuresets[400:]

In [None]:
pizza_classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(pizza_classifier, test_set)

In [None]:
gold_list = [t[1] for t in test_set]
test_list = [pizza_classifier.classify(t[0]) for t in test_set]
cm = nltk.ConfusionMatrix(gold_list, test_list)
print(cm)