# Solo Work 4

## A bit more about pizza

**Here, I'd like you to do a little bit more on the pizza dataset:**

* First: Try using slightly different parts of the corpus for training and testing. Do the results vary?
* Second: See if you can make a better classifier than we did in class. 
    * One thing to try would be to experiment with the vocabulary. You can try a smaller or larger vocabulary.
    * You could try tailoring the vocabulary, perhaps by messing with the stop list.
    * You could also try adding other features entirely to the feature set. We looked at the length of posts in class. You can play with that. You could also maybe look for phrases.
    * You could try some of the other classification algorithms provided by NLTK? Maybe you could get decision trees to work?

**I copied over some big parts of notebook 14 to save you some trouble**

In [None]:
import csv
csvfile = open('corpora/pizza.csv')
pizza_reader = csv.reader(csvfile)
data_list = []
for row in pizza_reader:
    data_list.append(row)

### Tokenize it

In [None]:
def bruces_twitter_tokenizer(text):
    import re
    
    def is_contraction(the_text):
        contraction_patterns = re.compile(r"(?i)(.)('ll|'re|'ve|n't|'s|'m|'d)\b")
        return contraction_patterns.search(the_text)
    
    punctuation_class = r"([\.\-\/&\";:\(\)\?\!\]\[\{\}\*#])"
    
    # eliminate_urls
    text = re.sub(r"http\S*", "", text)
    
    # elimintate hashtags, user mentionds
    text = re.sub(r"#\S*", "", text)
    text = re.sub(r"@\S*", "", text)
    
    # Separate most punctuation at end of words

    text = re.sub(r"(\w)" + punctuation_class, r'\1 \2 ', text)
    
    # Separate most punctuation at start of words
    text = re.sub(punctuation_class + r"(\w)", r'\1 \2', text)
    
    # Separate punctuation from other punctuation
    text = re.sub(punctuation_class + punctuation_class, r'\1 \2 ', text)
    
    # Put spaces between + and = signs and digits. Also %s that follow a digit, $s that come before a digit
    text = re.sub(r"(\d)([+=%])", r'\1 \2 ', text)
    text = re.sub(r"([\$+=])(\d)", r'\1 \2', text)
    
    # Separate commas if they're followed by space.
    # (E.g., don't separate 2,500)
    text = re.sub(r"(,\s)", r' \1', text)
    
    #when we have two double quotes make it 1.
    #
    text = re.sub("\"\"", "\"", text)

    # Separate leading and trailing single and double quotes .
    text = re.sub(r"(\'\s)", r' \1', text)
    text = re.sub(r"(\s\')", r'\1 ', text)
    text = re.sub(r"(\"\s)", r' \1', text)
    text = re.sub(r"(\s\")", r'\1 ', text)
    text = re.sub(r"(^\")", r'\1 ', text)
    text = re.sub(r"(^\')", r'\1 ', text)
    text = re.sub(r"('\'$)", r' \1', text)
    text = re.sub(r"('\"$)", r' \1', text)

    #Separate parentheses where appropriate
    text = re.sub(r"(\)\s)", r' \1', text)
    text = re.sub(r"(\s\()", r'\1 ', text)

    # Separate periods that come before newline or end of string.
    text = re.sub('\. *(\n|$)', ' . ', text)
    
    # separate single quotes in the middle of words
    # text = re.sub(r"(\w)(\')(\w)", r'\1 \2 \3', text)
    
    # separate out 's at the end of words
    text = re.sub(r"(\w)(\'s)(\s)", r"\1 s ", text)
    split_text = text.split()
    
    return split_text


In [None]:
tokenized_data_list = []
for row in data_list:
    new_row = [bruces_twitter_tokenizer(row[0].lower()), row[1]]
    tokenized_data_list.append(new_row)

### Converting each text item to a featureset

In [None]:
f = open("lists/stop-words_english_5_en.txt")
stop_list = f.read().split("\n")
stop_list += list('!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~’')
stop_list += list("abcdefghijklmnopqrstuvwxyz0123456789")
stop_list = set(stop_list)

Next we create a frequency distribution, ignoring words on the stop list

In [None]:
import nltk
word_fdist = nltk.FreqDist() # the corpus frequences
for row in tokenized_data_list:
    for word in row[0]:
        if not word in stop_list:
            word_fdist[word] += 1

Our vocabulary will be the 100 most common words

In [None]:
mc = word_fdist.most_common(100)
vocab = [w[0] for w in mc]
print(vocab)

Now we create the feature sets

In [None]:
def word_feature_extractor(the_text, vocab):
    features = {}
    for word in vocab:
        features['contains(%s)' % word] = (word in the_text)
    return features

In [None]:
word_feature_extractor(tokenized_data_list[0][0], vocab)

In [None]:
labeled_featuresets = []
for row in tokenized_data_list:
    new_lf = [word_feature_extractor(row[0], vocab), row[1]]
    labeled_featuresets.append(new_lf)

### Separate all of the labeled featuresets into a test set and a training set

I'm going to reserve just about 10% of the data for testing

In [None]:
test_size = 400

In [None]:
test_set = labeled_featuresets[:400]
train_set = labeled_featuresets[400:]

In [None]:
pizza_classifier = nltk.NaiveBayesClassifier.train(train_set)

In [None]:
nltk.classify.accuracy(pizza_classifier, test_set)

In [None]:
gold_list = [t[1] for t in test_set]
test_list = [pizza_classifier.classify(t[0]) for t in test_set]
cm = nltk.ConfusionMatrix(gold_list, test_list)
print(cm)