# W207 Group Project/Final
## Kaggle Competition: Random Acts of Pizza

Project Prompt:
People post pizza requests on Reddit
Build 2-class classifier
Classify whether post will get pizza
Practice mining features from text

Reference links:
https://www.kaggle.com/c/random-acts-of-pizza
http://cs.stanford.edu/~althoff/raop-dataset/

Data Set:
This training dataset contains a collection of 5671 textual requests for pizza from the Reddit community "Random Acts of Pizza" together with their outcome (successful/unsuccessful) and meta-data.

We will split the dataset into:
25% for development
75% for training

A separate dataset file was provided for testing purposes, of which we do not have the labels as to whether or not a pizza was received.

In [1]:
import numpy as np
import pandas as pd
import re

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import metrics



In [2]:
# import the data
df_train = pd.read_json('train.json')
df_test = pd.read_json('test.json')

# drop the target column from the data and use it for the labels
classification_column_name = 'requester_received_pizza'

train_data = df_train.drop([classification_column_name], axis=1)
train_labels = df_train[classification_column_name]

# use twenty-five percent of the training data for a dev data set
# note that we cannot use the test data set here, because we are not given their labels
train_data, dev_data, train_labels, dev_labels = train_test_split(train_data, train_labels, random_state=42)

As we will be parsing the message body of each pizza request to utilize as features for our models, we will create term-frequency matricies of the text to use in our models as features.

In [4]:
def decimal_to_percent(decimal):
    return round(decimal * 100, 2)

def basic_vectorizer():
    ''' Construct term-frequency matrices for use in models '''
    
    # use just the text of the post
    text_column = 'request_text'
    train_text = train_data[text_column]
    dev_text = dev_data[text_column]

    # construct the term-frequency count matrix
    tf_vect = CountVectorizer()
    tf_train = tf_vect.fit_transform(train_text)
    tf_dev = tf_vect.transform(dev_text)
    
    #output_file("basic_vocab.txt", tf_vect.get_feature_names())
    
    return (tf_train, tf_dev)

(tf_train, tf_dev) = basic_vectorizer()

Next we will train basic models, without tuning, for each candidate learning model:
  Logistic Regression
  Naive Bayes
  Decision Tree
  
We will train and find the accuracies of each model.  This will give us a general idea of which learning models may be most successful and can build upon them from there.

In [26]:
def train_and_evaluate_model(model, tf_train, train_labels, tf_dev, dev_labels):
    ''' Train and score a model'''
    clf = model
    clf.fit(tf_train, train_labels)
    
    # return the accuracy and F1 scores
    accuracy = clf.score(tf_dev, dev_labels) 
    predicted = clf.predict(tf_dev)
    f1_score = metrics.f1_score(predicted, dev_labels, average=None)

    return accuracy, f1_score

def print_model_scores(model_type, accuracy, f1_score):
    ''' Print the accuracy and f1 scores '''
    
    print 'The accuracy of {} model is {}%\n'.format(model_type, decimal_to_percent(accuracy))
    print 'The F1 scores are:\nFalse: {}\nTrue: {}\n'.format(*[decimal_to_percent(score) for score in f1_score])

In [27]:
# train basic models to gauge baseline performance of the models
# Logisitc Regression
# Naive Bayes
# Decision Tree
basic_lr = LogisticRegression()
basic_nb = BernoulliNB()
basic_dt = DecisionTreeClassifier()

for model, model_name in [(basic_lr, 'Logistic Regression'), (basic_nb, 'Naive Bayes'),
                  (basic_dt, 'Decision Tree')]:
    accuracy, f1_score = train_and_evaluate_model(model, tf_train, train_labels, tf_dev, dev_labels)
    print_model_scores(model_name, accuracy, f1_score)


The accuracy of Logistic Regression model is 69.7%

The F1 scores are:
False: 81.02
True: 25.0

The accuracy of Naive Bayes model is 71.39%

The F1 scores are:
False: 82.08
True: 28.99

The accuracy of Decision Tree model is 65.54%

The F1 scores are:
False: 77.43
True: 27.2



Lets attempt to build upon our basic models by introducing preprocessing algorithms for our word vocabulary vectorizer.

In [7]:
def vectorize_with_preprocessor(preprocessor_func):
    ''' Construct term-frequency matrices for use in models '''
    
    # use just the text of the post
    text_column = 'request_text'
    train_text = train_data[text_column]
    dev_text = dev_data[text_column]

    # construct the term-frequency count matrix
    tf_vect = CountVectorizer(preprocessor=preprocessor_func)
    tf = tf_vect.fit(train_text)
    
    # make the matrices global variables for convenience?
    tf_train_pp = tf.transform(train_text)
    tf_dev_pp = tf.transform(dev_text)
    
    #output_file("porter.txt", tf_vect.get_feature_names())
    
    return (tf_train_pp, tf_dev_pp)


In [8]:
# use porter-stemming algorithm to generalize words in the messages
# https://tartarus.org/martin/PorterStemmer/def.txt
def porter_stemming(s):
    # create a new empty string, since s is the entire message not a single word
    new_s = ""
    
    # make everything lowercase
    s = s.lower()
    # eliminate special characters
    s = re.sub('[^A-Za-z0-9\s]+', ' ', s)
    # eliminate repeated numbers
    s = re.sub('([0-9])[0-9]+', r'\1', s)
    
    # iterate through each word in space delimited string
    for w in s.split():
        # calculate the measure, which is the number of vowel to consanant transitions
        m_cnt = 0
        if (re.search('^[^aeiou]*(([aeiou]+[^aeiou]+)+)[aeiou]*$', w)):
            measure_match = re.match('^[^aeiou]*(([aeiou]+[^aeiou]+)+)[aeiou]*$', w)
            # split on vowels to count number of transitions
            consanant_groups = re.split('[aeiou]+', measure_match.group(1))
            m_cnt = len(consanant_groups) - 1
        
        # Step 1a of porter stemming
        if re.search('sses$', w):
            w = re.sub('sses$', 'ss', w)
        elif re.search('ies$', w):
            w = re.sub('ies$', 'i', w)
        elif re.search('ss$', w):
            w = re.sub('ss$', 'i', w)
        elif re.search('s$', w):
            w = re.sub('s$', '', w)
        # Step 1b
        # Porter-Stemming says this should be m_cnt > 0, but doesn't
        # even match their own examples, tweaked to 1 and got slightly better performance
        if (m_cnt > 1 and re.search('eed$', w)):
            w = re.sub('eed$', 'ee', w)
        elif (re.search('.*[aeiou].*(ed|ing)$', w)):
            w = re.sub('(ed|ing)$', '', w)
            # if the second or third rule of 1b is successful, we also
            if (re.search('(at|bl|iz)$', w)):
                w += 'e'
            # ends in double consanant, but no l s or z
            elif (re.search('.*([^aeiou])([^aeiou])$', w)) :
                m = re.match('(.*)([^aeiou])([^aeiou])$', w)
                if (m.group(3) == m.group(2) and
                    m.group(3) != 'l' and
                    m.group(3) != 's' and
                    m.group(3) != 'z') :
                    w = m.group(1) + m.group(2)
            # measure at least one and ends in cvc
            # but second c is not W,X,Y
            elif (m_cnt == 1 and re.search('^.*[^aeiou][aeiou][^aeiouwxy]$', w)) :
                w = re.sub('[^aeiouwxy]$', 'e', w)
        # Step 1c
        if (re.search('.*[aeiou].*y$', w)) :
            w = re.sub('y$', 'i', w)
        # Step 2
        if (m_cnt > 0) :
            if (re.search('ational$', w)) :
                w = re.sub('ational$', 'ate', w)
            elif (re.search('tional$', w)) :
                w = re.sub('tional$', 'tion', w)
            elif (re.search('enci$', w)) :
                w = re.sub('enci$', 'ence', w)
            elif (re.search('anci$', w)) :
                w = re.sub('anci$', 'ance', w)
            elif (re.search('izer$', w)) :
                w = re.sub('izer$', 'ize', w)
            elif (re.search('abli$', w)) :
                w = re.sub('abli$', 'able', w)
            elif (re.search('alli$', w)) :
                w = re.sub('alli$', 'al', w)
            elif (re.search('entli$', w)) :
                w = re.sub('entli$', 'ent', w)
            elif (re.search('eli$', w)) :
                w = re.sub('eli$', 'e', w)
            elif (re.search('ousli$', w)) :
                w = re.sub('ousli$', 'ous', w)
            elif (re.search('ization$', w)) :
                w = re.sub('ization$', 'ize', w)
            elif (re.search('(ation|ator)$', w)) :
                w = re.sub('(ation|ator)$', 'ate', w)
            elif (re.search('alism$', w)) :
                w = re.sub('alism$', 'al', w)
            elif (re.search('iveness$', w)) :
                w = re.sub('iveness$', 'ive', w)
            elif (re.search('fulness$', w)) :
                w = re.sub('fulness$', 'ful', w)
            elif (re.search('ousness$', w)) :
                w = re.sub('ousness$', 'ous', w)
            elif (re.search('aliti$', w)) :
                w = re.sub('aliti$', 'al', w)
            elif (re.search('iviti$', w)) :
                w = re.sub('iviti$', 'ive', w)
            elif (re.search('biliti$', w)) :
                w = re.sub('biliti$', 'ble', w)
        # Step 3
        if (m_cnt > 0) :
            if (re.search('icate$', w)) :
                w = re.sub('icate$', 'ic', w)
            elif (re.search('ative$', w)) :
                w = re.sub('ative$', '', w)
            elif (re.search('alize$', w)) :
                w = re.sub('alize$', 'al', w)
            elif (re.search('iciti$', w)) :
                w = re.sub('iciti$', 'ic', w)
            elif (re.search('ical$', w)) :
                w = re.sub('ical$', 'ic', w)
            elif (re.search('ful$', w)) :
                w = re.sub('ful$', '', w)
            elif (re.search('ness$', w)) :
                w = re.sub('ness$', '', w)
        # Step 4
        if (m_cnt > 1) :
            if (re.search('al$', w)) :
                w = re.sub('al$', '', w)
            elif (re.search('ance$', w)) :
                w = re.sub('ance$', '', w)
            elif (re.search('ence$', w)) :
                w = re.sub('ence$', '', w)
            elif (re.search('er$', w)) :
                w = re.sub('er$', '', w)
            elif (re.search('ic$', w)) :
                w = re.sub('ic$', '', w)
            elif (re.search('able$', w)) :
                w = re.sub('able$', '', w)
            elif (re.search('ible$', w)) :
                w = re.sub('ible$', '', w)
            elif (re.search('ant$', w)) :
                w = re.sub('ant$', '', w)
            elif (re.search('ement$', w)) :
                w = re.sub('ement$', '', w)
            elif (re.search('ent$', w)) :
                w = re.sub('ent$', '', w)
            elif (m_cnt > 1 and re.search('[s|t]ion$', w)) :
                w = re.sub('[s|t]ion$', '', w)
            elif (re.search('ou$', w)) :
                w = re.sub('ou$', '', w)
            elif (re.search('ism$', w)) :
                w = re.sub('ism$', '', w)
            elif (re.search('ate$', w)) :
                w = re.sub('ate$', '', w)
            elif (re.search('iti$', w)) :
                w = re.sub('iti$', '', w)
            elif (re.search('ous$', w)) :
                w = re.sub('ous$', '', w)
            elif (re.search('ive$', w)) :
                w = re.sub('ive$', '', w)
            elif (re.search('ize$', w)) :
                w = re.sub('ize$', '', w)
        # Step 5a
        if (m_cnt > 1 and re.search('e$', w)) :
            w = re.sub('e$', '', w)
        # measure at least one and ends in cvc
        # but second c is not W,X,Y
        elif (m_cnt == 1 and not re.search('^[^aeiou][aeiou][^aeiouwxy]e$', w)) :
            w = re.sub('e$', '', w)
        # Step 5b
        if (m_cnt > 1 and re.search('.*ll$', w)):
            w = re.sub('l$', '', w)
        # end of porter stemming
        # attach the word back to the new string
        new_s += (" " + w)
    return new_s

def output_file(output_name, output_list):
    with open(output_name, 'w') as output:
        for i in output_list:
            output.write(i.encode('UTF-8') + "\n")

In [9]:
def stopword_preprocessor(s):
    # make everything lowercase
    s = s.lower()
    # eliminate repeated numbers
    s = re.sub('([0-9])[0-9]+', r'\1', s)
    # eliminate special characters
    s = re.sub('[^A-Za-z0-9\s\-]+', ' ', s)
    # add a space between consecutive numbers/alpha
    s = re.sub('([0-9])([a-z])', '\1 \2', s)
    s = re.sub('([a-z])([0-9])', '\1 \2', s)
    
    # remove "stop" words which bear no significant meaning in most contexts
    s = re.sub(' ?the ', r' ', s)
    s = re.sub(' ?who ', r' ', s)
    s = re.sub(' ?what ', r' ', s)
    s = re.sub(' ?them ', r' ', s)
    s = re.sub(' ?my ', r' ', s)
    s = re.sub(' ?our ', r' ', s)
    s = re.sub(' ?this ', r' ', s)
    s = re.sub(' ?that ', r' ', s)
    s = re.sub(' ?which ', r' ', s)
    s = re.sub(' ?why ', r' ', s)
    s = re.sub(' ?me ', r' ', s)
    #s = re.sub(' ?i ', r' ', s)
    #s = re.sub(' ?us ', r' ', s)
    #s = re.sub(' ?you ', r' ', s)
    s = re.sub(' ?they ', r' ', s)
    s = re.sub(' ?where ', r' ', s)
    s = re.sub(' ?and ', r' ', s)
    s = re.sub(' ?for ', r' ', s)
    s = re.sub(' ?his ', r' ', s)
    s = re.sub(' ?her ', r' ', s)
    s = re.sub(' ?to ', r' ', s)
    #s = re.sub(' ?of ', r' ', s)
    
    return s

Retrain new basic models, this time we use an updated vocabulary based on our vectorizer preprocessor.

In [28]:
(tf_pp_train, tf_pp_dev) = vectorize_with_preprocessor(stopword_preprocessor)
    
# train basic models to gauge baseline performance of the models
# Logisitc Regression
# Naive Bayes
# Decision Tree
basic_pp_lr = LogisticRegression()
basic_pp_nb = BernoulliNB()
basic_pp_dt = DecisionTreeClassifier()

for model, model_name in [(basic_pp_lr, 'Logistic Regression'), (basic_pp_nb, 'Naive Bayes'),
                  (basic_pp_dt, 'Decision Tree')]:
    accuracy, f1_score = train_and_evaluate_model(model, tf_pp_train, train_labels,
                                                  tf_pp_dev, dev_labels)
    print_model_scores(model_name, accuracy, f1_score)


The accuracy of Logistic Regression model is 70.99%

The F1 scores are:
False: 81.97
True: 25.82

The accuracy of Naive Bayes model is 71.58%

The F1 scores are:
False: 82.32
True: 27.71

The accuracy of Decision Tree model is 67.43%

The F1 scores are:
False: 79.08
True: 26.4



Using the stopword preprocessor, the accuracy of for our models improved:
Logistic Regression model from 69.7% to 70.99%
Naive Bayes model from 71.39% to 71.58%
Decision Tree model from 64.75% to 66.53%

In [29]:
# Vectorize textual data with TF-IDF transformation
def tfidf_vectorizer(preprocessor_func):
    # use just the text of the post
    text_column = 'request_text'
    train_text = train_data[text_column]
    dev_text = dev_data[text_column]

    # construct the term-frequency count matrix
    tfidf_vect = TfidfVectorizer(preprocessor=preprocessor_func)
    tfidf = tfidf_vect.fit(train_text)
    
    # make the matrices global variables for convenience?
    tfidf_train = tf.transform(train_text)
    tfidf_dev = tf.transform(dev_text)
    
    return (tfidf_train, tfidf_dev)

(tfidf_train, tfidf_dev) = vectorize_with_preprocessor(stopword_preprocessor)

basic_tfidf_lr = LogisticRegression()
basic_tfidf_nb = BernoulliNB()
basic_tfidf_dt = DecisionTreeClassifier()

for model, model_name in [(basic_tfidf_lr, 'Logistic Regression'),
                          (basic_tfidf_nb, 'Naive Bayes'),
                          (basic_tfidf_dt, 'Decision Tree')]:
    accuracy, f1_score = train_and_evaluate_model(model,
                                                  tfidf_train, train_labels,
                                                  tfidf_dev, dev_labels)
    print_model_scores(model_name, accuracy, f1_score)


The accuracy of Logistic Regression model is 70.99%

The F1 scores are:
False: 81.97
True: 25.82

The accuracy of Naive Bayes model is 71.58%

The F1 scores are:
False: 82.32
True: 27.71

The accuracy of Decision Tree model is 65.35%

The F1 scores are:
False: 77.3
True: 26.78



Overall, Naive Bayes appears to be performing better than the Logistic Regression and Decision Tree models, although only about 2% better over the Logistic Regression model.

We will attempt to tune our default modeling by gridsearching over various hyperparameters.

In [32]:
# generic grid search which can field different models and hyperparameters
# and output the best results
def gridsearch_model(model, parameters, tf_dev, dev_labels): 
    gridsearch = GridSearchCV(estimator=model,
                              param_grid=parameters)
    gridsearch.fit(tf_dev, dev_labels)
    
    print "Best parameters:"
    print gridsearch.best_params_
    
# c_values for logistic regression
c_values = {'C': [0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 2.0,
                  5.0, 10.0, 15.0, 20.0, 25.0, 50.0, 75.0, 100.0]}

# logistic regression tuning
#lr_basic = LogisticRegression()
#lr_basic.fit(tf_pp_train, train_labels)
gridsearch_model(basic_pp_lr, c_values, tf_pp_dev, dev_labels)

# use optimal parameters
tuned_lr = LogisticRegression(C=0.001)
tuned_lr.fit(tf_pp_train, train_labels)
lr_accuracy = tuned_lr.score(tf_pp_dev, dev_labels) 
lr_predicted = tuned_lr.predict(tf_pp_dev)
lr_f1_score = metrics.f1_score(lr_predicted, dev_labels, average=None)
print_model_scores("Tuned Logistic Regression", lr_accuracy, lr_f1_score)
#print "Accuracy: ", lr_accuracy
#print "F1 Score: ", lr_f1_score

Best parameters:
{'C': 0.001}
The accuracy of Tuned Logistic Regression model is 74.55%

The F1 scores are:
False: 85.42
True: 0.0



In [31]:
# naive bayes tuning
#nb_basic = BernoulliNB()
#nb_basic.fit(tx_train, train_labels)

params = {'alpha': [0.0001, 0.001, 0.01, 0.05, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95, 1.0,]}
gridsearch_model(basic_pp_nb, params, tf_pp_dev, dev_labels)

tuned_nb = BernoulliNB(alpha=0.9)
tuned_nb.fit(tf_pp_train, train_labels)
nb_accuracy = tuned_nb.score(tf_pp_dev, dev_labels)
nb_predicted = tuned_nb.predict(tf_pp_dev)
nb_f1_score = metrics.f1_score(nb_predicted, dev_labels, average=None)
print_model_scores("Tuned Naive Bayes", nb_accuracy, nb_f1_score)
#print "Accuracy: ", nb_accuracy
#print "F1 Score: ", nb_f1_score

Best parameters:
{'alpha': 0.9}
The accuracy of Tuned Naive Bayes model is 71.39%

The F1 scores are:
False: 82.06
True: 29.34



Although the tuned hyper parameters produce better overall accuracy, in the case of logistic regression, we are getting 0% results for "True" or successful Pizza Requests.  It possible there is a class imbalance in our training data and the model is merely always guessing False/unsuccessful pizza requests to obtain a higher accuracy.

If there is a class imbalance in our training data, it may be difficult for the model to distinguish between sucessful and unsucessful features.

In [33]:
# doing some exploration to find out if there is a class imbalance in our training data
def class_counts():
    counts = [0, 0]
    for i in train_labels:
        counts[i] += 1
    
    for i in range(len(counts)):
        print i, counts[i]
        
class_counts()

0 2293
1 737


In [37]:
# There is some class imbalance, so lets try weighting
weight_dict = {0: 0.35, 1: .65}
lr_weight = LogisticRegression(C=0.001, class_weight=weight_dict)
lr_weight.fit(tf_pp_train, train_labels)
lrw_accuracy = lr_weight.score(tf_pp_dev, dev_labels) 
lrw_predicted = lr_weight.predict(tf_pp_dev)
lrw_f1_score = metrics.f1_score(lrw_predicted, dev_labels, average=None)
print_model_scores("Weighed Logistic Regression", lrw_accuracy, lrw_f1_score)
#print "Accuracy: ", lrw_accuracy
#print "F1 Score: ", lrw_f1_score

The accuracy of Weighed Logistic Regression model is 70.79%

The F1 scores are:
False: 81.78
True: 26.43

