##W207 Final Project - Kaggle Competition

###Random Acts of Pizza

This dataset includes 5671 requests collected from the Reddit community Random Acts of Pizza between December 8, 2010 and September 29, 2013 (retrieved on September 30, 2013). All requests ask for the same thing: a free pizza. The outcome of each request -- whether its author received a pizza or not -- is known. Meta-data includes information such as: time of the request, activity of the requester, community-age of the requester, etc.

Each JSON entry corresponds to one request (the first and only request by the requester on Random Acts of Pizza). We have removed fields from the test set which would not be available at the time of posting.

**Data fields**

"giver_username_if_known": Reddit username of giver if known, i.e. the person satisfying the request ("N/A" otherwise).

"number_of_downvotes_of_request_at_retrieval": Number of downvotes at the time the request was collected.

"number_of_upvotes_of_request_at_retrieval": Number of upvotes at the time the request was collected.

"post_was_edited": Boolean indicating whether this post was edited (from Reddit).

"request_id": Identifier of the post on Reddit, e.g. "t3_w5491".

"request_number_of_comments_at_retrieval": Number of comments for the request at time of retrieval.

"request_text": Full text of the request.

"request_text_edit_aware": Edit aware version of "request_text". We use a set of rules to strip edited comments indicating the success of the request such as "EDIT: Thanks /u/foo, the pizza was delicous".

"request_title": Title of the request.

"requester_account_age_in_days_at_request": Account age of requester in days at time of request.

"requester_account_age_in_days_at_retrieval": Account age of requester in days at time of retrieval.

"requester_days_since_first_post_on_raop_at_request": Number of days between requesters first post on RAOP and this request (zero if requester has never posted before on RAOP).

"requester_days_since_first_post_on_raop_at_retrieval": Number of days between requesters first post on RAOP and time of retrieval.

"requester_number_of_comments_at_request": Total number of comments on Reddit by requester at time of request.

"requester_number_of_comments_at_retrieval": Total number of comments on Reddit by requester at time of retrieval.

"requester_number_of_comments_in_raop_at_request": Total number of comments in RAOP by requester at time of request.

"requester_number_of_comments_in_raop_at_retrieval": Total number of comments in RAOP by requester at time of retrieval.

"requester_number_of_posts_at_request": Total number of posts on Reddit by requester at time of request.

"requester_number_of_posts_at_retrieval": Total number of posts on Reddit by requester at time of retrieval.

"requester_number_of_posts_on_raop_at_request": Total number of posts in RAOP by requester at time of request.

"requester_number_of_posts_on_raop_at_retrieval": Total number of posts in RAOP by requester at time of retrieval.

"requester_number_of_subreddits_at_request": The number of subreddits in which the author had already posted in at the time of request.

"requester_received_pizza": Boolean indicating the success of the request, i.e., whether the requester received pizza.

"requester_subreddits_at_request": The list of subreddits in which the author had already posted in at the time of request.

"requester_upvotes_minus_downvotes_at_request": Difference of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_minus_downvotes_at_retrieval": Difference of total upvotes and total downvotes of requester at time of retrieval.

"requester_upvotes_plus_downvotes_at_request": Sum of total upvotes and total downvotes of requester at time of request.

"requester_upvotes_plus_downvotes_at_retrieval": Sum of total upvotes and total downvotes of requester at time of retrieval.

"requester_user_flair": Users on RAOP receive badges (Reddit calls them flairs) which is a small picture next to their username. In our data set the user flair is either None (neither given nor received pizza, N=4282), "shroom" (received pizza, but not given, N=1306), or "PIF" (pizza given after having received, N=83).

"requester_username": Reddit username of requester.

"unix_timestamp_of_request": Unix timestamp of request (supposedly in timezone of user, but in most cases it is equal to the UTC timestamp -- which is incorrect since most RAOP users are from the USA).

"unix_timestamp_of_request_utc": Unit timestamp of request in UTC.

In [121]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.decomposition import PCA
from sklearn.neural_network import BernoulliRBM
from sklearn.svm import SVC
from sklearn import metrics
from textblob import TextBlob

In [122]:
traindf = pd.read_json("train.json")
testdf = pd.read_json("test.json")

In [123]:
traindf.dtypes

giver_username_if_known                                  object
number_of_downvotes_of_request_at_retrieval               int64
number_of_upvotes_of_request_at_retrieval                 int64
post_was_edited                                           int64
request_id                                               object
request_number_of_comments_at_retrieval                   int64
request_text                                             object
request_text_edit_aware                                  object
request_title                                            object
requester_account_age_in_days_at_request                float64
requester_account_age_in_days_at_retrieval              float64
requester_days_since_first_post_on_raop_at_request      float64
requester_days_since_first_post_on_raop_at_retrieval    float64
requester_number_of_comments_at_request                   int64
requester_number_of_comments_at_retrieval                 int64
requester_number_of_comments_in_raop_at_

In [124]:
testdf.dtypes

giver_username_if_known                                object
request_id                                             object
request_text_edit_aware                                object
request_title                                          object
requester_account_age_in_days_at_request              float64
requester_days_since_first_post_on_raop_at_request    float64
requester_number_of_comments_at_request                 int64
requester_number_of_comments_in_raop_at_request         int64
requester_number_of_posts_at_request                    int64
requester_number_of_posts_on_raop_at_request            int64
requester_number_of_subreddits_at_request               int64
requester_subreddits_at_request                        object
requester_upvotes_minus_downvotes_at_request            int64
requester_upvotes_plus_downvotes_at_request             int64
requester_username                                     object
unix_timestamp_of_request                               int64
unix_tim

Hmm... the test data has a lot less features than train data does. Let's start by only exploring the data in both, plus whether the requester received pizza. Also, let's split the data into numerical features and text based features.

In [125]:
train_data = traindf.as_matrix(columns=['giver_username_if_known',
                  'request_id',
                  'request_text_edit_aware',
                  'request_title',
                  'requester_account_age_in_days_at_request',
                  'requester_days_since_first_post_on_raop_at_request',
                  'requester_number_of_comments_at_request',
                  'requester_number_of_comments_in_raop_at_request',
                  'requester_number_of_posts_at_request',
                  'requester_number_of_posts_on_raop_at_request',
                  'requester_number_of_subreddits_at_request',
                  'requester_upvotes_minus_downvotes_at_request',
                  'requester_upvotes_plus_downvotes_at_request',
                  'requester_username',
                  'unix_timestamp_of_request_utc'])

train_labels = traindf.as_matrix(columns=['requester_received_pizza'])

dev_data = train_data[3000:,:]
dev_labels = train_labels[3000:,:]

train_data = train_data[:3000,:]
train_labels = train_labels[:3000,:]

test_data = testdf.as_matrix()

# just numerical data
train_data_numerical = traindf.as_matrix(columns=['requester_account_age_in_days_at_request',
                                                  'requester_days_since_first_post_on_raop_at_request',
                                                  'requester_number_of_comments_at_request',
                                                  'requester_number_of_comments_in_raop_at_request',
                                                  'requester_number_of_posts_at_request',
                                                  'requester_number_of_posts_on_raop_at_request',
                                                  'requester_number_of_subreddits_at_request',
                                                  'requester_upvotes_minus_downvotes_at_request',
                                                  'requester_upvotes_plus_downvotes_at_request',
                                                  'unix_timestamp_of_request_utc'])

dev_data_numerical = train_data_numerical[3000:,:]
train_data_numerical = train_data_numerical[:3000,:]

Prediction time:

Start with using Random Forest on numerical features.

In [126]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(train_data_numerical, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(dev_data_numerical, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.716346153846
Accuracy of Random Forest with 10 trees: 0.7375
Accuracy of Random Forest with 15 trees: 0.724038461538
Accuracy of Random Forest with 25 trees: 0.731730769231
Accuracy of Random Forest with 50 trees: 0.734615384615
Accuracy of Random Forest with 100 trees: 0.7375
Accuracy of Random Forest with 150 trees: 0.733653846154
Accuracy of Random Forest with 200 trees: 0.731730769231
Accuracy of Random Forest with 250 trees: 0.732692307692
Accuracy of Random Forest with 500 trees: 0.735576923077


73.75% accuracy with 100 trees, not bad. Let's dig into the request title.

In [127]:
req_reason = traindf.as_matrix(columns=['request_title'])

In [128]:
req = req_reason.tolist()

In [129]:
subjectivities = []
polarities = []
for r in range(len(req)):
    tmp = TextBlob(req[r][0])
    subjectivities.append(tmp.sentiment.subjectivity)
    polarities.append(tmp.sentiment.polarity)

In [130]:
new_train_data_numerical = traindf.as_matrix(columns=['requester_account_age_in_days_at_request',
                                                  'requester_days_since_first_post_on_raop_at_request',
                                                  'requester_number_of_comments_at_request',
                                                  'requester_number_of_comments_in_raop_at_request',
                                                  'requester_number_of_posts_at_request',
                                                  'requester_number_of_posts_on_raop_at_request',
                                                  'requester_number_of_subreddits_at_request',
                                                  'requester_upvotes_minus_downvotes_at_request',
                                                  'requester_upvotes_plus_downvotes_at_request',
                                                  'unix_timestamp_of_request_utc'])

In [131]:
s = np.array(subjectivities)
p = np.array(polarities)

In [132]:
sp = np.column_stack((s,p))

In [133]:
new_train_data_num = np.c_[new_train_data_numerical, sp]

In [134]:
sp_dev_data = sp[3000:,:]
sp_train_data = sp[:3000,:]

new_dev_data_numerical = new_train_data_num[3000:,:]
new_train_data_num = new_train_data_num[:3000,:]

Now let's try Random Forest again on:  
1) Just the request title sentiments  
2) The new numerical array, which joins the previous array with the sentiments

In [135]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(sp_train_data, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(sp_dev_data, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.724038461538
Accuracy of Random Forest with 10 trees: 0.727884615385
Accuracy of Random Forest with 15 trees: 0.724038461538
Accuracy of Random Forest with 25 trees: 0.731730769231
Accuracy of Random Forest with 50 trees: 0.729807692308
Accuracy of Random Forest with 100 trees: 0.732692307692
Accuracy of Random Forest with 150 trees: 0.733653846154
Accuracy of Random Forest with 200 trees: 0.735576923077
Accuracy of Random Forest with 250 trees: 0.732692307692
Accuracy of Random Forest with 500 trees: 0.734615384615


Accuracy on predicting just on sentiment data topped out at 73.557% with 100 trees, which is not an improvement by itself. Let's see if adding the sentiment data to the previous numerical data helped.

In [136]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(new_train_data_num, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(new_dev_data_numerical, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.724038461538
Accuracy of Random Forest with 10 trees: 0.732692307692
Accuracy of Random Forest with 15 trees: 0.727884615385
Accuracy of Random Forest with 25 trees: 0.733653846154
Accuracy of Random Forest with 50 trees: 0.739423076923
Accuracy of Random Forest with 100 trees: 0.736538461538
Accuracy of Random Forest with 150 trees: 0.739423076923
Accuracy of Random Forest with 200 trees: 0.740384615385
Accuracy of Random Forest with 250 trees: 0.740384615385
Accuracy of Random Forest with 500 trees: 0.739423076923


Hmm... adding sentiment on the request title only boosted accuracy to 74.038%, an increase of only ~1/4 of a percent.  

Let's see if adding sentiment for the request text helps.

In [137]:
req_texts = traindf.as_matrix(columns=['request_text_edit_aware'])
req_text = req_texts.tolist()

In [138]:
subjectivities = []
polarities = []
for r in range(len(req_text)):
    tmp = TextBlob(req[r][0])
    subjectivities.append(tmp.sentiment.subjectivity)
    polarities.append(tmp.sentiment.polarity)

In [139]:
s = np.array(subjectivities)
p = np.array(polarities)

In [140]:
sp = np.column_stack((s,p))

In [141]:
sp_dev_data = sp[3000:,:]
sp_train_data = sp[:3000,:]

In [142]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(sp_train_data, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(sp_dev_data, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.724038461538
Accuracy of Random Forest with 10 trees: 0.727884615385
Accuracy of Random Forest with 15 trees: 0.724038461538
Accuracy of Random Forest with 25 trees: 0.731730769231
Accuracy of Random Forest with 50 trees: 0.729807692308
Accuracy of Random Forest with 100 trees: 0.732692307692
Accuracy of Random Forest with 150 trees: 0.733653846154
Accuracy of Random Forest with 200 trees: 0.735576923077
Accuracy of Random Forest with 250 trees: 0.732692307692
Accuracy of Random Forest with 500 trees: 0.734615384615


In [143]:
new_train_data_num = np.c_[new_train_data_num, sp_train_data]
new_dev_data_numerical = np.c_[new_dev_data_numerical, sp_dev_data]

In [144]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(new_train_data_num, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(new_dev_data_numerical, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.694230769231
Accuracy of Random Forest with 10 trees: 0.725961538462
Accuracy of Random Forest with 15 trees: 0.714423076923
Accuracy of Random Forest with 25 trees: 0.728846153846
Accuracy of Random Forest with 50 trees: 0.735576923077
Accuracy of Random Forest with 100 trees: 0.734615384615
Accuracy of Random Forest with 150 trees: 0.730769230769
Accuracy of Random Forest with 200 trees: 0.7375
Accuracy of Random Forest with 250 trees: 0.739423076923
Accuracy of Random Forest with 500 trees: 0.7375


Uh oh. Adding the request text sentiment scores without any preprocessing actually hurt our predicitive accuracy.  

Let's explore some of the common vocabularly in request text and how it correlates to success.

In [160]:
common_words = {'pizza':0, 'please':0, 'hungry':0, 'kids':0, 'money':0}
for r in range(len(req_text)):
    tmp = TextBlob(req[r][0])
    for word in common_words.keys():
        common_words[word] += tmp.word_counts[word]
common_words

{'hungry': 478, 'kids': 54, 'money': 249, 'pizza': 1577, 'please': 136}

Cool, now that we have word counts, do any of these words align with success?

In [171]:
def has_word(array, word):
    word_list = []
    for item in range(len(array)):
        tmp = TextBlob(array[item][0].lower())
        if tmp.word_counts[word] > 0:
            word_list.append(1)
        else:
            word_list.append(0)
    return np.array(word_list)

has_pizza = has_word(req_text, 'pizza')
has_please = has_word(req_text, 'please')
has_hungry = has_word(req_text, 'hungry')
has_kids = has_word(req_text, 'kids')
has_money = has_word(req_text, 'money')

In [189]:
print('has pizza matches train_labels', np.sum(has_pizza[0:3000]==np.ravel(train_labels)) / float(train_labels.shape[0]), '% of the time.')
print('has please matches train_labels', np.sum(has_please[0:3000]==np.ravel(train_labels)) / float(train_labels.shape[0]), '% of the time.')
print('has hungry matches train_labels', np.sum(has_hungry[0:3000]==np.ravel(train_labels)) / float(train_labels.shape[0]), '% of the time.')
print('has kids matches train_labels', np.sum(has_kids[0:3000]==np.ravel(train_labels)) / float(train_labels.shape[0]), '% of the time.')
print('has money matches train_labels', np.sum(has_money[0:3000]==np.ravel(train_labels)) / float(train_labels.shape[0]), '% of the time.')

has pizza matches train_labels 0.454666666667 % of the time.
has please matches train_labels 0.706 % of the time.
has hungry matches train_labels 0.696333333333 % of the time.
has kids matches train_labels 0.741333333333 % of the time.
has money matches train_labels 0.645666666667 % of the time.


Cool! We see that pizza, the most commonly used word is not very predictive of success, but the key words may be useful. Let's plug in the non-pizza words into our numerical data matrix and run Random Forest model again.

In [190]:
key_words = np.column_stack((has_please, has_hungry, has_kids, has_money))

In [192]:
keywords_dev_data = key_words[3000:,:]
keywords_train_data = key_words[:3000,:]

In [193]:
new_train_data_num = np.c_[new_train_data_num, keywords_train_data]
new_dev_data_numerical = np.c_[new_dev_data_numerical, keywords_dev_data]

In [199]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(keywords_train_data, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(keywords_dev_data, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.754807692308
Accuracy of Random Forest with 10 trees: 0.754807692308
Accuracy of Random Forest with 15 trees: 0.753846153846
Accuracy of Random Forest with 25 trees: 0.753846153846
Accuracy of Random Forest with 50 trees: 0.754807692308
Accuracy of Random Forest with 100 trees: 0.754807692308
Accuracy of Random Forest with 150 trees: 0.754807692308
Accuracy of Random Forest with 200 trees: 0.754807692308
Accuracy of Random Forest with 250 trees: 0.754807692308
Accuracy of Random Forest with 500 trees: 0.754807692308


Alright! Accuracy boosted to 75.48% using just the key words! Let's try folding those into the existing numerical data.

In [196]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(new_train_data_num, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(new_dev_data_numerical, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.704807692308
Accuracy of Random Forest with 10 trees: 0.726923076923
Accuracy of Random Forest with 15 trees: 0.727884615385
Accuracy of Random Forest with 25 trees: 0.744230769231
Accuracy of Random Forest with 50 trees: 0.75
Accuracy of Random Forest with 100 trees: 0.743269230769
Accuracy of Random Forest with 150 trees: 0.748076923077
Accuracy of Random Forest with 200 trees: 0.738461538462
Accuracy of Random Forest with 250 trees: 0.740384615385
Accuracy of Random Forest with 500 trees: 0.740384615385


Well, that didn't hepl... maybe we should go back and take out the request_text sentiment scores.

In [216]:
new_train_data_num2 = np.copy(new_train_data_num)

In [217]:
new_train_data_num2.shape

(3000, 18)

In [218]:
train_data_numerical.shape

(3000, 10)

In [219]:
# Since the orignial numerical data had 10 features, the next two we added were request_title
# then we added request_text, we want to delete the 13th and 14th columns or the new numerical array
new_train_data_num2 = np.delete(new_train_data_num2,[12,13], axis=1)

In [221]:
new_train_data_num2.shape

(3000, 16)

In [222]:
new_dev_data_numerical2 = np.copy(new_dev_data_numerical)
new_dev_data_numerical2 = np.delete(new_dev_data_numerical2,[12,13], axis=1)

In [223]:
forests = [5, 10, 15, 25, 50, 100, 150, 200, 250, 500]
for trees in forests:    
    rf = RandomForestClassifier(random_state=42, n_estimators=trees)
    rf.fit(new_train_data_num2, np.ravel(train_labels))
    print('Accuracy of Random Forest with', trees,'trees:',rf.score(new_dev_data_numerical2, np.ravel(dev_labels)))

Accuracy of Random Forest with 5 trees: 0.694230769231
Accuracy of Random Forest with 10 trees: 0.734615384615
Accuracy of Random Forest with 15 trees: 0.730769230769
Accuracy of Random Forest with 25 trees: 0.738461538462
Accuracy of Random Forest with 50 trees: 0.743269230769
Accuracy of Random Forest with 100 trees: 0.741346153846
Accuracy of Random Forest with 150 trees: 0.736538461538
Accuracy of Random Forest with 200 trees: 0.739423076923
Accuracy of Random Forest with 250 trees: 0.7375
Accuracy of Random Forest with 500 trees: 0.733653846154


So far it doesn't seem like there is much difference between the numerical data available, the sentiment of the request or whether the request contains key words in determining if the request ends in success. Each method so far has ended in only 73-75% accuracy for predictions.