# Project 4: Final Project - Random Acts of Pizza
### Predicting altruism through free pizza

This project is originated from the Kaggle competition https://www.kaggle.com/c/random-acts-of-pizza. We will create an algorithm to predict which requests will recieve pizza and which on will not.  The competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. This data was collected and graciously shared by Althoff et al (http://www.timalthoff.com/). 

**Reference Paper:**
Tim Althoff, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. How to Ask for a Favor: A Case Study on the Success of Altruistic Requests, Proceedings of ICWSM, 2014. (http://cs.stanford.edu/~althoff/raop-dataset/altruistic_requests_icwsm.pdf)


## Approach 


**Step 1:  Exploratory Data Analysis **

**Step 2:  Create a Baseline Model **

**Step 3:  Feature Engineering **

- Preprocessing data 
    - data cleansing 
    - data transformation

- Use other meta-data from the data set, such as,
    - request_text_edit_aware
    - request_title
    - requester_account_age_in_days_at_request
    - requester_days_since_first_post_on_raop_at_request
    - requester_number_of_comments_at_request
    - requester_number_of_comments_in_raop_at_request
    - requester_number_of_posts_at_request
    - requester_number_of_posts_on_raop_at_request
    - requester_number_of_subreddits_at_request
    - requester_subreddits_at_request
    - requester_upvotes_minus_downvotes_at_request
    - requester_upvotes_plus_downvotes_at_request
    - requester_username
    - unix_timestamp_of_request
    - unix_timestamp_of_request_utc
    - other features includes:
        - number of requests made by the same user
        - number of requests fulfilled or % of requests fulfilled, etc
    
- Generate new features from the data set such as, 
    - Politeness, 
    - Evidentiality, 
    - Reciprocity, 
    - Sentiment, 
    - Length, etc

**Step 4:  Algorithm / Model Selection **

- Generative Models 
    - Naive Bayes 
- Discriminative Models
    - Logistic Regression  
- Neural Network 


**Step 5:  Error Analysis & Optimization ** 


**Step 6:  Final Model ** 




## Load Data

In [157]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVC

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# SK-learn libraries for model selection 
from sklearn.model_selection import train_test_split

# json libraries to parse json file
import json
from pandas.io.json import json_normalize

In [158]:
# read json file
train_json = json.load(open('train.json'))

# normalize data and put in a dataframe
train_json_df = json_normalize(train_json)

# read json file
test_json = json.load(open('test.json'))

# normalize data and put in a dataframe
test_json_df = json_normalize(test_json)

In [159]:
# for the baseline, we just use the "request_text_edit_aware" as input features 
# but we can explore other metadata (such as "request_title", 
# "number_of_downvotes_of_request_at_retrieval", "number_of_upvotes_of_request_at_retrieval", etc) 
# to add to the input features 
train_data = train_json_df.request_text_edit_aware.as_matrix()

# convert the requester_received_pizza field to 0 and 1
# 0 means the user doesn't receive pizza & 1 means the user receives pizza
train_labels = train_json_df.requester_received_pizza.astype(int).as_matrix()

# split the training data into training data and dev data 
train_data, dev_data, train_labels, dev_labels = \
            train_test_split(train_data, train_labels, test_size=0.2, random_state=42)
    
    
# apply same logic as train_data to test_data
test_data = test_json_df.request_text_edit_aware.as_matrix()

print('training data shape:', train_data.shape)
print('dev data shape:', dev_data.shape)
print('test data shape:', test_data.shape)

training data shape: (3232,)
dev data shape: (808,)
test data shape: (1631,)


## Step 1:  Exploratory Data Analysis

In [162]:
num_examples=5
    
# for each label, display a number of examples 
for i in range(2):

    # find the indexes of the corresponding label 
    index = np.where(train_labels == i)

    for j in range(num_examples):

        # print the training data for that label
        if i == 0:
            title = "This message receives pizza"
        else:
            title = "This message doesn't receive pizza"
        print("-----------------------------------------------" )
        print("{} : Sample {}".format(title, j+1))
        print("-----------------------------------------------")
        print(train_data[index[0][j]])
        print("\n")


-----------------------------------------------
This message receives pizza : Sample 1
-----------------------------------------------
My power was out for about 3 hours earlier this afternoon. I keep trying to watch DVD's (Twister...I mean when in Rome, right?) but as soon as I get halfway in the power either flickers on and off or stays off for an extended period of time. I don't feel up to going out for food either since I've been sick for about 3 days now and Hurricane Irene is being a bitch...

Thank you all!


-----------------------------------------------
This message receives pizza : Sample 2
-----------------------------------------------
I'm lucky that internet comes part and parcel with my rent. I work retail grocery and summer hours are murder on the wallet. I have to hustle managers just to get a good 20-25 hours a week, when I used to push the part time hour limit. 

I hate begging, but my stomach is scoffing at me and calling me a proud asshole. So here I am :). Just on

## Step 2: Create a Baseline Model 

In [166]:
# use standard TfidfVectorizer to transform the training data and dev data 
vectorizer = TfidfVectorizer()
train_bag_of_words = vectorizer.fit_transform(train_data)
dev_bag_of_words = vectorizer.transform(dev_data)

# create MultinomialNB
nb = MultinomialNB()
    
# test the best value for alpha
parameters = {'alpha': np.linspace(0.01, 10, 100)}

# create GridSearchCV to find the best alpha
clf = GridSearchCV(nb, parameters)
    
# train the MultinomialNB
clf.fit(train_bag_of_words, train_labels)

pred_dev_labels = clf.predict(dev_bag_of_words)

print(clf.best_params_)
print("f1 score using TfidfVectorizer & MultinomialNB = {}".format(metrics.f1_score(dev_labels, pred_dev_labels, average='micro')))


{'alpha': 0.81727272727272726}
f1 score using TfidfVectorizer & MultinomialNB = 0.7388613861386139
