# Kaggle: Random Acts of Pizza Competition Notebook

## MIDS W207 Project: Brennan Borlaug, Cory Kind, & Divyang Prateek

### *1. Importing relavent libraries/modules*

Start by importing relevant libraries for storing and analyzing data.

In [1]:
import pandas as pd
import json as js
import random
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
import datetime
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import *
from sklearn.grid_search import GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors.nearest_centroid import NearestCentroid
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_score
from sklearn import preprocessing
from sklearn.decomposition import PCA

### *2. Importing and structuring data*

Read in JSON file of training data.

In [2]:
#Read json file as string
dl_dir = '/home/bborlaug/Downloads'
os.chdir(dl_dir)
data2 = open("train.json").read()
#Converts JSON string to a List of Dictionaries
jsondata2 = js.loads(data2)

The RAOP data contains a variety of predictors of different formats. The decision was made to break the predictors up into three sets in order to distribute the workload and parallelize data exploration more effectively. The three predictor types included: 
  
1) Temporal variables - Features that deal with the time at which a request was submitted.  
2) Requester variables - Features that describe the requester at the time of request.  
3) Text variables - Features that describe the textual content of the request.  
  
The following cell breaks the raw data up into the three categories outlined above.  An outcome array containing the binary outcome of the request (i.e. "requester_received_pizza") is also created.  

In [3]:
#temporal_variables (related to time of request)
temporal_variables = ['unix_timestamp_of_request_utc']

#requester variables (features of requester at time of request)
requester_variables = ['requester_account_age_in_days_at_request',
                      'requester_days_since_first_post_on_raop_at_request',
                      'requester_number_of_comments_at_request',
                      'requester_number_of_comments_in_raop_at_request',
                      'requester_number_of_posts_at_request',
                      'requester_number_of_posts_on_raop_at_request',
                      'requester_number_of_subreddits_at_request',
                      'requester_upvotes_minus_downvotes_at_request',
                      'requester_upvotes_plus_downvotes_at_request']

#text variables (content of request)
text_variables = ['giver_username_if_known',
    'request_id',
    'request_text',
    'request_text_edit_aware',
    'request_title',
    'requester_user_flair',
    'requester_username']

#Creating empty data frames to store training data
temporal_elements = pd.DataFrame(np.nan, index = range(len(jsondata2)), columns = temporal_variables)
requester_elements = pd.DataFrame(np.nan, index = range(len(jsondata2)), columns = requester_variables)
text_elements = pd.DataFrame(np.nan, index = range(len(jsondata2)), columns = text_variables)
outcome = pd.DataFrame(np.nan, index = range(len(jsondata2)), columns = ['requester_received_pizza'])

#Print the number of temporal, requester, & text predictors currently included
print "Number of temporal features:", len(temporal_elements.columns)
print "Number of requester features:", len(requester_elements.columns)
print "Number of text features:", len(text_elements.columns)

Number of temporal features: 1
Number of requester features: 9
Number of text features: 7


The next step is to fill these arrays with values from thee raw JSON data. Although the loop approach is less efficient at large scale, this direction was chosen because the number of keys vary between cases in the data.  

In [4]:
for i in range(len(jsondata2)):
    mykeys = jsondata2[i].keys()
    myvals = jsondata2[i].values()
    for key, val in zip(mykeys, myvals):
        if key in temporal_variables:
            idx = temporal_variables.index(key)
            temporal_elements.iloc[i, idx] = val
        if key in requester_variables:
            idx = requester_variables.index(key)
            requester_elements.iloc[i, idx] = val
        if key in text_variables:
            idx = text_variables.index(key)
            text_elements.iloc[i, idx] = val
        if key == 'requester_received_pizza':
            outcome.iloc[i,0] = val

This is a quick check on the size of these arrays - the number of columns should match the number of temporal, requester, and text predictors determined above.

In [5]:
#Output shapes of temporal, requester, text, and outcome arrays
print "Temporal array:"
print temporal_elements.shape
print 

print "Requester array:"
print requester_elements.shape
print

print "Text array:"
print text_elements.shape
print

print "Outcome array:"
print outcome.shape
print

Temporal array:
(4040, 1)

Requester array:
(4040, 9)

Text array:
(4040, 7)

Outcome array:
(4040, 1)



We need a dev set for cross validation. Here we split out a dev set from the provided training data (80/20). There is no need to separate out a test set, since that is provided by Kaggle in a separate JSON file. To compare our results to other competitors in the Kaggle competition, we will use that test set.

In [6]:
random.seed(500)
data_size = len(jsondata2)
dev_indices = random.sample(range(data_size), data_size / 5)
train_indices = list(set(range(data_size)) - set(dev_indices))

#Define training & dev sets
train_temporal_feats = temporal_elements.ix[train_indices,]
train_requester_feats = requester_elements.ix[train_indices,]
train_text_feats = requester_elements.ix[train_indices,]
train_outcomes = outcome.ix[train_indices,].astype(int).sum(axis = 1)

dev_temporal_feats = temporal_elements.ix[dev_indices,]
dev_requester_feats = requester_elements.ix[dev_indices,]
dev_text_feats = text_elements.ix[dev_indices,]
dev_outcomes = outcome.ix[dev_indices,].astype(int).sum(axis = 1)

print "Number of training cases: ", len(train_indices)
print "Number of dev cases: ", len(dev_indices)

Number of training cases:  3232
Number of dev cases:  808


### *3. Generate Baseline*

Before we begin to develop a predictive model, we must establish a baseline to give ourselves a sense of what we are trying to achieve. In our application, the following models would serve as an effective baseline: 
    1. A model that always predicts the most frequent label (in this case, no pizza).
    2. A model that predicts outcomes with the probability of the mean response (e.g. 26.6% of requesters receive a pizza, therefore the model will predict positive outcomes 24.6% of the time).

In [7]:
y = 0
r = 0
outcomes = []
for request in jsondata2:
    if request['requester_received_pizza'] == True:
        y+=1
        r+=1
        outcomes.append(1)
    else:
        r+=1
        outcomes.append(0)
avg = float(y)/float(r)

#Baseline 1
base1 = [0]*len(jsondata2)
c = 0
n = 0
for i, j in zip(base1, outcomes):
    if i == j:
        c+=1
        n+=1
    else:
        n+=1
print 'Baseline 1 Accuracy:', round(float(c)/float(n),4)*100, '%'

#Baseline 2
base2 = np.random.binomial(1, avg, size=len(jsondata2))
c = 0
n = 0
for i, j in zip(base2, outcomes):
    if i == j:
        c+=1
        n+=1
    else:
        n+=1        
print 'Baseline 2 Accuracy:', round(float(c)/float(n),4)*100, '%'

Baseline 1 Accuracy: 75.4 %
Baseline 2 Accuracy: 62.6 %


### *4. Explore Various Classifiers*

Now we'll define a function that calculates the accuracy for a number of machine learning classifiers. These include:  

* Nearest Neighbors (w/ 1-15 neighbors)   
* Logistic Regression  
* Random Forest (w/ 1-30 estimators)  
  
This function can be used to explore the effectiveness of these classifiers on the RAOP data.

In [8]:
def multi_classifiers(train_data,train_labels, dev_data,dev_labels):
    #Nearest Neighbour Classifier\
    knn_scores=[]
    knn_estimators=[]
    for k in range(1,16,2):
        knn = KNeighborsClassifier(algorithm='auto', n_neighbors=k)
        knn.fit(train_data, train_labels)
        knn_preds = knn.predict(dev_data)
        knn_scores.append(round(metrics.accuracy_score(dev_outcomes, knn_preds),4)*100)
        knn_estimators.append(k)
    
    print 'Nearest Neighbors Model Accuracy (',knn_estimators[knn_scores.index(max(knn_scores))],'neighbors):',max(knn_scores),'%'
    
    #Nearest Centroid Classifier
    nc = NearestCentroid()
    nc.fit(train_data, train_labels)
    nc_preds = nc.predict(dev_data)
    
    print 'Nearest Centroid Model Accuracy:',round(metrics.accuracy_score(dev_labels, nc_preds),4)*100,'%'
    
    #Logistic Regression
    reg = LogisticRegression(penalty="l2", C=1.0)
    reg.fit(train_data, train_labels)
    reg_preds = reg.predict(dev_data)
    
    print 'Logistic Regression Model Accuracy:',round(metrics.accuracy_score(dev_labels, reg_preds),4)*100,'%'

    estimators=[]
    accuracies=[]

    for i in range(1,30):
        rf = RandomForestClassifier(n_estimators=i, random_state=99)
        rf.fit(train_requester_feats, train_outcomes)
        rf_preds=rf.predict(dev_requester_feats)
        acc = metrics.accuracy_score(dev_outcomes, rf_preds)
        estimators.append(i)
        accuracies.append(acc)

    max_acc = max(accuracies)
    est = estimators[accuracies.index(max_acc)]
    print 'Random Forests Model Accuracy (',est,'estimators ):',round(max_acc,4)*100,'%'

#### *4-1. KNN*

In [9]:
def KNN(train_feats, train_labels, dev_feats, dev_labels, metric, weights='uniform'):
    k = np.arange(20)+1
    parameters = {'n_neighbors': k}
    knn = KNeighborsClassifier(algorithm='auto', metric=metric, n_neighbors=k)
    knn_clf = GridSearchCV(knn, parameters, cv=10)
    knn_clf.fit(train_feats, train_labels)

    k = knn_clf.best_params_['n_neighbors']

    knn = KNeighborsClassifier(algorithm='auto', metric=metric, n_neighbors=k)
    knn.fit(train_feats, train_labels)
    knn_preds = knn.predict(dev_feats)
    acc = round(metrics.accuracy_score(dev_labels, knn_preds),4)*100

    print 'Nearest Neighbors Model Accuracy (',k,'neighbors ):',acc,'%'

##### *4-1-1. Temporal KNN Model*

In [10]:
#Convert unix timestamp to datetime
train_request_time = temporal_elements.ix[train_indices,"unix_timestamp_of_request_utc"].astype(long)
train_request_dateTime = [datetime.datetime.fromtimestamp(time) for time in train_request_time]
dev_request_time = temporal_elements.ix[dev_indices,"unix_timestamp_of_request_utc"].astype(long)
dev_request_dateTime = [datetime.datetime.fromtimestamp(time) for time in dev_request_time]

Since we are interested in developing a model capable of predicting the outcome of a new RAOP request or generating our own request with a high probability of sucess, a few temporal features will not be useful to us (like year since all requests in the training set were made between 2011 & 2013 and we are more interested in developing a model that will generalize to current and future requests than one that excels at predicting past outcomes). As such we will focus on the following temporal variables:  
* Day of week (dummies)  
* Month (dummies)  
* Hour in day (dummies)

In [11]:
#Create outcome arrays
train_time_label = np.asarray(train_outcomes)
dev_time_label = np.asarray(train_outcomes)

#Create day_of_week variable (train & dev)
train_dow=[]
dev_dow=[]
for day in train_request_dateTime:
    train_dow.append(day.weekday())
    
for day in dev_request_dateTime:
    dev_dow.append(day.weekday())

train_dow = np.asarray(train_dow)
dev_dow = np.asarray(dev_dow)
dow_train_dummies = pd.get_dummies(train_dow.flatten())
dow_train_dummies.columns=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
dow_dev_dummies = pd.get_dummies(dev_dow.flatten())
dow_dev_dummies.columns=['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
    
#Create month variables
train_month = np.asarray([time.month for time in train_request_dateTime]).reshape((len(train_request_time),1))
dev_month = np.asarray([time.month for time in dev_request_dateTime]).reshape((len(dev_request_time),1))

month_train_dummies = pd.get_dummies(train_month.flatten())
month_train_dummies.columns=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul',
                             'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
month_dev_dummies = pd.get_dummies(dev_month.flatten())
month_dev_dummies.columns=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul',
                             'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

#Create hour_of_day variable (train & dev)
train_hour = np.asarray([time.hour for time in train_request_dateTime]).reshape((len(train_request_time),1))
dev_hour = np.asarray([time.hour for time in train_request_dateTime]).reshape((len(train_request_time), 1))

hour_train_dummies = pd.get_dummies(train_hour.flatten())
hour_dev_dummies = pd.get_dummies(dev_hour.flatten())

#Combine all features
temporal_train_dummies = pd.concat([dow_train_dummies,month_train_dummies,hour_train_dummies], axis=1, join='inner')
temporal_dev_dummies = pd.concat([dow_dev_dummies,month_dev_dummies,hour_dev_dummies], axis=1, join='inner')

#Create temporal feats dataframes
train_temporal = pd.DataFrame(temporal_train_dummies)
dev_temporal = pd.DataFrame(temporal_dev_dummies)

#Check logistic regression coeffs for all feats
lr_clf = LogisticRegression(C=1)
lr_clf = lr_clf.fit(train_temporal,train_outcomes)
print pd.DataFrame(zip(train_temporal, np.transpose(lr_clf.coef_)), 
                   columns=['features', 'coefs']).sort_values('coefs', ascending = False)

   features                coefs
25        6     [0.489011292165]
27        8     [0.356955853137]
37       18     [0.338926368015]
28        9     [0.320022143069]
12      Jun     [0.278230243056]
30       11     [0.275641923736]
31       12     [0.208599447692]
26        7     [0.204593782113]
29       10     [0.159626773017]
10      Apr     [0.122908276108]
35       16    [0.0962172862899]
8       Feb    [0.0375776860636]
33       14    [0.0210668699712]
42       23   [0.00919845490355]
40       21  [-0.00686342452334]
0       Mon   [-0.0133678020977]
3       Thu   [-0.0156104088592]
18      Dec    [-0.029851157286]
13      Jul   [-0.0403539984815]
7       Jan   [-0.0478147771957]
36       17   [-0.0633480554368]
11      May   [-0.0721255055275]
38       19    [-0.073591824871]
9       Mar   [-0.0912851159366]
2       Wed    [-0.104775506167]
39       20     [-0.12784898838]
32       13    [-0.129730459723]
6       Sun    [-0.129990770321]
21        2    [-0.151261970969]
4       Fr

In [12]:
KNN(train_temporal, train_outcomes, dev_temporal, dev_outcomes, metric = "hamming")

Nearest Neighbors Model Accuracy ( 20 neighbors ): 73.39 %


Now reduce the number of temporal features to contain the 20 with the largest effect on the outcome (largest absolute value of regression coeffs) to to see if it improves accuracy.

In [13]:
lr_clf = LogisticRegression(C=1)
lr_clf = lr_clf.fit(train_temporal,train_outcomes)
rm = train_temporal.shape[1]-20
removed = []
for feat in pd.DataFrame(zip(train_temporal, np.transpose(abs(lr_clf.coef_))),
                         columns=['features', 'coefs']).sort_values('coefs', ascending = False)['features'].tail(rm):
    removed.append(feat)
reduced_train_temporal = train_temporal
reduced_dev_temporal = dev_temporal
for feat in removed:
    reduced_train_temporal.drop(feat, axis=1, inplace=True)
    reduced_dev_temporal.drop(feat, axis=1, inplace=True)
KNN(reduced_train_temporal, train_outcomes, reduced_dev_temporal, dev_outcomes, metric="hamming")

Nearest Neighbors Model Accuracy ( 14 neighbors ): 73.76 %


#### *4-1-2. Requester KNN Model*

Since requester features contain different scales, it is a good idea to normalize the data:

In [14]:
#Create new dataframe to store scaled training features
norm_train_req = pd.DataFrame()
norm_train_req['requester_account_age_in_days_at_request'] = preprocessing.scale(train_requester_feats['requester_account_age_in_days_at_request'])
norm_train_req['requester_days_since_first_post_on_raop_at_request'] = preprocessing.scale(train_requester_feats['requester_days_since_first_post_on_raop_at_request'])
norm_train_req['requester_number_of_comments_at_request'] = preprocessing.scale(train_requester_feats['requester_number_of_comments_at_request'])
norm_train_req['requester_number_of_comments_in_raop_at_request'] = preprocessing.scale(train_requester_feats['requester_number_of_comments_in_raop_at_request'])                      
norm_train_req['requester_number_of_posts_at_request'] = preprocessing.scale(train_requester_feats['requester_number_of_posts_at_request'])    
norm_train_req['requester_number_of_posts_on_raop_at_request'] = preprocessing.scale(train_requester_feats['requester_number_of_posts_on_raop_at_request'])                      
norm_train_req['requester_number_of_subreddits_at_request'] = preprocessing.scale(train_requester_feats['requester_number_of_subreddits_at_request'])    
norm_train_req['requester_upvotes_minus_downvotes_at_request'] = preprocessing.scale(train_requester_feats['requester_upvotes_minus_downvotes_at_request'])
norm_train_req['requester_upvotes_plus_downvotes_at_request'] = preprocessing.scale(train_requester_feats['requester_upvotes_plus_downvotes_at_request'])                     

#Create dataframe for scaled dev features
norm_dev_req = pd.DataFrame()
norm_dev_req['requester_account_age_in_days_at_request'] = preprocessing.scale(dev_requester_feats['requester_account_age_in_days_at_request'])
norm_dev_req['requester_days_since_first_post_on_raop_at_request'] = preprocessing.scale(dev_requester_feats['requester_days_since_first_post_on_raop_at_request'])
norm_dev_req['requester_number_of_comments_at_request'] = preprocessing.scale(dev_requester_feats['requester_number_of_comments_at_request'])
norm_dev_req['requester_number_of_comments_in_raop_at_request'] = preprocessing.scale(dev_requester_feats['requester_number_of_comments_in_raop_at_request'])                      
norm_dev_req['requester_number_of_posts_at_request'] = preprocessing.scale(dev_requester_feats['requester_number_of_posts_at_request'])    
norm_dev_req['requester_number_of_posts_on_raop_at_request'] = preprocessing.scale(dev_requester_feats['requester_number_of_posts_on_raop_at_request'])                      
norm_dev_req['requester_number_of_subreddits_at_request'] = preprocessing.scale(dev_requester_feats['requester_number_of_subreddits_at_request'])    
norm_dev_req['requester_upvotes_minus_downvotes_at_request'] = preprocessing.scale(dev_requester_feats['requester_upvotes_minus_downvotes_at_request'])
norm_dev_req['requester_upvotes_plus_downvotes_at_request'] = preprocessing.scale(dev_requester_feats['requester_upvotes_plus_downvotes_at_request'])

In [15]:
#Check requester regression coeffs
lr_clf = LogisticRegression(C=1)
lr_clf = lr_clf.fit(norm_train_req,train_outcomes)
print pd.DataFrame(zip(norm_train_req, np.transpose(lr_clf.coef_)), 
                   columns=['features', 'coefs']).sort_values('coefs', ascending = False)

                                            features                coefs
3    requester_number_of_comments_in_raop_at_request     [0.255744093693]
5       requester_number_of_posts_on_raop_at_request     [0.173259943227]
6          requester_number_of_subreddits_at_request     [0.064288743157]
0           requester_account_age_in_days_at_request    [0.0563296974155]
1  requester_days_since_first_post_on_raop_at_req...    [0.0416914310184]
7       requester_upvotes_minus_downvotes_at_request   [0.00742909288698]
8        requester_upvotes_plus_downvotes_at_request  [-0.00789511737875]
2            requester_number_of_comments_at_request   [-0.0272276840178]
4               requester_number_of_posts_at_request   [-0.0389159537653]


In [16]:
KNN(norm_train_req, train_outcomes, norm_dev_req, dev_outcomes, metric="euclidean")

Nearest Neighbors Model Accuracy ( 20 neighbors ): 74.01 %


#### *4-1-3. Text KNN Model*

In [17]:
#Pull out the request text and outcomes for training and dev sets
train_request_text = text_elements.ix[train_indices, "request_text"]
dev_request_text = text_elements.ix[dev_indices, "request_text"]

In [18]:
#Create CountVectorizer object with no preprocessing, but include basic English stop words
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = "english")
train_text_features = vectorizer.fit_transform(train_request_text).toarray()
train_vocab = vectorizer.get_feature_names()

#Use train_vocab to extract the same features from the dev set
vectorizer_dev = CountVectorizer(analyzer = "word", tokenizer= None, preprocessor = None, stop_words = "english", vocabulary = train_vocab)
dev_text_features = vectorizer_dev.fit_transform(dev_request_text).toarray()

print "The size of the vocabulary using this basic model is: ", str(len(train_vocab))

The size of the vocabulary using this basic model is:  10919


In [30]:
#Use PCA to reduce # of dimensions (n_components = 214 from Cory's nb)
pca = PCA(n_components=214)
train_text_reduced = pca.fit_transform(train_text_features)
dev_text_reduced = pca.fit_transform(dev_text_features)
train_text_reduced = pd.DataFrame(train_text_reduced)
dev_text_reduced = pd.DataFrame(dev_text_reduced)

In [31]:
#Scale new reduced text features
cols = list(train_text_reduced.columns)
train_text_reduced_scaled = pd.DataFrame()

for col in cols:
    col_zscore = str(col) + '_zscore'
    train_text_reduced_scaled[col_zscore] = (train_text_reduced[col] - train_text_reduced[col].mean())/train_text_reduced[col].std(ddof=0)
    
cols = list(dev_text_reduced.columns)
dev_text_reduced_scaled = pd.DataFrame()

for col in cols:
    col_zscore = str(col) + '_zscore'
    dev_text_reduced_scaled[col_zscore] = (dev_text_reduced[col] - dev_text_reduced[col].mean())/dev_text_reduced[col].std(ddof=0)

In [32]:
KNN(train_text_reduced_scaled, train_outcomes, dev_text_reduced_scaled, dev_outcomes, metric="euclidean")

Nearest Neighbors Model Accuracy ( 10 neighbors ): 73.89 %


#### *4-1-4. Combine the three KNN models*

In [35]:
#Optimal temporal KNN model
knn = KNeighborsClassifier(algorithm='auto', metric="hamming", n_neighbors=20)
knn.fit(train_temporal, train_outcomes)
temp_knn_preds = knn.predict(dev_temporal)

#Optimal requester KNN model
knn = KNeighborsClassifier(algorithm='auto', metric="euclidean", n_neighbors=20)
knn.fit(norm_train_req, train_outcomes)
req_knn_preds = knn.predict(norm_dev_req)

#Optimal text KNN model
knn = KNeighborsClassifier(algorithm='auto', metric="euclidean", n_neighbors=10)
knn.fit(train_text_reduced_scaled, train_outcomes)
text_knn_preds = knn.predict(dev_text_reduced_scaled)

#Combine the three models
KNN_preds=[]
for i,j,k in zip(temp_knn_preds, req_knn_preds, text_knn_preds):
    pred=[]
    pred.append(i), pred.append(j), pred.append(k)
    KNN_preds.append(max(set(pred)))
    
print 'Combined KNN Model Accuracy:', round(metrics.accuracy_score(dev_outcomes, KNN_preds),4)*100, '%'


Combine KNN Model Accuracy: 73.89 %


Combining the three models showed no improvement...

#### *4-2. Logistic Regression*

#### *4-3. Random Forest*

In [36]:
def Random_Forest(train_feats, train_labels, dev_feats, dev_labels):
    estimators=[]
    accuracies=[]

    for i in range(1,30):
        rf = RandomForestClassifier(n_estimators=i, random_state=99)
        rf.fit(train_feats, train_labels)
        rf_preds=rf.predict(dev_feats)
        acc = metrics.accuracy_score(dev_labels, rf_preds)
        estimators.append(i)
        accuracies.append(acc)

    max_acc = max(accuracies)
    est = estimators[accuracies.index(max_acc)]
    print 'Random Forests Model Accuracy (',est,'estimators ):',round(max_acc,4)*100,'%'

#### *4-3-1. Temporal Random Forest Model*

In [39]:
print 'Temporal Random Forest Model:'
Random_Forest(train_temporal, train_outcomes, dev_temporal, dev_outcomes)

Temporal Random Forest Model:
Random Forests Model Accuracy ( 3 estimators ): 73.51 %


#### *4-3-2. Requester Random Forest Model*

In [47]:
print 'Requester Random Forest Model (simple):'
Random_Forest(train_requester_feats, train_outcomes, dev_requester_feats, dev_outcomes)

Requester Random Forest Model (simple):
Random Forests Model Accuracy ( 26 estimators ): 73.64 %


In [46]:
#Convert normalized feats to binary feats
#Start with training set
bin_train_req_1 = []
bin_train_req_2 = []
bin_train_req_3 = []
bin_train_req_4 = []
bin_train_req_5 = []
bin_train_req_6 = []
bin_train_req_7 = []
bin_train_req_8 = []
bin_train_req_9 = []

for a,b,c,d,e,f,g,h,i in zip(norm_train_req['requester_account_age_in_days_at_request'],
                             norm_train_req['requester_days_since_first_post_on_raop_at_request'],
                             norm_train_req['requester_number_of_comments_at_request'],
                             norm_train_req['requester_number_of_comments_in_raop_at_request'],
                             norm_train_req['requester_number_of_posts_at_request'],
                             norm_train_req['requester_number_of_posts_on_raop_at_request'],
                             norm_train_req['requester_number_of_subreddits_at_request'],
                             norm_train_req['requester_upvotes_minus_downvotes_at_request'],
                             norm_train_req['requester_upvotes_plus_downvotes_at_request']):
    if a>0:
        bin_train_req_1.append(1)
    else:
        bin_train_req_1.append(0)
        
    if b>0:
        bin_train_req_2.append(1)
    else:
        bin_train_req_2.append(0)
        
    if c>0:
        bin_train_req_3.append(1)
    else:
        bin_train_req_3.append(0)
        
    if d>0:
        bin_train_req_4.append(1)
    else:
        bin_train_req_4.append(0)
        
    if e>0:
        bin_train_req_5.append(1)
    else:
        bin_train_req_5.append(0)
        
    if f>0:
        bin_train_req_6.append(1)
    else:
        bin_train_req_6.append(0)
        
    if g>0:
        bin_train_req_7.append(1)
    else:
        bin_train_req_7.append(0)
        
    if h>0:
        bin_train_req_8.append(1)
    else:
        bin_train_req_8.append(0)
        
    if i>0:
        bin_train_req_9.append(1)
    else:
        bin_train_req_9.append(0)
                             

bin_train_req = pd.DataFrame()

bin_train_req['requester_account_age_in_days_at_request'] = bin_train_req_1
bin_train_req['requester_days_since_first_post_on_raop_at_request'] = bin_train_req_2
bin_train_req['requester_number_of_comments_at_request'] = bin_train_req_3
bin_train_req['requester_number_of_comments_in_raop_at_request'] = bin_train_req_4
bin_train_req['requester_number_of_posts_at_request'] = bin_train_req_5
bin_train_req['requester_number_of_posts_on_raop_at_request'] = bin_train_req_6
bin_train_req['requester_number_of_subreddits_at_request'] = bin_train_req_7
bin_train_req['requester_upvotes_minus_downvotes_at_request'] = bin_train_req_8
bin_train_req['requester_upvotes_plus_downvotes_at_request'] = bin_train_req_9

#Now the dev set
bin_dev_req_1 = []
bin_dev_req_2 = []
bin_dev_req_3 = []
bin_dev_req_4 = []
bin_dev_req_5 = []
bin_dev_req_6 = []
bin_dev_req_7 = []
bin_dev_req_8 = []
bin_dev_req_9 = []

for a,b,c,d,e,f,g,h,i in zip(norm_dev_req['requester_account_age_in_days_at_request'],
                             norm_dev_req['requester_days_since_first_post_on_raop_at_request'],
                             norm_dev_req['requester_number_of_comments_at_request'],
                             norm_dev_req['requester_number_of_comments_in_raop_at_request'],
                             norm_dev_req['requester_number_of_posts_at_request'],
                             norm_dev_req['requester_number_of_posts_on_raop_at_request'],
                             norm_dev_req['requester_number_of_subreddits_at_request'],
                             norm_dev_req['requester_upvotes_minus_downvotes_at_request'],
                             norm_dev_req['requester_upvotes_plus_downvotes_at_request']):
    if a>0:
        bin_dev_req_1.append(1)
    else:
        bin_dev_req_1.append(0)
        
    if b>0:
        bin_dev_req_2.append(1)
    else:
        bin_dev_req_2.append(0)
        
    if c>0:
        bin_dev_req_3.append(1)
    else:
        bin_dev_req_3.append(0)
        
    if d>0:
        bin_dev_req_4.append(1)
    else:
        bin_dev_req_4.append(0)
        
    if e>0:
        bin_dev_req_5.append(1)
    else:
        bin_dev_req_5.append(0)
        
    if f>0:
        bin_dev_req_6.append(1)
    else:
        bin_dev_req_6.append(0)
        
    if g>0:
        bin_dev_req_7.append(1)
    else:
        bin_dev_req_7.append(0)
        
    if h>0:
        bin_dev_req_8.append(1)
    else:
        bin_dev_req_8.append(0)
        
    if i>0:
        bin_dev_req_9.append(1)
    else:
        bin_dev_req_9.append(0)
                             

bin_dev_req = pd.DataFrame()

bin_dev_req['requester_account_age_in_days_at_request'] = bin_dev_req_1
bin_dev_req['requester_days_since_first_post_on_raop_at_request'] = bin_dev_req_2
bin_dev_req['requester_number_of_comments_at_request'] = bin_dev_req_3
bin_dev_req['requester_number_of_comments_in_raop_at_request'] = bin_dev_req_4
bin_dev_req['requester_number_of_posts_at_request'] = bin_dev_req_5
bin_dev_req['requester_number_of_posts_on_raop_at_request'] = bin_dev_req_6
bin_dev_req['requester_number_of_subreddits_at_request'] = bin_dev_req_7
bin_dev_req['requester_upvotes_minus_downvotes_at_request'] = bin_dev_req_8
bin_dev_req['requester_upvotes_plus_downvotes_at_request'] = bin_dev_req_9

print 'Requester Random Forest Model (binary, scaled):'
Random_Forest(bin_train_req, train_outcomes, bin_dev_req, dev_outcomes)

Requester Random Forest Model (binary, scaled):
Random Forests Model Accuracy ( 3 estimators ): 74.01 %


In [42]:
#Combine Temporal & Requester Random Forest Models
temp_req_train = pd.concat([train_temporal, bin_train_req], axis=1, join='inner')
temp_req_dev = pd.concat([dev_temporal, bin_dev_req], axis=1, join='inner')

print 'Temporal + Requester Random Forest Model:'
Random_Forest(temp_req_train, train_outcomes, temp_req_dev, dev_outcomes)

Temporal + Requester Random Forest Model:
Random Forests Model Accuracy ( 2 estimators ): 71.41 %


#### *4-3-3. Text Random Forest Model*

In [43]:
print 'Text Random Forest Model (All Features):'
Random_Forest(train_text_features, train_outcomes, dev_text_features, dev_outcomes)

Random Forests Model Accuracy ( 12 estimators ): 73.27 %


In [44]:
print 'Text Random Forest Model (Reduced Features):'
Random_Forest(train_text_reduced, train_outcomes, dev_text_reduced, dev_outcomes)

Text Random Forest Model (Reduced Features):
Random Forests Model Accuracy ( 14 estimators ): 71.41 %


#### *4-3-4. Combined Random Forest Model*

In [50]:
train_text_features = pd.DataFrame(train_text_features)
dev_text_features = pd.DataFrame(dev_text_features)

temp_req_txt_train = pd.concat([temp_req_train, train_text_features], axis=1, join='inner')
temp_req_txt_dev = pd.concat([temp_req_dev, dev_text_features], axis=1, join='inner')

print 'Temporal + Requester + Text Random Forest Model:'
Random_Forest(temp_req_txt_train, train_outcomes, temp_req_txt_dev, dev_outcomes)

Temporal + Requester + Text Random Forest Model:
Random Forests Model Accuracy ( 16 estimators ): 73.64 %
