**W207 Final Final Project Team 2 - Random Acts of Pizza**

Team Members: Ahmad Azizi, Jordan Thomas, Prashant K Dhingra, Qi Yao

**Problem Description:**

This dataset includes 5671 requests collected from the Reddit community Random Acts of Pizza between December 8, 2010 and September 29, 2013 (retrieved on September 30, 2013). All requests ask for the same thing: a free pizza. The outcome of each request -- whether its author received a pizza or not -- is known. Meta-data includes information such as: time of the request, activity of the requester, community-age of the requester, etc.

The objective of the competition was to create an algorithm capable of predicting which requests will garner a cheesy (but sincere!) act of kindness.

The data are stored in JSON format. Each JSON entry corresponds to one request (the first and only request by the requester on Random Acts of Pizza).

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import json
import seaborn as sns
import matplotlib.pyplot as plt

import re
from sklearn.feature_extraction.text import *
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.metrics import r2_score
from xgboost import XGBClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

from sklearn.metrics import accuracy_score
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [2]:
#load json train and test data, print the shape
#Train has more columns than test
with open('../input/random-acts-of-pizza/train.json') as train_data:
    traindata = json.load(train_data)
train = pd.json_normalize(traindata)

with open('../input/random-acts-of-pizza/test.json') as test_data:
    testdata = json.load(test_data)
test = pd.json_normalize(testdata)

print('Train shape: ', train.shape)
print('Test shape: ', test.shape)

**Exploratory Analysis**

In [3]:
train.head()
#request_test_edit_aware, requester_subreddits_at_request and request-title are the text features
#rest are all numeric features

In [4]:
train.columns

In [5]:
test.columns
#test data is only a subset of the train data

In [6]:
#no null value for what we are using
train.isnull().sum()

In [7]:
#no null on test data as well
test.isnull().sum()

In [8]:
train_sns = train_A = train[['requester_account_age_in_days_at_request', 'requester_days_since_first_post_on_raop_at_request', 
                     'requester_number_of_comments_at_request', 'requester_number_of_comments_in_raop_at_request', 
                     'requester_number_of_posts_at_request', 'requester_number_of_posts_on_raop_at_request', 
                     'requester_number_of_subreddits_at_request',  
                     'requester_upvotes_minus_downvotes_at_request', 'requester_upvotes_plus_downvotes_at_request', 
                             'requester_received_pizza']]

In [9]:
#does not seem the features are very separatable
sns.pairplot(train_sns,hue='requester_received_pizza')

**Feature Engineering:**

Based on the initial observation, we have mostly numeric columns, and three text columns.

Numeric features and text features (categorical features) will need to be processed separately.



In [10]:
#create label set and convert true/false value to 1/0
train_label = train['requester_received_pizza'].astype('int')

In [11]:
#train_A is a subset of features that are relelvant minus the text features, B, C  each is a text feature
#we need to feature engineer or train them separately
train_A = train[['requester_account_age_in_days_at_request', 'requester_days_since_first_post_on_raop_at_request', 
                     'requester_number_of_comments_at_request', 'requester_number_of_comments_in_raop_at_request', 
                     'requester_number_of_posts_at_request', 'requester_number_of_posts_on_raop_at_request', 
                     'requester_number_of_subreddits_at_request',  
                     'requester_upvotes_minus_downvotes_at_request', 'requester_upvotes_plus_downvotes_at_request']]
train_B = train['request_text_edit_aware']
train_C = train['request_title']

#repeat for test
test_A = test[['requester_account_age_in_days_at_request', 'requester_days_since_first_post_on_raop_at_request', 
                     'requester_number_of_comments_at_request', 'requester_number_of_comments_in_raop_at_request', 
                     'requester_number_of_posts_at_request', 'requester_number_of_posts_on_raop_at_request', 
                     'requester_number_of_subreddits_at_request',  
                     'requester_upvotes_minus_downvotes_at_request', 'requester_upvotes_plus_downvotes_at_request']]
test_B = test['request_text_edit_aware']
test_C = test['request_title']


In [12]:
#for column requester_subreddits_at_request, convert it to three values, either no data, has random acts of pizza, or others
#make it more meaningful for the model
train_sub = []
sub = train['requester_subreddits_at_request']
for i in sub:    
    if len(i) == 0:
        i='none'
    elif 'Random_Acts_Of_Pizza' in i:
        i='flagged'
    else:
        i='nonflagged'
    train_sub.append(i)

In [13]:
#function to pre process the text data
#use lemm, but can also try different combination of stop words,stem vs lemm, remove words
def pre_processing(data):
    stop_words = set(stopwords.words('english')) #use stop words
    #data = re.sub('(\d+)', '', data) #remove digits
    data = re.sub('\W+',' ', data) #remove special chaacters
    word_tokens = word_tokenize(data.lower()) #tokenize the string
    filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words] #remove stop words
    #porter_stemmer = PorterStemmer() #stem the words 
    #stemmed_words = [porter_stemmer.stem(word) for word in filtered_sentence]
    wordnet_lemmatizer = WordNetLemmatizer() #lemm the word, does better than stem the word
    lemm_words = [wordnet_lemmatizer.lemmatize(word) for word in filtered_sentence]
    return ' '.join(lemm_words)

In [14]:
#pre process the text data (vectorizer)
vectorizer_proc = CountVectorizer(preprocessor = pre_processing)
train_B_proc = vectorizer_proc.fit_transform(train_B)
train_B_proc = pd.DataFrame(train_B_proc.todense())
train_B_columns = vectorizer_proc.get_feature_names()
train_C_proc = vectorizer_proc.fit_transform(train_C)
train_C_proc = pd.DataFrame(train_C_proc.todense())
#train_C_columns = vectorizer_proc.get_feature_names()

In [15]:
#standardize the numeric value columns
sc = StandardScaler()
train_A_norm = sc.fit_transform(train_A)
train_A_norm = pd.DataFrame(train_A_norm)

In [16]:
#use number to name each column so these are all unique
#can use column index to find text column (feature) name if needed for further analysis
train_A_norm.columns = [str(i) for i in range(3,12)]
train_B_proc.columns = [str(i) for i in range(12,11052)]
train_C_proc.columns = [str(i) for i in range(11052,15128)]
#train_A_norm.columns = train_A_norm_columns
#train_B_proc.columns = train_B_columns
#train_C_proc.columns = train_C_columns

In [17]:
#one hot encoding the train_sub data which is categorical 
enc = OneHotEncoder(handle_unknown='ignore')
train_sub_one = enc.fit_transform(np.array(train_sub).reshape(-1,1))
train_sub_one = pd.DataFrame(train_sub_one.todense())
#train_sub_one.columns = list(enc.get_feature_names())

In [18]:
#combine all of them and do train test split using 0.9 ratio for the model
train_all = pd.concat([train_B_proc,train_C_proc, train_sub_one, train_A_norm], axis=1, sort=False)
X_train, X_test, Y_train, Y_test = train_test_split(train_all, train_label, train_size = 0.9, random_state = 42)

**Sensible Methods**

In [19]:
#start with logistic regression
Log = LogisticRegression(max_iter=10000)
Log.fit(X_train, Y_train)
Y_pred_log = Log.predict(X_test)
print(accuracy_score(Y_test,Y_pred_log))

In [20]:
#use xgb boost for better performance
xgb = XGBClassifier(eval_metric = 'logloss', use_label_encoder=False)
xgb.fit(X_train, Y_train)
Y_pred_xgb = xgb.predict(X_test)
print(accuracy_score(Y_test,Y_pred_xgb))

In [21]:
#random forest for better performence
rf = RandomForestClassifier()
rf.fit(X_train, Y_train)
Y_pred_rf = rf.predict(X_test)
print(accuracy_score(Y_test,Y_pred_rf))

**Error Analysis**

In [22]:
#confusion matrix with RF model
#could work more on false negative
cm = metrics.confusion_matrix(Y_test, Y_pred_rf)
pd.DataFrame(data = cm, columns = ['Predicted False ', 'Predicted True'],
            index = ['Actual False', 'Actual True'])

**What's next?**

1, More feature engineering on the text features, maybe outliers on the numeric features since there is no missing value? With the results of error analysis, we may try to oversampling the positive class? (not very imbalanced but can still try).

2, Consider to fine tune the models and train different features separately using advanced models, usinng ANN to train numeric features, and RNN (LSTM)/MultinomialNB() to train text features, and model ensembling (average the predictions). (This maybe out of the scope of this class)

4, Presentation preparation (polish the notebook with a bit more write up).