# Project 4: Final Project - Random Acts of Pizza
### Predicting altruism through free pizza

#### Team Members: Gurdit Chahal, Shan He, Joanna Huang,  Emmy Lau

This project is originated from the Kaggle competition https://www.kaggle.com/c/random-acts-of-pizza. We will create an algorithm to predict which requests will recieve pizza and which on will not.  The competition contains a dataset with 5671 textual requests for pizza from the Reddit community Random Acts of Pizza together with their outcome (successful/unsuccessful) and meta-data. This data was collected and graciously shared by Althoff et al (http://www.timalthoff.com/). 

**Reference Paper:**
Tim Althoff, Cristian Danescu-Niculescu-Mizil, Dan Jurafsky. How to Ask for a Favor: A Case Study on the Success of Altruistic Requests, Proceedings of ICWSM, 2014. (http://cs.stanford.edu/~althoff/raop-dataset/altruistic_requests_icwsm.pdf)

### Initialize packages and load in data

In [1]:
import re
import numpy as np
import pandas as pd
import os
import string
import matplotlib.pyplot as plt
from wordcloud import WordCloud

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import RandomForestClassifier as RF
from sklearn.svm import SVC
from sklearn.svm import LinearSVC as LSVC
from sklearn.decomposition import TruncatedSVD as TSVD
from sklearn.decomposition import PCA
from scipy.sparse import hstack
from sklearn.feature_selection import SelectFromModel
from sklearn.preprocessing import StandardScaler

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# SK-learn libraries for model selection 
from sklearn.model_selection import train_test_split

# json libraries to parse json file
import json
from pandas.io.json import json_normalize

from sklearn.decomposition import LatentDirichletAllocation as LDA
import lda
import gensim
from gensim import utils
import xgboost as xgb



In [2]:
# read json file
train_json = json.load(open('train.json'))

# normalize data and put in a dataframe
train_json_df = json_normalize(train_json)

# read json file
test_json = json.load(open('test.json'))

# normalize data and put in a dataframe
test_json_df = json_normalize(test_json)

print("Train shape: ", train_json_df.shape)
print("Test shape: ", test_json_df.shape)

Train shape:  (4040, 32)
Test shape:  (1631, 17)


There appears to be a discrepancy between the train and test datasets shapes, with the training set having 32 columns and the test set only having 17. Let's take a closer look at the significance of these columns.

In [3]:
train_only_columns = set(train_json_df.columns.values)-set(test_json_df.columns.values)
print("Columns in Train but not Test:\n",train_only_columns)
test_only_columns = set(test_json_df.columns.values)-set(train_json_df.columns.values)
print("\nColumns in Test but not Train:",test_only_columns)
test_w_train_col = train_json_df[test_json_df.columns.values]

Columns in Train but not Test:
 {'requester_days_since_first_post_on_raop_at_retrieval', 'number_of_downvotes_of_request_at_retrieval', 'requester_upvotes_minus_downvotes_at_retrieval', 'requester_user_flair', 'request_text', 'requester_number_of_comments_in_raop_at_retrieval', 'requester_upvotes_plus_downvotes_at_retrieval', 'requester_number_of_posts_on_raop_at_retrieval', 'request_number_of_comments_at_retrieval', 'requester_received_pizza', 'post_was_edited', 'requester_account_age_in_days_at_retrieval', 'number_of_upvotes_of_request_at_retrieval', 'requester_number_of_comments_at_retrieval', 'requester_number_of_posts_at_retrieval'}

Columns in Test but not Train: set()


#### Details on the additional columns in the train set:

* request_text/post_was_edited: Since posts are often edited after a successful request, this request_text column is not the most accurate. Instead, request_text_edit_aware, which is available in both the train and test sets, will be used. This edit aware version of "request_text" strips edited comments indicating the success of the request.

* *_at_retrieval: For our purposes of real-time prediction, _at_request columns are more relevant.

* requester_user_flair: This is a post-receipt of pizza feature and thus will not be a useful indicator of results.

* requester_received_pizza: To be predicted

We will move forward with only the columns in both the train and test set. 

#### Split the data into train and dev for model testing

In [4]:
# 0 means the user doesn't receive pizza & 1 means the user receives pizza
train_labels = train_json_df.requester_received_pizza.astype(int).as_matrix()

# split the training data into training data and dev data 
train_data, dev_data, train_labels, dev_labels = \
            train_test_split(test_w_train_col, train_labels, test_size=0.2, random_state=12)

### Feature Engineering

We begin by prepping our text data, creating a new column "full text" that combines all the relevant text fields.

In [5]:
train_data['full_text']=train_data['request_text_edit_aware'] +' '+train_data['request_title']
dev_data['full_text']=dev_data['request_text_edit_aware'] +' '+dev_data['request_title']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In preparation of further processing, we will preprocess our text in the following ways: converting all text to lowercase, removing punctuation, non-alphanumeric characters and extra spaces.

In [6]:
#from nltk.stem import SnowballStemmer
def pre_process(s):
    s = re.sub("[^\w']|_", " ", s) 
    s=s.translate(str.maketrans(' ',' ',string.punctuation))# Strip punctuation before looking
    s= re.sub(' +',' ', s) # Remove extra spaces
    s=s.lower()
    #s = ' '.join(word[:6] if len(word)>6 else word 
                        # for word in s.split()) # Shorten long words
    #s = re.sub(r'\b\d+\b', r' ', s) # Replace sequences of numbers with a single token
    
    #s_stemmer = SnowballStemmer('english')
    #s = ' '.join([s_stemmer.stem(i) for i in s.split()])
    return s

money = ["money", "now", "broke", "week", "until", "time",
          "last", "day", "when", "today", "tonight", "paid", "next",
          "first", "night", "after", "tomorrow", "month", "while",
          "account", "before", "long", "Friday", "rent", "buy",
          "bank", "still", "bills", "ago", "cash", "due",
          "soon", "past", "never", "paycheck", "check", "spent",
          "years", "poor", "till", "yesterday", "morning", "dollars",
          "financial", "hour", "bill", "evening", "credit",
          "budget", "loan", "bucks", "deposit", "dollar", "current",
          "payed"]
job =["work", "job", "paycheck", "unemployment", "interview",
          "fired", "employment", "hired", "hire"]
student = ["college", "student", "school", "roommate",
          "studying", "university", "finals", "semester",
          "class", "study", "project", "dorm", "tuition"]
family =["family", "mom", "wife", "parents", "mother", "husband",
           "dad", "son", "daughter", "father", "parent",
           "mum"]
craving = ["friend", "girlfriend", "craving", "birthday",
          "boyfriend", "celebrate", "party", "game", "games",
          "movie", "date", "drunk", "beer", "celebrating", "invited",
          "drinks", "crave", "wasted", "invite"]

narratives = [money, job, student, family, craving]

def find_narr(narr,s):
    ct=0
    for word in narr:
        ct+=s.split().count(word)
    return ct/len(s.split())

def construct_topic_features(data): 
    
    data['full_text'] = data['request_text_edit_aware'] + ' ' + data['request_title']
    clean_text = data['full_text'].apply(lambda s: pre_process(s))

    features = pd.DataFrame()

    for n in narratives:
        features[n[0]] = clean_text.apply(lambda s: find_narr(n,s))
    return features

train_topic_features = construct_topic_features(train_data)
dev_topic_features = construct_topic_features(dev_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Create a function extract the seasonality information from our post metadata.

In [7]:
#https://www.timeanddate.com/calendar/aboutseasons.html
def ts_to_season(month):
    if month>=3 and month<=5:
        return "spring"
    elif month>=6 and month <=8:
        return "summer"
    elif month>=9 and month <=11:
        return "fall"
    else:
        return "winter"

Create features for both train and dev set to leverage the metadata provided within the dataset.

In [8]:
def construct_ft_mat(train_data):
    feat_mat=pd.DataFrame()
    
    # Extract temporal features
    feat_mat['hour_request']=pd.to_datetime(train_data['unix_timestamp_of_request_utc'],unit = 's').dt.hour
    feat_mat['day_request']=pd.to_datetime(train_data['unix_timestamp_of_request_utc'],unit = 's').dt.day
    feat_mat['day_request']=feat_mat['day_request'].apply(lambda x: 0 if x<16 else 1)
    feat_mat['season_request']=pd.to_datetime(train_data['unix_timestamp_of_request_utc'],unit = 's').dt.month
    feat_mat['season_request']=feat_mat['season_request'].apply(ts_to_season)
    feat_mat['is_spring']=feat_mat['season_request'].apply(lambda x: 1 if x=='spring' else 0)
    feat_mat['is_summer']=feat_mat['season_request'].apply(lambda x: 1 if x=='summer' else 0)
    feat_mat['is_fall']=feat_mat['season_request'].apply(lambda x: 1 if x=='fall' else 0)
    feat_mat['is_winter']=feat_mat['season_request'].apply(lambda x: 1 if x=='winter' else 0)
    del feat_mat['season_request']
    
    # Extract post popularity features
    feat_mat['first_post']=np.log(train_data['requester_days_since_first_post_on_raop_at_request']+1)
    feat_mat['upvotes_minus_downvotes']=train_data['requester_upvotes_minus_downvotes_at_request']
    feat_mat['upvotes_plus_downvotes_at_request']=np.log(train_data['requester_upvotes_plus_downvotes_at_request']+1)
    upvotes=train_data.apply(lambda row: (row['requester_upvotes_plus_downvotes_at_request'] + row['requester_upvotes_minus_downvotes_at_request'])/2,axis=1)
    downvotes=train_data.apply(lambda row: (row['requester_upvotes_plus_downvotes_at_request']- row['requester_upvotes_minus_downvotes_at_request'])/2,axis=1)
    feat_mat['upvotes']=upvotes
    feat_mat['vote_ratio']=upvotes/(upvotes+downvotes+1)
    
    # Extract requester features
    feat_mat['req_age']=np.log(train_data['requester_account_age_in_days_at_request']+1)
    feat_mat['num_subs']=np.log(train_data['requester_number_of_subreddits_at_request']+1)
    feat_mat['num_posts']=np.log(train_data['requester_number_of_posts_at_request']+1)
    feat_mat['pizza_activity']=np.log(train_data['requester_number_of_posts_on_raop_at_request']+1)
    feat_mat['pizza_comments']=np.log(train_data['requester_number_of_comments_in_raop_at_request']+1)
    feat_mat['community_age'] = (pd.to_datetime(train_data['unix_timestamp_of_request_utc'],utc = True, unit = 's') - \
                                pd.to_datetime('2010-12-8', format='%Y-%m-%d')).astype('timedelta64[D]')
    feat_mat['community_age'] = (feat_mat['community_age'] * 10./feat_mat.community_age.max()).astype(int)
    
    #feat_mat['karma']=(train_data['requester_upvotes_minus_downvotes_at_request']* 10.\
                                         #/train_data.requester_upvotes_minus_downvotes_at_request.max()).astype(int)

    #feat_mat['posted_in_raop_before']= (train_data['requester_number_of_posts_on_raop_at_request'] > 0).astype(int)
    
    #feat_mat['posted_before']= (train_data['requester_number_of_posts_at_request'] > 0).astype(int)
    
    
    # Extract post features
    feat_mat['len_request']=np.log(train_data['request_text_edit_aware'].apply(len)+1)
    feat_mat['len_title']=np.log(train_data['request_title'].apply(len)+1)
    feat_mat['reciprocity'] = train_data['full_text'].apply(lambda x:1 if re.search("repay|pay.+back|pay.+forward|return.+favor", x) 
                                               else 0)
    feat_mat['image_in_text'] = train_data['full_text'].str.contains('imgur.com|.jpg|.png|.jpeg', case=False).apply(lambda x: 1 if x else 0)
    feat_mat['politeness'] = train_data['full_text'].apply(lambda x: 1 if re.search("thank|appreciate|advance", x) else 0)
    
    # Extract narrative features
    #craving = re.compile(r'(friend|party|birthday|boyfriend|girlfriend|date|drinks|drunk|wasted|invite|invited|celebrate|celebrating|game|games|movie|beer|crave|craving)', re.IGNORECASE)
    #family = re.compile(r'(husband|wife|family|parent|parents|mother|father|mom|mum|son|dad|daughter)', re.IGNORECASE)
    #job = re.compile(r'(job|unemployment|employment|hire|hired|fired|interview|work|paycheck)', re.IGNORECASE)
    #money = re.compile(r'(money|bill|bills|rent|bank|account|paycheck|due|broke|bills|deposit|cash|dollar|dollars|bucks|paid|payed|buy|check|spent|financial|poor|loan|credit|budget|day|now| \
        #time|week|until|last|month|tonight|today|next|night|when|tomorrow|first|after|while|before|long|hour|Friday|ago|still|due|past|soon|current|years|never|till|yesterday|morning|evening)', re.IGNORECASE)
    #student = re.compile(r'(college|student|university|finals|study|studying|class|semester|school|roommate|project|tuition|dorm)', re.IGNORECASE)
    #gratitude = re.compile(r'(thank|thanks|thankful|appreciate|grateful|gratitude|advance)', re.IGNORECASE)
    # ratio of narrative words to length of full word
    #feat_mat['money'] = train_data['full_text'].apply(lambda x: len(money.findall(x))/len(x.split()))
    #feat_mat['job'] = train_data['full_text'].apply(lambda x: len(job.findall(x))/len(x.split()))
    #feat_mat['student'] = train_data['full_text'].apply(lambda x: len(student.findall(x))/len(x.split()))
    #feat_mat['family'] = train_data['full_text'].apply(lambda x: len(family.findall(x))/len(x.split()))
    #feat_mat['craving'] = train_data['full_text'].apply(lambda x: len(craving.findall(x))/len(x.split()))
    return feat_mat

feat_mat=construct_ft_mat(train_data)
dev_mat=construct_ft_mat(dev_data)

In [9]:
# Convert dataframe to a numpy-array representation.
t_mat=feat_mat.as_matrix()
d_mat=dev_mat.as_matrix()

In [10]:
# Extract text features from the post text by generating features from words
vectorizer = TfidfVectorizer(min_df=5,ngram_range=(1,2), preprocessor=pre_process,stop_words='english',norm='l2',sublinear_tf=True) 
train_bag_of_words = vectorizer.fit_transform(train_data['full_text'])
dev_bag_of_words = vectorizer.transform(dev_data['full_text'])

In [11]:
lsvc = LSVC(C=.85, penalty="l1", dual=False,random_state=42).fit(train_bag_of_words,train_labels)
model = SelectFromModel(lsvc, prefit=True)

X_new = model.transform(train_bag_of_words)
print(X_new.shape)

d_new=model.transform(dev_bag_of_words)

(3232, 873)


In [12]:
no_features = 1000
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(train_data['full_text'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, max_features=no_features, stop_words='english')
tf = tf_vectorizer.fit_transform(train_data['full_text'])
tf_feature_names = tf_vectorizer.get_feature_names()

In [15]:
from sklearn.decomposition import NMF
no_topics = 20

# Run NMF
nmf = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)

# Run LDA
lda = LDA(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)




In [18]:
# Derived topics with the top 10 words in each topic
def display_topics(model, feature_names, no_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))

no_top_words = 10
print("NMF Topics:\n")
display_topics(nmf, tfidf_feature_names, no_top_words)
print("\nLDA Topics:\n")
display_topics(lda, tf_feature_names, no_top_words)

NMF Topics:

Topic 0:
ve help work time know like going little family thanks
Topic 1:
pizza craving like send buy want hut ll random free
Topic 2:
student college finals studying school poor kid appreciate students university
Topic 3:
http com imgur jpg www reddit proof comments pic picture
Topic 4:
pay forward ll promise paycheck soon check rent money thanks
Topic 5:
job lost new got celebrate paycheck months moved recently started
Topic 6:
hungry im help pretty dont nc ky right favor sure
Topic 7:
love pizza usa thanks forever night hot starving finals guys
Topic 8:
food money house starving stamps appreciated month days left ran
Topic 9:
really use appreciate right pizza thanks pick don sick haven
Topic 10:
tonight dinner kids family help pizza daughter uk thanks night
Topic 11:
day today make work long brighten lunch home bad bed
Topic 12:
birthday today celebrate party family girlfriend spent awesome enjoy amp
Topic 13:
ramen noodles eating past living weeks tired ve sick days
Top

Based on the derived topics from NMF and LDA above, NMF seems to find more meaningful and cohesive topics compared to LDA. To gain a better understanding of the topic, we can try displaying the top documents in a topic as well.

In [21]:
def display_topics(H, W, feature_names, documents, no_top_words, no_top_documents):
    'The display_topics method prints out a numerical index as the topic name, prints the top words in the \
    topic and then prints the top documents in the topic. The top words and top documents have the highest \
    weights in the returned matrices. The argsort() method is used to sort the row or column of the matrix \
    and returns the indexes for the cells that have the highest weights in order.'
    for topic_idx, topic in enumerate(H):
        print("Topic %d:" % (topic_idx))
        print(" ".join([feature_names[i]
                        for i in topic.argsort()[:-no_top_words - 1:-1]]))
        top_doc_indices = np.argsort( W[:,topic_idx] )[::-1][0:no_top_documents]
        for doc_index in top_doc_indices:
            print(documents[doc_index])

# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(train_data['full_text'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

# LDA can only use raw term counts for LDA because it is a probabilistic graphical model
tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
tf = tf_vectorizer.fit_transform(train_data['full_text'])
tf_feature_names = tf_vectorizer.get_feature_names()

no_topics = 2

# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_W = nmf_model.transform(tfidf)
nmf_H = nmf_model.components_

# Run LDA
lda_model = LDA(n_topics=no_topics, max_iter=5, learning_method='online', learning_offset=50.,random_state=0).fit(tf)
lda_W = lda_model.transform(tf)
lda_H = lda_model.components_

no_top_words = 4
no_top_documents = 4
display_topics(nmf_H, nmf_W, tfidf_feature_names, train_data['full_text'], no_top_words, no_top_documents)
display_topics(lda_H, lda_W, tf_feature_names, train_data['full_text'], no_top_words, no_top_documents)




Topic 0:
help food just money
Hey there, I am going hungry and would really appreciate a pizza from anyone willing. I'll give a little story about what has led me to be extremely poor (£9 to last until the end of August).

I went to university this year, and moved out of my parent's home to forge a new life in Peckham, which is when everything started going horribly wrong. I hated my course, the people on my course seemingly went out of their way to make me feel uncomfortable (I'm quite shy, and very insecure).

For years now, I've been fighting my demons, keeping them just below the surface - I've always felt like I was just about floating, reading to sink at any moment. Around Christmas, I did just that - sank like a stone. I fell into a major depressive episode that last months, culminating in what I can only describe as a mental breakdown over Easter. I became dependent on weed, never leaving my room or spending time sober other than to go to work on Sundays, and then I'd come home

KeyError: 2058

In [12]:
vectorizer_lda = CountVectorizer(min_df=10,ngram_range=(1,1), preprocessor=pre_process,stop_words='english') 
lda_bag_of_words = vectorizer_lda.fit_transform(train_data['full_text'])
lda_devbag_of_words = vectorizer_lda.transform(dev_data['full_text'])

In [13]:
# LDA tells us what topics are present in any given document by observing all the words 
# in it and producing a topic distribution

lda = LDA(n_components = 3, learning_method="batch", max_iter=30,learning_decay=.7, random_state=42)
train_topics = lda.fit_transform(lda_bag_of_words)
print(lda.components_.shape)

dev_topics=lda.transform(lda_devbag_of_words)

(3, 1580)


In [14]:
f_new=hstack([X_new,t_mat,train_topics, train_topic_features])
dev_new=hstack([d_new,d_mat,dev_topics, dev_topic_features])

In [28]:
from sklearn.decomposition import NMF
# NMF is able to use tf-idf
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(train_data['full_text'])
dev_tfidf = tfidf_vectorizer.transform(dev_data['full_text'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names()

no_topics = 3

# Run NMF
nmf_model = NMF(n_components=no_topics, random_state=1, alpha=.1, l1_ratio=.5, init='nndsvd').fit(tfidf)
nmf_train = nmf_model.transform(tfidf)
nmf_dev = nmf_model.transform(dev_tfidf)

In [29]:
f_new=hstack([X_new,t_mat,nmf_train])
dev_new=hstack([d_new,d_mat,nmf_dev])

In [15]:
lr = LogisticRegression()
#parameters = {'C':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]}
parameters = {'C':np.linspace(0.0005, 0.1, 100)}
clf = GridSearchCV(lr, parameters)
clf.fit(f_new, train_labels)
pred_dev_prob = clf.predict_proba(dev_new)[:,1]
pred_dev_labels = clf.predict(dev_new)

print(clf.best_params_)
roc_auc_score(dev_labels, pred_dev_prob, average='micro')

{'C': 0.09497474747474748}


0.6863075657894737

In [16]:

#create dmatrices
dtrain = xgb.DMatrix(f_new, train_labels)
dtest = xgb.DMatrix(dev_new
                         , dev_labels)

#booster parameter
param = {'max_depth':15, 'eta': .0188, 'silent': 1, 'objective': 'binary:logistic'
         , 'scale_pos_weight': 3.06,'max_delta_step':1,'subsample':.7,'seed':42}#9 depth if sublin false
param['nthread'] = 4
param['eval_metric'] = 'auc'

#specify validation set to watch performance
evallist = [(dtest, 'eval'), (dtrain, 'train')]

#train model
num_round = 100
bst = xgb.train(param.items(), dtrain, num_round, evallist)

[0]	eval-auc:0.581164	train-auc:0.806568
[1]	eval-auc:0.588865	train-auc:0.888404
[2]	eval-auc:0.621332	train-auc:0.925238
[3]	eval-auc:0.633092	train-auc:0.94247
[4]	eval-auc:0.628869	train-auc:0.948143
[5]	eval-auc:0.642311	train-auc:0.958036
[6]	eval-auc:0.659104	train-auc:0.967232
[7]	eval-auc:0.66405	train-auc:0.972512
[8]	eval-auc:0.668729	train-auc:0.978589
[9]	eval-auc:0.666271	train-auc:0.97898
[10]	eval-auc:0.66699	train-auc:0.98171
[11]	eval-auc:0.672788	train-auc:0.981569
[12]	eval-auc:0.668104	train-auc:0.982041
[13]	eval-auc:0.668602	train-auc:0.983463
[14]	eval-auc:0.678364	train-auc:0.984584
[15]	eval-auc:0.674141	train-auc:0.984253
[16]	eval-auc:0.672845	train-auc:0.98505
[17]	eval-auc:0.674963	train-auc:0.985674
[18]	eval-auc:0.679515	train-auc:0.98625
[19]	eval-auc:0.678602	train-auc:0.986309
[20]	eval-auc:0.682418	train-auc:0.987215
[21]	eval-auc:0.686271	train-auc:0.988013
[22]	eval-auc:0.684737	train-auc:0.987671
[23]	eval-auc:0.687138	train-auc:0.988843
[24]	eval

### Adding Sentiment Analysis on Text Body and Title

In [41]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()
sentiment_metrics = ['neg', 'neu','pos']
sentiment_data_train = pd.DataFrame()
sentiment_data_dev = pd.DataFrame()
# Return a float for sentiment strength based on the input text. 
# Positive values are positive valence, negative value are negative valence.
for metrics in sentiment_metrics:
    sentiment_data_train['request_text_edit_aware'+ metrics] = train_data['request_text_edit_aware'].apply(lambda x: sia.polarity_scores(x)[metrics])
    sentiment_data_dev['request_text_edit_aware'+ metrics] = dev_data['request_text_edit_aware'].apply(lambda x: sia.polarity_scores(x)[metrics])
    sentiment_data_train['request_title'+ metrics] = train_data['request_title'].apply(lambda x: sia.polarity_scores(x)[metrics])
    sentiment_data_dev['request_title'+ metrics] = dev_data['request_title'].apply(lambda x: sia.polarity_scores(x)[metrics])

sent_mat_train = sentiment_data_train.as_matrix()
sent_mat_dev =sentiment_data_dev.as_matrix()
print(sentiment_data_train)



      request_text_edit_awareneg  request_titleneg  \
183                        0.177             0.202   
3483                       0.162             0.286   
3430                       0.000             0.000   
1974                       0.222             0.000   
3721                       0.000             0.190   
3609                       0.053             0.259   
3902                       0.024             0.000   
3546                       0.110             0.155   
1488                       0.000             0.219   
114                        0.055             0.000   
395                        0.296             0.000   
2896                       0.018             0.000   
2638                       0.085             0.139   
134                        0.000             0.483   
350                        0.149             0.000   
1915                       0.000             0.000   
714                        0.096             0.267   
1194                       0

In [42]:
# With LDA
f_new=hstack([X_new,t_mat,train_topics, sent_mat_train])
dev_new=hstack([d_new,d_mat,dev_topics, sent_mat_dev])

In [33]:
# With NMF
f_new=hstack([X_new,t_mat,nmf_train, sent_mat_train])
dev_new=hstack([d_new,d_mat,nmf_dev, sent_mat_dev])

In [43]:
lr = LogisticRegression()
#parameters = {'C':[0.00001, 0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000]}
parameters = {'C':np.linspace(0.0005, 0.1, 100)}
clf = GridSearchCV(lr, parameters)
clf.fit(f_new, train_labels)
pred_dev_prob = clf.predict_proba(dev_new)[:,1]
pred_dev_labels = clf.predict(dev_new)

print(clf.best_params_)
roc_auc_score(dev_labels, pred_dev_prob, average='micro')

{'C': 0.1}


0.6900575657894739

In [44]:
#create dmatrices
dtrain = xgb.DMatrix(f_new, train_labels)
dtest = xgb.DMatrix(dev_new
                         , dev_labels)

#booster parameter
param = {'max_depth':15, 'eta': .0188, 'silent': 1, 'objective': 'binary:logistic'
         , 'scale_pos_weight': 3.06,'max_delta_step':1,'subsample':.7,'seed':42}#9 depth if sublin false
param['nthread'] = 4
param['eval_metric'] = 'auc'

#specify validation set to watch performance
evallist = [(dtest, 'eval'), (dtrain, 'train')]

#train model
num_round = 100
bst = xgb.train(param.items(), dtrain, num_round, evallist)

[0]	eval-auc:0.573713	train-auc:0.814951
[1]	eval-auc:0.5925	train-auc:0.904558
[2]	eval-auc:0.620921	train-auc:0.939711
[3]	eval-auc:0.643479	train-auc:0.960859
[4]	eval-auc:0.641612	train-auc:0.968663
[5]	eval-auc:0.649852	train-auc:0.96989
[6]	eval-auc:0.65803	train-auc:0.969359
[7]	eval-auc:0.661086	train-auc:0.972426
[8]	eval-auc:0.663006	train-auc:0.976213
[9]	eval-auc:0.663676	train-auc:0.979939
[10]	eval-auc:0.655366	train-auc:0.982069
[11]	eval-auc:0.661637	train-auc:0.984147
[12]	eval-auc:0.652747	train-auc:0.984551
[13]	eval-auc:0.649585	train-auc:0.985734
[14]	eval-auc:0.654437	train-auc:0.986891
[15]	eval-auc:0.649942	train-auc:0.986584
[16]	eval-auc:0.65185	train-auc:0.986838
[17]	eval-auc:0.649868	train-auc:0.988017
[18]	eval-auc:0.651649	train-auc:0.98895
[19]	eval-auc:0.651706	train-auc:0.989825
[20]	eval-auc:0.653499	train-auc:0.990844
[21]	eval-auc:0.657134	train-auc:0.991536
[22]	eval-auc:0.654716	train-auc:0.991585
[23]	eval-auc:0.659803	train-auc:0.992079
[24]	eva