# Classification

## Team Name
>### Sigma  

## Team Member
>### 조현윤, 이상협, 정하연  

## Objective
> ### in search of better methods of using this behavioral data to predict which individuals they should approach—and even when and how to approach them.
> ### to create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.
> ### to predict the potential business value of a person who has performed a specific activity.

## Evaluation
> ### valuated on area under the ROC curve between the predicted and the observed outcome.

## Submission File
> ### For each activity_id in the test set, you must predict a probability for the 'outcome' variable, represented by a number between 0 and 1.
~~~~
activity_id,outcome
act1_1,0
act1_100006,0
act1_100050,0
~~~~

## Data
> ### uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.
> ### The people file contains all of the unique people (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique people_id.
> ### The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.
> ### The activity file contains several different categories of activities. 
>> Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).
> ### To develop a predictive model with this data, you will likely need to join the files together into a single data set. The two files can be joined together using person_id as the common key. All variables are categorical, with the exception of 'char_38' in the people file, which is a continuous numerical variable.

## Reference 
[kaggel Predicting Red Hat Business Value](https://www.kaggle.com/c/predicting-red-hat-business-value)

### Load Python Package

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from datetime import datetime
from datetime import date
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.stats.stattools as stools
import scipy as sp
%matplotlib inline

In [2]:
import xgboost



In [3]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [4]:
from sklearn.tree import DecisionTreeClassifier

In [5]:
from sklearn.metrics import *

In [6]:
from sklearn.ensemble import VotingClassifier

In [7]:
from sklearn.ensemble import GradientBoostingClassifier

In [8]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [9]:
from sklearn.linear_model import LogisticRegression

In [10]:
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [11]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [12]:
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier


In [13]:
from sklearn.preprocessing import LabelEncoder

In [14]:
cv =KFold(10)

## Exploratory Data Analysis (EDA )¶

## Load Data Set

In [51]:
# activity data set
act_Train = pd.read_csv('./data/act_train.csv')
act_Test = pd.read_csv('./data/act_test.csv')
# people data set
people = pd.read_csv('./data/people.csv')

### Split people data set wether train data set or test data set

In [52]:
idx_train =list(act_Train['people_id'].value_counts().index)
idx_test =list(act_Test['people_id'].value_counts().index)

In [53]:
train_people = people.loc[people['people_id'].isin(idx_train)]
test_people = people.loc[people['people_id'].isin(idx_test)]

In [54]:
train_people.to_csv('./data/act_train_people.csv',index=False)
test_people.to_csv('./data/act_test_people.csv',index=False)

In [55]:
print('Number of active people: {}'.format(act_Train['people_id'].nunique()))

Number of active people: 151295


In [56]:
print('Number of active people: {}'.format(act_Test['people_id'].nunique()))

Number of active people: 37823


In [57]:
trainMerge = pd.merge(act_Train,people, on='people_id')
trainMerge.tail()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
2197286,ppl_99994,act2_4668076,2023-06-16,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197287,ppl_99994,act2_4743548,2023-03-30,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197288,ppl_99994,act2_536973,2023-01-19,type 2,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197289,ppl_99994,act2_688656,2023-05-02,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197290,ppl_99994,act2_715089,2023-06-15,type 2,,,,,,,...,True,True,True,True,False,True,True,True,True,95


In [58]:
trainMerge.to_csv('./data/train_merge.csv',index=False)

In [59]:
testMerge = pd.merge(act_Test,people, on='people_id')
testMerge.tail()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
498682,ppl_99997,act2_4367092,2023-04-22,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498683,ppl_99997,act2_4404220,2022-11-12,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498684,ppl_99997,act2_448830,2022-08-02,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498685,ppl_99997,act2_450133,2022-08-02,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498686,ppl_99997,act2_847967,2022-10-15,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36


In [60]:
testMerge.to_csv('./data/test_merge.csv',index=False)

In [61]:
dfx = act_Train.groupby(['people_id','outcome']).size().unstack()
dfx = dfx.fillna(0).astype(int)

In [62]:
only1 = dfx[(dfx[0]==0) & (dfx[1]!=0)]
only0 = dfx[(dfx[0]!=0) & (dfx[1]==0)]
mix_0or1 = dfx[(dfx[0]!=0) & (dfx[1]!=0)]

In [63]:
print (len(only1.index),len(only0.index),len(mix_0or1))

62115 82524 6656


In [64]:
train_People = pd.merge(train_people, dfx, left_on = 'people_id',right_index = True)

In [65]:
train_People.rename(columns={0:'outcome_0',1:'outcome_1'}, inplace = True)

In [66]:
def ax(x):
    if x['outcome_0'] !=0 and x['outcome_1'] ==0:
        return 0
    elif x['outcome_0'] ==0 and x['outcome_1'] !=0:
        return 1
    else:
        return 2

In [67]:
train_People.head()

Unnamed: 0,people_id,char_1,group_1,char_2,date,char_3,char_4,char_5,char_6,char_7,...,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38,outcome_0,outcome_1
0,ppl_100,type 2,group 17304,type 2,2021-06-29,type 5,type 5,type 5,type 3,type 11,...,True,False,False,True,True,True,False,36,6,0
1,ppl_100002,type 2,group 8688,type 3,2021-01-06,type 28,type 9,type 5,type 3,type 11,...,True,True,True,True,True,True,False,76,0,2
2,ppl_100003,type 2,group 33592,type 3,2022-06-10,type 4,type 8,type 5,type 2,type 5,...,True,True,True,True,False,True,True,99,0,34
4,ppl_100006,type 2,group 6534,type 3,2022-07-27,type 40,type 25,type 9,type 3,type 8,...,True,False,False,False,True,True,False,84,0,3
7,ppl_100013,type 2,group 4204,type 3,2023-01-24,type 4,type 8,type 4,type 1,type 7,...,True,True,True,True,False,True,True,91,0,5


In [68]:
train_People['result'] = train_People.apply(ax, axis = 1)

In [69]:
train_People.tail()

Unnamed: 0,people_id,char_1,group_1,char_2,date,char_3,char_4,char_5,char_6,char_7,...,char_32,char_33,char_34,char_35,char_36,char_37,char_38,outcome_0,outcome_1,result
189111,ppl_99981,type 2,group 17304,type 2,2023-01-26,type 5,type 5,type 5,type 2,type 5,...,True,False,True,False,True,True,3,7,0,0
189113,ppl_99987,type 1,group 8600,type 1,2022-04-02,type 4,type 6,type 4,type 3,type 11,...,True,False,True,False,True,True,89,0,1,1
189114,ppl_9999,type 2,group 17304,type 2,2023-02-23,type 6,type 2,type 8,type 3,type 11,...,False,False,False,False,False,False,0,2,0,0
189115,ppl_99992,type 2,group 17304,type 2,2020-06-25,type 5,type 5,type 3,type 4,type 16,...,False,False,False,False,False,False,0,2,0,0
189116,ppl_99994,type 2,group 17764,type 3,2023-01-06,type 2,type 7,type 2,type 1,type 2,...,True,False,True,True,True,True,95,0,46,1


In [70]:
train_People.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 151295 entries, 0 to 189116
Data columns (total 44 columns):
people_id    151295 non-null object
char_1       151295 non-null object
group_1      151295 non-null object
char_2       151295 non-null object
date         151295 non-null object
char_3       151295 non-null object
char_4       151295 non-null object
char_5       151295 non-null object
char_6       151295 non-null object
char_7       151295 non-null object
char_8       151295 non-null object
char_9       151295 non-null object
char_10      151295 non-null bool
char_11      151295 non-null bool
char_12      151295 non-null bool
char_13      151295 non-null bool
char_14      151295 non-null bool
char_15      151295 non-null bool
char_16      151295 non-null bool
char_17      151295 non-null bool
char_18      151295 non-null bool
char_19      151295 non-null bool
char_20      151295 non-null bool
char_21      151295 non-null bool
char_22      151295 non-null bool
char_23      15

In [71]:
X = train_People.drop(['people_id','outcome_0', 'outcome_1','result'],axis = 1)
y = train_People['result']
#151295

In [72]:
np.any(X != np.NaN)

True

In [73]:
for idx in X.columns:
    #X[idx] = X[idx].fillna('type 0')
    X[idx] = LabelEncoder().fit_transform(X[idx])

In [74]:
X.tail()

Unnamed: 0,char_1,group_1,char_2,date,char_3,char_4,char_5,char_6,char_7,char_8,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
189111,1,4691,1,981,38,20,4,1,20,7,...,0,0,0,1,0,1,0,1,1,3
189113,0,28923,0,682,33,21,3,2,2,1,...,0,0,0,1,0,1,0,1,1,89
189114,1,4691,1,1009,39,11,7,2,2,2,...,0,0,0,0,0,0,0,0,0,0
189115,1,4691,1,38,38,20,2,3,7,2,...,0,0,0,0,0,0,0,0,0,0
189116,1,5064,2,961,11,22,1,0,11,1,...,1,1,1,1,0,1,1,1,1,95


In [56]:
logimodel = LogisticRegression().fit(X,y)

In [62]:
cross_val_score(logimodel, X, y, scoring='accuracy',cv=cv)

array([ 0.81949769,  0.81962987,  0.8236616 ,  0.82181097,  0.82313285,
        0.81975015,  0.81968405,  0.81657743,  0.81690793,  0.82265847])

In [63]:
qda = QuadraticDiscriminantAnalysis(store_covariances = True).fit(X,y)



In [64]:
cross_val_score(qda, X, y, scoring='accuracy',cv=cv)



array([ 0.48360872,  0.47402512,  0.59391937,  0.49345671,  0.57224058,
        0.7880891 ,  0.65820609,  0.48079847,  0.49329103,  0.70963051])

In [65]:
lda = LinearDiscriminantAnalysis(n_components=3, solver="svd", store_covariance=True).fit(X, y)
cross_val_score(lda, X, y, scoring='accuracy',cv=cv)



array([ 0.81996034,  0.82233972,  0.81857237,  0.81705221,  0.81665565,
        0.81875868,  0.82358385,  0.82127041,  0.81875868,  0.82298896])

In [66]:
clf_norm = GaussianNB().fit(X,y)
cross_val_score(clf_norm, X, y, scoring='accuracy',cv=cv)

array([ 0.68830139,  0.68559154,  0.68744217,  0.6849306 ,  0.67699934,
        0.68649613,  0.68530637,  0.68001851,  0.67988631,  0.67763897])

In [67]:
clf_bern = BernoulliNB().fit(X, y)
cross_val_score(clf_bern, X, y, scoring='accuracy',cv=cv)

array([ 0.6784534 ,  0.67514871,  0.67805684,  0.67448777,  0.66794448,
        0.67572212,  0.6766475 ,  0.67155794,  0.66759204,  0.67129354])

In [68]:
clf_mult = MultinomialNB().fit(X,y)
cross_val_score(clf_mult, X, y, scoring='accuracy',cv=cv)

array([ 0.48770654,  0.48955717,  0.48420357,  0.47904825,  0.4849306 ,
        0.48998612,  0.48767268,  0.48139335,  0.49375372,  0.48628462])

In [77]:
tree1 = DecisionTreeClassifier(criterion='entropy', max_depth=1).fit(X, y)
cross_val_score(tree1, X, y, scoring='accuracy',cv=cv)

array([ 0.82009253,  0.82214144,  0.81837409,  0.81771315,  0.81982816,
        0.81889087,  0.82318726,  0.82100601,  0.82047723,  0.82252627])

In [80]:
bagging = BaggingClassifier(DecisionTreeClassifier(), bootstrap_features=True, random_state=0).fit(X, y)
cross_val_score(bagging, X, y, scoring='accuracy',cv=cv)

array([ 0.87891606,  0.88162591,  0.88023794,  0.87779247,  0.87680106,
        0.87599974,  0.87950294,  0.87837927,  0.87593364,  0.87751999])

In [81]:
extraforest = ExtraTreesClassifier(n_estimators=150, random_state=0).fit(X,y)
cross_val_score(extraforest, X, y, scoring='accuracy',cv=cv)

array([ 0.87561137,  0.88056841,  0.87514871,  0.87686715,  0.87607403,
        0.87309141,  0.87904025,  0.87282702,  0.87408289,  0.87117457])

In [82]:
forest = RandomForestClassifier(n_estimators = 150).fit(X,y)
cross_val_score(forest, X, y, scoring='accuracy',cv=cv)

array([ 0.87904825,  0.88446794,  0.87713153,  0.87627231,  0.87600793,
        0.87791658,  0.88082491,  0.87414899,  0.87646242,  0.87414899])

In [86]:
model_ada = AdaBoostClassifier(DecisionTreeClassifier(max_depth=5, random_state=0), 
                               algorithm="SAMME", n_estimators=100).fit(X,y)
cross_val_score(model_ada, X, y, scoring='accuracy',cv=cv)

NameError: name 'AdaBoostClassifier' is not defined

In [None]:
model_grad = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=0).fit(X,y)
cross_val_score(model_grad, X, y, scoring='accuracy',cv=cv)

In [35]:
model_xgb = xgboost.XGBClassifier(n_estimators=100, max_depth=10)
%time
Xx = model_xgb.fit(X,y)

Wall time: 0 ns


In [36]:
for ix in range(len(X.columns)):
    print (X.columns[ix], Xx.feature_importances_[ix])

char_1 0.00996145
group_1 0.272938
char_2 0.00406065
date 0.142657
char_3 0.054752
char_4 0.0442251
char_5 0.03376
char_6 0.0334002
char_7 0.0752197
char_8 0.0320535
char_9 0.0323619
char_10 0.00654845
char_11 0.00821383
char_12 0.00645592
char_13 0.00734002
char_14 0.00580828
char_15 0.00571575
char_16 0.00547931
char_17 0.00473914
char_18 0.00728861
char_19 0.00499615
char_20 0.00493446
char_21 0.00400925
char_22 0.00479054
char_23 0.00612696
char_24 0.0074531
char_25 0.00924184
char_26 0.00632228
char_27 0.00576715
char_28 0.00217939
char_29 0.00643536
char_30 0.00539707
char_31 0.00688769
char_32 0.00533539
char_33 0.00572603
char_34 0.0059008
char_35 0.00723721
char_36 0.00457466
char_37 0.00392701
char_38 0.099779


In [37]:
actX = act_Train.drop(['people_id','activity_id','outcome'],axis = 1)
acty = act_Train['outcome']

In [38]:
for idx in actX.columns:
    actX[idx] = actX[idx].fillna('type 0')
    actX[idx] = LabelEncoder().fit_transform(actX[idx])

In [69]:
logimodel_act = LogisticRegression().fit(actX,acty)

In [70]:
cross_val_score(logimodel_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.61977427,  0.50443046,  0.51324131,  0.51512545,  0.53869084,
        0.63643397,  0.52123752,  0.6305722 ,  0.57515849,  0.48487455])

In [71]:
qda_act= QuadraticDiscriminantAnalysis(store_covariances = True).fit(actX,acty)



In [72]:
cross_val_score(qda_act, actX, acty, scoring='accuracy',cv=cv)



array([ 0.39371501,  0.50752973,  0.50146317,  0.4968757 ,  0.47854858,
        0.37569461,  0.48644922,  0.38067347,  0.43497217,  0.49541026])

In [73]:
lda_act = LinearDiscriminantAnalysis(n_components=3, solver="svd", store_covariance=True).fit(actX, acty)
cross_val_score(lda_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.61819961,  0.50453058,  0.51197156,  0.51494796,  0.53868174,
        0.63604258,  0.52119656,  0.63067688,  0.57513118,  0.48533876])

In [74]:
clf_norm_act = GaussianNB().fit(actX,acty)
cross_val_score(clf_norm_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.39244982,  0.50046648,  0.50321532,  0.49189684,  0.50147682,
        0.37361477,  0.5005211 ,  0.37857998,  0.43417573,  0.47920393])

In [75]:
clf_bern_act = BernoulliNB().fit(actX, acty)
cross_val_score(clf_bern_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.6203477 ,  0.50820329,  0.51335054,  0.51678659,  0.5408617 ,
        0.63783115,  0.52536989,  0.63180554,  0.58041497,  0.48548439])

In [76]:
clf_mult_act = MultinomialNB().fit(actX,acty)
cross_val_score(clf_mult_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.59483457,  0.50232787,  0.49638418,  0.4996837 ,  0.50401631,
        0.61737413,  0.4923838 ,  0.62836949,  0.53971483,  0.48157048])

In [78]:
tree_act = DecisionTreeClassifier(criterion='entropy', max_depth=1).fit(actX, acty)
cross_val_score(tree_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.6203477 ,  0.50820329,  0.51335054,  0.51678659,  0.5408617 ,
        0.63783115,  0.52536989,  0.63180554,  0.58041497,  0.48548439])

In [83]:
bagging_act = BaggingClassifier(DecisionTreeClassifier(), bootstrap_features=True, random_state=0).fit(actX, acty)
cross_val_score(bagging_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.59866199,  0.58892545,  0.5889391 ,  0.59686705,  0.5929941 ,
        0.68994534,  0.59068216,  0.70186457,  0.63712118,  0.57653291])

In [84]:
extraforest_act = ExtraTreesClassifier(n_estimators=150, random_state=0).fit(actX,acty)
cross_val_score(extraforest_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.66260411,  0.58974009,  0.5919519 ,  0.59710371,  0.59362214,
        0.69190685,  0.59203382,  0.70196924,  0.63570125,  0.57894952])

In [85]:
forest_act = RandomForestClassifier(n_estimators = 150).fit(actX,acty)
cross_val_score(forest_act, actX, acty, scoring='accuracy',cv=cv)

array([ 0.59942657,  0.59016789,  0.59176986,  0.59723569,  0.59252989,
        0.69185224,  0.59141033,  0.65345949,  0.63431318,  0.57873562])

In [None]:
model_ada_act = AdaBoostClassifier(DecisionTreeClassifier(max_depth=5, random_state=0), 
                               algorithm="SAMME", n_estimators=100).fit(actX,acty)
cross_val_score(model_ada_act, actX, acty, scoring='accuracy',cv=cv)

In [None]:
model_grad_act = GradientBoostingClassifier(n_estimators=100, max_depth=5, random_state=0).fit(actX,acty)
cross_val_score(model_grad_act, actX, acty, scoring='accuracy',cv=cv)

In [39]:
model_xgb2 = xgboost.XGBClassifier(n_estimators=100, max_depth=10)
%time
Xx2 = model_xgb2.fit(actX,acty)

Wall time: 0 ns


In [40]:
for ix in range(len(actX.columns)):
    print (actX.columns[ix], Xx2.feature_importances_[ix])

date 0.435285
activity_category 0.0495369
char_1 0.0314591
char_2 0.0288091
char_3 0.020413
char_4 0.012804
char_5 0.0119644
char_6 0.0106263
char_7 0.0145882
char_8 0.0204655
char_9 0.0222759
char_10 0.341773


In [75]:
for idx in trainMerge.columns:
    if 'type 0' in list(trainMerge[idx].unique()):
        print (idx, 'type 0')
    else:pass

In [76]:
trainMerge.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2197291 entries, 0 to 2197290
Data columns (total 55 columns):
people_id            object
activity_id          object
date_x               object
activity_category    object
char_1_x             object
char_2_x             object
char_3_x             object
char_4_x             object
char_5_x             object
char_6_x             object
char_7_x             object
char_8_x             object
char_9_x             object
char_10_x            object
outcome              int64
char_1_y             object
group_1              object
char_2_y             object
date_y               object
char_3_y             object
char_4_y             object
char_5_y             object
char_6_y             object
char_7_y             object
char_8_y             object
char_9_y             object
char_10_y            bool
char_11              bool
char_12              bool
char_13              bool
char_14              bool
char_15              bool
char

In [None]:
trainMerge.head()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
0,ppl_100,act2_1734928,2023-08-26,type 4,,,,,,,...,False,True,True,False,False,True,True,True,False,36
1,ppl_100,act2_2434093,2022-09-27,type 2,,,,,,,...,False,True,True,False,False,True,True,True,False,36
2,ppl_100,act2_3404049,2022-09-27,type 2,,,,,,,...,False,True,True,False,False,True,True,True,False,36
3,ppl_100,act2_3651215,2023-08-04,type 2,,,,,,,...,False,True,True,False,False,True,True,True,False,36
4,ppl_100,act2_4109017,2023-08-26,type 2,,,,,,,...,False,True,True,False,False,True,True,True,False,36


In [None]:
for idx in trainMerge.columns:
    print (idx)
    if idx not in ['people_id', 'activity_id', 'date_x','date_y', 'char_38', 'outcome']:
        if trainMerge[idx].dtype == 'object':
            trainMerge.fillna('type 0', inplace = True)
            trainMerge[idx] = trainMerge[idx].apply(lambda x:x.split(' ')[1]).astype(np.int32)
        elif trainMerge[idx].dtype == 'bool':
            trainMerge[idx] = trainMerge[idx].astype(np.int8)

people_id
activity_id
date_x
activity_category
char_1_x
char_2_x


In [None]:
trainMerge['date_x'] = pd.to_datetime(trainMerge['date_x'])
trainMerge['date_y'] = pd.to_datetime(trainMerge['date_y'])

In [None]:
trainMerge['year_x'] = trainMerge['date_x'].dt.year
trainMerge['month_x'] = trainMerge['date_x'].dt.month
trainMerge['day_x'] = trainMerge['date_x'].dt.day
trainMerge['weekday_x'] = trainMerge['date_x'].dt.weekday
trainMerge['weekend_x'] = ((trainMerge.weekday_x == 0) | (trainMerge.weekday_x == 6)).astype(int)
trainMerge = trainMerge.drop('date_x', axis = 1)
    
trainMerge['year_y'] = trainMerge['date_y'].dt.year
trainMerge['month_y'] = trainMerge['date_y'].dt.month
trainMerge['day_y'] = trainMerge['date_y'].dt.day
trainMerge['weekday_y'] = trainMerge['date_y'].dt.weekday
trainMerge['weekend_y'] = ((trainMerge.weekday_y == 0) | (trainMerge.weekday_y == 6)).astype(int)
trainMerge = trainMerge.drop('date_y', axis = 1)

In [None]:
trainMerge.tail()

In [None]:
mergeX = trainMerge.drop(['people_id','activity_id','outcome'],axis = 1)
mergey = trainMerge['outcome']

In [None]:
for idx in testMerge.columns:
    print (idx)
    if idx not in ['people_id', 'activity_id', 'date_x','date_y', 'char_38', 'outcome']:
        if testMerge[idx].dtype == 'object':
            testMerge.fillna('type 0', inplace = True)
            testMerge[idx] = testMerge[idx].apply(lambda x:x.split(' ')[1]).astype(np.int32)
        elif testMerge[idx].dtype == 'bool':
            testMerge[idx] = testMerge[idx].astype(np.int8)

In [50]:
testMerge['date_x'] = pd.to_datetime(testMerge['date_x'])
testMerge['date_y'] = pd.to_datetime(testMerge['date_y'])

KeyError: 'date_x'

In [48]:
testMerge['year_x'] = testMerge['date_x'].dt.year
testMerge['month_x'] = testMerge['date_x'].dt.month
testMerge['day_x'] = testMerge['date_x'].dt.day
testMerge['weekday_x'] = testMerge['date_x'].dt.weekday
testMerge['weekend_x'] = ((testMerge.weekday_x == 0) | (testMerge.weekday_x == 6)).astype(int)
testMerge = testMerge.drop('date_x', axis = 1)
    
testMerge['year_y'] = testMerge['date_y'].dt.year
testMerge['month_y'] = testMerge['date_y'].dt.month
testMerge['day_y'] = testMerge['date_y'].dt.day
testMerge['weekday_y'] = testMerge['date_y'].dt.weekday
testMerge['weekend_y'] = ((testMerge.weekday_y == 0) | (testMerge.weekday_y == 6)).astype(int)
testMerge = testMerge.drop('date_y', axis = 1)

KeyError: 'date_x'

In [46]:
testX = testMerge.drop(['people_id','activity_id'],axis = 1)
#testy = trainMerge['outcome']

In [103]:
model_xgb3 = xgboost.XGBClassifier(n_estimators=100)
%time
Xx3 = model_xgb3.fit(mergeX,mergey)
#cross_val_score(Xx3, mergeX, mergey, scoring='accuracy',cv=cv)

CPU times: user 3 µs, sys: 1e+03 ns, total: 4 µs
Wall time: 7.15 µs


In [104]:
output = pd.concat([testMerge['activity_id'], pd.DataFrame(Xx3.predict(testX))],axis = 1)
output.rename({0:'outcome'},axis = 1,inplace = True)

In [105]:
output.to_csv('./submission.csv',index = False)

In [72]:
for ix in range(len(mergeX.columns)):
    print (mergeX.columns[ix], Xx3.feature_importances_[ix])

activity_category 0.00743537
char_1_x 0.00196751
char_2_x 0.00180737
char_3_x 0.00102951
char_4_x 0.000640586
char_5_x 0.00105239
char_6_x 0.000617708
char_7_x 0.000938
char_8_x 0.00118966
char_9_x 0.00164722
char_10_x 0.0199497
char_1_y 0.00652025
group_1 0.21757
char_2_y 0.00304278
char_3_y 0.0396248
char_4_y 0.0310684
char_5_y 0.0210249
char_6_y 0.0322352
char_7_y 0.0701213
char_8_y 0.0334248
char_9_y 0.0274308
char_10_y 0.00587966
char_11 0.00617708
char_12 0.00473576
char_13 0.00571952
char_14 0.00462137
char_15 0.00510181
char_16 0.00382064
char_17 0.00409517
char_18 0.00558225
char_19 0.00382064
char_20 0.00423244
char_21 0.00315717
char_22 0.00356898
char_23 0.00462137
char_24 0.00443834
char_25 0.00800732
char_26 0.00475864
char_27 0.00391215
char_28 0.00148707
char_29 0.0052162
char_30 0.00475864
char_31 0.0052162
char_32 0.00352322
char_33 0.00329444
char_34 0.00510181
char_35 0.00571952
char_36 0.00363761
char_37 0.00272249
char_38 0.0787692
year_x 0.0267902
month_x 0.05902

In [73]:
logimodelmerge = LogisticRegression().fit(mergeX,mergey)
cross_val_score(logimodelmerge, mergeX, mergey, scoring='accuracy',cv=cv)

array([ 0.85367952,  0.81624638,  0.81061671,  0.80914672,  0.80718521,
        0.85277774,  0.8079862 ,  0.86519303,  0.81605978,  0.77994257])

In [74]:
qdamerge = QuadraticDiscriminantAnalysis(store_covariances = True).fit(mergeX,mergey)



In [75]:
cross_val_score(qdamerge, mergeX, mergey, scoring='accuracy',cv=cv)



array([ 0.62059346,  0.50818053,  0.81209126,  0.517046  ,  0.81556372,
        0.63777198,  0.81617811,  0.63190111,  0.82374197,  0.48560272])

In [76]:
ldamerge = LinearDiscriminantAnalysis(n_components=3, solver="svd", store_covariance=True).fit(mergeX, mergey)
cross_val_score(ldamerge, mergeX, mergey, scoring='accuracy',cv=cv)

array([ 0.85596869,  0.82679118,  0.81842633,  0.82106595,  0.82131171,
        0.85879424,  0.8212935 ,  0.87016734,  0.82472045,  0.82841136])

In [77]:
clf_normmerge = GaussianNB().fit(mergeX,mergey)
cross_val_score(clf_normmerge, mergeX, mergey, scoring='accuracy',cv=cv)

array([ 0.73122013,  0.66348548,  0.67646965,  0.68250891,  0.67930041,
        0.74960064,  0.67598269,  0.74794861,  0.59273924,  0.66433197])

In [78]:
clf_bernmerge = BernoulliNB().fit(mergeX, mergey)
cross_val_score(clf_bernmerge, mergeX, mergey, scoring='accuracy',cv=cv)

array([ 0.71996086,  0.64669661,  0.65858853,  0.66844613,  0.66450036,
        0.7376086 ,  0.64823032,  0.73514192,  0.57807117,  0.64820301])

In [79]:
clf_multmerge = MultinomialNB().fit(mergeX,mergey)
cross_val_score(clf_multmerge, mergeX, mergey, scoring='accuracy',cv=cv)

array([ 0.74633869,  0.66453222,  0.68505295,  0.67880435,  0.68361482,
        0.76247559,  0.68787916,  0.77176431,  0.69864697,  0.65938497])

In [80]:
tree1merge = DecisionTreeClassifier(criterion='entropy', max_depth=1).fit(mergeX, mergey)
cross_val_score(tree1merge, mergeX, mergey, scoring='accuracy',cv=cv)

array([ 0.85038456,  0.82158022,  0.81369323,  0.81952314,  0.81330184,
        0.85232263,  0.81772547,  0.87026747,  0.82280445,  0.82262241])

In [81]:
baggingmerge = BaggingClassifier(DecisionTreeClassifier(), bootstrap_features=True, random_state=0).fit(mergeX, mergey)
cross_val_score(baggingmerge, mergeX, mergey, scoring='accuracy',cv=cv)

array([ 0.90233013,  0.87740353,  0.86963032,  0.86558442,  0.87749   ,
        0.90748149,  0.87475936,  0.8635683 ,  0.88395251,  0.84388497])

In [96]:
pred = baggingmerge.predict(testX)

In [101]:
output = pd.concat([testMerge['activity_id'], pd.DataFrame(pred)],axis = 1)
output.rename({0:'outcome'},axis = 1,inplace = True)

In [102]:
output.to_csv('./submission.csv',index = False)

In [106]:
forestmerge = RandomForestClassifier(n_estimators = 100).fit(mergeX,mergey)
#cross_val_score(forestmerge, mergeX, mergey, scoring='accuracy',cv=cv)
output2 = pd.concat([testMerge['activity_id'], pd.DataFrame(forestmerge.predict(testX))],axis = 1)
output2.rename({0:'outcome'},axis = 1,inplace = True)

In [109]:
output2.to_csv('./submission2.csv',index=False)

In [111]:
model_gradmerge = GradientBoostingClassifier(n_estimators=100,random_state=0).fit(mergeX,mergey)
#cross_val_score(model_gradmerge, mergeX, mergey, scoring='accuracy',cv=cv)

In [112]:
output3 = pd.concat([testMerge['activity_id'], pd.DataFrame(model_gradmerge.predict(testX))],axis = 1)
output3.rename({0:'outcome'},axis = 1,inplace = True)

In [113]:
output3.to_csv('./submission3.csv',index=False)

In [114]:
extraforestmerge = ExtraTreesClassifier(n_estimators=100, random_state=0).fit(mergeX,mergey)
#cross_val_score(extraforestmerge, mergeX, mergey, scoring='accuracy',cv=cv)

In [115]:
output4 = pd.concat([testMerge['activity_id'], pd.DataFrame(extraforestmerge.predict(testX))],axis = 1)
output4.rename({0:'outcome'},axis = 1,inplace = True)

In [116]:
output4.to_csv('./submission4.csv',index=False)

In [117]:
model_adamerge = AdaBoostClassifier(DecisionTreeClassifier(max_depth=5, random_state=0), 
                               algorithm="SAMME", n_estimators=100).fit(mergeX,mergey)
#cross_val_score(model_adamerge, mergeX, mergey, scoring='accuracy',cv=cv)

In [118]:
output5 = pd.concat([testMerge['activity_id'], pd.DataFrame(model_adamerge.predict(testX))],axis = 1)
output5.rename({0:'outcome'},axis = 1,inplace = True)

In [119]:
output5.to_csv('./submission5.csv',index=False)

In [48]:
from sklearn.svm import SVC

In [None]:
modelsvc = SVC(kernel='linear').fit(mergeX, mergey)

In [None]:
output6 = pd.concat([testMerge['activity_id'], pd.DataFrame(modelsvc.predict(testX))],axis = 1)
output6.rename({0:'outcome'},axis = 1,inplace = True)

In [None]:
output6.to_csv('./submission6.csv',index=False)

In [None]:
modelsvc2 = SVC(kernel='rbf',C=0.5,gamma=30).fit(mergeX, mergey)

In [None]:
output7 = pd.concat([testMerge['activity_id'], pd.DataFrame(modelsvc2.predict(testX))],axis = 1)
output7.rename({0:'outcome'},axis = 1,inplace = True)
output7.to_csv('./submission7.csv',index=False)