# Classification

## Team Name
>### Sigma  

## Team Member
>### 조현윤, 이상협, 정하연  

## Objective
> ### in search of better methods of using this behavioral data to predict which individuals they should approach—and even when and how to approach them.
> ### to create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.
> ### to predict the potential business value of a person who has performed a specific activity.

## Evaluation
> ### valuated on area under the ROC curve between the predicted and the observed outcome.

## Submission File
> ### For each activity_id in the test set, you must predict a probability for the 'outcome' variable, represented by a number between 0 and 1.
~~~~
activity_id,outcome
act1_1,0
act1_100006,0
act1_100050,0
~~~~

## Data
> ### uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.
> ### The people file contains all of the unique people (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique people_id.
> ### The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.
> ### The activity file contains several different categories of activities. 
>> Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).
> ### To develop a predictive model with this data, you will likely need to join the files together into a single data set. The two files can be joined together using person_id as the common key. All variables are categorical, with the exception of 'char_38' in the people file, which is a continuous numerical variable.

## Reference 
[kaggel Predicting Red Hat Business Value](https://www.kaggle.com/c/predicting-red-hat-business-value)

### Load Python Package

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from datetime import datetime
from datetime import date
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.stats.stattools as stools
import scipy as sp
%matplotlib inline

## Exploratory Data Analysis (EDA )¶

## Load Data Set

In [2]:
# activity data set
act_Train = pd.read_csv('./data/act_train.csv')
act_Test = pd.read_csv('./data/act_test.csv')
# people data set
people = pd.read_csv('./data/people.csv')

### Split people data set wether train data set or test data set

In [3]:
idx_train =list(act_Train['people_id'].value_counts().index)
idx_test =list(act_Test['people_id'].value_counts().index)

In [4]:
train_people = people.loc[people['people_id'].isin(idx_train)]
test_people = people.loc[people['people_id'].isin(idx_test)]

In [5]:
train_people.to_csv('./data/act_train_people.csv',index=False)
test_people.to_csv('./data/act_test_people.csv',index=False)

In [6]:
print('Number of active people: {}'.format(act_Train['people_id'].nunique()))

Number of active people: 151295


In [7]:
print('Number of active people: {}'.format(act_Test['people_id'].nunique()))

Number of active people: 37823


In [8]:
trainMerge = pd.merge(act_Train,people, on='people_id')
trainMerge.tail()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
2197286,ppl_99994,act2_4668076,2023-06-16,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197287,ppl_99994,act2_4743548,2023-03-30,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197288,ppl_99994,act2_536973,2023-01-19,type 2,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197289,ppl_99994,act2_688656,2023-05-02,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197290,ppl_99994,act2_715089,2023-06-15,type 2,,,,,,,...,True,True,True,True,False,True,True,True,True,95


In [9]:
trainMerge.to_csv('./data/train_merge.csv',index=False)

In [10]:
testMerge = pd.merge(act_Test,people, on='people_id')
testMerge.tail()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
498682,ppl_99997,act2_4367092,2023-04-22,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498683,ppl_99997,act2_4404220,2022-11-12,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498684,ppl_99997,act2_448830,2022-08-02,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498685,ppl_99997,act2_450133,2022-08-02,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36
498686,ppl_99997,act2_847967,2022-10-15,type 2,,,,,,,...,False,False,False,False,False,False,False,False,False,36


In [11]:
testMerge.to_csv('./data/test_merge.csv',index=False)

In [12]:
dfx = act_Train.groupby(['people_id','outcome']).size().unstack()
dfx = dfx.fillna(0).astype(int)

In [13]:
only1 = dfx[(dfx[0]==0) & (dfx[1]!=0)]
only0 = dfx[(dfx[0]!=0) & (dfx[1]==0)]
mix_0or1 = dfx[(dfx[0]!=0) & (dfx[1]!=0)]

In [14]:
print (len(only1.index),len(only0.index),len(mix_0or1))

62115 82524 6656


In [15]:
train_People = pd.merge(train_people, dfx, left_on = 'people_id',right_index = True)

In [16]:
train_People.rename(columns={0:'outcome_0',1:'outcome_1'}, inplace = True)

In [17]:
def ax(x):
    if x['outcome_0'] !=0 and x['outcome_1'] ==0:
        return 0
    elif x['outcome_0'] ==0 and x['outcome_1'] !=0:
        return 1
    else:
        return 2

In [18]:
train_People['result'] = train_People.apply(ax, axis = 1)

In [19]:
train_People.head()

Unnamed: 0,people_id,char_1,group_1,char_2,date,char_3,char_4,char_5,char_6,char_7,...,char_32,char_33,char_34,char_35,char_36,char_37,char_38,outcome_0,outcome_1,result
0,ppl_100,type 2,group 17304,type 2,2021-06-29,type 5,type 5,type 5,type 3,type 11,...,False,False,True,True,True,False,36,6,0,0
1,ppl_100002,type 2,group 8688,type 3,2021-01-06,type 28,type 9,type 5,type 3,type 11,...,True,True,True,True,True,False,76,0,2,1
2,ppl_100003,type 2,group 33592,type 3,2022-06-10,type 4,type 8,type 5,type 2,type 5,...,True,True,True,False,True,True,99,0,34,1
4,ppl_100006,type 2,group 6534,type 3,2022-07-27,type 40,type 25,type 9,type 3,type 8,...,False,False,False,True,True,False,84,0,3,1
7,ppl_100013,type 2,group 4204,type 3,2023-01-24,type 4,type 8,type 4,type 1,type 7,...,True,True,True,False,True,True,91,0,5,1


In [24]:
X.tail()

Unnamed: 0,char_1,group_1,char_2,date,char_3,char_4,char_5,char_6,char_7,char_8,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
189111,1,4691,1,981,38,20,4,1,20,7,...,0,0,0,1,0,1,0,1,1,3
189113,0,28923,0,682,33,21,3,2,2,1,...,0,0,0,1,0,1,0,1,1,89
189114,1,4691,1,1009,39,11,7,2,2,2,...,0,0,0,0,0,0,0,0,0,0
189115,1,4691,1,38,38,20,2,3,7,2,...,0,0,0,0,0,0,0,0,0,0
189116,1,5064,2,961,11,22,1,0,11,1,...,1,1,1,1,0,1,1,1,1,95


In [25]:
import xgboost



In [76]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold

In [77]:
from sklearn.tree import DecisionTreeClassifier

In [78]:
from sklearn.metrics import *

In [79]:
from sklearn.ensemble import GradientBoostingClassifier

In [80]:
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier

In [32]:
model_xgb = xgboost.XGBClassifier(n_estimators=100, max_depth=10)
%time
Xx = model_xgb.fit(X,y)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.96 µs


In [33]:
for ix in range(len(X.columns)):
    print (X.columns[ix], Xx.feature_importances_[ix])

char_1 0.00996145
group_1 0.272938
char_2 0.00406065
date 0.142657
char_3 0.054752
char_4 0.0442251
char_5 0.03376
char_6 0.0334002
char_7 0.0752197
char_8 0.0320535
char_9 0.0323619
char_10 0.00654845
char_11 0.00821383
char_12 0.00645592
char_13 0.00734002
char_14 0.00580828
char_15 0.00571575
char_16 0.00547931
char_17 0.00473914
char_18 0.00728861
char_19 0.00499615
char_20 0.00493446
char_21 0.00400925
char_22 0.00479054
char_23 0.00612696
char_24 0.0074531
char_25 0.00924184
char_26 0.00632228
char_27 0.00576715
char_28 0.00217939
char_29 0.00643536
char_30 0.00539707
char_31 0.00688769
char_32 0.00533539
char_33 0.00572603
char_34 0.0059008
char_35 0.00723721
char_36 0.00457466
char_37 0.00392701
char_38 0.099779


In [34]:
actX = act_Train.drop(['people_id','activity_id','outcome'],axis = 1)
acty = act_Train['outcome']

In [35]:
for idx in actX.columns:
    actX[idx] = actX[idx].fillna('type 0')
    actX[idx] = LabelEncoder().fit_transform(actX[idx])

In [36]:
model_xgb2 = xgboost.XGBClassifier(n_estimators=100, max_depth=10)
%time
Xx2 = model_xgb2.fit(actX,acty)

CPU times: user 3 µs, sys: 0 ns, total: 3 µs
Wall time: 8.11 µs


In [37]:
for ix in range(len(actX.columns)):
    print (actX.columns[ix], Xx2.feature_importances_[ix])

date 0.435285
activity_category 0.0495369
char_1 0.0314591
char_2 0.0288091
char_3 0.020413
char_4 0.012804
char_5 0.0119644
char_6 0.0106263
char_7 0.0145882
char_8 0.0204655
char_9 0.0222759
char_10 0.341773


In [81]:
for idx in trainMerge.columns:
    if 'type 0' in list(trainMerge[idx].unique()):
        print (idx, 'type 0 ')
    else:pass

In [82]:
mergeX = trainMerge.drop(['people_id','activity_id','outcome'],axis = 1)
mergey = trainMerge['outcome']

In [83]:
for idx in mergeX.columns:
    mergeX[idx] = mergeX[idx].fillna('type 0')
    mergeX[idx] = LabelEncoder().fit_transform(mergeX[idx])

In [85]:
mergeX1 = mergeX[['date_x','group_1','date_y','char_38','char_7_y','char_10_x']]

In [88]:
cv =KFold(10)

In [89]:
from sklearn.ensemble import ExtraTreesClassifier

In [90]:
forest = ExtraTreesClassifier(n_estimators=150, random_state=0)
forest.fit(mergeX1, mergey)

ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion='gini',
           max_depth=None, max_features='auto', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=1, min_samples_split=2,
           min_weight_fraction_leaf=0.0, n_estimators=150, n_jobs=1,
           oob_score=False, random_state=0, verbose=0, warm_start=False)

In [91]:
cross_val_score(forest, mergeX1, mergey, scoring='accuracy',cv=cv)

array([ 0.90596186,  0.88171338,  0.86944827,  0.87097288,  0.87468199,
        0.90906071,  0.86988518,  0.66371758,  0.88092605,  0.88326529])

In [41]:
model_xgb3 = xgboost.XGBClassifier(n_estimators=100, max_depth=10)
%time
Xx3 = model_xgb3.fit(mergeX,mergey)

CPU times: user 3 µs, sys: 1 µs, total: 4 µs
Wall time: 6.91 µs


In [42]:
for ix in range(len(mergeX.columns)):
    print (mergeX.columns[ix], Xx3.feature_importances_[ix])

date_x 0.130113
activity_category 0.00766686
char_1_x 0.00295051
char_2_x 0.00201171
char_3_x 0.00167643
char_4_x 0.00113997
char_5_x 0.0014082
char_6_x 0.000827037
char_7_x 0.000894095
char_8_x 0.00154231
char_9_x 0.00225759
char_10_x 0.0123385
char_1_y 0.00918682
group_1 0.235013
char_2_y 0.00221288
date_y 0.100809
char_3_y 0.0462694
char_4_y 0.0371049
char_5_y 0.0271805
char_6_y 0.0335956
char_7_y 0.0691806
char_8_y 0.030891
char_9_y 0.0293039
char_10_y 0.00657159
char_11 0.00605749
char_12 0.00507399
char_13 0.00677277
char_14 0.0056775
char_15 0.00569985
char_16 0.00402343
char_17 0.00534221
char_18 0.00502928
char_19 0.00364344
char_20 0.0037999
char_21 0.00359873
char_22 0.00384461
char_23 0.00444812
char_24 0.00558809
char_25 0.00858331
char_26 0.00476105
char_27 0.00464929
char_28 0.00151996
char_29 0.00543162
char_30 0.00397872
char_31 0.00596808
char_32 0.00556574
char_33 0.00362108
char_34 0.00518575
char_35 0.00527516
char_36 0.00315168
char_37 0.00301757
char_38 0.0785462

In [None]:
cross_val_score(forest, mergeX, mergey, scoring='accuracy',cv=cv)

In [32]:
act_null_chr10 = act_Train[act_Train['char_10'].isnull()]
act_null_chr10

Unnamed: 0,people_id,activity_id,date,activity_category,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,char_10,outcome
52,ppl_100025,act1_9923,2022-11-25,type 1,type 3,type 5,type 1,type 1,type 6,type 3,type 3,type 6,type 8,,0
105,ppl_100033,act1_198174,2022-07-26,type 1,type 36,type 11,type 5,type 1,type 6,type 1,type 1,type 4,type 1,,0
106,ppl_100033,act1_214090,2023-06-15,type 1,type 24,type 6,type 6,type 3,type 1,type 3,type 4,type 5,type 1,,0
107,ppl_100033,act1_230588,2023-02-28,type 1,type 2,type 2,type 3,type 3,type 5,type 2,type 2,type 4,type 2,,0
108,ppl_100033,act1_271874,2022-07-26,type 1,type 2,type 5,type 3,type 2,type 6,type 1,type 1,type 6,type 8,,0
124,ppl_100035,act1_104259,2023-07-28,type 1,type 5,type 2,type 7,type 3,type 1,type 3,type 5,type 4,type 7,,1
125,ppl_100035,act1_188526,2023-02-03,type 1,type 5,type 2,type 8,type 3,type 1,type 2,type 6,type 9,type 13,,1
126,ppl_100035,act1_212220,2023-02-02,type 1,type 3,type 2,type 8,type 3,type 1,type 2,type 3,type 9,type 13,,1
127,ppl_100035,act1_313621,2023-02-03,type 1,type 5,type 2,type 8,type 3,type 1,type 2,type 2,type 9,type 13,,1
128,ppl_100035,act1_336085,2023-02-03,type 1,type 5,type 2,type 8,type 3,type 1,type 2,type 2,type 9,type 13,,1


In [33]:
act_notnull_chr10 = act_Train[act_Train['char_10'].notnull()]
act_notnull_chr10

Unnamed: 0,people_id,activity_id,date,activity_category,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,char_10,outcome
0,ppl_100,act2_1734928,2023-08-26,type 4,,,,,,,,,,type 76,0
1,ppl_100,act2_2434093,2022-09-27,type 2,,,,,,,,,,type 1,0
2,ppl_100,act2_3404049,2022-09-27,type 2,,,,,,,,,,type 1,0
3,ppl_100,act2_3651215,2023-08-04,type 2,,,,,,,,,,type 1,0
4,ppl_100,act2_4109017,2023-08-26,type 2,,,,,,,,,,type 1,0
5,ppl_100,act2_898576,2023-08-04,type 4,,,,,,,,,,type 1727,0
6,ppl_100002,act2_1233489,2022-11-23,type 2,,,,,,,,,,type 1,1
7,ppl_100002,act2_1623405,2022-11-23,type 2,,,,,,,,,,type 1,1
8,ppl_100003,act2_1111598,2023-02-07,type 2,,,,,,,,,,type 1,1
9,ppl_100003,act2_1177453,2023-06-28,type 2,,,,,,,,,,type 1,1
