# Classification

## Team Name
>### Sigma  

## Team Member
>### 조현윤, 이상협, 정하연  

## Objective
> ### in search of better methods of using this behavioral data to predict which individuals they should approach—and even when and how to approach them.
> ### to create a classification algorithm that accurately identifies which customers have the most potential business value for Red Hat based on their characteristics and activities.
> ### to predict the potential business value of a person who has performed a specific activity.

## Evaluation
> ### valuated on area under the ROC curve between the predicted and the observed outcome.

## Submission File
> ### For each activity_id in the test set, you must predict a probability for the 'outcome' variable, represented by a number between 0 and 1.
~~~~
activity_id,outcome
act1_1,0
act1_100006,0
act1_100050,0
~~~~

## Data
> ### uses two separate data files that may be joined together to create a single, unified data table: a people file and an activity file.
> ### The people file contains all of the unique people (and the corresponding characteristics) that have performed activities over time. Each row in the people file represents a unique person. Each person has a unique people_id.
> ### The activity file contains all of the unique activities (and the corresponding activity characteristics) that each person has performed over time. Each row in the activity file represents a unique activity performed by a person on a certain date. Each activity has a unique activity_id.
> ### The activity file contains several different categories of activities. 
>> Type 1 activities are different from type 2-7 activities because there are more known characteristics associated with type 1 activities (nine in total) than type 2-7 activities (which have only one associated characteristic).
> ### To develop a predictive model with this data, you will likely need to join the files together into a single data set. The two files can be joined together using person_id as the common key. All variables are categorical, with the exception of 'char_38' in the people file, which is a continuous numerical variable.

## Reference 
[kaggel Predicting Red Hat Business Value](https://www.kaggle.com/c/predicting-red-hat-business-value)

### Load Python Package

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
from datetime import datetime
from datetime import date
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats.api as sms
import statsmodels.stats.stattools as stools
import scipy as sp
%matplotlib inline

## Exploratory Data Analysis (EDA )¶

## Load Data Set

In [2]:
# activity data set
act_Train = pd.read_csv('./data/act_train.csv')
act_Test = pd.read_csv('./data/act_test.csv')
# people data set
people = pd.read_csv('./data/people.csv')

### Split people data set wether train data set or test data set

In [3]:
idx_train =list(act_Train['people_id'].value_counts().index)
idx_test =list(act_Test['people_id'].value_counts().index)

In [4]:
train_people = people.loc[people['people_id'].isin(idx_train)]
test_people = people.loc[people['people_id'].isin(idx_test)]

In [5]:
train_people.to_csv('./data/act_train_people.csv',index=False)
test_people.to_csv('./data/act_test_people.csv',index=False)

In [6]:
print('Number of active people: {}'.format(act_Train['people_id'].nunique()))

Number of active people: 151295


In [7]:
print('Number of active people: {}'.format(act_Test['people_id'].nunique()))

Number of active people: 37823


In [8]:
dfx = act_Train.groupby(['people_id','outcome']).size().unstack()
dfx = dfx.fillna(0).astype(int)

In [11]:
only1 = dfx[(dfx[0]==0) & (dfx[1]!=0)]
only1

outcome,0,1
people_id,Unnamed: 1_level_1,Unnamed: 2_level_1
ppl_100002,0,2
ppl_100003,0,34
ppl_100006,0,3
ppl_100013,0,5
ppl_100019,0,2
ppl_100035,0,54
ppl_100040,0,4
ppl_100043,0,1
ppl_100049,0,9
ppl_100050,0,12


In [12]:
only0 = dfx[(dfx[0]!=0) & (dfx[1]==0)]
only0

outcome,0,1
people_id,Unnamed: 1_level_1,Unnamed: 2_level_1
ppl_100,6,0
ppl_100025,46,0
ppl_100028,3,0
ppl_100029,1,0
ppl_100032,3,0
ppl_100033,19,0
ppl_100042,3,0
ppl_100045,20,0
ppl_100047,1,0
ppl_10005,3,0


In [14]:
mix_0or1 = dfx[(dfx[0]!=0) & (dfx[1]!=0)]
mix_0or1

outcome,0,1
people_id,Unnamed: 1_level_1,Unnamed: 2_level_1
ppl_10006,1,10
ppl_100075,57,2
ppl_100145,2,36
ppl_100297,3,13
ppl_100324,14,67
ppl_100382,3,12
ppl_100387,51,399
ppl_10041,20,4
ppl_100451,8,5
ppl_100510,26,21


In [16]:
print (len(only1.index),len(only0.index),len(mix_0or1))

62115 82524 6656


In [18]:
act_null_chr10 = act_Train[act_Train['char_10'].isnull()]
act_null_chr10

Unnamed: 0,people_id,activity_id,date,activity_category,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,char_10,outcome
52,ppl_100025,act1_9923,2022-11-25,type 1,type 3,type 5,type 1,type 1,type 6,type 3,type 3,type 6,type 8,,0
105,ppl_100033,act1_198174,2022-07-26,type 1,type 36,type 11,type 5,type 1,type 6,type 1,type 1,type 4,type 1,,0
106,ppl_100033,act1_214090,2023-06-15,type 1,type 24,type 6,type 6,type 3,type 1,type 3,type 4,type 5,type 1,,0
107,ppl_100033,act1_230588,2023-02-28,type 1,type 2,type 2,type 3,type 3,type 5,type 2,type 2,type 4,type 2,,0
108,ppl_100033,act1_271874,2022-07-26,type 1,type 2,type 5,type 3,type 2,type 6,type 1,type 1,type 6,type 8,,0
124,ppl_100035,act1_104259,2023-07-28,type 1,type 5,type 2,type 7,type 3,type 1,type 3,type 5,type 4,type 7,,1
125,ppl_100035,act1_188526,2023-02-03,type 1,type 5,type 2,type 8,type 3,type 1,type 2,type 6,type 9,type 13,,1
126,ppl_100035,act1_212220,2023-02-02,type 1,type 3,type 2,type 8,type 3,type 1,type 2,type 3,type 9,type 13,,1
127,ppl_100035,act1_313621,2023-02-03,type 1,type 5,type 2,type 8,type 3,type 1,type 2,type 2,type 9,type 13,,1
128,ppl_100035,act1_336085,2023-02-03,type 1,type 5,type 2,type 8,type 3,type 1,type 2,type 2,type 9,type 13,,1


In [19]:
act_notnull_chr10 = act_Train[act_Train['char_10'].notnull()]
act_notnull_chr10

Unnamed: 0,people_id,activity_id,date,activity_category,char_1,char_2,char_3,char_4,char_5,char_6,char_7,char_8,char_9,char_10,outcome
0,ppl_100,act2_1734928,2023-08-26,type 4,,,,,,,,,,type 76,0
1,ppl_100,act2_2434093,2022-09-27,type 2,,,,,,,,,,type 1,0
2,ppl_100,act2_3404049,2022-09-27,type 2,,,,,,,,,,type 1,0
3,ppl_100,act2_3651215,2023-08-04,type 2,,,,,,,,,,type 1,0
4,ppl_100,act2_4109017,2023-08-26,type 2,,,,,,,,,,type 1,0
5,ppl_100,act2_898576,2023-08-04,type 4,,,,,,,,,,type 1727,0
6,ppl_100002,act2_1233489,2022-11-23,type 2,,,,,,,,,,type 1,1
7,ppl_100002,act2_1623405,2022-11-23,type 2,,,,,,,,,,type 1,1
8,ppl_100003,act2_1111598,2023-02-07,type 2,,,,,,,,,,type 1,1
9,ppl_100003,act2_1177453,2023-06-28,type 2,,,,,,,,,,type 1,1


In [20]:
actX = act_Train.drop(['people_id','activity_id','date','outcome'],axis = 1)
acty = act_Train['outcome']

In [21]:
from sklearn.preprocessing import LabelEncoder

In [23]:
for idx in actX.columns:
    actX[idx] = actX[idx].fillna('type 0')
    actX[idx] = LabelEncoder().fit_transform(actX[idx])

In [24]:
import xgboost



In [25]:
model_xgb = xgboost.XGBClassifier(n_estimators=100, max_depth=5)
%time
Xx = model_xgb.fit(actX,acty)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs


In [26]:
for ix in range(len(actX.columns)):
    print (actX.columns[ix], Xx.feature_importances_[ix])

activity_category 0.0900398
char_1 0.0545817
char_2 0.0565737
char_3 0.0342629
char_4 0.0227092
char_5 0.0282869
char_6 0.0175299
char_7 0.0211155
char_8 0.0266932
char_9 0.0390438
char_10 0.609163


In [28]:
trainMerge = pd.merge(act_Train,people, on='people_id')
trainMerge.tail()

Unnamed: 0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
2197286,ppl_99994,act2_4668076,2023-06-16,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197287,ppl_99994,act2_4743548,2023-03-30,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197288,ppl_99994,act2_536973,2023-01-19,type 2,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197289,ppl_99994,act2_688656,2023-05-02,type 4,,,,,,,...,True,True,True,True,False,True,True,True,True,95
2197290,ppl_99994,act2_715089,2023-06-15,type 2,,,,,,,...,True,True,True,True,False,True,True,True,True,95


In [29]:
trainMerge[trainMerge['date_x'] ==trainMerge['date_y']].groupby('outcome').count()

Unnamed: 0_level_0,people_id,activity_id,date_x,activity_category,char_1_x,char_2_x,char_3_x,char_4_x,char_5_x,char_6_x,...,char_29,char_30,char_31,char_32,char_33,char_34,char_35,char_36,char_37,char_38
outcome,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,83034,83034,83034,83034,31196,31196,31196,31196,31196,31196,...,83034,83034,83034,83034,83034,83034,83034,83034,83034,83034
1,49239,49239,49239,49239,16178,16178,16178,16178,16178,16178,...,49239,49239,49239,49239,49239,49239,49239,49239,49239,49239


In [30]:
for idx in trainMerge.columns:
    if 'type 0' in list(trainMerge[idx].unique()):
        print (idx, 'type 0 ')
    else:pass

In [31]:
mergeX = trainMerge.drop(['people_id','activity_id','date_x','date_y','outcome'],axis = 1)
mergey = trainMerge['outcome']

In [32]:
for idx in mergeX.columns:
    mergeX[idx] = mergeX[idx].fillna('type 0')
    mergeX[idx] = LabelEncoder().fit_transform(mergeX[idx])

In [33]:
model_xgb = xgboost.XGBClassifier(n_estimators=100, max_depth=2)
%time
Xx = model_xgb.fit(mergeX,mergey)

CPU times: user 3 µs, sys: 9 µs, total: 12 µs
Wall time: 32.2 µs


In [34]:
for ix in range(len(mergeX.columns)):
    print (mergeX.columns[ix], Xx.feature_importances_[ix])

activity_category 0.0
char_1_x 0.0
char_2_x 0.0
char_3_x 0.0
char_4_x 0.0
char_5_x 0.0
char_6_x 0.0
char_7_x 0.0
char_8_x 0.0
char_9_x 0.0
char_10_x 0.0
char_1_y 0.0433333
group_1 0.2
char_2_y 0.0933333
char_3_y 0.0233333
char_4_y 0.00333333
char_5_y 0.0
char_6_y 0.226667
char_7_y 0.14
char_8_y 0.00666667
char_9_y 0.0133333
char_10_y 0.0
char_11 0.0
char_12 0.0
char_13 0.00666667
char_14 0.0
char_15 0.0
char_16 0.0
char_17 0.0
char_18 0.0
char_19 0.0
char_20 0.0
char_21 0.0
char_22 0.0
char_23 0.0
char_24 0.00666667
char_25 0.0
char_26 0.0
char_27 0.0
char_28 0.0
char_29 0.00333333
char_30 0.0
char_31 0.0
char_32 0.0
char_33 0.0
char_34 0.00666667
char_35 0.0
char_36 0.00666667
char_37 0.00333333
char_38 0.216667
