#### 서비스 이탈예측 데이터
고객의 신상정보 데이터를 통한 회사 서비스 이탈 예측 (종속변수 : Exited)

In [144]:
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv")


display(x_train.head())
display(y_train.head())

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,15799217,Zetticci,791,Germany,Female,35,7,52436.2,1,1,0,161051.75
1,15748986,Bischof,705,Germany,Male,42,8,166685.92,2,1,1,55313.51
2,15722004,Hsiung,543,France,Female,31,4,138317.94,1,0,0,61843.73
3,15780966,Pritchard,709,France,Female,32,2,0.0,2,0,0,109681.29
4,15636731,Ts'ai,714,Germany,Female,36,1,101609.01,2,1,1,447.73


Unnamed: 0,CustomerId,Exited
0,15799217,0
1,15748986,0
2,15722004,0
3,15780966,0
4,15636731,0


In [145]:
display(x_test.head())

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,15601012,Abdullah,802,France,Female,60,3,92887.06,1,1,0,39473.63
1,15734762,Ignatiev,602,France,Female,56,3,115895.22,3,1,0,4176.17
2,15586757,Anenechukwu,801,France,Female,32,4,75170.54,1,1,1,37898.5
3,15590888,Wade,693,Spain,Female,34,10,107556.06,2,0,0,154631.35
4,15726087,Ch'in,592,France,Female,62,5,0.0,1,1,1,100941.57


In [146]:
len(x_train), len(x_test), len(x_train)+len(x_test)

(6499, 3501, 10000)

In [147]:
# Data Preprocessing

# 결측치 확인
x_train.isnull().sum()
x_test.isnull().sum()

# 전처리를 위하여 데이터 통합
x_data = pd.concat([x_train, x_test], ignore_index=True)

# 성별 전처리 (남자 : 0, 여자 : 1)
x_data.Gender = x_data.Gender.str.upper()
x_data.Gender = x_data.Gender.map({'MALE':0, ' MALE':0, 'FEMALE':1})

# 나이 전처리
x_data.Age.max() # 92
x_data.Age.min() # 18
x_data['Age'] = x_data.Age.map(lambda x : x//10)

# Balance 전처리
x_data.Balance.max() # 250898.09
x_data.Balance.min() # 0
x_data['Balance'] = x_data.Balance.map(lambda x : x // 2500)
x_data

# EstimatedSalary 전처리
x_data.EstimatedSalary.max() # 199992.48
x_data.EstimatedSalary.min() # 11.58
x_data.EstimatedSalary = x_data.EstimatedSalary.map(lambda x : x // 1999)

# Country 전처리
x_data.Geography.unique() # Germany, France, Spain
x_data.Geography = x_data.Geography.map({"Germany":0, "France":1, "Spain":2})

# Surname 삭제
x_data.drop('Surname', axis=1, inplace=True)
x_data

Unnamed: 0,CustomerId,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,15799217,791,0,1,3,7,20.0,1,1,0,80.0
1,15748986,705,0,0,4,8,66.0,2,1,1,27.0
2,15722004,543,1,1,3,4,55.0,1,0,0,30.0
3,15780966,709,1,1,3,2,0.0,2,0,0,54.0
4,15636731,714,0,1,3,1,40.0,2,1,1,0.0
...,...,...,...,...,...,...,...,...,...,...,...
9995,15733966,496,0,1,5,4,50.0,1,1,1,15.0
9996,15669994,556,0,1,3,1,51.0,2,1,0,62.0
9997,15712403,589,1,1,6,1,0.0,1,1,0,30.0
9998,15643819,714,1,1,2,4,0.0,2,0,0,41.0


In [148]:
x_train = x_data[:6499]
x_test = x_data[6500:]

In [149]:
from sklearn.model_selection import train_test_split
X_tr, X_val, y_tr, y_val = train_test_split(x_train, y_train['Exited'], train_size=0.1, random_state=99)

In [150]:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(X_tr, y_tr)

In [151]:
predict_train_label = model.predict(X_tr)
predict_train_proba = model.predict_proba(X_tr)[:,1]

predict_validation_label = model.predict(X_val)
predict_validation_proba = model.predict_proba(X_val)[:,1]

In [152]:
# 성능 검증은 풀이 참조함.
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
from sklearn.metrics import accuracy_score, f1_score, recall_score, roc_auc_score ,precision_score

print('train accuracy :', accuracy_score(y_tr,predict_train_label))
print('validation accuracy :', accuracy_score(y_val,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(y_tr,predict_train_label))
print('validation accuracy :', f1_score(y_val,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(y_tr,predict_train_label))
print('validation recall_score :', recall_score(y_val,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(y_tr,predict_train_label))
print('validation precision_score :', precision_score(y_val,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(y_tr,predict_train_proba))
print('validation auc :', roc_auc_score(y_val,predict_validation_proba))

train accuracy : 1.0
validation accuracy : 0.8418803418803419


train f1_score : 1.0
validation accuracy : 0.463768115942029


train recall_score : 1.0
validation recall_score : 0.3352891869237217


train precision_score : 1.0
validation precision_score : 0.7518796992481203


train auc : 0.9999999999999999
validation auc : 0.8120891119030362


In [153]:
# test데이터 마찬가지 위와 같은 방식
predict_test_label = model.predict(x_test)
predict_test_proba = model.predict_proba(x_test)[:,1]

# accuracy, f1_score, recall, precision 
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_proba}).to_csv('003000000.csv', index=False)

### 풀이

In [154]:
import pandas as pd
#데이터 로드
x_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_train.csv")
y_train = pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/y_train.csv")
x_test= pd.read_csv("https://raw.githubusercontent.com/Datamanim/datarepo/main/churnk/X_test.csv")


display(x_train.head())
display(y_train.head())

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,15799217,Zetticci,791,Germany,Female,35,7,52436.2,1,1,0,161051.75
1,15748986,Bischof,705,Germany,Male,42,8,166685.92,2,1,1,55313.51
2,15722004,Hsiung,543,France,Female,31,4,138317.94,1,0,0,61843.73
3,15780966,Pritchard,709,France,Female,32,2,0.0,2,0,0,109681.29
4,15636731,Ts'ai,714,Germany,Female,36,1,101609.01,2,1,1,447.73


Unnamed: 0,CustomerId,Exited
0,15799217,0
1,15748986,0
2,15722004,0
3,15780966,0
4,15636731,0


In [155]:
# print(x_train.info())
# print(x_train.nunique())  <- 판다스 낮은버전은 동작 안할수도 있음

drop_col = ['CustomerId','Surname']
x_train_drop = x_train.drop(columns = drop_col)
x_test_drop = x_test.drop(columns = drop_col)

# import sklearn
# print(sklearn.__all__)
# import sklearn.model_selection
# print(dir(sklearn.model_selection))

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
x_train_dummies = pd.get_dummies(x_train_drop)
y = y_train['Exited']


x_test_dummies = pd.get_dummies(x_test_drop)
# train과 컬럼 순서 동일하게 하기 (더미화 하면서 순서대로 정렬을 이미 하기 때문에 오류가 난다면 해당 컬럼이 누락된것)
x_test_dummies = x_test_dummies[x_train_dummies.columns]
# print(help(train_test_split))



X_train, X_validation, Y_train, Y_validation = train_test_split(x_train_dummies, y, test_size=0.33, random_state=42)
rf = RandomForestClassifier(random_state =23)
rf.fit(X_train,Y_train)

# import sklearn.metrics
# print(dir(sklearn.metrics))

from sklearn.metrics import accuracy_score , f1_score, recall_score, roc_auc_score ,precision_score

#model_score
predict_train_label = rf.predict(X_train)
predict_train_proba = rf.predict_proba(X_train)[:,1]

predict_validation_label = rf.predict(X_validation)
predict_validation_prob = rf.predict_proba(X_validation)[:,1]


# 문제에서 묻는 것에 따라 모델 성능 확인하기
# 정확도 (accuracy) , f1_score , recall , precision -> model.predict로 결과뽑기
# auc , 확률이라는 표현있으면 model.predict_proba로 결과뽑고 첫번째 행의 값을 가져오기 model.predict_proba()[:,1]
print('train accuracy :', accuracy_score(Y_train,predict_train_label))
print('validation accuracy :', accuracy_score(Y_validation,predict_validation_label))
print('\n')
print('train f1_score :', f1_score(Y_train,predict_train_label))
print('validation accuracy :', f1_score(Y_validation,predict_validation_label))
print('\n')
print('train recall_score :', recall_score(Y_train,predict_train_label))
print('validation recall_score :', recall_score(Y_validation,predict_validation_label))
print('\n')
print('train precision_score :', precision_score(Y_train,predict_train_label))
print('validation precision_score :', precision_score(Y_validation,predict_validation_label))
print('\n')
print('train auc :', roc_auc_score(Y_train,predict_train_proba))
print('validation auc :', roc_auc_score(Y_validation,predict_validation_prob))


# test데이터 마찬가지 위와 같은 방식
predict_test_label = rf.predict(x_test_dummies)
predict_test_proba = rf.predict_proba(x_test_dummies)[:,1]


# accuracy, f1_score, recall, precision 
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_label}).to_csv('003000000.csv', index=False)

# auc, 확률
#pd.DataFrame({'CustomerId': x_test.CustomerId, 'Exited': predict_test_proba}).to_csv('003000000.csv', index=False)

train accuracy : 1.0
validation accuracy : 0.8652680652680653


train f1_score : 1.0
validation accuracy : 0.5912305516265912


train recall_score : 1.0
validation recall_score : 0.4543478260869565


train precision_score : 1.0
validation precision_score : 0.8461538461538461


train auc : 1.0
validation auc : 0.8497613211198555
