**백화점 고객의 1년 간 구매 데이터를 활용해**
- 데이터 전처리
- Feature Engineering
- 모델링 (분류 알고리즘 사용)
- 하이퍼파라미터 튜닝 (초매개변수 최적화)
- 모형 앙상블
- csv제출

**유의사항**
- 수험번호.csv 파일이 만들어지도록 코드를 제출함
- 제출한 모델의 성능은 ROC-AUC 평가지표에 따라 채점함

**데이터 출처 및 연결**
- data 출처: https://www.dataq.or.kr/ - 공지사항 - 759번 제2회 빅데이터분석기사 실기 안내 - 첨부파일

**데이터셋 업로드**
- 데이터셋 프라이빗 업로드 : https://youtu.be/BZlEQ5JwLiA
- Datasets - new dataset - (drag&drop) - Create / 반드시 Private
- 작업형2 예시: https://youtu.be/_GIBVt5-khk
- 아래 코드는 베이스라인 예시입니다

In [1]:
import pandas as pd

x = pd.read_csv('../../dataset/t2-ex/X_train.csv', encoding='euc-kr')
y = pd.read_csv('../../dataset/t2-ex/y_train.csv', encoding='euc-kr')
test = pd.read_csv('../../dataset/t2-ex/X_test.csv', encoding='euc-kr')
x.head()

Unnamed: 0,cust_id,총구매액,최대구매액,환불금액,주구매상품,주구매지점,내점일수,내점당구매건수,주말방문비율,구매주기
0,0,68282840,11264000,6860000.0,기타,강남점,19,3.894737,0.527027,17
1,1,2136000,2136000,300000.0,스포츠,잠실점,2,1.5,0.0,1
2,2,3197000,1639000,,남성 캐주얼,관악점,2,2.0,0.0,1
3,3,16077620,4935000,,기타,광주점,18,2.444444,0.318182,16
4,4,29050000,24000000,,보석,본 점,2,1.5,0.0,85


---
## My Code

---
## Solution

##### 간단 EDA

In [2]:
x.shape, y.shape, test.shape

((3500, 10), (3500, 2), (2482, 10))

In [3]:
y.head()

Unnamed: 0,cust_id,gender
0,0,0
1,1,0
2,2,1
3,3,1
4,4,0


In [4]:
x.isnull().sum()

cust_id       0
총구매액          0
최대구매액         0
환불금액       2295
주구매상품         0
주구매지점         0
내점일수          0
내점당구매건수       0
주말방문비율        0
구매주기          0
dtype: int64

In [5]:
x.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3500 entries, 0 to 3499
Data columns (total 10 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   cust_id  3500 non-null   int64  
 1   총구매액     3500 non-null   int64  
 2   최대구매액    3500 non-null   int64  
 3   환불금액     1205 non-null   float64
 4   주구매상품    3500 non-null   object 
 5   주구매지점    3500 non-null   object 
 6   내점일수     3500 non-null   int64  
 7   내점당구매건수  3500 non-null   float64
 8   주말방문비율   3500 non-null   float64
 9   구매주기     3500 non-null   int64  
dtypes: float64(3), int64(5), object(2)
memory usage: 273.6+ KB


In [6]:
y['gender'].value_counts()

0    2184
1    1316
Name: gender, dtype: int64

##### 데이터 전처리

In [7]:
x = x.fillna(0)
test = test.fillna(0)
x.isnull().sum()

cust_id    0
총구매액       0
최대구매액      0
환불금액       0
주구매상품      0
주구매지점      0
내점일수       0
내점당구매건수    0
주말방문비율     0
구매주기       0
dtype: int64

In [8]:
x_id = x.pop('cust_id')
test_id = test.pop('cust_id')

##### 피처엔지니어링

In [9]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

df = pd.concat([x, test])

cols = df.select_dtypes(include='object').columns

for col in cols:
    df[col] = encoder.fit_transform(df[col])


df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5982 entries, 0 to 2481
Data columns (total 9 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   총구매액     5982 non-null   int64  
 1   최대구매액    5982 non-null   int64  
 2   환불금액     5982 non-null   float64
 3   주구매상품    5982 non-null   int64  
 4   주구매지점    5982 non-null   int64  
 5   내점일수     5982 non-null   int64  
 6   내점당구매건수  5982 non-null   float64
 7   주말방문비율   5982 non-null   float64
 8   구매주기     5982 non-null   int64  
dtypes: float64(3), int64(6)
memory usage: 467.3 KB


In [10]:
x = df[:x.shape[0]]
test = df[x.shape[0]:]
x.shape, test.shape

((3500, 9), (2482, 9))

In [11]:
x.head()

Unnamed: 0,총구매액,최대구매액,환불금액,주구매상품,주구매지점,내점일수,내점당구매건수,주말방문비율,구매주기
0,68282840,11264000,6860000.0,5,0,19,3.894737,0.527027,17
1,2136000,2136000,300000.0,21,19,2,1.5,0.0,1
2,3197000,1639000,0.0,6,1,2,2.0,0.0,1
3,16077620,4935000,0.0,5,2,18,2.444444,0.318182,16
4,29050000,24000000,0.0,15,8,2,1.5,0.0,85


##### 모델링 & 하이퍼파라미터 튜닝

In [12]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=2022)
model.fit(x, y['gender'])
print(model.score(x, y['gender']))
pred = model.predict_proba(test)

0.6874285714285714


In [13]:
pred[:,1]

array([0.43567157, 0.19725558, 0.17732635, ..., 0.43703219, 0.36002886,
       0.54383742])

In [14]:
submission = pd.DataFrame({
    'cust_id': test_id,
    'gender': pred[:,1]
})

In [15]:
submission.head()

Unnamed: 0,cust_id,gender
0,3500,0.435672
1,3501,0.197256
2,3502,0.177326
3,3503,0.420662
4,3504,0.484252


In [16]:
submission.to_csv('t2-ex.csv', index=False)

In [17]:
pd.read_csv('t2-ex.csv').head()

Unnamed: 0,cust_id,gender
0,3500,0.435672
1,3501,0.197256
2,3502,0.177326
3,3503,0.420662
4,3504,0.484252
