# 기계 학습 (Machine Learning)

- 기계학습 : 컴퓨터가 학습한다.
- 데이터 마이닝 : 데이터의 규칙과 페턴을 찾아내어 분류하고 앞으로의 일을 예측하는것. 스스로 작업한다 -> 이걸 하는 방법중에 하나가 기계학습 이다.

## 파이썬에서 기계학습을 위한 라이브러리

- Scikit Learn (scipy + toolkit)
  : pandas 와의 상호작용이 유연하다.
  : 데이터 마이닝 : 데이터로 부터 패턴이나 규칙을 찾아내어 실무에 적용할 수 있는 insight를 발견하는 방법론

        - 데이터마이닝 문제해결 process : KDD / SEMMA /...
          1. 데이터 준비
            - Target (목표 변수를 잡는다)
              : "매출액" / X (feature) : 제품명, 판매일자, 요일...
          2. 데이터 전처리 : 컴퓨터가 읽고 처리할 수 있는 조건 만들기
            - 결측치 / 이상치 처리
            - 학습 / 평가데이터 분할
          3. 데이터 학습(모델)
            - 알고리즘(회귀, Tree, SVM : teaching 시켜줄 선생님들)
          4. 모델 평가
            - 평가(학습이 잘 되었는가)


In [1]:
import pandas as pd

In [2]:
df1 = pd.read_csv('Data06.csv')
df1.head(2)

Unnamed: 0,회원번호,조합원상태,주소-구,주소-동,성별,연령,연령대,총구매금액,총구매수량,1회방문시구매금액(평균),배송서비스신청여부,모바일알람여부,Gold_member
0,272369856,정상회원,수지구,풍덕천동,여,45.0,40대,5733884,546.5,47782,미신청,수신,VIP
1,1506656256,정상회원,수지구,풍덕천동,여,36.0,30대이하,673414,90.0,35443,미신청,.,normal


In [None]:
# 실제 상점에서 사용하는 회원정보 데이터 이다.
# 이 데이터로 학습을 해보자
# 어떤 학습인가?
# 고객의 Gold_member 를 기준으로 한다.
# 어떤 고객이 vip 될 확률이 몇 % 인가? Nomal이 될 확률은 몇 % 인가?


# 1. 데이터 준비 및 전처리(가공)

* Y Target 설정 : **Gold_member** / X (Feature) ...
* Feature Selection
* Missing Value 처리

In [3]:
# 불필요한 데이터 제거
df1.drop(columns=['회원번호', '주소-동', '연령대', '1회방문시구매금액(평균)'])

Unnamed: 0,조합원상태,주소-구,성별,연령,총구매금액,총구매수량,배송서비스신청여부,모바일알람여부,Gold_member
0,정상회원,수지구,여,45.0,5733884,546.5,미신청,수신,VIP
1,정상회원,수지구,여,36.0,673414,90.0,미신청,.,normal
2,정상회원,수지구,여,34.0,655919,66.0,미신청,.,normal
3,정상회원,수지구,여,51.0,2984534,252.1,미신청,.,normal
4,정상회원,수지구,여,51.0,1901488,152.0,신청,.,normal
...,...,...,...,...,...,...,...,...,...
6514,정상회원,수지구,남,32.0,52646,1.1,미신청,.,normal
6515,정상회원,분당구,여,46.0,61740,9.0,미신청,.,normal
6516,정상회원,기타,여,82.0,15507,2.0,미신청,.,normal
6517,정상회원,분당구,여,55.0,36374,2.0,미신청,.,normal


In [4]:
df2 = df1.drop(columns=['회원번호', '주소-동', '연령대', '1회방문시구매금액(평균)'])
df2.head()

Unnamed: 0,조합원상태,주소-구,성별,연령,총구매금액,총구매수량,배송서비스신청여부,모바일알람여부,Gold_member
0,정상회원,수지구,여,45.0,5733884,546.5,미신청,수신,VIP
1,정상회원,수지구,여,36.0,673414,90.0,미신청,.,normal
2,정상회원,수지구,여,34.0,655919,66.0,미신청,.,normal
3,정상회원,수지구,여,51.0,2984534,252.1,미신청,.,normal
4,정상회원,수지구,여,51.0,1901488,152.0,신청,.,normal


In [6]:
# Missing Value 처리
df2.isnull()

Unnamed: 0,조합원상태,주소-구,성별,연령,총구매금액,총구매수량,배송서비스신청여부,모바일알람여부,Gold_member
0,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...
6514,False,False,False,False,False,False,False,False,False
6515,False,False,False,False,False,False,False,False,False
6516,False,False,False,False,False,False,False,False,False
6517,False,False,False,False,False,False,False,False,False


In [7]:
# 각 컬럼별로 미싱값 확인 / 결측값 확인
df2.isnull().sum()

조합원상태          0
주소-구           0
성별             1
연령             1
총구매금액          0
총구매수량          0
배송서비스신청여부      0
모바일알람여부        0
Gold_member    0
dtype: int64

In [8]:
# 결측값 제거
df2.dropna()

Unnamed: 0,조합원상태,주소-구,성별,연령,총구매금액,총구매수량,배송서비스신청여부,모바일알람여부,Gold_member
0,정상회원,수지구,여,45.0,5733884,546.5,미신청,수신,VIP
1,정상회원,수지구,여,36.0,673414,90.0,미신청,.,normal
2,정상회원,수지구,여,34.0,655919,66.0,미신청,.,normal
3,정상회원,수지구,여,51.0,2984534,252.1,미신청,.,normal
4,정상회원,수지구,여,51.0,1901488,152.0,신청,.,normal
...,...,...,...,...,...,...,...,...,...
6514,정상회원,수지구,남,32.0,52646,1.1,미신청,.,normal
6515,정상회원,분당구,여,46.0,61740,9.0,미신청,.,normal
6516,정상회원,기타,여,82.0,15507,2.0,미신청,.,normal
6517,정상회원,분당구,여,55.0,36374,2.0,미신청,.,normal


In [9]:
df3 = df2.dropna()
df3

Unnamed: 0,조합원상태,주소-구,성별,연령,총구매금액,총구매수량,배송서비스신청여부,모바일알람여부,Gold_member
0,정상회원,수지구,여,45.0,5733884,546.5,미신청,수신,VIP
1,정상회원,수지구,여,36.0,673414,90.0,미신청,.,normal
2,정상회원,수지구,여,34.0,655919,66.0,미신청,.,normal
3,정상회원,수지구,여,51.0,2984534,252.1,미신청,.,normal
4,정상회원,수지구,여,51.0,1901488,152.0,신청,.,normal
...,...,...,...,...,...,...,...,...,...
6514,정상회원,수지구,남,32.0,52646,1.1,미신청,.,normal
6515,정상회원,분당구,여,46.0,61740,9.0,미신청,.,normal
6516,정상회원,기타,여,82.0,15507,2.0,미신청,.,normal
6517,정상회원,분당구,여,55.0,36374,2.0,미신청,.,normal


In [10]:
# X, Y 나누기
df3['Gold_member'].value_counts()

normal    6421
VIP         96
Name: Gold_member, dtype: int64

In [None]:
# 컴퓨터가 읽을 수 있도록 바꿔준다
# 문자를 숫자로
# ex) vip -> 1
#     nomal -> 0

In [20]:
df3['Gold_member'].replace({'VIP':1,'normal':0})

0       1
1       0
2       0
3       0
4       0
       ..
6514    0
6515    0
6516    0
6517    0
6518    0
Name: Gold_member, Length: 6517, dtype: int64

In [21]:
df3['Target'] = df3['Gold_member'].replace({'VIP':1,'normal':0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df3['Target'] = df3['Gold_member'].replace({'VIP':1,'normal':0})


In [22]:
df3

Unnamed: 0,조합원상태,주소-구,성별,연령,총구매금액,총구매수량,배송서비스신청여부,모바일알람여부,Gold_member,Target
0,정상회원,수지구,여,45.0,5733884,546.5,미신청,수신,VIP,1
1,정상회원,수지구,여,36.0,673414,90.0,미신청,.,normal,0
2,정상회원,수지구,여,34.0,655919,66.0,미신청,.,normal,0
3,정상회원,수지구,여,51.0,2984534,252.1,미신청,.,normal,0
4,정상회원,수지구,여,51.0,1901488,152.0,신청,.,normal,0
...,...,...,...,...,...,...,...,...,...,...
6514,정상회원,수지구,남,32.0,52646,1.1,미신청,.,normal,0
6515,정상회원,분당구,여,46.0,61740,9.0,미신청,.,normal,0
6516,정상회원,기타,여,82.0,15507,2.0,미신청,.,normal,0
6517,정상회원,분당구,여,55.0,36374,2.0,미신청,.,normal,0


In [23]:
pd.get_dummies(df3) # one-hot Encoding

Unnamed: 0,연령,총구매금액,총구매수량,Target,조합원상태_이관처리중,조합원상태_정상회원,조합원상태_탈퇴,조합원상태_탈퇴신청,조합원상태_탈퇴처리중,주소-구_광주,...,주소-구_하남,주소-구_화성,성별_남,성별_여,배송서비스신청여부_미신청,배송서비스신청여부_신청,모바일알람여부_.,모바일알람여부_수신,Gold_member_VIP,Gold_member_normal
0,45.0,5733884,546.5,1,0,1,0,0,0,0,...,0,0,0,1,1,0,0,1,1,0
1,36.0,673414,90.0,0,0,1,0,0,0,0,...,0,0,0,1,1,0,1,0,0,1
2,34.0,655919,66.0,0,0,1,0,0,0,0,...,0,0,0,1,1,0,1,0,0,1
3,51.0,2984534,252.1,0,0,1,0,0,0,0,...,0,0,0,1,1,0,1,0,0,1
4,51.0,1901488,152.0,0,0,1,0,0,0,0,...,0,0,0,1,0,1,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6514,32.0,52646,1.1,0,0,1,0,0,0,0,...,0,0,1,0,1,0,1,0,0,1
6515,46.0,61740,9.0,0,0,1,0,0,0,0,...,0,0,0,1,1,0,1,0,0,1
6516,82.0,15507,2.0,0,0,1,0,0,0,0,...,0,0,0,1,1,0,1,0,0,1
6517,55.0,36374,2.0,0,0,1,0,0,0,0,...,0,0,0,1,1,0,1,0,0,1


In [26]:
pd.get_dummies(df3.drop(columns=['Gold_member','Target']))

Unnamed: 0,연령,총구매금액,총구매수량,조합원상태_이관처리중,조합원상태_정상회원,조합원상태_탈퇴,조합원상태_탈퇴신청,조합원상태_탈퇴처리중,주소-구_광주,주소-구_기타,...,주소-구_중원구,주소-구_처인구,주소-구_하남,주소-구_화성,성별_남,성별_여,배송서비스신청여부_미신청,배송서비스신청여부_신청,모바일알람여부_.,모바일알람여부_수신
0,45.0,5733884,546.5,0,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,0,1
1,36.0,673414,90.0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
2,34.0,655919,66.0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
3,51.0,2984534,252.1,0,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
4,51.0,1901488,152.0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6514,32.0,52646,1.1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,1,0,1,0
6515,46.0,61740,9.0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
6516,82.0,15507,2.0,0,1,0,0,0,0,1,...,0,0,0,0,0,1,1,0,1,0
6517,55.0,36374,2.0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0


In [28]:
X = pd.get_dummies(df3.drop(columns=['Gold_member','Target']))
Y = df3['Target']
# 컴퓨터가 학습할 세팅 완료. 여기까지가 pandas 의 역할

# 2. 데이터 전처리(분할 / 학습, 테스트)

- sklearn
  - model_selection
  - metrics
  - tree

In [30]:
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report

In [31]:
train_test_split(X,Y)

[        연령    총구매금액   총구매수량  조합원상태_이관처리중  조합원상태_정상회원  조합원상태_탈퇴  조합원상태_탈퇴신청  \
 5615  40.0    42117    2.00            0           1         0           0   
 4901  35.0   786127   83.08            0           1         0           0   
 5361  49.0    60649    6.60            0           1         0           0   
 2720  46.0   854307   51.00            0           1         0           0   
 1400  54.0    30343    4.00            0           0         1           0   
 ...    ...      ...     ...          ...         ...       ...         ...   
 4412  55.0    83851   10.00            0           1         0           0   
 6256  80.0   216999   25.00            0           1         0           0   
 6098  41.0    21059    2.00            0           1         0           0   
 4407  56.0    81171    3.00            0           1         0           0   
 1595  63.0  4301433  383.08            0           1         0           0   
 
       조합원상태_탈퇴처리중  주소-구_광주  주소-구_기타  ...  주소-구_중원

In [37]:
X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.3, random_state=2020)

print(X_train.shape)
print(X_test.shape)
print(Y_train.shape)
print(Y_test.shape)

# 데이터 구조가 일치하는지 확인

(4561, 30)
(1956, 30)
(4561,)
(1956,)


## 3. 학습 진행

In [39]:
model = DecisionTreeClassifier()
model.fit(X_train,Y_train)

DecisionTreeClassifier()

## 4. 평가

In [40]:
# 모델이 잘 만들어 졌는지 확인해야 한다.

Y_train_pred = model.predict(X_train) # 알고리즘이 학습을 얼마나 잘 시켰는가
Y_test_pred = model.predict(X_test) # 새로운 데이터가 들어올때, 얼마나 좋은 성능을 나타내는가

In [43]:
print(classification_report(Y_train,Y_train_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      4495
           1       1.00      1.00      1.00        66

    accuracy                           1.00      4561
   macro avg       1.00      1.00      1.00      4561
weighted avg       1.00      1.00      1.00      4561



In [44]:
print(classification_report(Y_test,Y_test_pred))

              precision    recall  f1-score   support

           0       0.99      0.99      0.99      1926
           1       0.54      0.47      0.50        30

    accuracy                           0.99      1956
   macro avg       0.77      0.73      0.75      1956
weighted avg       0.98      0.99      0.99      1956



In [46]:
# 일반화, 적용에 약한것을 볼 수 있다.
# 데이터에 맞지 않는 알고리즘을 선택했거나
# 전처리를 잘못했거나 등등

# 학습 종류

- 지도 : Y - X
        => 분류 Y : 문자 또는 범주형
        => 회귀 Y : 연속형
- 비지도
- 강화

# 학습과 일반화 !

- 숫자 / 문자 -> Feature(연령 : 20~100 등 변수의 스케일등이 다르다)                         (남성_label : 1~0)
  => Scaling 이용하여 처리
- VIP 90... / Normar 6000...
  => Threshold / cut ~off / Sampling
  
- 알고리즘
  => Hyperparameter Tunning