# **LightGBM**

---

- **목표**
    - 카드 거래 내역 데이터셋을 이용해 이상거래를 예측하라.
- **알고리즘** : LightGBM
- **문제유형** : 분류
- **종속변수** : is_fraud(이상거래)
- **사용한 모델** : LGBMClassifier, train
- **데이터셋**
    - 파일명 : fraud.csv
    - 소개
        - 이상거래에 관련되 데이터입니다.
        - 이상거래는 카드값을 지불하지 않을 의도를 가지고서 결제를 하거나, 도난된 카드를 가지고 결제를 하는 등의 거래를 의미합니다.
        - 종속변수는 이상거래 여부이고, 독립변수는 거래 금액, 고객성별, 상점 범주 등입니다.
- **평가지표** : ACC(정확도), Confusion Matrix(혼동 행렬), Classifier Report(분류 리포트), ROC AUC 점수

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

file_url = 'https://media.githubusercontent.com/media/musthave-ML10/data_source/main/fraud.csv'
data = pd.read_csv(file_url) # 데이터 셋 읽기

KeyboardInterrupt: 

In [2]:
data.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [3]:
data.info(show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1852394 entries, 0 to 1852393
Data columns (total 22 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   trans_date_trans_time  1852394 non-null  object 
 1   cc_num                 1852394 non-null  int64  
 2   merchant               1852394 non-null  object 
 3   category               1852394 non-null  object 
 4   amt                    1852394 non-null  float64
 5   first                  1852394 non-null  object 
 6   last                   1852394 non-null  object 
 7   gender                 1852394 non-null  object 
 8   street                 1852394 non-null  object 
 9   city                   1852394 non-null  object 
 10  state                  1852394 non-null  object 
 11  zip                    1852394 non-null  int64  
 12  lat                    1852394 non-null  float64
 13  long                   1852394 non-null  float64
 14  city_pop          

In [4]:
round(data.describe(), 2)

Unnamed: 0,cc_num,amt,zip,lat,long,city_pop,unix_time,merch_lat,merch_long,is_fraud
count,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0,1852394.0
mean,4.17386e+17,70.06,48813.26,38.54,-90.23,88643.67,1358674000.0,38.54,-90.23,0.01
std,1.309115e+18,159.25,26881.85,5.07,13.75,301487.62,18195080.0,5.11,13.76,0.07
min,60416210000.0,1.0,1257.0,20.03,-165.67,23.0,1325376000.0,19.03,-166.67,0.0
25%,180042900000000.0,9.64,26237.0,34.67,-96.8,741.0,1343017000.0,34.74,-96.9,0.0
50%,3521417000000000.0,47.45,48174.0,39.35,-87.48,2443.0,1357089000.0,39.37,-87.44,0.0
75%,4642255000000000.0,83.1,72042.0,41.94,-80.16,20328.0,1374581000.0,41.96,-80.25,0.0
max,4.992346e+18,28948.9,99921.0,66.69,-67.95,2906700.0,1388534000.0,67.51,-66.95,1.0


In [5]:
# 전처리
data.drop(['first', 'last', 'street', 'city', 'state', 'zip', 'trans_num', 'unix_time', 'job', 'merchant'], axis=1, inplace=True) #무의미한 변수 제거

In [6]:
data['trans_date_trans_time'] = pd.to_datetime(data['trans_date_trans_time']) # 날짜 형식으로 변환

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1852394 entries, 0 to 1852393
Data columns (total 12 columns):
 #   Column                 Dtype         
---  ------                 -----         
 0   trans_date_trans_time  datetime64[ns]
 1   cc_num                 int64         
 2   category               object        
 3   amt                    float64       
 4   gender                 object        
 5   lat                    float64       
 6   long                   float64       
 7   city_pop               int64         
 8   dob                    object        
 9   merch_lat              float64       
 10  merch_long             float64       
 11  is_fraud               int64         
dtypes: datetime64[ns](1), float64(5), int64(3), object(3)
memory usage: 169.6+ MB


In [8]:
amt_info = data.groupby('cc_num').agg(['mean', 'std'])['amt'].reset_index() # cc_num별 amt 평균과 표준편차 계산

  amt_info = data.groupby('cc_num').agg(['mean', 'std'])['amt'].reset_index() # cc_num별 amt 평균과 표준편차 계산


In [9]:
amt_info.head()

Unnamed: 0,cc_num,mean,std
0,60416207185,59.257796,142.869746
1,60422928733,65.483159,92.042844
2,60423098130,96.376084,1000.693872
3,60427851591,107.48755,131.014534
4,60487002085,64.096925,153.20766


In [10]:
data = data.merge(amt_info, on='cc_num', how='left') # 데이터 합치기

In [12]:
data['amt_z_score'] = (data['amt'] - data['mean']) / data['std'] # z-score 계산

## Z-score

- Z 점수(Z-Score, Z점수, 표준값, 표준점수라고도 쓴다.)는 평균과 표준편차를 이용하여 특정값이 정규분포 범위에서 어느 수준에 위치하는지를 나타낸다.

$$Z-점수 = \frac{x(특정값)-m(평균)}{(표준편차)}$$

In [13]:
data[['amt', 'mean', 'std', 'amt_z_score']].head()

Unnamed: 0,amt,mean,std,amt_z_score
0,4.97,89.408743,127.530101,-0.662108
1,107.23,56.078113,159.201852,0.321302
2,220.11,69.924272,116.688602,1.287064
3,45.0,80.09004,280.07788,-0.125287
4,41.96,95.341146,94.322842,-0.565941


In [14]:
data.drop(['mean', 'std'], axis=1, inplace=True)

In [15]:
category_info = data.groupby(['cc_num', 'category']).agg(['mean', 'std'])['amt'].reset_index()

  category_info = data.groupby(['cc_num', 'category']).agg(['mean', 'std'])['amt'].reset_index()


In [16]:
data = data.merge(category_info, on=['cc_num', 'category'], how='left')

In [17]:
data['cat_z_score'] = (data['amt'] - data['mean']) / data['std']
data.drop(['mean', 'std'], axis=1, inplace=True)

In [None]:
import geopy.distance