### LightGBM, SHAP, Scikit-Image install

In [13]:
!pip install lightgbm

Collecting lightgbm
  Downloading lightgbm-3.3.1-py3-none-win_amd64.whl (1.0 MB)
Installing collected packages: lightgbm
Successfully installed lightgbm-3.3.1




In [29]:
!pip install shap --user





In [21]:
!pip install scikit-image

Collecting scikit-image
  Downloading scikit_image-0.19.0-cp37-cp37m-win_amd64.whl (34.7 MB)
Collecting networkx>=2.2
  Downloading networkx-2.6.3-py3-none-any.whl (1.9 MB)




Collecting tifffile>=2019.7.26
  Downloading tifffile-2021.11.2-py3-none-any.whl (178 kB)
Collecting PyWavelets>=1.1.1
  Downloading PyWavelets-1.2.0-cp37-cp37m-win_amd64.whl (4.2 MB)
Installing collected packages: tifffile, PyWavelets, networkx, scikit-image
Successfully installed PyWavelets-1.2.0 networkx-2.6.3 scikit-image-0.19.0 tifffile-2021.11.2


In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [5]:
data = pd.read_csv('kc_house_data.csv')

In [6]:
data.head()

Unnamed: 0,id,date,price,bedrooms,bathrooms,floors,waterfront,condition,grade,yr_built,yr_renovated,zipcode,lat,long
0,7129300520,20141013T000000,221900.0,3,1.0,1.0,0,3,7,1955,0,98178,47.5112,-122.257
1,6414100192,20141209T000000,538000.0,3,2.25,2.0,0,3,7,1951,1991,98125,47.721,-122.319
2,5631500400,20150225T000000,180000.0,2,1.0,1.0,0,3,6,1933,0,98028,47.7379,-122.233
3,2487200875,20141209T000000,604000.0,4,3.0,1.0,0,5,7,1965,0,98136,47.5208,-122.393
4,1954400510,20150218T000000,510000.0,3,2.0,1.0,0,3,8,1987,0,98074,47.6168,-122.045


In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21613 entries, 0 to 21612
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   id            21613 non-null  int64  
 1   date          21613 non-null  object 
 2   price         21613 non-null  float64
 3   bedrooms      21613 non-null  int64  
 4   bathrooms     21613 non-null  float64
 5   floors        21613 non-null  float64
 6   waterfront    21613 non-null  int64  
 7   condition     21613 non-null  int64  
 8   grade         21613 non-null  int64  
 9   yr_built      21613 non-null  int64  
 10  yr_renovated  21613 non-null  int64  
 11  zipcode       21613 non-null  int64  
 12  lat           21613 non-null  float64
 13  long          21613 non-null  float64
dtypes: float64(5), int64(8), object(1)
memory usage: 2.3+ MB


id: 집 고유아이디

date: 집이 팔린 날짜 

price: 집 가격 (타겟변수)

bedrooms: 주택 당 침실 개수

bathrooms: 주택 당 화장실 개수

floors: 전체 층 개수

waterfront: 해변이 보이는지 (0, 1)

condition: 집 청소상태 (1~5)

grade: King County grading system 으로 인한 평점 (1~13)

yr_built: 집이 지어진 년도

yr_renovated: 집이 리모델링 된 년도

zipcode: 우편번호

lat: 위도

long: 경도

In [9]:
nCar = data.shape[0] # 데이터 개수
nVar = data.shape[1] # 변수 개수
print('nCar: %d' % nCar, 'nVar: %d' % nVar )

nCar: 21613 nVar: 14


In [10]:
data = data.drop(['id', 'date', 'zipcode', 'lat', 'long'], axis = 1)

In [11]:
feature_columns = list(data.columns.difference(['price']))
X = data[feature_columns]
y = data['price']
train_x, test_x, train_y, test_y = train_test_split(X, y, test_size = 0.3, random_state = 42) 
print(train_x.shape, test_x.shape, train_y.shape, test_y.shape)

(15129, 8) (6484, 8) (15129,) (6484,)


In [18]:
import lightgbm as lgb 
from math import sqrt
from sklearn.metrics import mean_squared_error

lgb_dtrain = lgb.Dataset(data = train_x, label = train_y) # LightGBM 모델에 맞게 변환
lgb_param = {'max_depth': 10,
            'learning_rate': 0.01,
            'n_estimators': 1000, 
             'verbose':-1,
            'objective': 'regression'}

In [19]:
lgb_model = lgb.train(params = lgb_param, train_set = lgb_dtrain)
lgb_model_predict = lgb_model.predict(test_x) 
print("RMSE: {}".format(sqrt(mean_squared_error(lgb_model_predict, test_y))))



RMSE: 212217.42594653403


In [30]:
import shap
explainer = shap.TreeExplainer(lgb_model) # Tree model Shap Value 확인 객체 지정
shap_values = explainer.shap_values(test_x) # Shap Values 계산

ImportError: Numba needs NumPy 1.20 or less