# SC23x 
## Applied Predictive Modeling

 이번 스프린트 챌린지에서는 시카고에 있는 식당들의 정보와 해당 식당들의 위생 검사 결과에 관한 정보를 담은 데이터셋을 다루게 됩니다.

데이터셋에 관한 설명은 [PDF 문서](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF)를 참고해주시길 바랍니다.

#### 목표: 오늘 여러분은 Chicago시의 공중보건부에서 진행한 레스토랑들의 위생 검사 "불합격" 여부를 예측하는 모델을 만들어야 합니다.

여러분의 모델이 예측할 target은 `Inspection Fail` 칼럼입니다.   
칼럼 값은 아래와 같습니다:
- 식당이 위생 검사에 불합격한 경우: **1**
- 식당이 검사를 통과한 경우: **0**

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.filterwarnings("ignore")

In [2]:
# 기본도구
import pandas as pd
import numpy as np 

# Pipeline
from sklearn.pipeline import Pipeline
from sklearn.pipeline import make_pipeline

# 전처리도구
from category_encoders import OneHotEncoder,TargetEncoder,OrdinalEncoder

from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

# 모델
from xgboost import XGBRFClassifier
import xgboost as xgb
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# selection 도구
from sklearn.model_selection import cross_val_score, train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import roc_auc_score

# 검증도구
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import classification_report

# 시각화
import eli5
import shap
from pdpbox.pdp import pdp_isolate, pdp_plot
from eli5.sklearn import PermutationImportance
from pdpbox.pdp import pdp_interact, pdp_interact_plot

In [3]:
# 데이터셋을 불러오기 위해 판다스 라이브러리를 불러옵니다


train_url = 'https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/food_inspection_sc23x/food_ins_train.csv'
test_url  = 'https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/food_inspection_sc23x/food_ins_test.csv'

# train, test 데이터셋을 불러옵니다
train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

# 데이터셋 확인
assert train.shape == (60000, 17)
assert test.shape  == (20000, 17)

# Part 1: 데이터 전처리 (Data Preprocessing)

## 1.1 데이터셋을 파악하기 위한 EDA를 진행하세요
> EDA를 하는 방식 및 라이브러리에 대한 제한은 없습니다. 단, **시간 분배**에 주의하세요.

In [4]:
train.shape, test.shape

((60000, 17), (20000, 17))

In [5]:
train.head(2)

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Violations,Latitude,Longitude,Location,Inspection Fail
0,2050629,MY SWEET STATION INC,MY SWEET STATION,2327223.0,Restaurant,Risk 1 (High),2511 N LINCOLN AVE,CHICAGO,IL,60614.0,2017-05-18,Canvass,,41.927577,-87.651528,"(-87.65152817242594, 41.92757677830966)",0
1,2078428,OUTTAKES,RED MANGO,2125004.0,Restaurant,Risk 2 (Medium),10 S DEARBORN ST FL,CHICAGO,IL,60603.0,2017-08-14,Canvass,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO...",41.881807,-87.629543,"(-87.62954311539407, 41.88180696006542)",0


In [6]:
train.isnull().sum()

Inspection ID          0
DBA Name               0
AKA Name             717
License #              4
Facility Type       1427
Risk                  24
Address                0
City                  45
State                 15
Zip                   13
Inspection Date        0
Inspection Type        0
Violations         15870
Latitude             178
Longitude            178
Location             178
Inspection Fail        0
dtype: int64

In [7]:
train.columns

Index(['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Violations', 'Latitude', 'Longitude', 'Location',
       'Inspection Fail'],
      dtype='object')

In [8]:
train["DBA Name"].unique()

array(['MY SWEET STATION INC', 'OUTTAKES', 'JAFFA BAGELS', ...,
       'NEW CELEBRITY LOUNGE INC', 'ITALIAN EXPRESS RESTAURANT',
       'RAFAEL CRUZ'], dtype=object)

In [9]:
train["AKA Name"].unique()

array(['MY SWEET STATION', 'RED MANGO', 'JAFFA BAGELS', ...,
       'NEW CELEBRITY LOUNGE INC', 'ITALIAN EXPRESS RESTAURANT',
       'RAFAEL CRUZ'], dtype=object)

In [10]:
train["Risk"].unique()

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', 'All', nan],
      dtype=object)

In [11]:
train["Facility Type"].unique()

array(['Restaurant', 'Mobile Food Dispenser', 'School', 'Grocery Store',
       "Children's Services Facility", 'Mobile Food Preparer',
       'watermelon house', 'Long Term Care', nan, 'Bakery', 'theater',
       'Daycare Above and Under 2 Years', 'Mobile Frozen Desserts Vendor',
       'Shared Kitchen User (Long Term)', 'Daycare (Under 2 Years)',
       'Liquor', 'Daycare (2 - 6 Years)', 'Mobile Prepared Food Vendor',
       'Daycare Combo 1586', 'STORE', 'PALETERIA /ICECREAM SHOP',
       'TAVERN', 'convenience store', 'Hospital', 'Catering',
       'MOBILE DESSERTS VENDOR', 'COLLEGE', 'THEATRE', 'Social Club',
       'BANQUET HALL', 'FITNESS CENTER', 'Shelter', 'WINE STORE',
       'DAYCARE', 'GROCERY STORE/TAQUERIA', 'Wholesale',
       'GROCERY(GAS STATION)', 'GROCERY STORE/BAKERY',
       'GROCERY/RESTAURANT', 'grocery/butcher', 'Special Event',
       'ROOFTOP', 'Pop-Up Establishment Host-Tier II', 'LIVE POULTRY',
       'Golden Diner', 'DISTRIBUTION CENTER',
       'EXERCISE A

In [12]:
train["City"].unique()

array(['CHICAGO', 'CICERO', 'NILES NILES', 'MAYWOOD', 'Chicago',
       'INACTIVE', nan, 'CHICAGOCHICAGO', 'CHicago', 'EVANSTON',
       'ELK GROVE VILLAGE', 'CHCICAGO', 'chicago', 'OAK PARK', 'CCHICAGO',
       'ROSEMONT', 'LAKE ZURICH', 'EAST HAZEL CREST', 'BURNHAM',
       'CHARLES A HAYES', '312CHICAGO', 'SKOKIE', 'GRIFFITH',
       'SCHAUMBURG', 'BOLINGBROOK', 'HIGHLAND PARK', 'ELMHURST',
       'PLAINFIELD', 'WHEATON', 'CALUMET CITY', 'STREAMWOOD', 'CHICAGO.',
       'SUMMIT', 'BRIDGEVIEW', 'WESTMONT', 'WORTH', 'NEW HOLSTEIN',
       'BANNOCKBURNDEERFIELD', 'alsip', 'CHICAGOI'], dtype=object)

In [13]:
train["State"].unique()

array(['IL', nan, 'IN', 'WI'], dtype=object)

In [14]:
train["Violations"].unique()

array([nan,
       '34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED BOXES OF POTATO CHIPS STORED ON THE FLOOR OF DRY STORAGE. INSTRUCTED TO STORE ALL FOOD 6 INCHES ABOVE THE FLOOR TO PROPERLY CLEAN THE FLOOR. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: OBSERVED THE WASTE PIPE LEAKING AT THE 3 COMPARTMENT SINK MIDDLE BASIN. INSTRUCTED TO REPAIR AND MAINTAIN ALL PLUMBING.  | 42. APPROPRIATE METHOD OF HANDLING OF FOOD (ICE) HAIR RESTRAINTS AND CLEAN APPAREL WORN - Comments: OBSERVED A FOOD HANDLER WITH NO HAIR RESTRAINT. INSTRUCTED TO WEAR A HAIR RESTRAINT WHEN HANDLING FOOD.',
       '30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABELED: CUSTOMER ADVISORY POSTED AS NEEDED - Comments: MUST DATE AND LABEL ALL LARGE PLASTIC TUB CONTAINERS HOLDING CHICKEN IN THE BOTH 2-SLIDING DOOR COOLERS IN THE REAR PART OF THE STORE WHERE THERE ARE 3 SMALL CUSTOMER TABLE

In [15]:
train["Inspection Type"].unique()

array(['Canvass', 'Complaint', 'License Re-Inspection', 'License',
       'Canvass Re-Inspection', 'Short Form Complaint',
       'License-Task Force', 'Complaint Re-Inspection', 'LICENSE REQUEST',
       'Complaint-Fire', 'Tag Removal', 'Task Force Liquor 1475',
       'Suspected Food Poisoning', 'Consultation',
       'Complaint-Fire Re-inspection', 'Recent Inspection',
       'Recent inspection', 'Suspected Food Poisoning Re-inspection',
       'CANVASS RE INSPECTION OF CLOSE UP', 'OUT OF BUSINESS',
       'Out of Business', 'Special Events (Festivals)',
       'Short Form Fire-Complaint', 'DAY CARE LICENSE RENEWAL',
       'No Entry', 'Pre-License Consultation', 'Package Liquor 1474',
       'Non-Inspection', 'CHANGED COURT DATE', 'TASKFORCE', 'Not Ready',
       'SFP/COMPLAINT', 'O.B.', 'task force', 'error save',
       'REINSPECTION', 'TASK FORCE PACKAGE GOODS 1474',
       'CANVASS SPECIAL EVENTS', 'LIQOUR TASK FORCE NOT READY',
       'TASK FORCE LIQUOR (1481)', 'Duplicated',


In [16]:
train["Inspection Fail"].unique()

array([0, 1], dtype=int64)

In [17]:
train.dtypes

Inspection ID        int64
DBA Name            object
AKA Name            object
License #          float64
Facility Type       object
Risk                object
Address             object
City                object
State               object
Zip                float64
Inspection Date     object
Inspection Type     object
Violations          object
Latitude           float64
Longitude          float64
Location            object
Inspection Fail      int64
dtype: object

In [18]:
train["Inspection Fail"].value_counts(normalize=True)

0    0.804733
1    0.195267
Name: Inspection Fail, dtype: float64

## 1.2 EDA의 결과를 토대로 Feature Engineering 및 Preprocessing을 진행하세요
> 새로운 feature를 만드는 작업뿐만이 아니라, 필요한 feature가 적절한 데이터 타입을 가지고 있지 않다면 변환합니다

In [19]:
# risk에 대한 feature 변경 
train.loc[(train.Risk =="Risk 1 (High)"), "Risk"] = 3
train.loc[(train.Risk =='Risk 2 (Medium)'), "Risk"] = 2
train.loc[(train.Risk =='Risk 3 (Low)'), "Risk"] = 1
train.loc[(train.Risk =="All"), "Risk"] = 4

test.loc[(train.Risk =="Risk 1 (High)"), "Risk"] = 3
test.loc[(train.Risk =='Risk 2 (Medium)'), "Risk"] = 2
test.loc[(train.Risk =='Risk 3 (Low)'), "Risk"] = 1
test.loc[(train.Risk =="All"), "Risk"] = 4

In [20]:
# 시간 순서대로 정렬 ( 시간적인 의미를 가지고 있기 때문에)
train = train.sort_values(by=["Inspection Date"], axis=0)

In [21]:
for i in range(7):
    train[f'Violation{i}'] = train.Violations.str.split('|').str[i]
    train[f'Violation{i}'] = train[f'Violation{i}'].str[:2]
    train[f'Violation{i}'] = pd.to_numeric(train[f'Violation{i}']).fillna(0)

In [22]:
for i in range(7):
    test[f'Violation{i}'] = test.Violations.str.split('|').str[i]
    test[f'Violation{i}'] = test[f'Violation{i}'].str[:2]
    test[f'Violation{i}'] = pd.to_numeric(test[f'Violation{i}']).fillna(0)

# Part 2: 모델링 (Modeling)

## 2.1 검증 방식 (Cross-validation / Hold-out Validation)을 정한 후 데이터셋을 목적에 맞게 분할하세요

In [23]:
target = 'Inspection Fail'
features = train.columns.drop([target])

train, val = train_test_split(train, train_size=0.80, test_size=0.2, stratify=train[target], random_state=42)

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

X_test = test[features]
y_test = test[target]

## 2.2 모델 학습을 위한 파이프라인을 구축 후 학습(fit)까지 진행하세요
> 모델은 scikit-learn, xgboost, lightgbm 등 어떤 라이브러리를 사용하셔도 괜찮지만 특정 라이브러리는 **설치 및 설정에 시간이 소요되는 점**을 감안하시기 바랍니다

학습한 내용을 확인하기 위해 xgboosting까지 활용해 보았습니다.

In [24]:
encoder = OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train) # 학습데이터
X_val_encoded = encoder.transform(X_val) # 검증데이터
X_test_encoded = encoder.fit_transform(X_test) # 학습데이터

boosting = XGBRFClassifier(
    n_estimators=1000,
    objective='reg:squarederror', # default
    learning_rate=0.01,
    n_jobs=-1
)

eval_set = [(X_train_encoded, y_train), 
            (X_val_encoded, y_val)]

boosting.fit(X_train_encoded, y_train, 
          eval_set=eval_set,
          early_stopping_rounds=50
         )

[0]	validation_0-rmse:0.49643	validation_1-rmse:0.49683


XGBRFClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
                colsample_bytree=1, gamma=0, gpu_id=-1, importance_type='gain',
                interaction_constraints='', learning_rate=0.01,
                max_delta_step=0, max_depth=6, min_child_weight=1, missing=nan,
                monotone_constraints='()', n_estimators=1000, n_jobs=-1,
                num_parallel_tree=1000, objective='reg:squarederror',
                random_state=0, reg_alpha=0, scale_pos_weight=1,
                tree_method='exact', validate_parameters=1, verbosity=None)

In [25]:
print('xg 훈련세트 정확도', boosting.score(X_train_encoded, y_train).round(7))
print('xg 검증세트 정확도', boosting.score(X_val_encoded, y_val).round(7))
print('xg 테스트세트 정확도', boosting.score(X_test_encoded, y_test).round(7))

xg 훈련세트 정확도 0.9168125
xg 검증세트 정확도 0.8116667
xg 테스트세트 정확도 0.81265


In [26]:
y_pred = boosting.predict(X_test_encoded)
print("accuracy_score : ", accuracy_score(y_pred, y_test))
print("recall_score : ", recall_score(y_pred, y_test))
print("precision_score : ", precision_score(y_pred, y_test))
print("f1_score : ", f1_score(y_pred, y_test))

accuracy_score :  0.81265
recall_score :  0.8081632653061225
precision_score :  0.05079527963057978
f1_score :  0.0955829109341057


In [27]:
model = xgb.XGBClassifier()

pipe = Pipeline([
    ("le", OrdinalEncoder()),
    ('standard_scaler', StandardScaler()), 
    ('imputer', SimpleImputer()),
    ('model', model)
])

pipe.fit(X_train, y_train)



Pipeline(steps=[('le',
                 OrdinalEncoder(cols=['DBA Name', 'AKA Name', 'Facility Type',
                                      'Risk', 'Address', 'City', 'State',
                                      'Inspection Date', 'Inspection Type',
                                      'Violations', 'Location'],
                                mapping=[{'col': 'DBA Name',
                                          'data_type': dtype('O'),
                                          'mapping': SUBWAY AT NORWEGIAN AMERICAN HOSPITAL                     1
PASTA AND BURGER SALON                                    2
TODDLER TOWN DAY CARE TOO                                 3
UMAIYA CAFE                                               4
Halsted Street Deli                                       5
                                                      ...  
PALERIA Y NEVERI...
                               colsample_bytree=1, gamma=0, gpu_id=-1,
                               importance_type='gai

In [28]:
print('검증 정확도: ', pipe.score(X_val, y_val))

검증 정확도:  0.8645


In [29]:
y_pred = pipe.predict(X_val)
print(classification_report(y_val, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.99      0.92      9657
           1       0.91      0.34      0.49      2343

    accuracy                           0.86     12000
   macro avg       0.89      0.67      0.71     12000
weighted avg       0.87      0.86      0.84     12000



In [30]:
y_pred_proba = pipe.predict_proba(X_val)[:, -1]
print('AUC score: ', roc_auc_score(y_val, y_pred_proba))

AUC score:  0.9394936461473613


In [31]:
y_pred = pipe.predict(X_test)
print("accuracy_score : ", accuracy_score(y_pred, y_test).round(7))
print("recall_score : ", recall_score(y_pred, y_test).round(7))
print("precision_score : ", precision_score(y_pred, y_test).round(7))
print("f1_score : ", f1_score(y_pred, y_test).round(7))

accuracy_score :  0.86405
recall_score :  0.8980419
precision_score :  0.3412006
f1_score :  0.4945157


In [32]:
pipe_rf_tag = make_pipeline(
    TargetEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(max_depth=6,
                          criterion='entropy',
                          n_jobs=-1,
                          min_samples_leaf=3,
                          random_state=100)
)

pipe_rf_tag.fit(X_train, y_train)
print('rf 훈련세트 정확도', pipe_rf_tag.score(X_train, y_train).round(7))
print('rf 훈련세트 f1 score', f1_score(y_train, pipe_rf_tag.predict(X_train).round(7)))

rf 훈련세트 정확도 0.9177083
rf 훈련세트 f1 score 0.7863941163746485


In [33]:
pipe_rf_ordi = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(max_depth=6,
                          criterion='entropy',
                          n_jobs=-1,
                          min_samples_leaf=3,
                          random_state=100)
)

pipe_rf_ordi.fit(X_train, y_train)
print('rf 훈련세트 정확도', pipe_rf_ordi.score(X_train, y_train).round(7))
print('rf 훈련세트 f1 score', f1_score(y_train, pipe_rf_ordi.predict(X_train).round(7)))

rf 훈련세트 정확도 0.8781458
rf 훈련세트 f1 score 0.6252802870139023


In [34]:
pipe_rf_ordi.fit(X_train, y_train)
print('rf 훈련세트 정확도', pipe_rf_ordi.score(X_train, y_train).round(7))
print('rf 훈련세트 f1 score', f1_score(y_train, pipe_rf_ordi.predict(X_train).round(7)))

rf 훈련세트 정확도 0.8781458
rf 훈련세트 f1 score 0.6252802870139023


## 2.3 테스트셋의 ROC / AUC 검증 점수를 예측합니다
> 제작한 모델을 활용해서 테스트셋의 **확률**을 예측하세요 (`ROC/AUC 검증 점수`를 **0.65 혹은 그 이상** 달성한다면 매우 훌륭한 모델입니다). 명시된 성능이 나오지 않았더라도 이 문제에 모든 시간을 소요하시면 안됩니다 (점수가 안나오는 경우 여러분이 구축한 파이프라인 모델링 및 모델링 과정으로 평가합니다)

**★피드백 주신 부분에 대한 추가작성 코드★**

In [35]:
from sklearn.metrics import roc_auc_score

y_pred_proba = pipe.predict_proba(X_val)[:, -1]
print('AUC score: ', roc_auc_score(y_val, y_pred_proba).round(7))

AUC score:  0.9394936


In [36]:
print('rf 훈련 정확도 : ',pipe.score(X_train, y_train).round(7))
print('rf 검증 정확도 : ',pipe.score(X_val, y_val).round(7))

rf 훈련 정확도 :  0.9688542
rf 검증 정확도 :  0.8645


In [37]:
print('rf 훈련 정확도 : ',pipe_rf_ordi.score(X_train, y_train).round(7))
print('rf 검증 정확도 : ',pipe_rf_ordi.score(X_val, y_val).round(7))

rf 훈련 정확도 :  0.8781458
rf 검증 정확도 :  0.86775


In [38]:
print('rf 훈련 f1 score : ',f1_score(y_train, pipe.predict(X_train).round(7)))
print('rf 검증 f1 score : ',f1_score(y_val, pipe.predict(X_val).round(7)))

rf 훈련 f1 score :  0.9183194011910616
rf 검증 f1 score :  0.4944029850746269


In [39]:
print('rf 훈련 f1 score : ',f1_score(y_train, pipe_rf_ordi.predict(X_train).round(7)))
print('rf 검증 f1 score : ',f1_score(y_val, pipe_rf_ordi.predict(X_val).round(7)))

rf 훈련 f1 score :  0.6252802870139023
rf 검증 f1 score :  0.5443583118001722


## 2.4 하이퍼 파라미터 튜닝을 통해서 모델을 개선하세요
> `RandomSearchCV`, `GridSearchCV` 등을 활용해서 모델의 성능을 개선합니다. 범위 설정에 따라 시간이 매우 소요될 수 있습니다.

In [40]:
model = xgb.XGBClassifier()

pipeline = Pipeline([
    ("le", OrdinalEncoder()),
    ('standard_scaler', StandardScaler()), 
    ('imputer', SimpleImputer()),
    ('model', model)
])

param_grid = {
    'model__max_depth': [2, 3, 5, 7, 10],
    'model__n_estimators': [10, 100, 500],
}

grid = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='f1')

In [None]:
grid.fit(X_train, y_train)

In [None]:
print(f"Best parameters : {grid.best_params_}")
grid_model = grid.best_estimator_

In [None]:
grid_model.fit(X_train, y_train)

In [None]:
print('rf 훈련세트 정확도', grid_model.score(X_train, y_train))

In [None]:
y_pred = grid_model.predict(X_test)
print("accuracy_score : ", accuracy_score(y_pred, y_test).round(7))
print("recall_score : ", recall_score(y_pred, y_test).round(7))
print("precision_score : ", precision_score(y_pred, y_test).round(7))
print("f1_score : ", f1_score(y_pred, y_test).round(7))

# Part 3: 시각화 (Visualization)

> 모델의 해석을 위한 시각화를 해주세요. 아래의 제시 된 종류 중 **2가지**를 선택하세요 (시각화에서 가장 중요한 것은 **여러분의 해석**입니다):
> - Permutation Importances
> - Partial Dependence Plot, 1 feature isolation
> - Partial Dependence Plot, 2 features interaction
> - Shapley Values (SHAP)

3-1) 시각화 - Partial Dependence Plot, 1 feature isolation

In [None]:
feature = 'Risk'

isolated = pdp_isolate(
    model=pipe_rf_ordi, 
    dataset=X_train, 
    model_features=X_train.columns, 
    feature=feature,
    grid_type='percentile', # default='percentile', or 'equal'
    num_grid_points=10 # default=10
)
pdp_plot(isolated, feature_name=feature);

★ 그래프 해석 ★  
 -. risk 1~2사이에 있는 식당들은 점검 후 적극적인 피드백에 의해 0보다 작은 곳에 위치하는 것 같고  
 -. risk3이상의 식당은 이미 회복하기에는 너무 큰 위반사항들이 있어서 폐업으로 이어지는 것으로 관측된다.   

3-2) 시각화 - Partial Dependence Plot, 2 features interaction

In [None]:
features = ['Risk', 'Violation5']

interaction = pdp_interact(
    model=boosting, 
    dataset=X_train_encoded, 
    model_features=X_train_encoded.columns, 
    features=features
)

pdp_interact_plot(interaction, plot_type='grid', feature_names=features);

★ 그래프 해석 ★  
 -. violation5가 많이 적발될 수록 risk가 높은 그룹에 해당되었다.   
 -. risk가 높은 그룹은 페업활률이 높아진다.   

3-3) Permutation Importances  

In [None]:
# permuter 정의
permuter = PermutationImportance(
    model,
    n_iter=5, 
    random_state=2).fit(X_val_encoded, y_val)


In [None]:
feature_names = X_val_encoded.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values(ascending=False)

In [None]:
row = X_train.iloc[[1000]] 
row

In [None]:
model.predict(row)

In [None]:
explainer = shap.TreeExplainer(boosting)
shap_values = explainer.shap_values(X_test_encoded.iloc[:3300])

In [None]:
X_test_encoded

In [None]:
shap.initjs()
shap.force_plot(
    base_value=explainer.expected_value, 
    shap_values=shap_values,
    features=X_test
)