# SC23x 
## Applied Predictive Modeling

 이번 스프린트 챌린지에서는 시카고에 있는 식당들의 정보와 해당 식당들의 위생 검사 결과에 관한 정보를 담은 데이터셋을 다루게 됩니다.

데이터셋에 관한 설명은 [PDF 문서](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF)를 참고해주시길 바랍니다.

#### 목표: 오늘 여러분은 Chicago시의 공중보건부에서 진행한 레스토랑들의 위생 검사 "불합격" 여부를 예측하는 모델을 만들어야 합니다.

여러분의 모델이 예측할 target은 `Inspection Fail` 칼럼입니다.   
칼럼 값은 아래와 같습니다:
- 식당이 위생 검사에 불합격한 경우: **1**
- 식당이 검사를 통과한 경우: **0**

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [2]:
# 데이터셋을 불러오기 위해 판다스 라이브러리를 불러옵니다
import pandas as pd

train_url = 'https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/food_inspection_sc23x/food_ins_train.csv'
test_url  = 'https://ds-lecture-data.s3.ap-northeast-2.amazonaws.com/food_inspection_sc23x/food_ins_test.csv'

# train, test 데이터셋을 불러옵니다
train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

# 데이터셋 확인
assert train.shape == (60000, 17)
assert test.shape  == (20000, 17)

# Part 1: 데이터 전처리 (Data Preprocessing)

## 1.1 데이터셋을 파악하기 위한 EDA를 진행하세요
> EDA를 하는 방식 및 라이브러리에 대한 제한은 없습니다. 단, **시간 분배**에 주의하세요.

In [3]:
train.shape, test.shape

((60000, 17), (20000, 17))

In [4]:
train.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Violations,Latitude,Longitude,Location,Inspection Fail
0,2050629,MY SWEET STATION INC,MY SWEET STATION,2327223.0,Restaurant,Risk 1 (High),2511 N LINCOLN AVE,CHICAGO,IL,60614.0,2017-05-18,Canvass,,41.927577,-87.651528,"(-87.65152817242594, 41.92757677830966)",0
1,2078428,OUTTAKES,RED MANGO,2125004.0,Restaurant,Risk 2 (Medium),10 S DEARBORN ST FL,CHICAGO,IL,60603.0,2017-08-14,Canvass,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO...",41.881807,-87.629543,"(-87.62954311539407, 41.88180696006542)",0
2,1591748,JAFFA BAGELS,JAFFA BAGELS,2278918.0,Restaurant,Risk 1 (High),225 N MICHIGAN AVE,CHICAGO,IL,60601.0,2015-12-15,Complaint,"30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABEL...",41.886377,-87.624382,"(-87.62438167043969, 41.88637740620821)",0
3,1230035,FRANKS 'N' DAWGS,FRANKS 'N' DAWGS,2094329.0,Restaurant,Risk 1 (High),1863 N CLYBOURN AVE,CHICAGO,IL,60614.0,2012-07-10,Canvass,33. FOOD AND NON-FOOD CONTACT EQUIPMENT UTENSI...,41.91499,-87.654994,"(-87.65499361162448, 41.91498953039437)",0
4,1228186,SOUTH COAST,SOUTH COAST SUSHI,1817424.0,Restaurant,Risk 1 (High),1700 S MICHIGAN AVE,CHICAGO,IL,60616.0,2013-09-20,Canvass,,41.858996,-87.624106,"(-87.62410566978502, 41.85899630014676)",0


In [5]:
train.isnull().sum()

Inspection ID          0
DBA Name               0
AKA Name             717
License #              4
Facility Type       1427
Risk                  24
Address                0
City                  45
State                 15
Zip                   13
Inspection Date        0
Inspection Type        0
Violations         15870
Latitude             178
Longitude            178
Location             178
Inspection Fail        0
dtype: int64

In [6]:
train.columns

Index(['Inspection ID', 'DBA Name', 'AKA Name', 'License #', 'Facility Type',
       'Risk', 'Address', 'City', 'State', 'Zip', 'Inspection Date',
       'Inspection Type', 'Violations', 'Latitude', 'Longitude', 'Location',
       'Inspection Fail'],
      dtype='object')

In [7]:
train["DBA Name"].unique()

array(['MY SWEET STATION INC', 'OUTTAKES', 'JAFFA BAGELS', ...,
       'NEW CELEBRITY LOUNGE INC', 'ITALIAN EXPRESS RESTAURANT',
       'RAFAEL CRUZ'], dtype=object)

In [8]:
train["AKA Name"].unique()

array(['MY SWEET STATION', 'RED MANGO', 'JAFFA BAGELS', ...,
       'NEW CELEBRITY LOUNGE INC', 'ITALIAN EXPRESS RESTAURANT',
       'RAFAEL CRUZ'], dtype=object)

In [9]:
train["Risk"].unique()

array(['Risk 1 (High)', 'Risk 2 (Medium)', 'Risk 3 (Low)', 'All', nan],
      dtype=object)

In [10]:
train["Facility Type"].unique()

array(['Restaurant', 'Mobile Food Dispenser', 'School', 'Grocery Store',
       "Children's Services Facility", 'Mobile Food Preparer',
       'watermelon house', 'Long Term Care', nan, 'Bakery', 'theater',
       'Daycare Above and Under 2 Years', 'Mobile Frozen Desserts Vendor',
       'Shared Kitchen User (Long Term)', 'Daycare (Under 2 Years)',
       'Liquor', 'Daycare (2 - 6 Years)', 'Mobile Prepared Food Vendor',
       'Daycare Combo 1586', 'STORE', 'PALETERIA /ICECREAM SHOP',
       'TAVERN', 'convenience store', 'Hospital', 'Catering',
       'MOBILE DESSERTS VENDOR', 'COLLEGE', 'THEATRE', 'Social Club',
       'BANQUET HALL', 'FITNESS CENTER', 'Shelter', 'WINE STORE',
       'DAYCARE', 'GROCERY STORE/TAQUERIA', 'Wholesale',
       'GROCERY(GAS STATION)', 'GROCERY STORE/BAKERY',
       'GROCERY/RESTAURANT', 'grocery/butcher', 'Special Event',
       'ROOFTOP', 'Pop-Up Establishment Host-Tier II', 'LIVE POULTRY',
       'Golden Diner', 'DISTRIBUTION CENTER',
       'EXERCISE A

In [11]:
train["City"].unique()

array(['CHICAGO', 'CICERO', 'NILES NILES', 'MAYWOOD', 'Chicago',
       'INACTIVE', nan, 'CHICAGOCHICAGO', 'CHicago', 'EVANSTON',
       'ELK GROVE VILLAGE', 'CHCICAGO', 'chicago', 'OAK PARK', 'CCHICAGO',
       'ROSEMONT', 'LAKE ZURICH', 'EAST HAZEL CREST', 'BURNHAM',
       'CHARLES A HAYES', '312CHICAGO', 'SKOKIE', 'GRIFFITH',
       'SCHAUMBURG', 'BOLINGBROOK', 'HIGHLAND PARK', 'ELMHURST',
       'PLAINFIELD', 'WHEATON', 'CALUMET CITY', 'STREAMWOOD', 'CHICAGO.',
       'SUMMIT', 'BRIDGEVIEW', 'WESTMONT', 'WORTH', 'NEW HOLSTEIN',
       'BANNOCKBURNDEERFIELD', 'alsip', 'CHICAGOI'], dtype=object)

In [12]:
train["State"].unique()

array(['IL', nan, 'IN', 'WI'], dtype=object)

In [13]:
train["Violations"].unique()

array([nan,
       '34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOOD REPAIR, COVING INSTALLED, DUST-LESS CLEANING METHODS USED - Comments: OBSERVED BOXES OF POTATO CHIPS STORED ON THE FLOOR OF DRY STORAGE. INSTRUCTED TO STORE ALL FOOD 6 INCHES ABOVE THE FLOOR TO PROPERLY CLEAN THE FLOOR. | 38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS REQUIRED: PLUMBING: INSTALLED AND MAINTAINED - Comments: OBSERVED THE WASTE PIPE LEAKING AT THE 3 COMPARTMENT SINK MIDDLE BASIN. INSTRUCTED TO REPAIR AND MAINTAIN ALL PLUMBING.  | 42. APPROPRIATE METHOD OF HANDLING OF FOOD (ICE) HAIR RESTRAINTS AND CLEAN APPAREL WORN - Comments: OBSERVED A FOOD HANDLER WITH NO HAIR RESTRAINT. INSTRUCTED TO WEAR A HAIR RESTRAINT WHEN HANDLING FOOD.',
       '30. FOOD IN ORIGINAL CONTAINER, PROPERLY LABELED: CUSTOMER ADVISORY POSTED AS NEEDED - Comments: MUST DATE AND LABEL ALL LARGE PLASTIC TUB CONTAINERS HOLDING CHICKEN IN THE BOTH 2-SLIDING DOOR COOLERS IN THE REAR PART OF THE STORE WHERE THERE ARE 3 SMALL CUSTOMER TABLE

In [14]:
train["Inspection Type"].unique()

array(['Canvass', 'Complaint', 'License Re-Inspection', 'License',
       'Canvass Re-Inspection', 'Short Form Complaint',
       'License-Task Force', 'Complaint Re-Inspection', 'LICENSE REQUEST',
       'Complaint-Fire', 'Tag Removal', 'Task Force Liquor 1475',
       'Suspected Food Poisoning', 'Consultation',
       'Complaint-Fire Re-inspection', 'Recent Inspection',
       'Recent inspection', 'Suspected Food Poisoning Re-inspection',
       'CANVASS RE INSPECTION OF CLOSE UP', 'OUT OF BUSINESS',
       'Out of Business', 'Special Events (Festivals)',
       'Short Form Fire-Complaint', 'DAY CARE LICENSE RENEWAL',
       'No Entry', 'Pre-License Consultation', 'Package Liquor 1474',
       'Non-Inspection', 'CHANGED COURT DATE', 'TASKFORCE', 'Not Ready',
       'SFP/COMPLAINT', 'O.B.', 'task force', 'error save',
       'REINSPECTION', 'TASK FORCE PACKAGE GOODS 1474',
       'CANVASS SPECIAL EVENTS', 'LIQOUR TASK FORCE NOT READY',
       'TASK FORCE LIQUOR (1481)', 'Duplicated',


In [15]:
train["Inspection Fail"].unique()

array([0, 1], dtype=int64)

In [16]:
train.dtypes

Inspection ID        int64
DBA Name            object
AKA Name            object
License #          float64
Facility Type       object
Risk                object
Address             object
City                object
State               object
Zip                float64
Inspection Date     object
Inspection Type     object
Violations          object
Latitude           float64
Longitude          float64
Location            object
Inspection Fail      int64
dtype: object

In [17]:
train["Inspection Fail"].value_counts(normalize=True)

0    0.804733
1    0.195267
Name: Inspection Fail, dtype: float64

## 1.2 EDA의 결과를 토대로 Feature Engineering 및 Preprocessing을 진행하세요
> 새로운 feature를 만드는 작업뿐만이 아니라, 필요한 feature가 적절한 데이터 타입을 가지고 있지 않다면 변환합니다

In [18]:
# risk에 대한 feature 변경 
train.loc[(train.Risk =="Risk 1 (High)"), "Risk"] = 3
train.loc[(train.Risk =='Risk 2 (Medium)'), "Risk"] = 2
train.loc[(train.Risk =='Risk 3 (Low)'), "Risk"] = 1
train.loc[(train.Risk =="All"), "Risk"] = 4

test.loc[(train.Risk =="Risk 1 (High)"), "Risk"] = 3
test.loc[(train.Risk =='Risk 2 (Medium)'), "Risk"] = 2
test.loc[(train.Risk =='Risk 3 (Low)'), "Risk"] = 1
test.loc[(train.Risk =="All"), "Risk"] = 4

In [19]:
# 시간 순서대로 정렬 ( 시간적인 의미를 가지고 있기 때문에)
train = train.sort_values(by=["Inspection Date"], axis=0)

In [20]:
for i in range(7):
    train[f'Violation{i}'] = train.Violations.str.split('|').str[i]
    train[f'Violation{i}'] = train[f'Violation{i}'].str[:2]
    train[f'Violation{i}'] = pd.to_numeric(train[f'Violation{i}']).fillna(0)

In [21]:
for i in range(7):
    test[f'Violation{i}'] = test.Violations.str.split('|').str[i]
    test[f'Violation{i}'] = test[f'Violation{i}'].str[:2]
    test[f'Violation{i}'] = pd.to_numeric(test[f'Violation{i}']).fillna(0)

# Part 2: 모델링 (Modeling)

## 2.1 검증 방식 (Cross-validation / Hold-out Validation)을 정한 후 데이터셋을 목적에 맞게 분할하세요

In [22]:
from category_encoders import OneHotEncoder # 테스트 목적을 위한 import
from category_encoders import TargetEncoder # 테스트목적을 위한 import
from sklearn.feature_selection import f_regression, SelectKBest
from sklearn.impute import SimpleImputer
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

target = 'Inspection Fail'
features = train.columns.drop([target])

train, val = train_test_split(train, train_size=0.80, test_size=0.2, random_state=42)

X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

X_test = test[features]
y_test = test[target]

## 2.2 모델 학습을 위한 파이프라인을 구축 후 학습(fit)까지 진행하세요
> 모델은 scikit-learn, xgboost, lightgbm 등 어떤 라이브러리를 사용하셔도 괜찮지만 특정 라이브러리는 **설치 및 설정에 시간이 소요되는 점**을 감안하시기 바랍니다

학습한 내용을 확인하기 위해 xgboosting까지 활용해 보았습니다.

In [23]:
from category_encoders import OrdinalEncoder
from sklearn.metrics import r2_score
from xgboost import XGBRegressor

encoder = OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train) # 학습데이터
X_val_encoded = encoder.transform(X_val) # 검증데이터

boosting = XGBRegressor(
    n_estimators=1000,
    objective='reg:squarederror', # default
    learning_rate=0.2,
    n_jobs=-1
)

eval_set = [(X_train_encoded, y_train), 
            (X_val_encoded, y_val)]

boosting.fit(X_train_encoded, y_train, 
          eval_set=eval_set,
          early_stopping_rounds=50
         )


[0]	validation_0-rmse:0.43073	validation_1-rmse:0.43139
[1]	validation_0-rmse:0.37613	validation_1-rmse:0.37755
[2]	validation_0-rmse:0.33615	validation_1-rmse:0.33906
[3]	validation_0-rmse:0.30759	validation_1-rmse:0.31094
[4]	validation_0-rmse:0.28689	validation_1-rmse:0.29404
[5]	validation_0-rmse:0.27205	validation_1-rmse:0.27910
[6]	validation_0-rmse:0.26088	validation_1-rmse:0.26982
[7]	validation_0-rmse:0.25304	validation_1-rmse:0.26314
[8]	validation_0-rmse:0.24772	validation_1-rmse:0.25805
[9]	validation_0-rmse:0.24308	validation_1-rmse:0.25402
[10]	validation_0-rmse:0.23935	validation_1-rmse:0.25106
[11]	validation_0-rmse:0.23662	validation_1-rmse:0.24855
[12]	validation_0-rmse:0.23454	validation_1-rmse:0.24674
[13]	validation_0-rmse:0.23207	validation_1-rmse:0.24533
[14]	validation_0-rmse:0.23047	validation_1-rmse:0.24407
[15]	validation_0-rmse:0.22973	validation_1-rmse:0.24353
[16]	validation_0-rmse:0.22828	validation_1-rmse:0.24265
[17]	validation_0-rmse:0.22731	validation

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.2, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=1000, n_jobs=-1, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

In [24]:
y_pred = boosting.predict(X_val_encoded)
print('R^2', r2_score(y_val, y_pred))

R^2 0.6378588455551615


In [25]:
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from category_encoders import OrdinalEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

pipe_rf_tag = make_pipeline(
    TargetEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(max_depth=6,
                          criterion='entropy',
                          n_jobs=-1,
                          min_samples_leaf=3,
                          random_state=100)
)

pipe_rf_tag.fit(X_train, y_train)
print('rf 훈련세트 정확도', pipe_rf_tag.score(X_train, y_train))
print('rf 훈련세트 f1 score', f1_score(y_train, pipe_rf_tag.predict(X_train)))

rf 훈련세트 정확도 0.9195
rf 훈련세트 f1 score 0.7861885790172642


In [26]:
from sklearn.pipeline import Pipeline

In [27]:
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from category_encoders import OrdinalEncoder
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score
from sklearn.linear_model import Ridge

pipe_rf_ordi = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier(max_depth=6,
                          criterion='entropy',
                          n_jobs=-1,
                          min_samples_leaf=3,
                          random_state=100)
)

pipe_rf_ordi.fit(X_train, y_train)
print('rf 훈련세트 정확도', pipe_rf_ordi.score(X_train, y_train))
print('rf 훈련세트 f1 score', f1_score(y_train, pipe_rf_ordi.predict(X_train)))

rf 훈련세트 정확도 0.8856458333333334
rf 훈련세트 f1 score 0.6502262155101001


In [28]:
pipe_rf_ordi.fit(X_train, y_train)
print('rf 훈련세트 정확도', pipe_rf_ordi.score(X_train, y_train))
print('rf 훈련세트 f1 score', f1_score(y_train, pipe_rf_ordi.predict(X_train)))

rf 훈련세트 정확도 0.8856458333333334
rf 훈련세트 f1 score 0.6502262155101001


## 2.3 테스트셋의 ROC / AUC 검증 점수를 예측합니다
> 제작한 모델을 활용해서 테스트셋의 **확률**을 예측하세요 (`ROC/AUC 검증 점수`를 **0.65 혹은 그 이상** 달성한다면 매우 훌륭한 모델입니다). 명시된 성능이 나오지 않았더라도 이 문제에 모든 시간을 소요하시면 안됩니다 (점수가 안나오는 경우 여러분이 구축한 파이프라인 모델링 및 모델링 과정으로 평가합니다)

In [29]:
print('rf 훈련 정확도 : ',pipe_rf_tag.score(X_train, y_train))
print('rf 검증 정확도 : ',pipe_rf_tag.score(X_val, y_val))

rf 훈련 정확도 :  0.9195
rf 검증 정확도 :  0.8428333333333333


In [30]:
print('rf 훈련 정확도 : ',pipe_rf_ordi.score(X_train, y_train))
print('rf 검증 정확도 : ',pipe_rf_ordi.score(X_val, y_val))

rf 훈련 정확도 :  0.8856458333333334
rf 검증 정확도 :  0.87275


In [31]:
print('rf 훈련 f1 score : ',f1_score(y_train, pipe_rf_tag.predict(X_train)))
print('rf 검증 f1 score : ',f1_score(y_val, pipe_rf_tag.predict(X_val)))

rf 훈련 f1 score :  0.7861885790172642
rf 검증 f1 score :  0.5220476431829701


In [32]:
print('rf 훈련 f1 score : ',f1_score(y_train, pipe_rf_ordi.predict(X_train)))
print('rf 검증 f1 score : ',f1_score(y_val, pipe_rf_ordi.predict(X_val)))

rf 훈련 f1 score :  0.6502262155101001
rf 검증 f1 score :  0.5835833106081266


proba를 구해서 roc까지 구라하는 말씀이신지.....
아니면,,, 둘중에 하나만 해도 된다라는 말씀이신지....
물론..auc가.. roc의 면적인건.. 알고 있지만..

## 2.4 하이퍼 파라미터 튜닝을 통해서 모델을 개선하세요
> `RandomSearchCV`, `GridSearchCV` 등을 활용해서 모델의 성능을 개선합니다. 범위 설정에 따라 시간이 매우 소요될 수 있습니다.

could not convert string to float: 'THE MARKET'이라는 오류를 만났었는데

우리 데이터에는 MARKET이 없는데 이게 왜 뜨는지는 확인하지 못하였다

In [33]:
from scipy.stats import randint, uniform
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import RandomizedSearchCV

param_distributions = { 
    'n_estimators': randint(50, 500), 
    'max_depth': [5, 10, 15, 20, None], 
    'max_features': uniform(0, 1), 
}

search = RandomizedSearchCV(
    RandomForestRegressor(random_state=2), 
    param_distributions=param_distributions, 
    n_iter=5, 
    cv=3, 
    scoring='neg_mean_absolute_error', 
    verbose=10, 
    return_train_score=True, 
    n_jobs=-1, 
    random_state=2
)

search.fit(X_train, y_train);

Fitting 3 folds for each of 5 candidates, totalling 15 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 16 concurrent workers.
[Parallel(n_jobs=-1)]: Done   2 out of  15 | elapsed:    2.4s remaining:   16.4s
[Parallel(n_jobs=-1)]: Done   4 out of  15 | elapsed:    3.0s remaining:    8.3s
[Parallel(n_jobs=-1)]: Done   6 out of  15 | elapsed:    3.4s remaining:    5.2s
[Parallel(n_jobs=-1)]: Done   8 out of  15 | elapsed:    3.9s remaining:    3.4s
[Parallel(n_jobs=-1)]: Done  10 out of  15 | elapsed:    4.4s remaining:    2.2s
[Parallel(n_jobs=-1)]: Done  12 out of  15 | elapsed:    4.8s remaining:    1.1s
[Parallel(n_jobs=-1)]: Done  15 out of  15 | elapsed:    5.3s finished


ValueError: could not convert string to float: 'THE MARKET'

In [None]:
print('최적 하이퍼파라미터: ', search.best_params_)
print('CV MAE: ', -search.best_score_)
model = search.best_estimator_

# Part 3: 시각화 (Visualization)

> 모델의 해석을 위한 시각화를 해주세요. 아래의 제시 된 종류 중 **2가지**를 선택하세요 (시각화에서 가장 중요한 것은 **여러분의 해석**입니다):
> - Permutation Importances
> - Partial Dependence Plot, 1 feature isolation
> - Partial Dependence Plot, 2 features interaction
> - Shapley Values (SHAP)

3-1) 시각화 - Partial Dependence Plot, 1 feature isolation

In [None]:
from pdpbox.pdp import pdp_isolate, pdp_plot

feature = 'Risk'

In [None]:
isolated = pdp_isolate(
    model=pipe_rf_ordi, 
    dataset=X_train, 
    model_features=X_train.columns, 
    feature=feature,
    grid_type='percentile', # default='percentile', or 'equal'
    num_grid_points=10 # default=10
)
pdp_plot(isolated, feature_name=feature);

★ 그래프 해석 ★  
 -. risk 1~2사이에 있는 식당들은 점검 후 적극적인 피드백에 의해 0보다 작은 곳에 위치하는 것 같고  
 -. risk3이상의 식당은 이미 회복하기에는 너무 큰 위반사항들이 있어서 폐업으로 이어지는 것으로 관측된다.   

3-2) 시각화 - Partial Dependence Plot, 2 features interaction

In [None]:
from pdpbox.pdp import pdp_interact, pdp_interact_plot

In [None]:
features = ['Risk', 'Violation5']

interaction = pdp_interact(
    model=boosting, 
    dataset=X_train_encoded, 
    model_features=X_train_encoded.columns, 
    features=features
)

pdp_interact_plot(interaction, plot_type='grid', feature_names=features);

★ 그래프 해석 ★  
 -. violation5가 많이 적발될 수록 risk가 높은 그룹에 해당되었다.   
 -. risk가 높은 그룹은 페업활률이 높아진다.   

3-3) Permutation Importances  
  -. 시각화를 도전하였으나, 처음만나는 에러로 인해 실패하였습니다.   
  -. 스첼 후 잔여시간동안 확인하여 해결해보겠습니다. 

In [None]:
import eli5
from eli5.sklearn import PermutationImportance

# permuter 정의
permuter = PermutationImportance(
    boosting, 
    
    n_iter=5, # 다른 random seed를 사용하여 5번 반복
    random_state=42
)


permuter.fit(X_val, y_val);

In [None]:
feature_names = X_val.columns.tolist()
pd.Series(permuter.feature_importances_, feature_names).sort_values()

3-4) Shapley Values (SHAP)  
-. 시각화를 도전하였으나, 처음만나는 에러로 인해 실패하였습니다.  
-. 스첼 후 잔여시간동안 확인하여 해결해보겠습니다

In [None]:
row = X_train.iloc[[1000]] 
row

In [None]:
pipe_rf_tag.predict(row)

In [None]:
import shap
explainer = shap.TreeExplainer(model)
row_processed = processor.transform(row)
shap_values = explainer.shap_values(row_processed)


shap.initjs()
shap.force_plot(
    base_value=explainer.expected_value, 
    shap_values=shap_values, 
    features=row, 
    link='logit' # SHAP value를 확률로 변환해 표시합니다.
)