<a href="https://colab.research.google.com/github/trista-paul/DS-Unit-4-Sprint-1-Tree-Ensembles/blob/master/DS41SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 4 Sprint Challenge 1 — Tree Ensembles

### Chicago Food Inspections

For this Sprint Challenge, you'll use a dataset with information from inspections of restaurants and other food establishments in Chicago from January 2010 to March 2019. 

[See this PDF](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF) for descriptions of the data elements included in this dataset.

According to [Chicago Department of Public Health — Food Protection Services](https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/food-protection-services.html), "Chicago is home to 16,000 food establishments like restaurants, grocery stores, bakeries, wholesalers, lunchrooms, mobile food vendors and more. Our business is food safety and sanitation with one goal, to prevent the spread of food-borne disease. We do this by inspecting food businesses, responding to complaints and food recalls." 

#### Your challenge: Predict whether inspections failed

The target is the `Fail` column.

- When the food establishment failed the inspection, the target is `1`.
- When the establishment passed, the target is `0`.

#### Run this cell to load the data:

In [0]:
!pip install category_encoders
!pip install eli5
!pip install pdpbox
!pip install shap

In [77]:
!pip install --upgrade git+https://github.com/scikit-learn-contrib/categorical-encoding

Collecting git+https://github.com/scikit-learn-contrib/categorical-encoding
  Cloning https://github.com/scikit-learn-contrib/categorical-encoding to /tmp/pip-req-build-t17qb1kx
Building wheels for collected packages: category-encoders
  Building wheel for category-encoders (setup.py) ... [?25ldone
[?25h  Stored in directory: /tmp/pip-ephem-wheel-cache-sagqf_yg/wheels/94/ac/6d/fe3feae87e68e96dd9d4ab262c1858312af15d4076fd23656f
Successfully built category-encoders
Installing collected packages: category-encoders
  Found existing installation: category-encoders 1.3.0
    Uninstalling category-encoders-1.3.0:
      Successfully uninstalled category-encoders-1.3.0
Successfully installed category-encoders-2.0.0


In [0]:
import pandas as pd
import numpy as np
import datetime
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import cross_val_score, cross_val_predict
import category_encoders as ce
import xgboost
from sklearn.pipeline import make_pipeline
import eli5
from pdpbox.pdp import pdp_isolate, pdp_plot
import shap
from sklearn.pipeline import make_pipeline

In [0]:
train_url = 'https://drive.google.com/uc?export=download&id=13_tP9JpLcZHSPVpWcua4t2rY44K_s4H5'
test_url  = 'https://drive.google.com/uc?export=download&id=1GkDHjsiGrzOXoF_xcYjdzBTSjOIi3g5a'

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

assert train.shape == (51916, 17)
assert test.shape  == (17306, 17)

### Part 1: Preprocessing

You may choose which features you want to use, and whether/how you will preprocess them. If you use categorical features, you may use any tools and techniques for encoding. (Pandas, category_encoders, sklearn.preprocessing, or any other library.)

_To earn a score of 3 for this part, find and explain leakage. The dataset has a feature that will give you an ROC AUC score > 0.90 if you process and use the feature. Find the leakage and explain why the feature shouldn't be used in a real-world model to predict the results of future inspections._

### Part 2: Modeling

Fit a Random Forest or Gradient Boosting model with the train set. (You may use scikit-learn, xgboost, or any other library.) Use cross-validation to estimate an ROC AUC validation score.

Use your model to predict probabilities for the test set. Get an ROC AUC test score >= 0.60.

_To earn a score of 3 for this part, get an ROC AUC test score >= 0.70 (without using the feature with leakage)._


### Part 3: Visualization

Make one visualization for model interpretation. (You may use any libraries.) Choose one of these types:
- Feature Importances
- Permutation Importances
- Partial Dependence Plot
- Shapley Values

_To earn a score of 3 for this part, make at least two of these visualization types._

# Part 1

### Value Counts

In [18]:
train['Inspection Type'].value_counts()

Canvass                                   24170
License                                    7825
Canvass Re-Inspection                      6346
Complaint                                  4948
License Re-Inspection                      3002
Complaint Re-Inspection                    2241
Short Form Complaint                       2103
License-Task Force                          214
Suspected Food Poisoning                    207
Consultation                                189
Tag Removal                                 146
Out of Business                             109
Task Force Liquor 1475                       92
Recent Inspection                            66
Suspected Food Poisoning Re-inspection       58
Complaint-Fire                               51
Short Form Fire-Complaint                    36
No Entry                                     21
Special Events (Festivals)                   21
Package Liquor 1474                          16
Complaint-Fire Re-inspection            

In [7]:
train.isnull().sum()

Inspection ID         0
DBA Name              0
AKA Name            623
License #             5
Facility Type       224
Risk                 12
Address               0
City                 53
State                10
Zip                  26
Inspection Date       0
Inspection Type       1
Violations         9655
Latitude            198
Longitude           198
Location            198
Fail                  0
dtype: int64

In [16]:
train['City'].value_counts()

CHICAGO              51659
Chicago                 91
chicago                 34
CCHICAGO                16
SCHAUMBURG               6
CHicago                  5
MAYWOOD                  4
ELK GROVE VILLAGE        4
CHESTNUT STREET          3
CICERO                   3
CHICAGOCHICAGO           2
OAK PARK                 2
NAPERVILLE               2
EAST HAZEL CREST         2
ELMHURST                 2
NILES NILES              2
ALSIP                    2
SKOKIE                   2
ROSEMONT                 2
CHARLES A HAYES          1
CHICAGOI                 1
CHICAGOHICAGO            1
EVANSTON                 1
CHICAGO HEIGHTS          1
CHCHICAGO                1
WORTH                    1
TINLEY PARK              1
STREAMWOOD               1
BERWYN                   1
OOLYMPIA FIELDS          1
HIGHLAND PARK            1
SCHILLER PARK            1
BRIDGEVIEW               1
SUMMIT                   1
OLYMPIA FIELDS           1
BEDFORD PARK             1
LAKE BLUFF               1
B

In [10]:
train['Facility Type'].value_counts()

Restaurant                                         34264
Grocery Store                                       6904
School                                              3876
Bakery                                               846
Daycare (2 - 6 Years)                                830
Children's Services Facility                         802
Daycare Above and Under 2 Years                      656
Long Term Care                                       394
Catering                                             304
Mobile Food Dispenser                                280
Liquor                                               261
Daycare Combo 1586                                   227
Wholesale                                            203
Golden Diner                                         162
Mobile Food Preparer                                 159
Hospital                                             141
TAVERN                                                88
Shared Kitchen User (Long Term)

In [19]:
train['Zip'].value_counts()

60647.0    1782
60614.0    1756
60657.0    1676
60622.0    1670
60611.0    1663
60618.0    1526
60608.0    1518
60607.0    1450
60625.0    1388
60639.0    1365
60623.0    1356
60616.0    1324
60640.0    1270
60632.0    1211
60609.0    1166
60659.0    1129
60613.0    1087
60619.0    1079
60617.0    1029
60654.0    1011
60620.0     990
60629.0     982
60634.0     967
60610.0     964
60641.0     961
60601.0     928
60628.0     923
60612.0     898
60606.0     858
60626.0     839
           ... 
60655.0     205
60633.0      79
60827.0      36
60193.0       6
60007.0       4
60153.0       4
60804.0       3
60540.0       2
60429.0       2
60018.0       2
60126.0       2
60803.0       2
60714.0       2
60302.0       2
60461.0       2
60411.0       1
60402.0       1
60440.0       1
60201.0       1
60482.0       1
60176.0       1
60155.0       1
60455.0       1
60107.0       1
60077.0       1
60076.0       1
60044.0       1
60035.0       1
60477.0       1
60501.0       1
Name: Zip, Length: 86, d

### def clean

In [0]:
def drop(X):
  
    #drop unpredictive id columns, redundant columns
    #and Violation. Violation is the result of an inspection
    #and can't be used to evaluate the location bef|ehand
    X = X.drop(columns=['DBA Name',
                        'AKA Name',
                        'License #',
                        'Inspection ID',
                        'Address',
                        'Location',
                        'Violations',
                        'City',
                        'State'
                                     ]
                                     )
    return X
    
train = drop(train)
test = drop(test)

In [0]:
def facility(X):
  #Facility Type: break into smaller categ|ies
    X['Facility Type'] = X['Facility Type'].str.lower()
    
    X['Nursing'] = X['Facility Type'].str.contains("long term")
    X['Hospital'] = X['Facility Type'].str.contains("hospital")
    X['Truck'] = X['Facility Type'].str.contains("mobile")
    X['Event'] = X['Facility Type'].str.contains("cater")
    X['Bakery'] = X['Facility Type'].str.contains("bakery")
    X['Store'] = X['Facility Type'].str.contains("liquor")
    X['Bar'] =  X['Facility Type'].str.contains("tavern")
    X['Daycare'] = X['Facility Type'].str.contains('daycare')
    X['School'] = X['Facility Type'] == "School"
    X['Restaurant'] = X['Facility Type'].str.contains('restaurant')
    X['Grocery'] = X['Facility Type'].str.contains('grocery')
    
    return X

train = facility(train)
test = facility(test)

In [0]:
def date(X):
  #Inspection Date: make into year and month
    X['Inspection Date'] = pd.to_datetime(X['Inspection Date'])
    X['Year'] = X['Inspection Date'].dt.year
    X['Month'] = X['Inspection Date'].dt.month
    X = X.drop(columns=['Inspection Date', 'Facility Type'])
    return X

train = date(train)
test = date(test)

In [0]:
def latlon(X):
  #latitude and longitude: real coordinates
    X['Latitude'] = X['Latitude'][X['Latitude'] <= 90]
    X['Latitude'] = X['Latitude'][X['Latitude'] >= -90]
    X['Latitude'] = X['Latitude'][X['Latitude'] != 0]
    X['Longitude'] = X['Latitude'][X['Longitude'] <= 180]
    X['Longitude'] = X['Latitude'][X['Longitude'] >= -180]
    X['Longitude'] = X['Latitude'][X['Longitude'] != -180]
    return X
  
train = latlon(train)
test = latlon(test)

In [0]:
train['Inspection Type'] = train['Inspection Type'].str.lower()
test['Inspection Type'] = test['Inspection Type'].str.lower()

In [0]:
train['Zip'] = train['Zip'].astype(str)
test['Zip'] = test['Zip'].astype(str)

In [0]:
y = train['Fail']
train = train.drop(columns=['Fail'])

In [0]:
encoder = ce.TargetEncoder(handle_unknown = 'ignore')
train_transform = encoder.fit_transform(train, y)

In [80]:
train_transform.head()

Unnamed: 0,Risk,Zip,Inspection Type,Latitude,Longitude,Nursing,Hospital,Truck,Event,Bakery,Store,Bar,Daycare,School,Restaurant,Grocery,Year,Month
0,0.249689,0.26253,0.259123,41.938007,41.938007,0.257218,0.257357,0.256386,0.257488,0.256713,0.2565,0.256881,0.257939,False,0.250298,0.249675,2017,9
1,0.337132,0.319572,0.103971,41.772402,41.772402,0.257218,0.257357,0.256386,0.257488,0.256713,0.2565,0.256881,0.257939,False,0.270971,0.305358,2011,10
2,0.249689,0.253782,0.073117,41.758779,41.758779,0.257218,0.257357,0.256386,0.257488,0.256713,0.2565,0.256881,0.257939,False,0.250298,0.249675,2016,4
3,0.259866,0.26507,0.259123,41.812181,41.812181,0.257218,0.257357,0.256386,0.257488,0.256713,0.2565,0.256881,0.257939,False,0.250298,0.249675,2016,4
4,0.249689,0.215886,0.259123,,,0.257218,0.257357,0.256386,0.257488,0.256713,0.2565,0.256881,0.257939,False,0.270971,0.249675,2011,1


# Part 2 - over 70%!

In [81]:
model = xgboost.XGBClassifier(max_depth = 2,
                              learning_rate = 0.1,
                              verbosity = 0,
                              n_jobs = -1,
                              random_state = 0)

model.fit(train_transform, y)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=2, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=-1, nthread=None, objective='binary:logistic',
       random_state=0, reg_alpha=0, reg_lambda=1, scale_pos_weight=1,
       seed=None, silent=True, subsample=1, verbosity=0)

In [87]:
#identical transforms to above to make compatible with cross_val_score
pipe = make_pipeline(
                     ce.TargetEncoder(),
                     xgboost.XGBClassifier(max_depth = 2, #tree depth
                                           learning_rate = 0.1, #equiv to eta
                                           verbosity = 0,
                                           n_jobs = -1, #parallel threads
                                           random_state = 0)
                     )

pipe.fit(train, y)

cross_val_score(pipe, train, y, cv=5, scoring='roc_auc', n_jobs=-1)

array([0.70777529, 0.703305  , 0.70697543, 0.71375145, 0.71416228])

In [0]:
test_y = test['Fail']
test = test.drop(columns=['Fail'])

In [90]:
cross_val_score(pipe, test, test_y, cv=5, scoring='roc_auc', n_jobs=-1)

array([0.70984955, 0.69732862, 0.72450502, 0.7120717 , 0.70721424])

# Part 3 - eli5

In [84]:
eli5.explain_weights_xgboost(model, top=None,
                             feature_names=train_transform.columns.tolist())

Weight,Feature
0.416,Inspection Type
0.1527,Zip
0.0772,Year
0.0648,Month
0.0511,Bar
0.0482,Daycare
0.0476,Grocery
0.0438,Store
0.0339,Restaurant
0.0262,Latitude
