# Data Science Unit 4 Sprint Challenge 1 — Tree Ensembles

### Chicago Food Inspections

For this Sprint Challenge, you'll use a dataset with information from inspections of restaurants and other food establishments in Chicago from January 1, 2010 to the present. 

[See this PDF](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF) for descriptions of the data elements included in this dataset.

According to [Chicago Department of Public Health — Food Protection Services](https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/food-protection-services.html), "Chicago is home to 16,000 food establishments like restaurants, grocery stores, bakeries, wholesalers, lunchrooms, mobile food vendors and more. Our business is food safety and sanitation with one goal, to prevent the spread of food-borne disease. We do this by inspecting food businesses, responding to complaints and food recalls." 

#### Your challenge: Predict whether inspections failed

The target is the `Fail` column.

- When the food establishment failed the inspection, the target is `1`.
- When the establishment passed, the target is `0`.

#### Run this cell to load the data:

In [14]:
import pandas as pd
import numpy as np
import category_encoders as ce
import xgboost as xgb
import matplotlib.pyplot as plt
plt.style.use('dark_background')

train_url = 'https://drive.google.com/uc?export=download&id=13_tP9JpLcZHSPVpWcua4t2rY44K_s4H5'
test_url  = 'https://drive.google.com/uc?export=download&id=1GkDHjsiGrzOXoF_xcYjdzBTSjOIi3g5a'

train = pd.read_csv(train_url).sample(2**13)
test  = pd.read_csv(test_url).sample(2**12)

X_train_0 = train.drop(['Fail'], axis=1)
X_test_0 = test.drop(['Fail'], axis=1)
y_train = train.Fail
y_test = test.Fail

#assert train.shape == (51916, 17)
#assert test.shape  == (17306, 17)

### Part 1: Preprocessing

You may choose which features you want to use, and whether/how you will preprocess them. You may use any tools and techniques for categorical encoding. (Pandas, category_encoders, sklearn.preprocessing, or any other library.)

_To earn a score of 3 for this part, engineer new features, and use any alternative categorical encoding instead of One-Hot or Ordinal/Label encoding._

### Part 2: Modeling

Fit a Random Forest or Gradient Boosting model with the train set. (You may use scikit-learn, xgboost, or any other library.) Use cross-validation to estimate an ROC AUC validation score.

Use your model to predict probabilities for the test set. Get an ROC AUC test score >= 0.60.

_To earn a score of 3 for this part, get an ROC AUC test score >= 0.70._


### Part 3: Visualization

Make one visualization for model interpretation. (You may use any libraries.) Choose one of these types:
- Feature Importances
- Permutation Importances
- Partial Dependence Plot

_To earn a score of 3 for this part, make at least two of these visualization types._

In [15]:
train.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Violations,Latitude,Longitude,Location,Fail
35113,2104532,JUST LIKE HOME CHILD CARE CENTER INC.,JUST LIKE HOME CHILD CARE CENTER,2216207.0,Children's Services Facility,Risk 1 (High),1249-1251 W 63RD ST,CHICAGO,IL,60636.0,2017-11-13T00:00:00,Canvass,"1. SOURCE SOUND CONDITION, NO SPOILAGE, FOODS ...",41.779473,-87.656456,"{'longitude': '-87.65645608488629', 'latitude'...",1
2166,1523054,"KOPI, A TRAVELER'S CAFE","KOPI, A TRAVELER'S CAFE",50261.0,Restaurant,Risk 1 (High),5317 N CLARK ST,CHICAGO,IL,60640.0,2015-02-19T00:00:00,Canvass,31. CLEAN MULTI-USE UTENSILS AND SINGLE SERVIC...,41.978607,-87.668176,"{'longitude': '-87.6681764114042', 'latitude':...",0
20492,2129832,MORSE FRESH MARKET,MORSE FRESH MARKET,1518304.0,Grocery Store,Risk 1 (High),1430 W MORSE AVE,CHICAGO,IL,60626.0,2018-01-02T00:00:00,Canvass Re-Inspection,29. PREVIOUS MINOR VIOLATION(S) CORRECTED 7-42...,42.007995,-87.667174,"{'longitude': '-87.66717414671852', 'latitude'...",1
42572,419272,SABRI NIHARI RESTAURANT,SABRI NIHARI RESTAURANT,1772527.0,Restaurant,Risk 1 (High),2500-2502 W DEVON AVE,CHICAGO,IL,60659.0,2010-10-07T00:00:00,Complaint-Fire,18. NO EVIDENCE OF RODENT OR INSECT OUTER OPEN...,41.997797,-87.692389,"{'longitude': '-87.69238861949636', 'latitude'...",1
9120,457457,MARKET CREATIONS,MARKET CREATIONS,1914291.0,CAFETERIA,Risk 1 (High),219 S DEARBORN ST,CHICAGO,IL,60604.0,2010-11-10T00:00:00,Canvass,31. CLEAN MULTI-USE UTENSILS AND SINGLE SERVIC...,41.878967,-87.629193,"{'longitude': '-87.62919335609855', 'latitude'...",0


In [16]:
train.dtypes

Inspection ID        int64
DBA Name            object
AKA Name            object
License #          float64
Facility Type       object
Risk                object
Address             object
City                object
State               object
Zip                float64
Inspection Date     object
Inspection Type     object
Violations          object
Latitude           float64
Longitude          float64
Location            object
Fail                 int64
dtype: object

In [17]:
train.describe()

Unnamed: 0,Inspection ID,License #,Zip,Latitude,Longitude,Fail
count,8192.0,8191.0,8187.0,8157.0,8157.0,8192.0
mean,1316196.0,1548803.0,60628.470624,41.878751,-87.676395,0.263794
std,618641.2,904719.9,24.815222,0.080939,0.058859,0.440716
min,48212.0,0.0,60035.0,41.64467,-87.914428,0.0
25%,670780.8,1141028.0,60614.0,41.823914,-87.707321,0.0
50%,1401870.0,1959556.0,60625.0,41.890082,-87.666764,0.0
75%,1932798.0,2215574.0,60642.0,41.938553,-87.634853,1.0
max,2279534.0,8700606.0,60827.0,42.021064,-87.525094,1.0


In [18]:
X_train_0.isna().sum()

Inspection ID         0
DBA Name              0
AKA Name            101
License #             1
Facility Type        39
Risk                  0
Address               0
City                  7
State                 3
Zip                   5
Inspection Date       0
Inspection Type       0
Violations         1569
Latitude             35
Longitude            35
Location             35
dtype: int64

In [19]:
todrop1 = ['Inspection ID', 'AKA Name', 'License #', 'Location']

def wrangle(dat): 
    import datetime as dt
    assigns = {
        **{'Violations': dat.Violations.fillna('NO_VIOLATIONS'), 
          'Facility Type': dat['Facility Type'].fillna('NOT_APPLICABLE'), 
          'Inspection Date': pd.to_datetime(dat['Inspection Date'], 
                                            infer_datetime_format=True
                                           ).apply(dt.datetime.toordinal)}, 
        **{l: dat[l].fillna(dat[l].mean()) for l in ['Latitude', 'Longitude']}
    }
    
    return dat.assign(**assigns).drop(todrop1, axis=1)

X_train_ = wrangle(X_train_0)
X_test_ = wrangle(X_test_0)

In [20]:
X_train_.isna().sum()

DBA Name           0
Facility Type      0
Risk               0
Address            0
City               7
State              3
Zip                5
Inspection Date    0
Inspection Type    0
Violations         0
Latitude           0
Longitude          0
dtype: int64

In [21]:
print(f'The target is unbalanced target by a factor of {y_train.value_counts()[0] / y_train.value_counts()[1]} favoring `0`. \n\n')

cats = X_train_.select_dtypes(include='object').columns
nums = X_train_.select_dtypes(exclude='object').columns

print(f"We have {len(cats)} categoricals and {len(nums)} numerics. \n")

cards = {n: X_train_[n].value_counts().shape[0] for n in cats}

def report_cards(c): 
    l = [f"\tFeature {k} has cardinality {v}. " for k,v in c.items()]
    return '\n'.join(l)

print(report_cards(cards))

The target is unbalanced target by a factor of 2.7908375751966683 favoring `0`. 


We have 8 categoricals and 4 numerics. 

	Feature DBA Name has cardinality 5958. 
	Feature Facility Type has cardinality 130. 
	Feature Risk has cardinality 3. 
	Feature Address has cardinality 5798. 
	Feature City has cardinality 15. 
	Feature State has cardinality 1. 
	Feature Inspection Type has cardinality 27. 
	Feature Violations has cardinality 6621. 


In [22]:
## We're going to encode "high" and "low" cardinality featuers differently, depending on some meaning of "high"/"low"

thresh=2**12
low = [k for k in cards.keys() if cards[k]<=thresh]
high = [k for k in cards.keys() if cards[k]>thresh]

assert len(low) + len(high) == len(cards.keys())

# send lows to binary encoder and send highs to target encoder. 
low, high

(['Facility Type', 'Risk', 'City', 'State', 'Inspection Type'],
 ['DBA Name', 'Address', 'Violations'])

In [45]:
def encode(train_dat, test_dat, low=low, high=high, targ=y_train, n=4): 
    '''
    base2 encode low cardinality, target encode high cardinailty
    
    Having given up on Target encoding because of `IndexingError: 
        Unalignable boolean Series provided as indexer`, 
    we shall basen encode even the high cardinality ones.   
    
    
    DO NOT RUN THIS. It crashes the computer. 
        #plt.plot(x=list(range(1,10)), y=[encode(X_train[cats], n=k) for k in range(1,10)])
    '''
    assert [x=='object' for x in train_dat.dtypes]
    assert [x=='object' for x in test_dat.dtypes]
    assert all([x==y for x,y in zip(X_train_[high].index, y_train.index)])
    
    bl = ce.basen.BaseNEncoder(base=n, return_df=True)
    #th = ce.TargetEncoder()
    
    bl.fit(train_dat[low+high])
    #bl_df = bl.fit_transform(train_dat[low+high])
    #th_df = th.fit_transform(dat[high], targ, return_df=True)
    return {'train': bl.transform(train_dat[low+high]), 'test': bl.transform(test_dat[low+high])}


print(f"Now is {encode(X_train_[cats], X_test_[cats])['train'].shape[1]} really too many?")



Now is 40 really too many?


In [46]:
cats_dfs = encode(X_train_[cats], X_test_[cats])
X_train = pd.concat([X_train_[nums], cats_dfs['train']], axis=1, sort=False)
X_test = pd.concat([X_test_[nums], cats_dfs['test']], axis=1, sort=False)

In [47]:
X_train.head()

Unnamed: 0,Zip,Inspection Date,Latitude,Longitude,Facility Type_0,Facility Type_1,Facility Type_2,Facility Type_3,Facility Type_4,Risk_0,...,Address_6,Address_7,Violations_0,Violations_1,Violations_2,Violations_3,Violations_4,Violations_5,Violations_6,Violations_7
35113,60636.0,736646,41.779473,-87.656456,0,0,0,0,1,0,...,0,1,0,0,0,0,0,0,0,1
2166,60640.0,735648,41.978607,-87.668176,0,0,0,0,2,0,...,0,2,0,0,0,0,0,0,0,2
20492,60626.0,736696,42.007995,-87.667174,0,0,0,0,3,0,...,0,3,0,0,0,0,0,0,0,3
42572,60659.0,734052,41.997797,-87.692389,0,0,0,0,2,0,...,1,0,0,0,0,0,0,0,1,0
9120,60604.0,734086,41.878967,-87.629193,0,0,0,1,0,0,...,1,1,0,0,0,0,0,0,1,1


In [48]:

dtrain = xgb.DMatrix(X_train.values, y_train.values)
dtest = xgb.DMatrix(X_test.values) # don't ask

# specify parameters via map
param = {'booster': 'dart',
         'max_depth': 5, 'learning_rate': 0.1,
         'objective': 'binary:logistic', 
         'silent': True,
         'sample_type': 'uniform',
         'normalize_type': 'tree',
         'rate_drop': 0.1,
         'skip_drop': 0.5}

num_round = 50
bst = xgb.train(param, dtrain, num_round)
# make prediction
# ntree_limit must not be 0
preds = bst.predict(dtest, ntree_limit=num_round)

preds_df = pd.DataFrame({'id': X_test.index, 'Fail?': preds})


In [49]:
preds_df.head(30)

Unnamed: 0,id,Fail?
0,14092,0.364225
1,5825,0.251822
2,6079,0.370663
3,11233,0.167482
4,16950,0.554544
5,17124,0.275952
6,15411,0.131426
7,1684,0.545223
8,11406,0.2207
9,2952,0.158459


In [52]:
from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, preds_df['Fail?'])

0.6103129370389406