# Data Science Unit 4 Sprint Challenge 1 — Tree Ensembles

### Chicago Food Inspections

For this Sprint Challenge, you'll use a dataset with information from inspections of restaurants and other food establishments in Chicago from January 1, 2010 to the present. 

[See this PDF](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF) for descriptions of the data elements included in this dataset.

According to [Chicago Department of Public Health — Food Protection Services](https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/food-protection-services.html), "Chicago is home to 16,000 food establishments like restaurants, grocery stores, bakeries, wholesalers, lunchrooms, mobile food vendors and more. Our business is food safety and sanitation with one goal, to prevent the spread of food-borne disease. We do this by inspecting food businesses, responding to complaints and food recalls." 

#### Your challenge: Predict whether inspections failed

The target is the `Fail` column.

- When the food establishment failed the inspection, the target is `1`.
- When the establishment passed, the target is `0`.

#### Run this cell to load the data:

In [1]:
import pandas as pd

train_url = 'https://drive.google.com/uc?export=download&id=13_tP9JpLcZHSPVpWcua4t2rY44K_s4H5'
test_url  = 'https://drive.google.com/uc?export=download&id=1GkDHjsiGrzOXoF_xcYjdzBTSjOIi3g5a'

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

In [2]:
# Check shape of train and test datasets

assert train.shape == (51916, 17)
assert test.shape  == (17306, 17)

In [3]:
train.shape, test.shape

((51916, 17), (17306, 17))

In [22]:
# Copy train and test so don't have to download multiple times
# Plus, rename to X and y

X_train = train.copy()
X_test = test.copy()

In [23]:
# Define target for both train1 and test1

y_train = X_train['Fail']
y_test = X_test['Fail']

In [24]:
# Check shapes

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((51916, 17), (17306, 17), (51916,), (17306,))

### Part 1: Preprocessing

You may choose which features you want to use, and whether/how you will preprocess them. You may use any tools and techniques for categorical encoding. (Pandas, category_encoders, sklearn.preprocessing, or any other library.)

_To earn a score of 3 for this part, engineer new features, and use any alternative categorical encoding instead of One-Hot or Ordinal/Label encoding._

In [25]:
import category_encoders as ce

# Aim for 'first, fast' baseline
# Start preprocessing by dropping nulls from all df's
X_train = X_train.dropna()
X_test = X_test.dropna()
y_train = y_train.dropna()
y_test = y_test.dropna()

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((41665, 17), (13822, 17), (51916,), (17306,))

In [26]:
# Need to reshape before encoding data
# Need X_train and y_train to have same shape[0]
# Also need X_test and y_test to have same?

y_train_random_sample = np.random.choice(y_train.index.values, 41665)
sampled_y_train = y_train.loc[y_train_random_sample]
y_test_random_sample = np.random.choice(y_test.index.values, 13822)
sampled_y_test = y_test.loc[y_test_random_sample]

X_train.shape, X_test.shape, sampled_y_train.shape, sampled_y_test.shape

# test_pred_proba = gb.predict_proba(test)[:,1]
# print('Validation ROC AUC:', roc_auc_score(test, test_pred_proba))

((41665, 17), (13822, 17), (41665,), (13822,))

In [27]:
encoder = ce.OrdinalEncoder()
X_train = encoder.fit_transform(X_train)
X_test = encoder.transform(X_test)

### Part 2: Modeling

Fit a Random Forest or Gradient Boosting model with the train set. (You may use scikit-learn, xgboost, or any other library.) Use cross-validation to estimate an ROC AUC validation score.

Use your model to predict probabilities for the test set. Get an ROC AUC test score >= 0.60.

_To earn a score of 3 for this part, get an ROC AUC test score >= 0.70._

In [21]:
# First work is for baseline modeling

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score

In [28]:
# With reshaping done, fit model, then compute cross_val_score

gb = GradientBoostingClassifier()
gb.fit(X_train, sampled_y_train)
cross_val_score(gb, X_test, sampled_y_test, scoring='roc_auc', cv=5, n_jobs=-1)

array([0.49449246, 0.49003441, 0.48145368, 0.49692797, 0.47023361])

### Part 3: Visualization

Make one visualization for model interpretation. (You may use any libraries.) Choose one of these types:
- Feature Importances
- Permutation Importances
- Partial Dependence Plot

_To earn a score of 3 for this part, make at least two of these visualization types._