<a href="https://colab.research.google.com/github/donw385/DS-Unit-4-Sprint-1-Tree-Ensembles/blob/master/DS41SC.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Science Unit 4 Sprint Challenge 1 — Tree Ensembles

### Chicago Food Inspections

For this Sprint Challenge, you'll use a dataset with information from inspections of restaurants and other food establishments in Chicago from January 2010 to March 2019. 

[See this PDF](https://data.cityofchicago.org/api/assets/BAD5301B-681A-4202-9D25-51B2CAE672FF) for descriptions of the data elements included in this dataset.

According to [Chicago Department of Public Health — Food Protection Services](https://www.chicago.gov/city/en/depts/cdph/provdrs/healthy_restaurants/svcs/food-protection-services.html), "Chicago is home to 16,000 food establishments like restaurants, grocery stores, bakeries, wholesalers, lunchrooms, mobile food vendors and more. Our business is food safety and sanitation with one goal, to prevent the spread of food-borne disease. We do this by inspecting food businesses, responding to complaints and food recalls." 

#### Your challenge: Predict whether inspections failed

The target is the `Fail` column.

- When the food establishment failed the inspection, the target is `1`.
- When the establishment passed, the target is `0`.

#### Run this cell to load the data:

In [11]:
!pip install category_encoders

Collecting category_encoders
[?25l  Downloading https://files.pythonhosted.org/packages/f7/d3/82a4b85a87ece114f6d0139d643580c726efa45fa4db3b81aed38c0156c5/category_encoders-1.3.0-py2.py3-none-any.whl (61kB)
[K    100% |████████████████████████████████| 61kB 4.1MB/s 
Installing collected packages: category-encoders
Successfully installed category-encoders-1.3.0


In [0]:
import category_encoders as ce
import pandas as pd
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier

In [0]:

train_url = 'https://drive.google.com/uc?export=download&id=13_tP9JpLcZHSPVpWcua4t2rY44K_s4H5'
test_url  = 'https://drive.google.com/uc?export=download&id=1GkDHjsiGrzOXoF_xcYjdzBTSjOIi3g5a'

train = pd.read_csv(train_url)
test  = pd.read_csv(test_url)

assert train.shape == (51916, 17)
assert test.shape  == (17306, 17)

In [0]:
pd.options.display.max_colwidth = 50

In [41]:
train.head()

Unnamed: 0,Inspection ID,DBA Name,AKA Name,License #,Facility Type,Risk,Address,City,State,Zip,Inspection Date,Inspection Type,Violations,Latitude,Longitude,Location,Fail
0,2088270,"TOM YUM RICE & NOODLE, INC.",TOM YUM CAFE,2354911.0,Restaurant,Risk 1 (High),608 W BARRY,CHICAGO,IL,60657.0,2017-09-15,Canvass,3. POTENTIALLY HAZARDOUS FOOD MEETS TEMPERATUR...,41.938007,-87.644755,"{'longitude': '-87.6447545707008', 'latitude':...",1
1,555268,FILLING STATION & CONVENIENCE STORE,FILLING STATION & CONVENIENCE STORE,1044901.0,Grocery Store,Risk 3 (Low),6646-6658 S WESTERN AVE,CHICAGO,IL,60636.0,2011-10-20,Complaint Re-Inspection,32. FOOD AND NON-FOOD CONTACT SURFACES PROPERL...,41.772402,-87.683603,"{'longitude': '-87.68360273081268', 'latitude'...",0
2,1751394,A P DELI,A P DELI,47405.0,Restaurant,Risk 1 (High),2025 E 75TH ST,CHICAGO,IL,60649.0,2016-04-05,Canvass Re-Inspection,"35. WALLS, CEILINGS, ATTACHED EQUIPMENT CONSTR...",41.758779,-87.575054,"{'longitude': '-87.57505446746121', 'latitude'...",0
3,1763905,FRANK'S CHICAGO SHRIMP HOUSE,FRANK'S CHICAGO SHRIMP HOUSE,6414.0,Restaurant,Risk 2 (Medium),4459 S ARCHER AVE,CHICAGO,IL,60632.0,2016-04-29,Canvass,38. VENTILATION: ROOMS AND EQUIPMENT VENTED AS...,41.812181,-87.707125,"{'longitude': '-87.70712481334274', 'latitude'...",0
4,453326,MORRILL,MORRILL,24571.0,School,Risk 1 (High),6011 S Rockwell (2600W) AVE,CHICAGO,IL,60629.0,2011-01-10,Canvass,"34. FLOORS: CONSTRUCTED PER CODE, CLEANED, GOO...",,,,0


In [9]:
train.isnull().sum()

Inspection ID         0
DBA Name              0
AKA Name            623
License #             5
Facility Type       224
Risk                 12
Address               0
City                 53
State                10
Zip                  26
Inspection Date       0
Inspection Type       1
Violations         9655
Latitude            198
Longitude           198
Location            198
Fail                  0
dtype: int64

In [18]:
X_train.value_counts()

0    38490
1    13426
Name: Fail, dtype: int64

In [31]:
X_train.dtypes

Inspection ID        int64
DBA Name            object
AKA Name            object
License #          float64
Facility Type       object
Risk                object
Address             object
City                object
State               object
Zip                float64
Inspection Date     object
Inspection Type     object
Violations          object
Latitude           float64
Longitude          float64
Location            object
dtype: object

In [0]:
#find unique values to see which can drop
columns = list(X_train) 

unique_values =[]

for i in columns:
    unique_values.append(X_train[i].nunique())

In [0]:
Unique = pd.DataFrame(unique_values)


In [0]:
Unique['Category'] = list(X_train)

In [0]:
train['Facility Type'].value_counts()

In [26]:
#keep categories under 10
Unique

Unnamed: 0,0,Category
0,51916,Inspection ID
1,17049,DBA Name
2,16350,AKA Name
3,21421,License #
4,329,Facility Type
5,3,Risk
6,13954,Address
7,39,City
8,1,State
9,86,Zip


In [0]:
def wrangle(df):
  df = df.copy()
  
  #drop columns that don't add anything
  df = df.drop(['Fail','Inspection ID','DBA Name','AKA Name','License #','Address','Location'], axis=1)
  
  #convert to date time
  df['Inspection Date'] = pd.to_datetime(df['Inspection Date'], infer_datetime_format=True)
  df['inspection_day'] = df['Inspection Date'].dt.day
  df['inspection_month'] = df['Inspection Date'].dt.month
  df['inspection_year'] = df['Inspection Date'].dt.year
  df = df.drop('Inspection Date', axis=1)
  
  # create variable fo violations count
  df['violation_count'] = df['Violations'].str.count('\|') + 1
  df['violation_count'] = df['violation_count'].fillna(0)
  df = df.drop('Violations', axis=1)
  
  return df

In [0]:
df_train = wrangle(train)
df_test = wrangle(test)

In [52]:
df_train.head()

Unnamed: 0,Facility Type,Risk,City,State,Zip,Inspection Type,Latitude,Longitude,inspection_day,inspection_month,inspection_year,violation_count
0,Restaurant,Risk 1 (High),CHICAGO,IL,60657.0,Canvass,41.938007,-87.644755,15,9,2017,5.0
1,Grocery Store,Risk 3 (Low),CHICAGO,IL,60636.0,Complaint Re-Inspection,41.772402,-87.683603,20,10,2011,7.0
2,Restaurant,Risk 1 (High),CHICAGO,IL,60649.0,Canvass Re-Inspection,41.758779,-87.575054,5,4,2016,1.0
3,Restaurant,Risk 2 (Medium),CHICAGO,IL,60632.0,Canvass,41.812181,-87.707125,29,4,2016,2.0
4,School,Risk 1 (High),CHICAGO,IL,60629.0,Canvass,,,10,1,2011,3.0


In [0]:
y_train = train['Fail']
X_train = train.drop(columns=[])

### Part 1: Preprocessing

You may choose which features you want to use, and whether/how you will preprocess them. If you use categorical features, you may use any tools and techniques for encoding. (Pandas, category_encoders, sklearn.preprocessing, or any other library.)

_To earn a score of 3 for this part, find and explain leakage. The dataset has a feature that will give you an ROC AUC score > 0.90 if you process and use the feature. Find the leakage and explain why the feature shouldn't be used in a real-world model to predict the results of future inspections._



In [0]:
pipe = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    DecisionTreeClassifier(max_depth=2)
)

In [21]:
cross_val_score(pipe, X_train, y_train, cv=5, scoring='roc_auc', n_jobs=-1)

KeyboardInterrupt: ignored

### Part 2: Modeling

Fit a Random Forest or Gradient Boosting model with the train set. (You may use scikit-learn, xgboost, or any other library.) Use cross-validation to estimate an ROC AUC validation score.

Use your model to predict probabilities for the test set. Get an ROC AUC test score >= 0.60.

_To earn a score of 3 for this part, get an ROC AUC test score >= 0.70 (without using the feature with leakage)._




### Part 3: Visualization

Make one visualization for model interpretation. (You may use any libraries.) Choose one of these types:
- Feature Importances
- Permutation Importances
- Partial Dependence Plot
- Shapley Values

_To earn a score of 3 for this part, make at least two of these visualization types._