<a href="https://colab.research.google.com/github/bundickm/Study-Guides/blob/master/Unit_2_Sprint_3_Applied_Modeling_Study_Guide.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

# Resources

[XGBoost](https://xgboost.readthedocs.io/en/latest/)

[Shapley Values](https://shap.readthedocs.io/en/latest/)

[eli5](https://eli5.readthedocs.io/en/latest/overview.html)

[Partial Dependence Plot](https://scikit-learn.org/stable/modules/partial_dependence.html)

Use the dataset below to complete the coding challenges throughout the notebook unless otherwise specified.

In [1]:
!pip install category_encoders
import warnings
import category_encoders as ce
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
warnings.filterwarnings(action='ignore', category=DataConversionWarning)



In [2]:
auto_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
columns = ['symboling','norm_loss','make','fuel','aspiration','doors',
           'bod_style','drv_wheels','eng_loc','wheel_base','length','width',
           'height','curb_weight','engine','cylinders','engine_size',
           'fuel_system','bore','stroke','compression','hp','peak_rpm',
           'city_mpg','hgwy_mpg','price']
df = pd.read_csv(auto_url, header=None, names=columns)

df.head()

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


# Data Cleaning and Exploring

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Outliers** - `Extremes in data that throw off accuracy`

**Skew** - `An abundance of one value or target, also affecting accuracy.`

**Log Transformation** - `A way to deal with skewed data.`

**Leakage** - `When your test data catches a glimpse of your training data.`

Throughout this unit we have been building predictive models and consistently doing certain steps such as loading data, and train-test split. In the space below, list out the steps needed to build a machine learning model. Make sure to include all major steps from sourcing and loading data all the way to scoring on your test set. Greater detail than just the major steps is encouraged if time permits you. Feel free too look back at assignments/lecture notebooks or use google to see the workflow steps others use and adapt it for you.

```
1. Import Pandas along with the needed data.
2. To prevent leakage, split your data (whether through sklearn's train_test_split or along a date metric).
3. Explore your training data to find inconsistencies and the target. During this step, it is also a good idea to think about which features are needed to be engineered.
4. Create a wrangle function, taking in outliers and replacing values. Wrangle your datasets.
5. Define your target and features; engineer new features here.
6. Create your pipeline while importing Imputers, Encoders, Regressors and/or Classifiers.
7. Fit and transform your data. Find your training and validation scores. Find your testing data at the very end.
```

How do we detect leakage and what are some examples?

```
Looking for similar features, or a high ".unique()" number of values.
```

Use your Machine Learning Workflow above to load and prep the dataframe above. When you get to feature selection, choose a subset to include in your model, justify why you kept/dropped the features you chose with code comments.

In [3]:
df.shape

(205, 26)

In [9]:
df.describe(exclude="number")

Unnamed: 0,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,engine,cylinders,fuel_system,bore,stroke,hp,peak_rpm,price
count,205,205,205,205,205,205,205,205,205,205,205,205.0,205.0,205,205,205
unique,52,22,2,2,3,5,3,2,7,7,8,39.0,37.0,60,24,187
top,?,toyota,gas,std,four,sedan,fwd,front,ohc,four,mpfi,3.62,3.4,68,5500,?
freq,41,32,185,168,114,96,120,202,148,159,94,23.0,20.0,19,37,4


In [10]:
df.describe()

Unnamed: 0,symboling,wheel_base,length,width,height,curb_weight,engine_size,compression,city_mpg,hgwy_mpg
count,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0,205.0
mean,0.834146,98.756585,174.049268,65.907805,53.724878,2555.565854,126.907317,10.142537,25.219512,30.75122
std,1.245307,6.021776,12.337289,2.145204,2.443522,520.680204,41.642693,3.97204,6.542142,6.886443
min,-2.0,86.6,141.1,60.3,47.8,1488.0,61.0,7.0,13.0,16.0
25%,0.0,94.5,166.3,64.1,52.0,2145.0,97.0,8.6,19.0,25.0
50%,1.0,97.0,173.2,65.5,54.1,2414.0,120.0,9.0,24.0,30.0
75%,2.0,102.4,183.1,66.9,55.5,2935.0,141.0,9.4,30.0,34.0
max,3.0,120.9,208.1,72.3,59.8,4066.0,326.0,23.0,49.0,54.0


In [11]:
import numpy as np

def wrangle(X):
    
    X = X.copy()
    
    X = X.replace("?", np.nan)
    
    return X

wrangle(df)

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
0,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.40,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.40,8.0,115,5500,18,22,17450
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
200,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,9.5,114,5400,23,28,16845
201,-1,95,volvo,gas,turbo,four,sedan,rwd,front,109.1,...,141,mpfi,3.78,3.15,8.7,160,5300,19,25,19045
202,-1,95,volvo,gas,std,four,sedan,rwd,front,109.1,...,173,mpfi,3.58,2.87,8.8,134,5500,18,23,21485
203,-1,95,volvo,diesel,turbo,four,sedan,rwd,front,109.1,...,145,idi,3.01,3.40,23.0,106,4800,26,27,22470


# Model Building

**Bagging** - `Your Answer Here`

**Boosting** - `Your Answer Here`

**Gradient Boosting** - `Your Answer Here`

**Monotonic Function** - `Your Answer Here`

**Hyperparameter Tuning** - `Your Answer Here`

**Pipeline** - `Your Answer Here`

**Overfitting** - `Your Answer Here`

Using your cleaned up dataframe above, build a model and score it with an appropriate metric and cross validation.

How do you know if your model is overfitting?

```
Your Answer Here
```

# Model Interpretation

**Confusion Matrix** - `Your Answer Here`

**Permutation Importance** - `Your Answer Here`

**Partial Dependence Plot** - `Your Answer Here`

**Shapley Values** - `Your Answer Here`

**Drop Column Importance** - `Your Answer Here`

Use the model you trained above or the classification model provided to complete the following. Create each of the visuals below and then explain how they help you interpret and/or refine your model.

In [0]:
bank = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/bank.csv')
bank.head()

# Assign to X, y
X = bank.drop(columns='y')
y = bank['y'] == 'yes'

# Drop leaky feature
X = X.drop(columns='duration')

# Split Train, Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Make pipeline
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

# Predict
y_pred_proba = cross_val_predict(pipeline, X_train, y_train, cv=3, n_jobs=-1, 
                                 method='predict_proba')[:,1]

### Shapley Values

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Partial Dependence Plot

In [0]:
# PDP, 1 Feature Isolation

In [0]:
# PDP, 2 Feature Interaction

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Confusion Matrix

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Permutation Importance

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

Use some of the visuals above to further refine your model and rescore. Once you have validation score you are happy with, use your test set to get a final score for your model.