This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

# Resources

[XGBoost](https://xgboost.readthedocs.io/en/latest/)

[Shapley Values](https://shap.readthedocs.io/en/latest/)

[eli5](https://eli5.readthedocs.io/en/latest/overview.html)

[Partial Dependence Plot](https://scikit-learn.org/stable/modules/partial_dependence.html)

Use the dataset below to complete the coding challenges throughout the notebook unless otherwise specified.

In [2]:
!pip install category_encoders
import warnings
import category_encoders as ce
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
warnings.filterwarnings(action='ignore', category=DataConversionWarning)



In [11]:
auto_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
columns = ['symboling','norm_loss','make','fuel','aspiration','doors',
           'bod_style','drv_wheels','eng_loc','wheel_base','length','width',
           'height','curb_weight','engine','cylinders','engine_size',
           'fuel_system','bore','stroke','compression','hp','peak_rpm',
           'city_mpg','hgwy_mpg','price']
df = pd.read_csv(auto_url, header=None, names=columns)
pd.set_option('display.max_columns', None)

df.head(50)

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,length,width,height,curb_weight,engine,cylinders,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,168.8,64.1,48.8,2548,dohc,four,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,171.2,65.5,52.4,2823,ohcv,six,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,176.6,66.2,54.3,2337,ohc,four,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,176.6,66.4,54.3,2824,ohc,five,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
5,2,?,audi,gas,std,two,sedan,fwd,front,99.8,177.3,66.3,53.1,2507,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,15250
6,1,158,audi,gas,std,four,sedan,fwd,front,105.8,192.7,71.4,55.7,2844,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,17710
7,1,?,audi,gas,std,four,wagon,fwd,front,105.8,192.7,71.4,55.7,2954,ohc,five,136,mpfi,3.19,3.4,8.5,110,5500,19,25,18920
8,1,158,audi,gas,turbo,four,sedan,fwd,front,105.8,192.7,71.4,55.9,3086,ohc,five,131,mpfi,3.13,3.4,8.3,140,5500,17,20,23875
9,0,?,audi,gas,turbo,two,hatchback,4wd,front,99.5,178.2,67.9,52.0,3053,ohc,five,131,mpfi,3.13,3.4,7.0,160,5500,16,22,?


# Data Cleaning and Exploring

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Outliers** - `Extreme values that deviate from other observations on data`

**Skew** - `Data is called as skewed when curve appears distorted or skewed either to the left or to the right, in a statistical distribution.`

**Log Transformation** - `Used to transform skewed data to approximately conform to normality`

**Leakage** - `When information from outside the training dataset is used to create the model`

Throughout this unit we have been building predictive models and consistently doing certain steps such as loading data, and train-test split. In the space below, list out the steps needed to build a machine learning model. Make sure to include all major steps from sourcing and loading data all the way to scoring on your test set. Greater detail than just the major steps is encouraged if time permits you. Feel free too look back at assignments/lecture notebooks or use google to see the workflow steps others use and adapt it for you.

```
# 1. Arrange X features matrix & y target vector
# 2. Create a train-test split
# 3. Import the appropriate estimator class from Scikit-Learn
# 4. Instantiate this class
# 5. Fit the model to your training data
# 6. Look at your error metric for both training and testing sets
```

How do we detect leakage and what are some examples?

```
Your Answer Here
```

Use your Machine Learning Workflow above to load and prep the dataframe above. When you get to feature selection, choose a subset to include in your model, justify why you kept/dropped the features you chose with code comments.

In [14]:
# 1. Arrange X features matrix & y target vector
def wrangle(df, thresh=350):
    # Dropping nulls
    
    # Split labels from feature matrix
    y = df['price']
    df.drop(['price'], axis=1, inplace=True)
    
    return df, y

In [16]:
# 2. Create train test split
from sklearn.model_selection import train_test_split
train, val = train_test_split(train, train_size=0.80, test_size=0.20, 
                              stratify=train['price'], random_state=42)

NameError: name 'train' is not defined

In [None]:
# 

# Model Building

**Bagging** - `Your Answer Here`

**Boosting** - `Your Answer Here`

**Gradient Boosting** - `Your Answer Here`

**Monotonic Function** - `Your Answer Here`

**Hyperparameter Tuning** - `Your Answer Here`

**Pipeline** - `Your Answer Here`

**Overfitting** - `Your Answer Here`

Using your cleaned up dataframe above, build a model and score it with an appropriate metric and cross validation.

How do you know if your model is overfitting?

```
Your Answer Here
```

# Model Interpretation

**Confusion Matrix** - `Your Answer Here`

**Permutation Importance** - `Your Answer Here`

**Partial Dependence Plot** - `Your Answer Here`

**Shapley Values** - `Your Answer Here`

**Drop Column Importance** - `Your Answer Here`

Use the model you trained above or the classification model provided to complete the following. Create each of the visuals below and then explain how they help you interpret and/or refine your model.

In [37]:
bank = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/bank.csv')
bank.head()

# Assign to X, y
X = bank.drop(columns='y')
y = bank['y'] == 'yes'

# Drop leaky feature
X = X.drop(columns='duration')

# Split Train, Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Make pipeline
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

# Predict
y_pred_proba = cross_val_predict(pipeline, X_train, y_train, cv=3, n_jobs=-1, 
                                 method='predict_proba')[:,1]

In [38]:
pipeline.fit(X_train, y_train)

Pipeline(steps=[('onehotencoder',
                 OneHotEncoder(cols=['job', 'marital', 'education', 'default',
                                     'housing', 'loan', 'contact', 'month',
                                     'day_of_week', 'poutcome'],
                               use_cat_names=True)),
                ('standardscaler', StandardScaler()),
                ('logisticregression', LogisticRegression(max_iter=1000))])

In [35]:
model = search.best_estimator_
model.predict(row)

NotFittedError: This RandomForestClassifier instance is not fitted yet. Call 'fit' with appropriate arguments before using this estimator.

### Shapley Values

In [18]:
row = X_test.iloc[[0]]
row

Unnamed: 0.1,Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed
14455,14455,32,management,divorced,university.degree,no,no,no,cellular,jul,tue,5,999,0,nonexistent,1.4,93.918,-42.7,4.961,5228.1


In [24]:
y_test.iloc[[0]]

14455    False
Name: y, dtype: bool

In [34]:
import shap

explainer = shap.TreeExplainer(pipeline)
shap_values = explainer.shap_values(row)
shap.initjs()
shap.force_plot(
    base_value=explainer.expected_value,
    shap_values=shap_values,
    features=row
)

Setting feature_perturbation = "tree_path_dependent" because no background data was given.


SHAPError: Model type not yet supported by TreeExplainer: <class 'sklearn.pipeline.Pipeline'>

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Partial Dependence Plot

In [None]:
# PDP, 1 Feature Isolation

In [None]:
# PDP, 2 Feature Interaction

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Confusion Matrix

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Permutation Importance

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

Use some of the visuals above to further refine your model and rescore. Once you have validation score you are happy with, use your test set to get a final score for your model.