This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

# Resources

[XGBoost](https://xgboost.readthedocs.io/en/latest/)

[Shapley Values](https://shap.readthedocs.io/en/latest/)

[eli5](https://eli5.readthedocs.io/en/latest/overview.html)

[Partial Dependence Plot](https://scikit-learn.org/stable/modules/partial_dependence.html)

Use the dataset below to complete the coding challenges throughout the notebook unless otherwise specified.

In [0]:
!pip install category_encoders
import warnings
import category_encoders as ce
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

In [0]:
auto_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
columns = ['symboling','norm_loss','make','fuel','aspiration','doors',
           'bod_style','drv_wheels','eng_loc','wheel_base','length','width',
           'height','curb_weight','engine','cylinders','engine_size',
           'fuel_system','bore','stroke','compression','hp','peak_rpm',
           'city_mpg','hgwy_mpg','price']
df = pd.read_csv(auto_url, header=None, names=columns)

df.head()

# Data Cleaning and Exploring

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Outliers** - `Your Answer Here`

**Skew** - `Your Answer Here`

**Log Transformation** - `Your Answer Here`

**Leakage** - `Your Answer Here`

Throughout this unit we have been building predictive models and consistently doing certain steps such as loading data, and train-test split. In the space below, list out the steps needed to build a machine learning model. Make sure to include all major steps from sourcing and loading data all the way to scoring on your test set. Greater detail than just the major steps is encouraged if time permits you. Feel free too look back at assignments/lecture notebooks or use google to see the workflow steps others use and adapt it for you.

```
Your Answer Here
```

How do we detect leakage and what are some examples?

```
Your Answer Here
```

Use your Machine Learning Workflow above to load and prep the dataframe above. When you get to feature selection, choose a subset to include in your model, justify why you kept/dropped the features you chose with code comments.

# Model Building

**Bagging** - `Your Answer Here`

**Boosting** - `Your Answer Here`

**Gradient Boosting** - `Your Answer Here`

**Monotonic Function** - `Your Answer Here`

**Hyperparameter Tuning** - `Your Answer Here`

**Pipeline** - `Your Answer Here`

**Overfitting** - `Your Answer Here`

Using your cleaned up dataframe above, build a model and score it with an appropriate metric and cross validation.

How do you know if your model is overfitting?

```
Your Answer Here
```

# Model Interpretation

**Confusion Matrix** - `Your Answer Here`

**Permutation Importance** - `Your Answer Here`

**Partial Dependence Plot** - `Your Answer Here`

**Shapley Values** - `Your Answer Here`

**Drop Column Importance** - `Your Answer Here`

Use the model you trained above or the classification model provided to complete the following. Create each of the visuals below and then explain how they help you interpret and/or refine your model.

In [0]:
bank = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/bank.csv')
bank.head()

# Assign to X, y
X = bank.drop(columns='y')
y = bank['y'] == 'yes'

# Drop leaky feature
X = X.drop(columns='duration')

# Split Train, Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Make pipeline
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

# Predict
y_pred_proba = cross_val_predict(pipeline, X_train, y_train, cv=3, n_jobs=-1, 
                                 method='predict_proba')[:,1]

### Shapley Values

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Partial Dependence Plot

In [0]:
# PDP, 1 Feature Isolation

In [0]:
# PDP, 2 Feature Interaction

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Confusion Matrix

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Permutation Importance

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

Use some of the visuals above to further refine your model and rescore. Once you have validation score you are happy with, use your test set to get a final score for your model.