This study guide should reinforce and provide practice for all of the concepts you have seen in the past week. There are a mix of written questions and coding exercises, both are equally important to prepare you for the sprint challenge as well as to be able to speak on these topics comfortably in interviews and on the job.

If you get stuck or are unsure of something remember the 20 minute rule. If that doesn't help, then research a solution with google and stackoverflow. Only once you have exausted these methods should you turn to your Team Lead - they won't be there on your SC or during an interview. That being said, don't hesitate to ask for help if you truly are stuck.

Have fun studying!

# Resources

[XGBoost](https://xgboost.readthedocs.io/en/latest/)

[Shapley Values](https://shap.readthedocs.io/en/latest/)

[eli5](https://eli5.readthedocs.io/en/latest/overview.html)

[Partial Dependence Plot](https://scikit-learn.org/stable/modules/partial_dependence.html)

Use the dataset below to complete the coding challenges throughout the notebook unless otherwise specified.

In [1]:
import warnings
import numpy as np
import category_encoders as ce
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.exceptions import DataConversionWarning
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_predict
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

In [2]:
auto_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data'
columns = ['symboling','norm_loss','make','fuel','aspiration','doors',
           'bod_style','drv_wheels','eng_loc','wheel_base','length','width',
           'height','curb_weight','engine','cylinders','engine_size',
           'fuel_system','bore','stroke','compression','hp','peak_rpm',
           'city_mpg','hgwy_mpg','price']
df = pd.read_csv(auto_url, header=None, names=columns)

df.head()

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
0,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,13495
1,3,?,alfa-romero,gas,std,two,convertible,rwd,front,88.6,...,130,mpfi,3.47,2.68,9.0,111,5000,21,27,16500
2,1,?,alfa-romero,gas,std,two,hatchback,rwd,front,94.5,...,152,mpfi,2.68,3.47,9.0,154,5000,19,26,16500
3,2,164,audi,gas,std,four,sedan,fwd,front,99.8,...,109,mpfi,3.19,3.4,10.0,102,5500,24,30,13950
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450


# Data Cleaning and Exploring

Define the following terms in your own words, do not simply copy and paste a definition found elsewhere but reword it to be understandable and memorable to you. *Double click the markdown to add your definitions.*
<br/><br/>

**Outliers** - `Extreme observations outside the range of expected values`

**Skew** - `A measure of data distribution's asymmetry with respect to a normal distribution`

**Log Transformation** - `A data processing technique in which a logarithmic function is used to normalize a skewed dataset`

**Leakage** - `Information outside the training set is used to improve model performance, a recipe for over-fitting`

Throughout this unit we have been building predictive models and consistently doing certain steps such as loading data, and train-test split. In the space below, list out the steps needed to build a machine learning model. Make sure to include all major steps from sourcing and loading data all the way to scoring on your test set. Greater detail than just the major steps is encouraged if time permits you. Feel free too look back at assignments/lecture notebooks or use google to see the workflow steps others use and adapt it for you.

```
train-validate-test split or cross-validation approach
Choose a class of model by importing the appropriate estimator class from Scikit-Learn.
Choose model hyperparameters by instantiating this class with desired values.
Arrange data into a features matrix and target vector following the discussion above.
Fit the model to your data by calling the fit() method of the model instance.
Apply the Model to new data:
For supervised learning, often we predict labels for unknown data using the predict() method.
For unsupervised learning, we often transform or infer properties of the data using the transform() or predict() method.

Thanks JakeVDP: https://jakevdp.github.io/PythonDataScienceHandbook/05.02-introducing-scikit-learn.html
```

How do we detect leakage and what are some examples?

```
Models with near-perfect accuracy are highly suspect os data leakage. 

Use cross-validation folds to help overcome data leakage. 
```

Use your Machine Learning Workflow above to load and prep the dataframe above. When you get to feature selection, choose a subset to include in your model, justify why you kept/dropped the features you chose with code comments.

In [62]:
df = df[df["price"] != "?"]

In [3]:
from sklearn.model_selection import train_test_split

In [63]:
train, test = train_test_split(
    df, 
    train_size=0.8, 
    test_size=0.2, 
    random_state=8
)

train, validate = train_test_split(
    train, 
    train_size=0.8, 
    test_size=0.2, 
    random_state=8
)

train.shape, validate.shape, test.shape

((128, 26), (32, 26), (41, 26))

In [64]:
train.head()

Unnamed: 0,symboling,norm_loss,make,fuel,aspiration,doors,bod_style,drv_wheels,eng_loc,wheel_base,...,engine_size,fuel_system,bore,stroke,compression,hp,peak_rpm,city_mpg,hgwy_mpg,price
4,2,164,audi,gas,std,four,sedan,4wd,front,99.4,...,136,mpfi,3.19,3.4,8.0,115,5500,18,22,17450
159,0,91,toyota,diesel,std,four,hatchback,fwd,front,95.7,...,110,idi,3.27,3.35,22.5,56,4500,38,47,7788
193,0,?,volkswagen,gas,std,four,wagon,fwd,front,100.4,...,109,mpfi,3.19,3.4,9.0,88,5500,25,31,12290
22,1,118,dodge,gas,std,two,hatchback,fwd,front,93.7,...,90,2bbl,2.97,3.23,9.4,68,5500,31,38,6377
100,0,106,nissan,gas,std,four,sedan,fwd,front,97.2,...,120,2bbl,3.33,3.47,8.5,97,5200,27,34,9549


In [65]:
train.columns

Index(['symboling', 'norm_loss', 'make', 'fuel', 'aspiration', 'doors',
       'bod_style', 'drv_wheels', 'eng_loc', 'wheel_base', 'length', 'width',
       'height', 'curb_weight', 'engine', 'cylinders', 'engine_size',
       'fuel_system', 'bore', 'stroke', 'compression', 'hp', 'peak_rpm',
       'city_mpg', 'hgwy_mpg', 'price'],
      dtype='object')

In [66]:
train.corr() > 0.5

Unnamed: 0,symboling,wheel_base,length,width,height,curb_weight,engine_size,compression,city_mpg,hgwy_mpg
symboling,True,False,False,False,False,False,False,False,False,False
wheel_base,False,True,True,True,True,True,True,False,False,False
length,False,True,True,True,True,True,True,False,False,False
width,False,True,True,True,False,True,True,False,False,False
height,False,True,True,False,True,False,False,False,False,False
curb_weight,False,True,True,True,False,True,True,False,False,False
engine_size,False,True,True,True,False,True,True,False,False,False
compression,False,False,False,False,False,False,False,True,False,False
city_mpg,False,False,False,False,False,False,False,False,True,True
hgwy_mpg,False,False,False,False,False,False,False,False,True,True


In [67]:
train.isna().sum()

symboling      0
norm_loss      0
make           0
fuel           0
aspiration     0
doors          0
bod_style      0
drv_wheels     0
eng_loc        0
wheel_base     0
length         0
width          0
height         0
curb_weight    0
engine         0
cylinders      0
engine_size    0
fuel_system    0
bore           0
stroke         0
compression    0
hp             0
peak_rpm       0
city_mpg       0
hgwy_mpg       0
price          0
dtype: int64

In [68]:
train.dtypes

symboling        int64
norm_loss       object
make            object
fuel            object
aspiration      object
doors           object
bod_style       object
drv_wheels      object
eng_loc         object
wheel_base     float64
length         float64
width          float64
height         float64
curb_weight      int64
engine          object
cylinders       object
engine_size      int64
fuel_system     object
bore            object
stroke          object
compression    float64
hp              object
peak_rpm        object
city_mpg         int64
hgwy_mpg         int64
price           object
dtype: object

In [45]:
train["norm_loss"].head() #Should be numeric

24     148
155     91
165    168
46       ?
147     89
Name: norm_loss, dtype: object

In [60]:
train["norm_loss"].value_counts() #Replace ? with np.NaN

?      24
161     9
134     5
85      5
91      5
74      4
168     4
150     4
128     4
104     4
95      3
115     3
103     3
154     3
122     3
125     3
110     2
81      2
106     2
119     2
194     2
101     2
113     2
137     2
118     2
129     2
102     2
83      2
192     2
77      1
142     1
78      1
153     1
94      1
148     1
93      1
158     1
65      1
87      1
188     1
197     1
89      1
98      1
108     1
121     1
145     1
186     1
231     1
Name: norm_loss, dtype: int64

In [46]:
train["make"].head() #Object is fine

24      dodge
155    toyota
165    toyota
46      isuzu
147    subaru
Name: make, dtype: object

In [61]:
train["make"].value_counts()

toyota           19
mazda            13
nissan           12
honda            10
mitsubishi        9
subaru            7
peugot            7
plymouth          7
volkswagen        6
volvo             6
dodge             5
bmw               5
mercedes-benz     5
saab              4
porsche           3
isuzu             3
chevrolet         3
audi              3
jaguar            2
alfa-romero       1
renault           1
Name: make, dtype: int64

In [47]:
train["fuel"].head() #Object is fine

24     gas
155    gas
165    gas
46     gas
147    gas
Name: fuel, dtype: object

In [62]:
train["fuel"].value_counts()

gas       120
diesel     11
Name: fuel, dtype: int64

In [48]:
train["aspiration"].head() #Object is fine

24     std
155    std
165    std
46     std
147    std
Name: aspiration, dtype: object

In [63]:
train["aspiration"].value_counts()

std      110
turbo     21
Name: aspiration, dtype: int64

In [49]:
train["doors"].head() #Object is fine

24     four
155    four
165     two
46      two
147    four
Name: doors, dtype: object

In [67]:
train["doors"].value_counts() #Replace? with np.NaN

four    71
two     59
?        1
Name: doors, dtype: int64

In [50]:
train["bod_style"].head() #Object is fine

24     hatchback
155        wagon
165        sedan
46     hatchback
147        wagon
Name: bod_style, dtype: object

In [65]:
train["bod_style"].value_counts()

sedan          63
hatchback      46
wagon          15
hardtop         5
convertible     2
Name: bod_style, dtype: int64

In [51]:
train["eng_loc"].head() #Object is fine

24     front
155    front
165    front
46     front
147    front
Name: eng_loc, dtype: object

In [66]:
train["eng_loc"].value_counts()

front    130
rear       1
Name: eng_loc, dtype: int64

In [52]:
train["engine"].head() #Object is fine

24      ohc
155     ohc
165    dohc
46      ohc
147    ohcf
Name: engine, dtype: object

In [68]:
train["engine"].value_counts()

ohc      93
ohcv     12
l         8
ohcf      8
dohc      6
rotor     3
dohcv     1
Name: engine, dtype: int64

In [53]:
train["cylinders"].head() #Object is fine

24     four
155    four
165    four
46     four
147    four
Name: cylinders, dtype: object

In [69]:
train["cylinders"].value_counts()

four      102
six        14
eight       5
five        5
two         3
three       1
twelve      1
Name: cylinders, dtype: int64

In [54]:
train["fuel_system"].head() #Object is fine

24     2bbl
155    2bbl
165    mpfi
46     spfi
147    mpfi
Name: fuel_system, dtype: object

In [70]:
train["fuel_system"].value_counts()

mpfi    54
2bbl    47
idi     11
1bbl     9
spdi     6
4bbl     2
spfi     1
mfi      1
Name: fuel_system, dtype: int64

In [55]:
train["bore"].head() #Should be numeric

24     2.97
155    3.05
165    3.24
46     3.43
147    3.62
Name: bore, dtype: object

In [71]:
train["bore"].value_counts() #Replace ? with np.NaN

3.62    15
3.15    12
2.97    10
3.19     9
3.43     7
3.39     6
3.46     6
3.03     6
2.91     5
3.05     5
3.54     5
3.01     4
3.70     3
3.31     3
3.35     3
?        3
3.17     3
3.78     3
3.80     2
3.58     2
3.50     2
3.27     2
3.24     2
3.94     2
3.59     2
3.61     1
3.34     1
3.60     1
2.92     1
2.68     1
3.74     1
3.63     1
3.13     1
3.08     1
Name: bore, dtype: int64

In [56]:
train["stroke"].head() #Should be numeric

24     3.23
155    3.03
165    3.08
46     3.23
147    2.64
Name: stroke, dtype: object

In [72]:
train["stroke"].value_counts() #Replace ? with np.NaN

3.23    12
3.03    10
3.40    10
3.39     9
3.15     7
3.29     7
3.46     7
2.64     6
3.11     5
3.50     5
3.07     5
3.27     5
3.58     5
3.19     4
3.35     4
3.41     4
3.52     3
?        3
3.86     2
3.10     2
3.90     2
3.64     2
3.08     2
2.80     2
4.17     1
3.21     1
2.87     1
2.90     1
3.54     1
3.47     1
2.36     1
2.76     1
Name: stroke, dtype: int64

In [57]:
train["hp"].head() #Should be numeric

24      68
155     62
165    112
46      90
147     94
Name: hp, dtype: object

In [73]:
train["hp"].value_counts() #Replace ? with np.NaN

68     14
116     8
69      8
70      7
160     6
84      5
101     5
62      5
110     5
88      5
86      4
97      3
145     3
76      3
95      3
184     2
112     2
52      2
82      2
155     2
114     2
182     2
152     2
111     2
200     1
134     1
162     1
288     1
154     1
90      1
?       1
85      1
207     1
64      1
143     1
135     1
92      1
161     1
78      1
72      1
123     1
262     1
60      1
102     1
142     1
176     1
48      1
94      1
56      1
58      1
106     1
121     1
73      1
Name: hp, dtype: int64

In [59]:
train["peak_rpm"].head() #Should be numeric

24     5500
155    4800
165    6600
46     5000
147    5200
Name: peak_rpm, dtype: object

In [74]:
train["peak_rpm"].value_counts() #Replace ? with np.NaN

4800    28
5500    24
5000    16
5200    14
5800     7
5400     7
6000     6
4500     4
5250     3
4750     3
4150     3
5100     2
6600     2
4200     2
4650     1
4400     1
4250     1
5600     1
5300     1
5900     1
4900     1
5750     1
?        1
4350     1
Name: peak_rpm, dtype: int64

In [58]:
train["price"].head() #Should be numeric

24      6229
155     8778
165     9298
46     11048
147    10198
Name: price, dtype: object

In [76]:
train["price"].value_counts() #Replace ? with np.NaN

?        3
9279     2
7898     2
7295     2
8845     2
        ..
13845    1
6918     1
6989     1
11048    1
6377     1
Name: price, Length: 120, dtype: int64

In [69]:
def wrangle(df):
    
    df = df.replace(
        "?",
        np.NaN
    )

    columns_to_convert = [
        "norm_loss", 
        "bore", 
        "stroke",
        "hp", 
        "peak_rpm",
        "price"
    ]
    
    for column in columns_to_convert:
        
        df[column] = df[column].astype(float)
        
    df.drop(
        ["city_mpg"],
        axis=1, 
        inplace=True
    )
    
    skewed_columns = (train.skew() >= 1).index.tolist()
    
    
    return df
    

In [70]:
train_wrangled = wrangle(train)
validate_wrangled = wrangle(validate)
test_wrangled = wrangle(test)

In [71]:
target = "price"

features = train_wrangled.columns.drop([target])

In [72]:
guess = df["price"].replace("?",np.NaN).astype(float).mean()
print("Guess the mean for all values:", guess, "dollars/month")

errors = guess - df["price"].replace("?",np.NaN).astype(float)
mean_absolute_error = errors.abs().mean()

print(f'If we just guessed every price for ${guess:,.0f}, we would be off by ${mean_absolute_error:,.0f} on average.')

Guess the mean for all values: 13207.129353233831 dollars/month
If we just guessed every price for $13,207, we would be off by $5,841 on average.


In [84]:
X_train = train_wrangled[features]
y_train = train_wrangled[target]
X_validate = validate_wrangled[features]
y_validate = validate_wrangled[target]
X_test = test_wrangled[features]
y_test = test_wrangled[target]


from sklearn.linear_model import LinearRegression
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer
from sklearn.metrics import r2_score

In [86]:
# Make pipeline
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(),
    StandardScaler(), 
    LinearRegression()
)

In [87]:
pipeline.fit(
    X_train, 
    y_train
)

y_pred = pipeline.predict(
    X_validate
)

print(
    "\n",
    r2_score(
          y_validate, 
          y_pred
    )
)


 0.4729779927868908


In [88]:
from sklearn.tree import DecisionTreeRegressor

# Make pipeline
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(),
    StandardScaler(), 
    DecisionTreeRegressor()
)

pipeline.fit(
    X_train, 
    y_train
)

y_pred = pipeline.predict(
    X_validate
)

print(
    "\n",
    r2_score(
          y_validate, 
          y_pred
    )
)


 0.865358713456198


In [91]:
from sklearn.ensemble import RandomForestRegressor

# Make pipeline
pipeline = make_pipeline(
    ce.OrdinalEncoder(), 
    SimpleImputer(),
    StandardScaler(), 
    RandomForestRegressor()
)

pipeline.fit(
    X_train, 
    y_train
)

y_pred = pipeline.predict(
    X_validate
)

print(
    "\n",
    r2_score(
          y_validate, 
          y_pred
    )
)


 0.959589731829623


# Model Building

**Bagging** - `Your Answer Here`

**Boosting** - `Your Answer Here`

**Gradient Boosting** - `Your Answer Here`

**Monotonic Function** - `Your Answer Here`

**Hyperparameter Tuning** - `Your Answer Here`

**Pipeline** - `Your Answer Here`

**Overfitting** - `Your Answer Here`

Using your cleaned up dataframe above, build a model and score it with an appropriate metric and cross validation.

How do you know if your model is overfitting?

```
Your Answer Here
```

# Model Interpretation

**Confusion Matrix** - `a specific table layout that allows visualization of the performance of an algorithm (wiki)`

**Permutation Importance** - `the decrease in a model score when a single feature value is randomly shuffled (sklearn)`

**Partial Dependence Plot** - `depicts the functional relationship between a small number of input variables and predictions (sas)`

**Shapley Values** - `An approach to explain the individual predictions of a machine learning model `

**Drop Column Importance** - `Investigating model feature importance through dropping column and running model`

Use the model you trained above or the classification model provided to complete the following. Create each of the visuals below and then explain how they help you interpret and/or refine your model.

In [94]:
bank = pd.read_csv('https://raw.githubusercontent.com/bundickm/Study-Guides/master/data/bank.csv')
bank.head()

# Assign to X, y
X = bank.drop(columns='y')
y = bank['y'] == 'yes'

# Drop leaky feature
X = X.drop(columns='duration')

# Split Train, Test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y)

# Make pipeline
pipeline = make_pipeline(
    ce.OneHotEncoder(use_cat_names=True), 
    StandardScaler(), 
    LogisticRegression(solver='lbfgs', max_iter=1000)
)

# Predict
y_pred_proba = cross_val_predict(pipeline, X_train, y_train, cv=3, n_jobs=-1, 
                                 method='predict_proba')[:,1]

### Shapley Values

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Partial Dependence Plot

In [None]:
# PDP, 1 Feature Isolation

In [None]:
# PDP, 2 Feature Interaction

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Confusion Matrix

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

### Permutation Importance

How does the above visual help you to interpret or refine your model?

```
Your Answer Here
```

Use some of the visuals above to further refine your model and rescore. Once you have validation score you are happy with, use your test set to get a final score for your model.