Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Applied Modeling, Module 3

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] Continue to iterate on your project: data cleaning, exploration, feature engineering, modeling.
- [ ] Make at least 1 partial dependence plot to explain your model.
- [ ] Share at least 1 visualization on Slack.

(If you have not yet completed an initial model yet for your portfolio project, then do today's assignment using your Tanzania Waterpumps model.)

## Stretch Goals
- [ ] Make multiple PDPs with 1 feature in isolation.
- [ ] Make multiple PDPs with 2 features in interaction. 
- [ ] Use Plotly to make a 3D PDP.
- [ ] Make PDPs with categorical feature(s). Use Ordinal Encoder, outside of a pipeline, to encode your data first. If there is a natural ordering, then take the time to encode it that way, instead of random integers. Then use the encoded data with pdpbox. Get readable category names on your plot, instead of integer category codes.

## Links
- [Christoph Molnar: Interpretable Machine Learning — Partial Dependence Plots](https://christophm.github.io/interpretable-ml-book/pdp.html) + [animated explanation](https://twitter.com/ChristophMolnar/status/1066398522608635904)
- [Kaggle / Dan Becker: Machine Learning Explainability — Partial Dependence Plots](https://www.kaggle.com/dansbecker/partial-plots)
- [Plotly: 3D PDP example](https://plot.ly/scikit-learn/plot-partial-dependence/#partial-dependence-of-house-value-on-median-age-and-average-occupancy)

In [9]:
%%capture
import sys

# If you're on Colab:
if 'google.colab' in sys.modules:
    DATA_PATH = 'https://raw.githubusercontent.com/LambdaSchool/DS-Unit-2-Applied-Modeling/master/data/'
    !pip install category_encoders==2.*
    !pip install eli5
    !pip install pdpbox

# If you're working locally:
else:
    DATA_PATH = '../data/'

In [10]:
import pandas as pd

path = '/Users/maxefremov/Workshop/data/home-credit-default-risk/'
application_train = 'application_train.csv'

df = pd.read_csv(path + application_train)
print(df.shape)

# Split train into train & val
from sklearn.model_selection import train_test_split
train, val = train_test_split(df, train_size=0.80, test_size=0.20, random_state=42)

(307511, 122)


### Baseline
A majority class baseline model would have us predict that every loan applicant will pay back their loan with no difficulty, assigning 0s for every observation in the test dataset.

In [11]:
df['TARGET'].value_counts(normalize=True)
# Seems like the data is unbalanced, in that only 8% of observations are of target class

0    0.919271
1    0.080729
Name: TARGET, dtype: float64

### An XGBClassifier model with minimal-to-no feature engineering

In [None]:
target = 'TARGET'

# Get a dataframe with all train columns except the target
train_features = df.drop(columns=[target])

# Get a list of the numeric features
numeric_features = train_features.select_dtypes(include='number').columns.tolist()

# Get a series with the cardinality of the nonnumeric features
cardinality = train_features.select_dtypes(exclude='number').nunique()

# Get a list of all categorical features with cardinality <= 50
categorical_features = cardinality[cardinality <= 50].index.tolist()

# Combine the lists 
features = numeric_features + categorical_features

In [None]:
# Arrange data into X features matrix and y target vector 
X_train = train[features]
y_train = train[target]
X_val = val[features]
y_val = val[target]

In [None]:
!pip install xgboost

Collecting xgboost
  Using cached https://files.pythonhosted.org/packages/96/84/4e2cae6247f397f83d8adc5c2a2a0c5d7d790a14a4c7400ff6574586f589/xgboost-0.90.tar.gz


In [None]:
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier

encoder = ce.OrdinalEncoder()
X_train_encoded = encoder.fit_transform(X_train)
X_val_encoded = encoder.fit_transform(X_val[features])

X_train_encoded.shape, X_val_encoded.shape

eval_set = [(X_train_encoded, y_train), (X_val_encoded,y_val)]

model = XGBClassifier(
    n_estimators=1000,
    max_depth=7,
    learning_rate=.1,
    n_jobs=-1
)

model.fit(X_train_encoded,y_train, eval_set=eval_set,
         eval_metric='auc',early_stopping_rounds=50)

In [None]:
# Get ROC AUC score for the class with index -1
from sklearn.metrics import roc_auc_score
y_pred_proba = model.predict_proba(X_val_encoded)[:, -1]
print('Validation ROC AUC', roc_auc_score(y_val, y_pred_proba))

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

results = model.evals_result()
train_error = results['validation_0']['auc']
val_error = results['validation_1']['auc']
epoch = range(1, len(train_error)+1)
plt.plot(epoch, train_error, label='Train')
plt.plot(epoch, val_error, label='Validation')
plt.ylabel('Classification Error')
plt.xlabel('Model Complexity (n_estimators)')
plt.ylim((.5,1)) # Zoom in
plt.legend();

### Partial Dependence Plot

In [None]:
# Later, when you save matplotlib images to include in blog posts or web apps,
# increase the dots per inch (double it), so the text isn't so fuzzy
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['figure.dpi'] = 72

In [None]:
nulls = pd.DataFrame(data=df.isnull().sum())
pd.set_option('display.max_columns', 50)
pd.set_option('display.max_rows', 150)

In [None]:
nulls.iloc[0:125]

In [None]:
from pdpbox.pdp import pdp_isolate, pdp_plot

feature = 'AMT_CREDIT'

isolated = pdp_isolate(
    model=model, 
    dataset=X_val, 
    model_features=X_val[numeric_features], 
    feature=feature
)

pdp_plot(isolated, feature_name=feature);