Lambda School Data Science

*Unit 2, Sprint 3, Module 3*

---


# Permutation & Boosting

You will use your portfolio project dataset for all assignments this sprint.

## Assignment

Complete these tasks for your project, and document your work.

- [ ] If you haven't completed assignment #1, please do so first.
- [ ] Continue to clean and explore your data. Make exploratory visualizations.
- [ ] Fit a model. Does it beat your baseline? 
- [ ] Try xgboost.
- [ ] Get your model's permutation importances.

You should try to complete an initial model today, because the rest of the week, we're making model interpretation visualizations.

But, if you aren't ready to try xgboost and permutation importances with your dataset today, that's okay. You can practice with another dataset instead. You may choose any dataset you've worked with previously.

The data subdirectory includes the Titanic dataset for classification and the NYC apartments dataset for regression. You may want to choose one of these datasets, because example solutions will be available for each.


## Reading

Top recommendations in _**bold italic:**_

#### Permutation Importances
- _**[Kaggle / Dan Becker: Machine Learning Explainability](https://www.kaggle.com/dansbecker/permutation-importance)**_
- [Christoph Molnar: Interpretable Machine Learning](https://christophm.github.io/interpretable-ml-book/feature-importance.html)

#### (Default) Feature Importances
  - [Ando Saabas: Selecting good features, Part 3, Random Forests](https://blog.datadive.net/selecting-good-features-part-iii-random-forests/)
  - [Terence Parr, et al: Beware Default Random Forest Importances](https://explained.ai/rf-importance/index.html)

#### Gradient Boosting
  - [A Gentle Introduction to the Gradient Boosting Algorithm for Machine Learning](https://machinelearningmastery.com/gentle-introduction-gradient-boosting-algorithm-machine-learning/)
  - [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/ISLR%20Seventh%20Printing.pdf), Chapter 8
  - _**[Gradient Boosting Explained](https://www.gormanalysis.com/blog/gradient-boosting-explained/)**_ — Ben Gorman
  - [Gradient Boosting Explained](http://arogozhnikov.github.io/2016/06/24/gradient_boosting_explained.html) — Alex Rogozhnikov
  - [How to explain gradient boosting](https://explained.ai/gradient-boosting/) — Terence Parr & Jeremy Howard

In [52]:
import platform
platform.architecture()[0]

'64bit'

In [79]:
# Read in Churn DS - Saved Locally to my PyCharm
# still looking into the certificate error

import numpy as np
import pandas as pd
import category_encoders as ce
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
from category_encoders import OneHotEncoder
import seaborn as sns

df = pd.read_csv('../churn_ds.csv')


In [2]:
# function for turning TotalChares into Float values
def fix_float(cell):
  try: 
    return float(cell)
  except: 
    print(cell)
    return np.NaN

In [3]:
# Cleaning Wrangle function 

def wrangle(X):
    
    X = X.copy()
    
    # fixing column to change to float
    df['TotalCharges'] = df['TotalCharges'].apply(fix_float)
                                                  
    # replacing Yes/No with True/False
    columns = ['Partner', 'Dependents', 'PhoneService', 'PaperlessBilling', 'Churn']                                              
    for col in columns:
        X[col] = X[col].apply(lambda cell: cell.lower() == 'yes')

    y = X['Churn']    
        
    X.drop(['customerID', 'Churn'], axis=1, inplace=True)
                                                  
    return X, y     

In [9]:
X, y = wrangle(df)

 
 
 
 
 
 
 
 
 
 
 


In [10]:
X.shape

(7043, 19)

In [11]:
y.value_counts(normalize=True)

False    0.73463
True     0.26537
Name: Churn, dtype: float64

In [12]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=.25, stratify=y, random_state=42)

In [13]:
X_train.shape

(5282, 19)

In [14]:
y_train.shape

(5282,)

In [19]:
X_train.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges
6661,Female,0,True,True,72,False,No phone service,DSL,No,Yes,No,Yes,Yes,Yes,Two year,False,Credit card (automatic),53.65,3784.0
4811,Female,0,False,False,4,True,No,DSL,No,No,No,No,No,No,Month-to-month,True,Mailed check,46.0,193.6
2193,Male,0,False,True,56,True,No,No,No internet service,No internet service,No internet service,No internet service,No internet service,No internet service,One year,True,Mailed check,21.2,1238.65
1904,Male,0,False,False,56,True,Yes,Fiber optic,No,Yes,No,Yes,No,Yes,Month-to-month,True,Electronic check,94.45,5124.6
6667,Female,0,False,False,9,True,No,Fiber optic,No,No,No,No,No,Yes,Month-to-month,True,Electronic check,79.55,723.4


In [80]:
sns?

In [94]:
# Having some trouble with graphing features and target in my dataset
import matplotlib.pyplot as plt
import plotly.express as px

In [103]:
px.bar(df, x='tenure', y='PaymentMethod', color='Churn')

In [15]:
# basic pipeline model
model = make_pipeline(
    OneHotEncoder(), 
    SimpleImputer(), 
    RandomForestClassifier()
)

model.fit(X_train, y_train)

Pipeline(steps=[('onehotencoder',
                 OneHotEncoder(cols=['gender', 'MultipleLines',
                                     'InternetService', 'OnlineSecurity',
                                     'OnlineBackup', 'DeviceProtection',
                                     'TechSupport', 'StreamingTV',
                                     'StreamingMovies', 'Contract',
                                     'PaymentMethod', 'TotalCharges'])),
                ('simpleimputer', SimpleImputer()),
                ('randomforestclassifier', RandomForestClassifier())])

In [16]:
# train accuracy - looks like I need to fix overfitting
model.score(X_train, y_train)

0.9979174555092768

In [17]:
# Val accuracy score - beats baseline guessing by about 5pts. I can get this much higher
model.score(X_val, y_val)

0.7830777967064169

# XGBoost

In [38]:
from xgboost import XGBClassifier

In [62]:
XGBClassifier?

In [68]:
pipeline = make_pipeline(
    ce.OneHotEncoder(),
    XGBClassifier(n_estimators=100, random_state=42, n_jobs=6)
)
pipeline.fit(X_train, y_train)

Pipeline(steps=[('onehotencoder',
                 OneHotEncoder(cols=['gender', 'MultipleLines',
                                     'InternetService', 'OnlineSecurity',
                                     'OnlineBackup', 'DeviceProtection',
                                     'TechSupport', 'StreamingTV',
                                     'StreamingMovies', 'Contract',
                                     'PaymentMethod', 'TotalCharges'])),
                ('xgbclassifier',
                 XGBClassifier(base_score=0.5, booster='gbtree',
                               colsample_bylevel=1, colsample_bynode=1,
                               colsample_...gpu_id=-1,
                               importance_type='gain',
                               interaction_constraints='',
                               learning_rate=0.300000012, max_delta_step=0,
                               max_depth=6, min_child_weight=1, missing=nan,
                               monotone_constraints='(

In [69]:
pipeline.score(X_train, y_train)

0.9278682317304051

In [70]:
pipeline.score(X_val, y_val)

0.7751277683134583

In [64]:
pipeline = make_pipeline(
    ce.OneHotEncoder(),
    XGBClassifier(n_estimators=100, random_state=42, n_jobs=6)
)

param_distributions= {
    'xgbclassifier__max_depth': range(4,6,1), 
}

search = RandomizedSearchCV(
    pipeline,
    param_distributions=param_distributions, 
    n_iter=10, 
    cv=5, 
    scoring='accuracy', 
    verbose= 5, 
    return_train_score= True,

)
search.fit(X_train, y_train)
print(search.best_score_)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV] xgbclassifier__max_depth=4 ......................................
[CV]  xgbclassifier__max_depth=4, score=(train=0.873, test=0.784), total=  18.5s
[CV] xgbclassifier__max_depth=4 ......................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   21.0s remaining:    0.0s


[CV]  xgbclassifier__max_depth=4, score=(train=0.865, test=0.811), total=  19.9s
[CV] xgbclassifier__max_depth=4 ......................................


[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:   43.4s remaining:    0.0s


[CV]  xgbclassifier__max_depth=4, score=(train=0.873, test=0.797), total=  20.5s
[CV] xgbclassifier__max_depth=4 ......................................


[Parallel(n_jobs=1)]: Done   3 out of   3 | elapsed:  1.1min remaining:    0.0s


[CV]  xgbclassifier__max_depth=4, score=(train=0.879, test=0.776), total=  25.3s
[CV] xgbclassifier__max_depth=4 ......................................


[Parallel(n_jobs=1)]: Done   4 out of   4 | elapsed:  1.6min remaining:    0.0s


[CV]  xgbclassifier__max_depth=4, score=(train=0.878, test=0.787), total=  33.6s
[CV] xgbclassifier__max_depth=5 ......................................
[CV]  xgbclassifier__max_depth=5, score=(train=0.908, test=0.782), total=  43.5s
[CV] xgbclassifier__max_depth=5 ......................................
[CV]  xgbclassifier__max_depth=5, score=(train=0.905, test=0.809), total=  38.7s
[CV] xgbclassifier__max_depth=5 ......................................
[CV]  xgbclassifier__max_depth=5, score=(train=0.908, test=0.785), total=  38.0s
[CV] xgbclassifier__max_depth=5 ......................................
[CV]  xgbclassifier__max_depth=5, score=(train=0.904, test=0.775), total=  36.4s
[CV] xgbclassifier__max_depth=5 ......................................
[CV]  xgbclassifier__max_depth=5, score=(train=0.912, test=0.793), total=  35.5s


[Parallel(n_jobs=1)]: Done  10 out of  10 | elapsed:  5.7min finished


0.7909857802241909


In [67]:
best_pipe = search.best_estimator_
best_pipe.score(X_val, y_val)

0.7893242475865985

# Permutation Importance 

In [71]:
# import the function
from sklearn.inspection import permutation_importance

In [72]:
# run function with Validation data and best_pipe from above
result = permutation_importance(best_pipe, X_val, y_val, n_repeats=5, random_state=42)

In [73]:
# set in df for easy viewing
dfpi = pd.DataFrame({'feature': X_val.columns, 
                    'importances_mean': np.round(result['importances_mean'], 3), 
                    'importances_std': result['importances_std']})

In [75]:
# See permutation importance by most to least importance
dfpi.sort_values(by='importances_mean', ascending=False)

Unnamed: 0,feature,importances_mean,importances_std
4,tenure,0.05,0.005489
14,Contract,0.034,0.007183
17,MonthlyCharges,0.008,0.00532
7,InternetService,0.005,0.002773
16,PaymentMethod,0.004,0.001781
6,MultipleLines,0.004,0.002012
8,OnlineSecurity,0.004,0.003077
11,TechSupport,0.004,0.002904
10,DeviceProtection,0.003,0.002339
3,Dependents,0.002,0.000718
