<a href="https://colab.research.google.com/github/damerei/DS-Unit-2-Sprint-3-Classification-Validation/blob/master/LS_DS_241_Categorical_Encoding_LIVE_LESSON.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

_Lambda School Data Science — Practicing & Understanding Predictive Modeling_

# Categorical Encoding

### [category_encoders](http://contrib.scikit-learn.org/categorical-encoding/)

Install category_encoders, version >= 2.0.0
- Google Colab: `pip install category_encoders`
- Local, Anaconda: `conda install -c conda-forge category_encoders`

In [0]:
!pip install category_encoders



In [0]:
import category_encoders as ce
ce.__version__

'2.0.0'

### Categorical encoding for trees
- [Are categorical variables getting lost in your random forests?](https://roamanalytics.com/2016/10/28/are-categorical-variables-getting-lost-in-your-random-forests/)
- [Beyond One-Hot: An Exploration of Categorical Variables](http://www.willmcginnis.com/2015/11/29/beyond-one-hot-an-exploration-of-categorical-variables/)
- [Categorical Features and Encoding in Decision Trees](https://medium.com/data-design/visiting-categorical-features-and-encoding-in-decision-trees-53400fa65931)
- [Coursera — How to Win a Data Science Competition: Learn from Top Kagglers — Concept of mean encoding](https://www.coursera.org/lecture/competitive-data-science/concept-of-mean-encoding-b5Gxv)
- [Mean (likelihood) encodings: a comprehensive study](https://www.kaggle.com/vprokopev/mean-likelihood-encodings-a-comprehensive-study)
- [The Mechanics of Machine Learning, Chapter 6: Categorically Speaking](https://mlbook.explained.ai/catvars.html)

### Can you predict if peer-to-peer loans are charged off or fully paid?

[Lending Club says,](https://www.lendingclub.com/) _"Our mission is to transform the banking system to make credit more affordable and investing more rewarding."_ You can view their [loan statistics and visualizations](https://www.lendingclub.com/info/demand-and-credit-profile.action).

[According to Wikipedia,](https://en.wikipedia.org/wiki/Lending_Club) _Lending Club is the world's largest peer-to-peer lending platform._

>Lending Club enables borrowers to create unsecured personal loans between $1,000 and 40,000. The standard loan period is three years. Investors can search and browse the loan listings on Lending Club website and select loans that they want to invest in based on the information supplied about the borrower, amount of loan, loan grade, and loan purpose. Investors make money from interest. Lending Club makes money by charging borrowers an origination fee and investors a service fee.

The data is a stratified sample of 100,000 Lending Club peer-to-peer loans with a loan status of "Charged Off" or "Fully Paid", issued from 2007 through 2018.

The set of variables included here are the intersection of what's available both when investors download historical data and when investors browse loans for manual investing.

Target: `charged_off`

Data dictionary: https://resources.lendingclub.com/LCDataDictionary.xlsx

In [0]:
import pandas as pd
pd.options.display.max_columns = 500
pd.options.display.max_rows = 500
url = 'https://drive.google.com/uc?export=download&id=1AafT_i1dmfaxqKiyFofVndleKozbQw3l'
df = pd.read_csv(url)
df.shape

(100000, 104)

In [0]:
from sklearn.model_selection import train_test_split

X = df.drop(columns='charged_off')
y = df['charged_off']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, train_size=0.80, test_size=0.20, stratify=y, random_state=42)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

((80000, 103), (20000, 103), (80000,), (20000,))

In [0]:
y_train.value_counts(normalize=True)

0    0.80045
1    0.19955
Name: charged_off, dtype: float64

In [0]:
X_train.describe(include='number').T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,80000.0,1667101.0,384960.93752,1000000.0,1333199.0,1667750.5,1998223.0,2336172.0
member_id,0.0,,,,,,,
loan_amnt,80000.0,14385.93,8653.335024,1000.0,8000.0,12000.0,20000.0,40000.0
funded_amnt,80000.0,14378.77,8649.381479,1000.0,8000.0,12000.0,20000.0,40000.0
installment,80000.0,437.2868,260.270011,23.61,249.54,374.64,578.265,1566.8
annual_inc,80000.0,75872.64,59552.605399,0.0,45760.0,65000.0,90000.0,6000000.0
url,0.0,,,,,,,
dti,79988.0,18.36223,12.2945,0.0,11.88,17.635,24.12,999.0
delinq_2yrs,80000.0,0.3147875,0.86675,0.0,0.0,0.0,0.0,20.0
inq_last_6mths,80000.0,0.6546125,0.936913,0.0,0.0,0.0,1.0,7.0


In [0]:
X_train.describe(exclude='number').T.sort_values(by='unique')

Unnamed: 0,count,unique,top,freq
term,80000,2,36 months,60781
application_type,80000,2,Individual,78548
initial_list_status,80000,2,w,46513
disbursement_method,80000,2,Cash,79612
home_ownership,80000,6,MORTGAGE,39614
grade,80000,7,B,23583
emp_length,75387,11,10+ years,26395
purpose,80000,14,debt_consolidation,46509
sub_grade,80000,35,C1,5049
addr_state,80000,50,CA,11523


## Categorical exploration, 1 feature at a time

Change `feature`, then re-run these cells!

In [0]:
feature = 'addr_state'

In [0]:
X_train[feature].value_counts()

CA    11523
TX     6565
NY     6417
FL     5822
IL     3097
NJ     2875
PA     2685
OH     2585
GA     2528
VA     2326
NC     2159
MI     2110
AZ     1952
MA     1905
MD     1838
CO     1720
WA     1718
MN     1410
IN     1314
MO     1265
TN     1226
CT     1181
NV     1158
WI     1071
OR      984
SC      968
AL      968
LA      942
KY      766
OK      756
KS      642
AR      634
UT      608
NM      479
MS      428
HI      399
NH      377
RI      332
WV      318
MT      251
NE      217
DE      212
DC      212
AK      201
SD      190
WY      171
VT      167
ME      114
ID      112
ND      102
Name: addr_state, dtype: int64

In [0]:
X_train[[feature]].head()

Unnamed: 0,addr_state
25539,MT
28968,NV
34666,NY
20864,CA
75088,CA


### One Hot Encoding

Warning: May run slow, or run out of memory, with high cardinality categoricals!

In [0]:
encoder = ce.OneHotEncoder(use_cat_names=True)
encoded = encoder.fit_transform(X_train[[feature]])
print(f'{len(encoded.columns)} columns')
encoded.head()

50 columns


Unnamed: 0,addr_state_MT,addr_state_NV,addr_state_NY,addr_state_CA,addr_state_AZ,addr_state_TX,addr_state_KY,addr_state_MS,addr_state_NJ,addr_state_GA,addr_state_PA,addr_state_CO,addr_state_IL,addr_state_MI,addr_state_CT,addr_state_KS,addr_state_SC,addr_state_TN,addr_state_AL,addr_state_VA,addr_state_MD,addr_state_WA,addr_state_AR,addr_state_NM,addr_state_OK,addr_state_OH,addr_state_FL,addr_state_IN,addr_state_MO,addr_state_NE,addr_state_DE,addr_state_MA,addr_state_VT,addr_state_LA,addr_state_MN,addr_state_UT,addr_state_WI,addr_state_NC,addr_state_DC,addr_state_RI,addr_state_OR,addr_state_WV,addr_state_NH,addr_state_HI,addr_state_ID,addr_state_SD,addr_state_ND,addr_state_AK,addr_state_ME,addr_state_WY
25539,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
28968,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
34666,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
20864,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
75088,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Binary Encoding

In [0]:
encoder = ce.BinaryEncoder()
encoded = encoder.fit_transform(X_train[[feature]])
print(f'{len(encoded.columns)} columns')
encoded.head()

7 columns


Unnamed: 0,addr_state_0,addr_state_1,addr_state_2,addr_state_3,addr_state_4,addr_state_5,addr_state_6
25539,0,0,0,0,0,0,1
28968,0,0,0,0,0,1,0
34666,0,0,0,0,0,1,1
20864,0,0,0,0,1,0,0
75088,0,0,0,0,1,0,0


In [0]:
import math
math.sqrt(50)

7.0710678118654755

### "Ordinal" Encoding

In [0]:
encoder = ce.OrdinalEncoder()
encoded = encoder.fit_transform(X_train[[feature]])
print(f'1 column, {encoded[feature].nunique()} unique values')
encoded.head()

1 column, 50 unique values


Unnamed: 0,addr_state
25539,1
28968,2
34666,3
20864,4
75088,4


### Target (Mean) Encoding

Warning: May overfit!

In [0]:
first5 = X_train[feature].head().values
train = pd.concat([X_train, y_train], axis='columns')
target_mean = train.groupby(feature)['charged_off'].mean()
target_mean[first5]

addr_state
MT    0.163347
NV    0.202936
NY    0.218015
CA    0.200295
CA    0.200295
Name: charged_off, dtype: float64

In [0]:
min_samples_leaf = 10
encoder = ce.TargetEncoder(min_samples_leaf=min_samples_leaf)
encoded = encoder.fit_transform(X_train[[feature]], y_train)
print(f'1 column, {encoded[feature].nunique()} unique values, min_samples_leaf={min_samples_leaf}')
encoded.head()

1 column, 50 unique values, min_samples_leaf=10


Unnamed: 0,addr_state
25539,0.163347
28968,0.202936
34666,0.218015
20864,0.200295
75088,0.200295


In [0]:
min_samples_leaf = 100
encoder = ce.TargetEncoder(min_samples_leaf=min_samples_leaf)
encoded = encoder.fit_transform(X_train[[feature]], y_train)
print(f'1 column, {encoded[feature].nunique()} unique values, min_samples_leaf={min_samples_leaf}')
encoded.head()

1 column, 50 unique values, min_samples_leaf=100


Unnamed: 0,addr_state
25539,0.163347
28968,0.202936
34666,0.218015
20864,0.200295
75088,0.200295


In [0]:
min_samples_leaf = 1000
encoder = ce.TargetEncoder(min_samples_leaf=min_samples_leaf)
encoded = encoder.fit_transform(X_train[[feature]], y_train)
print(f'1 column, {encoded[feature].nunique()} unique values, min_samples_leaf={min_samples_leaf}')
encoded.head()

1 column, 28 unique values, min_samples_leaf=1000


  smoove = 1 / (1 + np.exp(-(stats['count'] - self.min_samples_leaf) / self.smoothing))


Unnamed: 0,addr_state
25539,0.19955
28968,0.202936
34666,0.218015
20864,0.200295
75088,0.200295


### BONUS: Data Wrangling / Feature Engineering Example

In [0]:
def wrangle(X):
    X = X.copy()
    
    # Drop some columns
    X = X.drop(columns='id')  # id is random
    X = X.drop(columns=['member_id', 'url', 'desc'])  # All null
    X = X.drop(columns='title')  # Duplicative of purpose
    X = X.drop(columns='grade')  # Duplicative of sub_grade
    
    # Transform sub_grade from "A1" - "G5" to 1.1 - 7.5
    def wrangle_sub_grade(x):
        first_digit = ord(x[0]) - 64
        second_digit = int(x[1])
        return first_digit + second_digit/10
    
    X['sub_grade'] = X['sub_grade'].apply(wrangle_sub_grade)

    # Convert percentages from strings to floats
    X['int_rate'] = X['int_rate'].str.strip('%').astype(float)
    X['revol_util'] = X['revol_util'].str.strip('%').astype(float)
        
    # Transform earliest_cr_line to an integer: how many days it's been open
    X['earliest_cr_line'] = pd.to_datetime(X['earliest_cr_line'], infer_datetime_format=True)
    X['earliest_cr_line'] = pd.Timestamp.today() - X['earliest_cr_line']
    X['earliest_cr_line'] = X['earliest_cr_line'].dt.days
    
    # Create features for three employee titles: teacher, manager, owner
    X['emp_title'] = X['emp_title'].str.lower()
    X['emp_title_teacher'] = X['emp_title'].str.contains('teacher', na=False)
    X['emp_title_manager'] = X['emp_title'].str.contains('manager', na=False)
    X['emp_title_owner']   = X['emp_title'].str.contains('owner', na=False)
    
    # Drop categoricals with highest cardinality
    X = X.drop(columns=['emp_title', 'zip_code'])
    
    # Transform features with many nulls to binary flags
    many_nulls = ['sec_app_mths_since_last_major_derog',
                  'sec_app_revol_util',
                  'sec_app_earliest_cr_line',
                  'sec_app_mort_acc',
                  'dti_joint',
                  'sec_app_collections_12_mths_ex_med',
                  'sec_app_chargeoff_within_12_mths',
                  'sec_app_num_rev_accts',
                  'sec_app_open_act_il',
                  'sec_app_open_acc',
                  'revol_bal_joint',
                  'annual_inc_joint',
                  'sec_app_inq_last_6mths',
                  'mths_since_last_record',
                  'mths_since_recent_bc_dlq',
                  'mths_since_last_major_derog',
                  'mths_since_recent_revol_delinq',
                  'mths_since_last_delinq',
                  'il_util',
                  'emp_length',
                  'mths_since_recent_inq',
                  'mo_sin_old_il_acct',
                  'mths_since_rcnt_il',
                  'num_tl_120dpd_2m',
                  'bc_util',
                  'percent_bc_gt_75',
                  'bc_open_to_buy',
                  'mths_since_recent_bc']

    for col in many_nulls:
        X[col] = X[col].isnull()
    
    # For features with few nulls, do mean imputation
    for col in X:
        if X[col].isnull().sum() > 0:
            X[col] = X[col].fillna(X[col].mean())
    
    # Return the wrangled dataframe
    return X

X_train = wrangle(X_train)
X_test  = wrangle(X_test)
X_train.shape, X_test.shape, y_train.shape, y_test.shape