3 methods for feature selection:

* Remove collinear features
* Remove features with greater than a threshold percentage of missing values
* Keep only the most relevant features using feature importances from a model

In [1]:
import pandas as pd
import numpy as np

# featuretools for automated feature engineering
import sys
if not 'featuretools' in sys.modules:
  !pip install featuretools
import featuretools as ft

# matplotlit and seaborn for visualizations
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 22
import seaborn as sns

# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')

# modeling 
import lightgbm as lgb

# utilities
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import LabelEncoder

# memory management
import gc

Collecting featuretools
[?25l  Downloading https://files.pythonhosted.org/packages/e2/13/d28a29bb0438d69496a97a4c6818ee3c0b3a0f9ade169bd252c05370e495/featuretools-0.23.2-py3-none-any.whl (296kB)
[K     |█                               | 10kB 12.9MB/s eta 0:00:01[K     |██▏                             | 20kB 19.3MB/s eta 0:00:01[K     |███▎                            | 30kB 16.8MB/s eta 0:00:01[K     |████▍                           | 40kB 12.4MB/s eta 0:00:01[K     |█████▌                          | 51kB 13.2MB/s eta 0:00:01[K     |██████▋                         | 61kB 14.1MB/s eta 0:00:01[K     |███████▊                        | 71kB 14.3MB/s eta 0:00:01[K     |████████▉                       | 81kB 13.1MB/s eta 0:00:01[K     |██████████                      | 92kB 11.7MB/s eta 0:00:01[K     |███████████                     | 102kB 12.3MB/s eta 0:00:01[K     |████████████▏                   | 112kB 12.3MB/s eta 0:00:01[K     |█████████████▎                  |

In [5]:
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/Colab\ Notebooks/home\ credit\ default\ risk

Mounted at /content/drive
/content/drive/MyDrive/Colab Notebooks/home credit default risk


In [9]:
train = pd.read_csv('bureau.csv', nrows = 1000)


###Remove Collinear Variables

In [14]:
# Threshold for removing correlated variables
threshold = 0.7

# Absolute value correlation matrix
corr_matrix = train.corr().abs()
corr_matrix.head()

Unnamed: 0,SK_ID_CURR,SK_ID_BUREAU,DAYS_CREDIT,CREDIT_DAY_OVERDUE,DAYS_CREDIT_ENDDATE,DAYS_ENDDATE_FACT,AMT_CREDIT_MAX_OVERDUE,CNT_CREDIT_PROLONG,AMT_CREDIT_SUM,AMT_CREDIT_SUM_DEBT,AMT_CREDIT_SUM_LIMIT,AMT_CREDIT_SUM_OVERDUE,DAYS_CREDIT_UPDATE,AMT_ANNUITY
SK_ID_CURR,1.0,0.067046,0.042297,0.003859,0.005542,0.011839,0.068253,0.073558,0.02572,0.016058,0.029952,0.029776,0.006152,0.086643
SK_ID_BUREAU,0.067046,1.0,0.046713,0.01325,0.051398,0.046534,0.101289,0.025153,0.040555,0.01656,0.031745,0.022923,0.062732,0.000423
DAYS_CREDIT,0.042297,0.046713,1.0,0.061713,0.253069,0.858015,0.090935,0.015338,0.110251,0.187181,0.015467,0.021464,0.770254,0.20625
CREDIT_DAY_OVERDUE,0.003859,0.01325,0.061713,1.0,0.020337,,0.006643,0.002174,0.012983,0.00883,0.008114,0.621872,0.021877,
DAYS_CREDIT_ENDDATE,0.005542,0.051398,0.253069,0.020337,1.0,0.26242,0.014588,0.258166,0.113559,0.113175,0.085117,0.016002,0.293078,0.091582


In [15]:
# Upper triangle of correlations
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

to_drop = [column for column in upper.columns if any(upper[column] > threshold)]

print('There are %d columns to remove.' % (len(to_drop)))

There are 3 columns to remove.


In [16]:
train = train.drop(columns = to_drop)

###Remove Missing Values

Most models (including those in Sk-Learn) cannot handle missing values, so we will have to fill these in before machine learning. The Gradient Boosting Machine (at least in LightGBM) can handle missing values

In [22]:
# Train missing values (in percent)
train_missing = (train.isnull().sum() / len(train)).sort_values(ascending = False)
train_missing.head()

AMT_ANNUITY               0.931
AMT_CREDIT_MAX_OVERDUE    0.593
AMT_CREDIT_SUM_LIMIT      0.299
DAYS_CREDIT_ENDDATE       0.066
CREDIT_TYPE               0.000
dtype: float64

In [23]:
# Identify missing values above threshold
train_missing = train_missing.index[train_missing > 0.4]

all_missing = list(set(train_missing))
print('There are %d columns with more than 40%% missing values' % len(all_missing))

There are 2 columns with more than 40% missing values


In [24]:
train = pd.get_dummies(train.drop(columns = all_missing))

###Feature Selection through Feature Importances
see this part at the ending part in the EDAonTrainandTest.ipynb