Feature engineering an important part of machine-learning as we try to modify/create (i.e., engineer) new features from our existing dataset that might be meaningful in predicting the TARGET.

Each datasets provides more information about the loan application in terms of how prompt they have been on their instalment payments, their credit history on other loans, the amount of cash or credit card balances they have etc. A data scientists/researcher should always investigate and create new features from all the information provided.

In this basic exercise, we will focus on the main dataset, that is application_train.csv. We will go through two simple methodologies in feature engineering

In [None]:

import pandas as pd
import numpy as np
import sklearn


# importing graphics modules
import matplotlib.pyplot as plt
import seaborn as sns



In [None]:
df_app_test_align = db['df_app_test_align'] 
df_app_train_align = db['df_app_train_align'] 
df_app_train_corr_target = db['df_app_train_corr_target']



In [None]:
print(df_app_train_corr_target.tail(10))
print(df_app_train_corr_target.head(10))

**Polynomial features**
The variables that we select are EXT_SOURCE_1/2/3 (-ve), DAYS_BIRTH (+ve), and DAYS_EMPLOYED (+ve) that all have large correlation values to TARGET relative to the other features.

We create new poly_feat_x dataframes as both training and test datasets need to be equivalent. Thus, any polnomial features in the training dataset, must be created for the test dataset too.

In [None]:
imp_feat_list = ['EXT_SOURCE_1','EXT_SOURCE_2','EXT_SOURCE_3','DAYS_BIRTH'] #  ,'DAYS_EMPLOYED'

poly_feat_train = df_app_train_align[imp_feat_list]
poly_feat_test = df_app_test_align[imp_feat_list]


We observed that several features often had NaN values. We use the SimpleImputer function in scikit-learn's impute toolkit where we replace all np.nan with median values in that column.

We fit on the training data, as that is all the in-sample data that we have. Then we perform the transformations on both the training and test datasets

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='median')
imputer.fit(poly_feat_train)

poly_feat_train = imputer.transform(poly_feat_train)
poly_feat_test = imputer.transform(poly_feat_test)

In [None]:
from sklearn.preprocessing import PolynomialFeatures
poly_transform = PolynomialFeatures(degree=3)
poly_transform.fit(poly_feat_train)

poly_transform_train = poly_transform.transform(poly_feat_train)
poly_transform_test = poly_transform.transform(poly_feat_test)

print('Shape of polynomial features (training): {}'.format(poly_transform_train.shape))
print('Shape of polynomial features (test): {}'.format(poly_transform_test.shape))

Shape of polynomial features (training): (307511, 35)
Shape of polynomial features (test): (48744, 35)
We can see a detailed list of the new polynomial and interaction features that have been created

In [None]:
poly_feat_name_list = poly_transform.get_feature_names(imp_feat_list)

We would like to see if any of these new features in the training dataset have higher correlations with the TARGET. Since the new polynomial features in the training dataset have a correlation magnitude greater than the original feature set, we should consider adding them into our model.

In [None]:
df_poly_feat_train = pd.DataFrame(poly_transform_train,columns=poly_feat_name_list)
df_poly_feat_train['TARGET'] = df_app_train_align['TARGET']
poly_feat_corr = df_poly_feat_train.corr()['TARGET'].sort_values()

print(poly_feat_corr.head(10))
print(poly_feat_corr.tail(10))

In [None]:
df_poly_feat_test = pd.DataFrame(poly_transform_test,columns=poly_feat_name_list)
df_poly_feat_train.index = df_app_train_align.index
df_poly_feat_test.index  = df_app_test_align.index

In [None]:
df_app_train_poly = df_app_train_align.merge(df_poly_feat_train,left_index=True,right_index=True)
df_app_test_poly = df_app_test_align.merge(df_poly_feat_test,left_index=True,right_index=True)

In [None]:
df_app_train_poly_align, df_app_test_poly_align = df_app_train_poly.align(df_app_test_poly,join='inner',axis=1)

In [None]:
df_app_train_poly_align.head()

In [None]:
df_app_test_poly_align.head()

We can see that our original dataset has 237 features, and our polynomial feature engineering resulted in 36 new features. Having an extended set of both original and polynomial features resulted in 273 features. When we aligned both the training and test dataets together, we eend up with 271 features.

In [None]:
print('Original features (train): {}'.format(df_app_train_align.shape))
print('Polynomial features(train):{}'.format(df_poly_feat_train.shape))
print('Original & polynomial features (train): {}'.format(df_app_train_poly.shape))
print('Original & polynomial features align (train): {}'.format(df_app_train_poly_align.shape))

In [None]:
s1 = set(df_app_train_poly.columns)
s2 = set(df_app_train_poly_align.columns)

diff_s1_s2 = s1-s2
diff_s1_s2


**Expert knowledge features**
Often, experts have domain knowledge about what combination of existing features have strong explanatory/predictive power. In this case we are looking at the following features

Percentage of days employed - How long a person has been employed as a percentage of his life is a stronger predictor of his ability to keep paying off his loans.
Available credit as a percentage of income - If a person has a very large amount of credit available as a percentage of income, this can impact his ability to pay off the loans
Annuity as a percentage of income - If a person receives an annuity, this is a more stable source of income thus if it is higher, you are less likely to default.
Annuity as a percentage of available credit - If a person receives an annuity, this is more stable source of income thus if it is a high percentage compared to his/her credit availability then the person is more likely be able to pay off his debts.

In [None]:
df_app_train_align_expert = df_app_train_align.copy()
df_app_test_align_expert = df_app_test_align.copy()

# Training dataset
df_app_train_align_expert['DAYS_EMPLOYED_PCT'] = df_app_train_align_expert['DAYS_EMPLOYED'] / df_app_train_align_expert['DAYS_BIRTH']
df_app_train_align_expert['CREDIT_INCOME_PCT'] = df_app_train_align_expert['AMT_CREDIT'] / df_app_train_align_expert['AMT_INCOME_TOTAL']
df_app_train_align_expert['ANNUITY_INCOME_PCT'] = df_app_train_align_expert['AMT_ANNUITY'] / df_app_train_align_expert['AMT_INCOME_TOTAL']
df_app_train_align_expert['CREDIT_TERM'] = df_app_train_align_expert['AMT_ANNUITY'] / df_app_train_align_expert['AMT_CREDIT']

# Test dataset
df_app_test_align_expert['DAYS_EMPLOYED_PCT'] = df_app_test_align_expert['DAYS_EMPLOYED'] / df_app_test_align_expert['DAYS_BIRTH']
df_app_test_align_expert['CREDIT_INCOME_PCT'] = df_app_test_align_expert['AMT_CREDIT'] / df_app_test_align_expert['AMT_INCOME_TOTAL']
df_app_test_align_expert['ANNUITY_INCOME_PCT'] = df_app_test_align_expert['AMT_ANNUITY'] / df_app_test_align_expert['AMT_INCOME_TOTAL']
df_app_test_align_expert['CREDIT_TERM'] = df_app_test_align_expert['AMT_ANNUITY'] / df_app_test_align_expert['AMT_CREDIT']

# **Summary**

**Polynomial feature engineering**

Evaluate which are the features with the largest +ve and -ve correlations with TARGET.
Extract those features and fill in any np.nan rows by imputing with the median of that column (i.e., sklearn.impute.SimpleImputer)
Create new polynomial and interactive features (i.e., sklearn.preprocessing.PolynomialFeatures).
Evalute whether these new polynomial and interactive features exhibit greater +ve and -ve correlations with TARGET compared to the original feature set. If so, consider creating a new dataset with these new polynomial and interactive features.
Include row key identifiers (i.e., index) into the new polynomial feature set for both the polynomial training and test datasets (i.e., df_poly_feat_train, df_poly_feat_test)
Merge this new polynomial feature dataset with the original feature dataset (i.e., merge df_poly_feat_train and df_app_train_align) for both training and test datasets.
Align the new training and test datasets together.

# **Expert feature engineering**
These are features that are well known to domain knowledge to have high explanatory and predictive power.
They are useful in combining features in the original set together, thus making your model more parsimonious
Once you've created these expert features, compare their correlations with the TARGET and evaluate if they are greater than the individual features themselves.