# Module 4: Feature Engineering

Our overall strategy for feature engineering will include the following steps:
1. Apply domain knowledge to drop features that are not interpretable
2. Drop features with too many missing values (attribute sampling)
3. Drop examples with too many missing values (record sampling)
4. Transform numerical features
5. Encode categorical features

## Configuration

In [None]:
# basic configuration, put these lines at the top of each notebook
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [None]:
# plotting configuration (basically just change plot size)
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = (10, 6)

In [None]:
# show all columns of our data frames
import pandas as pd
pd.options.display.max_columns = None
pd.set_option("display.precision", 2)
pd.options.display.max_rows = 100

## Data loading

In [None]:
DATA_PATH = 'tmp/'
raw = pd.read_csv(f'{DATA_PATH}data_raw.csv')
raw.shape

In [None]:
raw.head()

## Data cleaning & sampling

### Applying domain knowledge to reduce features

Most of the features in our dataset were anonymized and are thus hard to interpret. Luckily, Vesta provides some abstract description of the feature groups in a [Kaggle forum post](https://www.kaggle.com/c/ieee-fraud-detection/discussion/101203#latest-607486). You can see the most important information below.

<img src="img/data_description.png" alt="Data description on Kaggle forum" style="width: 800px" />

Accordingly, we will drop the `V`, `C`, `D` and `id` features because there is no way for us to interpret them during the model evaluation phase.

In [None]:
raw = raw.drop(raw.columns.to_series()["V1":"V339"], axis=1)
raw.shape

In [None]:
raw = raw.drop(raw.columns.to_series()["id_01":"id_38"], axis=1)
raw.shape

In [None]:
raw = raw.drop(raw.columns.to_series()["C1":"C14"], axis=1)
raw.shape

In [None]:
raw = raw.drop(raw.columns.to_series()["D1":"D15"], axis=1)
raw.shape

In [None]:
raw = raw.drop(raw.columns.to_series()["M1":"M9"], axis=1)
raw.shape

This preliminary step leaves us with 19 features (including the target variable) for now.

### Attribute sampling

Now, we can reuse some code from the previous step to show us which columns have the most missing values. We will then decide how many of these we have to drop.

In [None]:
levels = [0.2, 0.5, 0.8]
missing_val_cols = raw.isnull().sum().sort_values(ascending=False) / len(raw)

for l in levels:
    perc = len(missing_val_cols.loc[missing_val_cols > l]) / len(missing_val_cols)
    print('Percentage of features with more than {:.0f}% missing values: {:.1f}%'.format(l * 100, perc * 100))

In [None]:
missing_val_cols * 100

We will set our cutoff at 20% of missing values, i.e., columns with more than 20% of missing values will be dropped. However, we will make an exception for the features `dist1`, `DeviceType` and `R_emaildomain`, since they are interpretable and might be important for predicting fraud. We will also drop two columns that are of no value to use, namely the index column and `TransactionID`.

In [None]:
cutoff = 0.2

cols_to_drop = missing_val_cols.loc[missing_val_cols > cutoff].index.to_list()
cols_to_drop.remove("DeviceType")
cols_to_drop.remove("R_emaildomain")
cols_to_drop.remove("dist1")
cols_to_drop.append("TransactionID")
cols_to_drop.append("Unnamed: 0")
len(cols_to_drop)

In [None]:
print(f'Number of columns before attribute sampling: {raw.shape[1]}')
raw = raw.drop(labels=cols_to_drop, axis=1)
print(f'Number of columns after attribute sampling: {raw.shape[1]}')

In [None]:
raw.head()

As we can see, we have 15 features left after accounting for missing values and our domain knowledge.

### Record sampling

Now, we can use a similar process to remove examples with too many missing values. Including these in our analysis might skew the results, because we they contain too many imputed values.

In [None]:
levels = [0.1, 0.2, 0.5]
missing_attrs = raw.isnull().sum(axis=1).sort_values(ascending=False) / raw.shape[1]

for l in levels:
    perc = len(missing_attrs.loc[missing_attrs >= l]) / len(missing_attrs)
    print('Percentage of records with more than {:.0f}% missing values: {:.1f}%'.format(l * 100, perc * 100))

Since we have a lot of data at our hands, we can easily remove all examples with more than 20% of missing data.

In [None]:
cutoff = 0.2

print(f'Number of rows before record sampling: {len(raw)}')
rows_to_drop = missing_attrs.loc[missing_attrs > cutoff].index.to_list()
raw = raw.drop(labels=rows_to_drop, axis=0)
print(f'Number of rows after record sampling: {len(raw)}')

This is a good time to save our progress. We have to reset our index (remember: we remove rows, thus creating holes in the existing index) in order to store the data frame in the efficient _Feather_ format.

In [None]:
raw = raw.reset_index(drop=True)

In [None]:
raw.to_feather(f'{DATA_PATH}feats_raw.feather')

### Dealing with missing values

We still have missing values left in our dataset. In the following, we will discover different ways of dealing with them. Firstly, let's calculate the percentage of missing values in our dataset.

#### Preparation

In [None]:
missing_vals_sum = raw.isnull().sum().sum() 
print(f'Percentage of missing values: {missing_vals_sum / (raw.shape[0] * raw.shape[1]) * 100:.2f}%')

In [None]:
raw.head(n=100)

We will deal with missing values for categorical and numerical variables separately. Let's write a helper function that splits these variable types for us (this function is borrowed from the great [fastai library](https://docs.fast.ai/tabular.html).

In [None]:
def cont_cat_split(df, dep_var=None):
    cont_names, cat_names = [], []
    for label in df:
        if label == dep_var: continue
        if df[label].dtype == int or df[label].dtype == float: cont_names.append(label)
        else: cat_names.append(label)
    return cont_names, cat_names

In [None]:
num_vars, cat_vars = cont_cat_split(raw, dep_var='isFraud')
print(f'Number of numerical variables: {len(num_vars)}')
print(f'Number of categorical variables: {len(cat_vars)}')

In [None]:
num_vars

In [None]:
cat_vars

We will use the `SimpleImputer` from the `scikit-learn` package to impute values for our numeric variables. Here, we will apply the `median` strategy, because both `TransactionAmt` and `dist1` are probably skewed.

#### Replace missing values for numerical features

In [None]:
from sklearn.impute import SimpleImputer
import numpy as np

In [None]:
num_imputer = SimpleImputer(missing_values=np.NaN, strategy="median")

In [None]:
for var in num_vars:
    raw[var] = num_imputer.fit_transform(X=raw[[var]])
raw.head()

#### Replace missing values for categorical features

Some of our categorical variables have a lot of unique values which slows imputation down a lot. Therefore, we should gather less popular categories which will also our model to make sense of the data.

In [None]:
raw.card1 = raw.card1.astype('int64').astype('category')
raw.card2 = raw.card2.astype('int64').astype('category')
raw.card3 = raw.card3.astype('int64').astype('category')
raw.card5 = raw.card5.astype('int64').astype('category')
raw.addr1 = raw.addr1.astype('int64').astype('category')
raw.addr2 = raw.addr2.astype('int64').astype('category')

In [None]:
def coverage_of_top_n_cats(col, n):
    counts = col.value_counts()
    total_count = counts.sum()
    top_n_count = counts[:n].sum()
    print(f'Coverage of top {n} categories for column {col.name}: {top_n_count/total_count*100:.2f}%')

In [None]:
for var in cat_vars:
    unique_vals = len(pd.unique(raw[var]))
    print(f'Unique values in {var}: {unique_vals}')
    coverage_of_top_n_cats(raw[var], 10)
    coverage_of_top_n_cats(raw[var], 20)
    print("\n")

The top 10 categories describe pretty much most of all our categorical features. We can therefore condense the long tail into one category.

In [None]:
def restructure_numerical_categories(col, n=10):
    top_ten_cats = list(col.value_counts().index[:n])
    mask = [False if row in top_ten_cats else True for row in col]
    temp = col.mask(mask, other=0)
    d = {0: (n+1)}
    for i, cat in zip(list(range(1, (n+1))), top_ten_cats):
        d[cat] = i
    return temp.astype('category').cat.rename_categories(d)

def restructure_string_categories(col, n=10):
    top_ten_cats = list(col.value_counts().index[:n])
    mask = [False if row in top_ten_cats else True for row in col]
    temp = col.mask(mask, other="other")
    return temp.astype('category')

In [None]:
raw.card1 = restructure_numerical_categories(raw.card1)
raw.card2 = restructure_numerical_categories(raw.card2)
raw.card3 = restructure_numerical_categories(raw.card3)
raw.card5 = restructure_numerical_categories(raw.card5)
raw.addr1 = restructure_numerical_categories(raw.addr1)
raw.addr2 = restructure_numerical_categories(raw.addr2)
raw.P_emaildomain = restructure_string_categories(raw.P_emaildomain)
raw.R_emaildomain = restructure_string_categories(raw.R_emaildomain)
raw.head()

Now, all our large categorical features only contain 11 distinct categories, where the category number also reflects the category's frequency.
At this point, we can impute values for missing values in the categorical variables. We will use constants for this (using the most frequent item would manipulate features with lots of missing values).

In [None]:
cat_imputer = SimpleImputer(missing_values=float('nan'), strategy="constant")

In [None]:
for var in cat_vars:
    raw[var] = cat_imputer.fit_transform(X=raw[[var]])
    raw[var] = raw[var].astype('category')
raw.head()

In [None]:
raw.addr1.cat.categories

After confirming that we don't have any missing values left, we can save our progress and go on to transformation of our features.

In [None]:
missing_vals_sum = raw.isnull().sum().sum() 
print(f'Percentage of missing values: {missing_vals_sum / (raw.shape[0] * raw.shape[1]) * 100:.2f}%')

In [None]:
raw.to_feather(f'{DATA_PATH}feats_clean.feather')

## Transformations of numerical variables

In [None]:
data = pd.read_feather(f'{DATA_PATH}feats_clean.feather')
data.head()

### Decomposition

We will start by making more sense of our datetime feature. Our goal is to decompose it into day of the week and hour of the day. With our anonymized, relative datetime it is hard to retrieve more information.

In [None]:
# offset is used to shift the start/end of a day, experimentation shows that offset of 0.58 is optimal 
def make_day_feature(col, offset=0):
    days = col / (3600*24)        
    encoded_days = np.floor(days-1+offset) % 7
    return encoded_days

def make_hour_feature(col):
    hours = col / (3600)        
    encoded_hours = np.floor(hours) % 24
    return encoded_hours

In [None]:
data['day'] = make_day_feature(data['TransactionDT'], offset=0.58).astype('int64').astype('category')
data['hour'] = make_hour_feature(data['TransactionDT']).astype('int64').astype('category')
print(data.day.describe())
print("\n")
print(data.hour.describe())

In [None]:
data = data.drop(["TransactionDT"], axis=1)

In [None]:
data.head()

As you can see, the new features are added to the data frame.

### Rescaling

Rescaling numeric variables is useful for models that are susceptible to different feature ranges, e.g., logistic regression. We will bring all our numeric variables to the range (0, 1) using the `MinMaxScaler` from the `scikit-learn` package. Beforehand, we will use a log transformation in order to de-skew the features.

In [None]:
num_vars, cat_vars = cont_cat_split(data, dep_var='isFraud')
num_vars

In [None]:
from sklearn.preprocessing import MinMaxScaler, FunctionTransformer

In [None]:
scaler = MinMaxScaler()
log_transformer = FunctionTransformer(func=np.log1p, inverse_func=np.expm1, validate=False)

for var in num_vars:
    data[var] = log_transformer.fit_transform(data[[var]])
    data[var] = scaler.fit_transform(data[[var]])
data.head()

### Discretization

We already saw an example of discretization in our exploratory data analysis, when we binned numerical data for plotting. Discretization does not make sense for our features, but an example is included nonetheless. We will use pandas' `cut` function for this.

In [None]:
transaction_amounts = raw.TransactionAmt
bins = [0, 10, 50, 100, 500, 1000, 5000, 10000, 50000]

pd.cut(transaction_amounts, bins)

We can see that the data was discretized into eight bins, replacing the original numeric values.

### Interaction features

Since we only have two numeric features left, there is only one possible interaction term to include in our dataset. We will add a `dist*TransactionAmt` feature and examine whether it might be a good predictor for our target variable.

In [None]:
data['dist1*TransactionAmt'] = data.dist1 * data.TransactionAmt
data.head()

In [None]:
df = data[['isFraud', 'dist1*TransactionAmt']].groupby('isFraud').agg(['mean', 'median'])
df[('dist1*TransactionAmt', 'mean')].plot(kind='bar')
plt.show()

The interaction feature will probably not help much, since the means for both groups are almost identical.

## Encoding of categorical variables

### Encoding schemes

Finally, we should encode our categorical variables in order to derive meaningful features that are also interpretable. Since our categorical features are non-ordinal, we can use one-hot encoding which will create a new feature for every level in each categorical variable. This will results in a "wider" dataset, i.e., a data frame with more columns than before.

In [None]:
one_hot_df = pd.get_dummies(data[cat_vars], prefix=cat_vars)
one_hot_df.head()

In [None]:
data = pd.concat([data, one_hot_df], axis=1)
data.shape

### Large categorical variables

We saw an example of how to deal with large categorical variables, when we limited several features to their top ten categories and an additional `other` category. Other common approaches include feature hashing or bin counting, which we will not further elaborate on here.
An alternative to our approach would be to one-hot encode a feature with many categories and subsequently apply an dimensionality reduction algorithm such as PCA in order to reduce the number of columns. This approach is often used in Kaggle competitions.

## Saving the pre-processed data

In [None]:
data.to_feather(f'{DATA_PATH}feats_final.feather')

In [None]:
!ls -lh tmp/

As we can see, our file size is down to less than 100MB from the original 700MB.