In [1]:
%load_ext autoreload

In [2]:
%run ./common_init.ipynb

Setup logging to file: out.log
Figure output directory saved in figure_output at /home/datarian/OneDrive/unine/Master_Thesis/ma-thesis-report/figures


In [3]:
%autoreload 2
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import HashingEncoder, OneHotEncoder, OrdinalEncoder

# Load custom code
import kdd98.data_handler as dh
import kdd98.utils_transformer as ut
from kdd98.transformers import *
from kdd98.config import Config
import pickle

Using TensorFlow backend.


# Splitting into training- and test dataset

Before applying *any* transformations, the dataset will be split 80/20 into a learning and test set.

Let's look at feature TARGET_B, which describes whether a person has donated or not:

In [4]:
with open(pathlib.Path(Config.get("df_store"), "imputed_iterative.pkl"), "rb") as f:
    dataset = pickle.load(f)

In [5]:
dataset.TARGET_B.value_counts(normalize=True) # 5 % of recipients have donated.a

0.0    0.949241
1.0    0.050759
Name: TARGET_B, dtype: float64

We want to preserve this ratio in the split datasets. scikit-learn provides a method for achieving this.

In [9]:
seed = Config.get("random_seed")
from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8, random_state=seed)
for learn_index, test_index in splitter.split(dataset, dataset.TARGET_B.astype('int')):
    l_i = learn_index
    t_i = test_index
    kdd_learn = dataset.iloc[learn_index]
    kdd_test = dataset.iloc[test_index]

Now, check that the two sets are really disjoint

In [10]:
set(kdd_learn.index).intersection(kdd_test.index)

set()

Check the frequencies of the donors in the sets:

In [11]:
kdd_learn['TARGET_B'].value_counts(normalize=True)

0.0    0.949246
1.0    0.050754
Name: TARGET_B, dtype: float64

In [12]:
kdd_test['TARGET_B'].value_counts(normalize=True)

0.0    0.949222
1.0    0.050778
Name: TARGET_B, dtype: float64

## Separating features and label

First, we separate the features from the labels. We will also remove the label "TARGET_B", which is an indicator variable for donors that is no longer of interest

**All subsequent steps are performed on the training dataset***

In [13]:
kdd_learn_features = kdd_learn.drop(['TARGET_B', 'TARGET_D'],axis=1).copy()
with open(pathlib.Path(Config.get("df_store"), "learn_features.pkl"), "wb") as f:
    pickle.dump(kdd_learn_features, f)
kdd_learn_targets = kdd_learn[['TARGET_B','TARGET_D']].copy()
with open(pathlib.Path(Config.get("df_store"), "learn_targets.pkl"), "wb") as f:
    pickle.dump(kdd_learn_targets, f)
kdd_test_features = kdd_test.drop(['TARGET_B', 'TARGET_D'],axis=1).copy()
with open(pathlib.Path(Config.get("df_store"), "test_features.pkl"), "wb") as f:
    pickle.dump(kdd_test_features, f)
kdd_test_targets = kdd_test[['TARGET_B','TARGET_D']].copy()
with open(pathlib.Path(Config.get("df_store"), "test_targets.pkl"), "wb") as f:
    pickle.dump(kdd_test_targets, f)

# Feature Selection
Meant to reduce dimensionality by selecting only features that are 'interesting enough' to be considered in order to boost performance of calculations / improve accuracy of the estimator
- By variance threshold
- Recursive Feature Elimination by Cross-Validation
- L1-based feature selection (Logistic Regression, Lasso, SVM)
- Tree-based feature selection

See [scikit-learn: feature selection](http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection)