In [1]:
%load_ext autoreload

In [2]:
%run ./common_init.ipynb

Setup logging to file: out.log
Figure output directory saved in figure_output at /home/datarian/OneDrive/unine/Master_Thesis/ma-thesis-report/figures


In [3]:
%autoreload 2
import pickle
from sklearn.model_selection import StratifiedShuffleSplit

# Splitting into training- and test dataset

Before going any further, the dataset will be split 80/20 into a learning and validation set.

Let's look at feature TARGET_B, which describes whether an example has donated or not:

In [4]:
with open(pathlib.Path(Config.get("df_store"), "Xy_imputed_median.pd.pkl"), "rb") as f:
    dataset = pickle.load(f)

In [5]:
dataset.TARGET_B.value_counts(normalize=True) # 5 % of recipients have donated.a

0.0    0.949241
1.0    0.050759
Name: TARGET_B, dtype: float64

We want to preserve this ratio in the split datasets. scikit-learn provides a stratified sampler for this task.

In [6]:
seed = Config.get("random_seed")
from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8, random_state=seed)
for learn_index, val_index in splitter.split(dataset, dataset.TARGET_B.astype('int')):
    learn = dataset.iloc[learn_index]
    validation = dataset.iloc[val_index]

Now, check that the two sets are really disjoint

In [7]:
set(learn.index).intersection(validation.index)

set()

Check the frequencies of the donors in the sets:

In [8]:
learn['TARGET_B'].value_counts(normalize=True)

0.0    0.949246
1.0    0.050754
Name: TARGET_B, dtype: float64

In [9]:
learn.shape

(76329, 655)

In [10]:
validation['TARGET_B'].value_counts(normalize=True)

0.0    0.949222
1.0    0.050778
Name: TARGET_B, dtype: float64

In [11]:
validation.shape

(19083, 655)

We have a split that preserves the class frequencies.

## Separating features and label

Finally, we separate features and targets for the downstream model evaluation and persist the data frames.

**All subsequent steps are performed on the learning dataset***

In [12]:
X_train = learn.drop(['TARGET_B', 'TARGET_D'],axis=1).copy()
with open(pathlib.Path(Config.get("df_store"), "X_train.pd.pkl"), "wb") as f:
    pickle.dump(X_train, f)
y_train = learn[['TARGET_B','TARGET_D']].copy()
with open(pathlib.Path(Config.get("df_store"), "y_train.pd.pkl"), "wb") as f:
    pickle.dump(y_train, f)
    
X_val = validation.drop(['TARGET_B', 'TARGET_D'],axis=1).copy()
with open(pathlib.Path(Config.get("df_store"), "X_val.pd.pkl"), "wb") as f:
    pickle.dump(X_val, f)
y_val = validation[['TARGET_B','TARGET_D']].copy()
with open(pathlib.Path(Config.get("df_store"), "y_val.pd.pkl"), "wb") as f:
    pickle.dump(y_val, f)