<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-the-learning-dataset" data-toc-modified-id="Loading-the-learning-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading the learning dataset</a></span></li><li><span><a href="#Splitting-into-training--and-test-dataset" data-toc-modified-id="Splitting-into-training--and-test-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Splitting into training- and test dataset</a></span></li><li><span><a href="#Separating-features-and-label" data-toc-modified-id="Separating-features-and-label-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Separating features and label</a></span></li><li><span><a href="#Cleaning-pipeline" data-toc-modified-id="Cleaning-pipeline-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Cleaning pipeline</a></span><ul class="toc-item"><li><span><a href="#Automating-cleaning-and-feature-extraction" data-toc-modified-id="Automating-cleaning-and-feature-extraction-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Automating cleaning and feature extraction</a></span></li></ul></li><li><span><a href="#Imputation-of-missing-values" data-toc-modified-id="Imputation-of-missing-values-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Imputation of missing values</a></span></li><li><span><a href="#Removing-constant-features" data-toc-modified-id="Removing-constant-features-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Removing constant features</a></span></li><li><span><a href="#Exploring-strategies-for-specific-feature-types" data-toc-modified-id="Exploring-strategies-for-specific-feature-types-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Exploring strategies for specific feature types</a></span><ul class="toc-item"><li><span><a href="#Constant-and-Sparse-Features" data-toc-modified-id="Constant-and-Sparse-Features-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Constant and Sparse Features</a></span></li><li><span><a href="#Numerical-features" data-toc-modified-id="Numerical-features-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Numerical features</a></span></li><li><span><a href="#Categorical-features" data-toc-modified-id="Categorical-features-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Categorical features</a></span></li><li><span><a href="#Remaining-object-features" data-toc-modified-id="Remaining-object-features-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Remaining object features</a></span></li></ul></li><li><span><a href="#Preprocessing-Pipeline" data-toc-modified-id="Preprocessing-Pipeline-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Preprocessing Pipeline</a></span></li><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Feature Selection</a></span><ul class="toc-item"><li><span><a href="#Removing-constant-features-(zero-variance)" data-toc-modified-id="Removing-constant-features-(zero-variance)-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Removing constant features (zero variance)</a></span></li><li><span><a href="#Sparse-Features" data-toc-modified-id="Sparse-Features-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Sparse Features</a></span></li><li><span><a href="#Advanced-approaches" data-toc-modified-id="Advanced-approaches-9.3"><span class="toc-item-num">9.3&nbsp;&nbsp;</span>Advanced approaches</a></span></li></ul></li><li><span><a href="#Feature-Extraction" data-toc-modified-id="Feature-Extraction-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Feature Extraction</a></span></li></ul></div>

# Cleaning


This notebook contains all code for the cleaning of the KDD Cup 98 datasets.

* Splits into learning and test
* Prepares the data for model fitting

This will be done with scikit-learn's transforming framework in order to ensure all transformations are applied identically on training, test and validation datasets.

First, the steps necessary are analysed, then the implemented cleaner is introduced.

In [1]:
%load_ext autoreload

In [2]:
%run ./common_init.ipynb

Setup logging to file: out.log
Figure output directory saved in figure_output at /home/datarian/OneDrive/unine/Master_Thesis/figures


In [3]:
%autoreload 2
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import HashingEncoder, OneHotEncoder, OrdinalEncoder

# Load custom code
import kdd98.data_loader as dl
import kdd98.utils_transformer as ut
from kdd98.transformers import *
from kdd98.config import Config

In [4]:
# Where to save the figures
IMAGES_PATH = pathlib.Path(figure_output/'preprocessing')

pathlib.Path(IMAGES_PATH).mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = pathlib.Path(IMAGES_PATH/fig_id + "." + fig_extension)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Loading the learning dataset


Set working directory to main code folder

In [35]:
data_loader = dl.KDD98DataLoader("cup98LRN.txt")
learning_raw = data_loader.raw_data

## Overview

A first, general look at the data structure:

In [6]:
learning_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95412 entries, 95515 to 185114
Columns: 480 entries, ODATEDW to GEOCODE2
dtypes: category(25), datetime64[ns](53), float64(48), int64(297), object(57)
memory usage: 334.2+ MB


* There are 481 features (of which one is the index)
* A total of 95412 examples
* 24 categorical features, 53 datetime features, 48 numerical features with missing values, 297 integer features without missing values and 56 object (string) features

In [21]:
learning_raw.head()

Unnamed: 0_level_0,ODATEDW,OSOURCE,TCODE,STATE,ZIP,MAILCODE,PVASTATE,DOB,NOEXCH,RECINHSE,...,TARGET_D,HPHONE_D,RFA_2R,RFA_2F,RFA_2A,MDMAUD_R,MDMAUD_F,MDMAUD_A,CLUSTER2,GEOCODE2
CONTROLN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
95515,1989-01-01,GRI,0,IL,61081,,,1937-12-01,0,,...,0,0,L,4,E,X,X,X,39.0,C
148535,1994-01-01,BOA,1,CA,91326,,,1952-02-01,0,,...,0,0,L,2,G,X,X,X,1.0,A
15078,1990-01-01,AMH,1,NC,27017,,,NaT,0,,...,0,1,L,4,E,X,X,X,60.0,C
172556,1987-01-01,BRY,0,CA,95953,,,1928-01-01,0,,...,0,1,L,4,E,X,X,X,41.0,C
7112,1986-01-01,,0,FL,33176,,,1920-01-01,0,X,...,0,1,L,2,F,X,X,X,26.0,A


### Numerical Features

In [22]:
numerical = learning_raw.select_dtypes(include=np.number).columns
print("There are {:1} numerical features".format(len(numerical)))

There are 345 numerical features


The ZIP code, which should be numerical, is missing from the list as it has some input errors. This is evident as it is a object feature:

In [23]:
learning_raw.ZIP.describe()

count     95412
unique    19938
top       85351
freq         61
Name: ZIP, dtype: object

In [24]:
# Fix formatting for ZIP feature
learning_raw.ZIP = learning_raw.ZIP.str.replace(
    '-', '').replace([' ', '.'], np.nan).astype('int64')

### Categorical Features

Some categories are already created on import of the data. Additionally, we will have to treat some special cases:

* Multibyte features. These are features that group together several related nominal features. These are mainly the promotion history codes. Recency, Frequency and Amount as of a particular mailing are glued together in one feature. For RFA_2 and additionally MDMAUD, the major donor matrix, the features were already spread out by the supplier of the data. These two were dropped on import of the CSV file and their spread out features kept.

* OSOURCE: It identifies the origin of the data for a particular record. However, it has so many levels that the feature space would get inflated heavily by one-hot encoding. For this feature, hasing is employed.

* TCODE: Special treatment will also be necessary for the TCODE feature. It describes the title code (Ms., Hon., and so on) in an unfortunate integer coding ranging from 1e0 to 1e4. We will also use the hasing encoder for these features

After having the categorical features ready, missing values are assigned their own category, 'missing'. Then, all non-hashed categorical features are one-hot encoded.

In [25]:
categories = learning_raw.select_dtypes(include='category').columns
print(categories)

Index(['STATE', 'PVASTATE', 'DOMAIN', 'CLUSTER', 'CHILD03', 'CHILD07',
       'CHILD12', 'CHILD18', 'INCOME', 'GENDER', 'WEALTH1', 'DATASRCE',
       'SOLP3', 'SOLIH', 'WEALTH2', 'GEOCODE', 'LIFESRC', 'TARGET_D', 'RFA_2R',
       'RFA_2F', 'RFA_2A', 'MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A', 'GEOCODE2'],
      dtype='object')


In [26]:
learning_raw[categories].describe().transpose()

Unnamed: 0,count,unique,top,freq
STATE,95412,57,CA,17343
PVASTATE,1458,2,P,1453
DOMAIN,93096,16,R2,13623
CLUSTER,93096,53,40,3979
CHILD03,1146,3,M,869
CHILD07,1566,3,M,1061
CHILD12,1811,3,M,1149
CHILD18,2847,3,M,1442
INCOME,74126,7,5,15451
GENDER,92455,6,F,51277


#### Treating multibyte features

In [27]:
print(dl.NOMINAL_FEATURES)

['OSOURCE', 'TCODE', 'RFA_3', 'RFA_4', 'RFA_5', 'RFA_6', 'RFA_7', 'RFA_8', 'RFA_9', 'RFA_10', 'RFA_11', 'RFA_12', 'RFA_13', 'RFA_14', 'RFA_15', 'RFA_16', 'RFA_17', 'RFA_18', 'RFA_19', 'RFA_20', 'RFA_21', 'RFA_22', 'RFA_23', 'RFA_24']


The cup documentation states that for the MDMAUD_* features, X is used as NA code. This is fixed now:

In [28]:
learning_raw[['MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A']] = learning_raw.loc[:, ['MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A']].replace('X', np.nan)

In [30]:
multibyte_transformer = ColumnTransformer([
            ("spread_rfa",
             MultiByteExtract(["R", "F", "A"]),
             dl.NOMINAL_FEATURES[2:]),
             ("spread_domain",
             MultiByteExtract(["Urbanicity", "SocioEconomic"]),
             ["DOMAIN"])
        ])

In [32]:
multibytes = multibyte_transformer.fit_transform(learning_raw)
multibytes_names = [n[n.find('__')+2:]
                 for n in multibyte_transformer.get_feature_names()]

Merge learning and the new nominal features, then drop the originals

In [None]:
multibytes = pd.DataFrame(data=multibytes, columns=multibytes_names,
                   index=learning_raw.index).astype("category")
learning_raw = learning_raw.merge(multibytes, on=learning_raw.index.name)

In [35]:
learning_raw.drop(dl.NOMINAL_FEATURES[2:]+["DOMAIN"], inplace=True)

KeyError: "['RFA_3' 'RFA_4' 'RFA_5' 'RFA_6' 'RFA_7' 'RFA_8' 'RFA_9' 'RFA_10' 'RFA_11'\n 'RFA_12' 'RFA_13' 'RFA_14' 'RFA_15' 'RFA_16' 'RFA_17' 'RFA_18' 'RFA_19'\n 'RFA_20' 'RFA_21' 'RFA_22' 'RFA_23' 'RFA_24' 'DOMAIN'] not found in axis"

In [None]:
for cat in learning_raw.select_dtypes(include="category").columns:
    learning_raw[cat] = learning_raw[cat].cat.remove_unused_categories()
    print("Feature: {}\n{}".format(cat, learning_raw[cat].cat.categories))

#### Ordinal features

Several ordinal features are present. We need to ensure to encode the levels correctly.

In [36]:
ordinal_transformer = ColumnTransformer([
            ("order_mdmaud",
             OrdinalEncoder(mapping=dl.ORDINAL_MAPPING_MDMAUD,
                            handle_unknown='ignore'),
             ['MDMAUD_R', 'MDMAUD_A']),
            ("order_rfa",
             OrdinalEncoder(mapping=dl.ORDINAL_MAPPING_RFA,
                            handle_unknown='ignore'),
                            list(learning_raw.filter(regex=r"RFA_\d{1,2}A", axis=1).columns.values)),
            ("recode_socioecon", RecodeUrbanSocioEconomic(), ["DOMAINUrbanicity", "DOMAINSocioEconomic"])
        ])

In [37]:
ordinals = ordinal_transformer.fit_transform(learning_raw)

In [38]:
ordinal_names = [n[n.find('__')+2:]
                 for n in ordinal_transformer.get_feature_names()]

In [41]:
ordinals = pd.DataFrame(data=ordinals, columns=ordinal_names,
                   index=learning_raw.index).astype("category")
learning_raw[ordinal_names] = ordinals

When the order is obvious, no order has to be passed in (i.e. 0 < 1 < 2 < 3 < ... and alphabetical)

In [None]:
learning_raw["WEALTH1"].describe()

In [43]:
remaining_ordinals = ['WEALTH1','WEALTH2','INCOME']+learning_raw.filter(regex=r"RFA_\d{1,2}F").columns.values.tolist()

for f in learning_raw[remaining_ordinals]:
    try:
        learning_raw[f] = learning_raw[f].cat.as_ordered()
    except AttributeError:
        learning_raw[f] = learning_raw[f].astype("category").cat.as_ordered()

### Binary features

For these, we will convert the values specified as True and False as per the dataset dictionary into 1.0 and 0.0 respectively. Furthermore, input errors are also being treated. In the end, these features will be of dtype float64, having {1.0, 0.0 and NaN} as values.

For features that either have a value representing True or are empty (as specified in the dataset dictionary), all empty cells will be considered False. For features specifically denoting True and False values, these will be coded appropriately and empty cells set to NaN.

In [37]:
learning_raw[dl.BINARY_FEATURES].describe().transpose()

Unnamed: 0,count,unique,top,freq
MAILCODE,95412,2,,94013
NOEXCH,95412,4,0,95085
RECSWEEP,95412,2,,93795
RECINHSE,95412,2,,88709
RECP3,95412,2,,93395
RECPGVG,95412,2,,95298
AGEFLAG,95412,3,E,57344
HOMEOWNR,95412,3,H,52354
MAJOR,95412,2,,95118
COLLECT1,95412,2,,90210


NOEXCH has X and 1 for True, 0 for False, which is not consistent with the documentation. It is therefore recoded to 1/0

In [38]:
learning_raw.NOEXCH.unique()

array(['0', '1', 'X', ' '], dtype=object)

In [39]:
# Fix binary encoding inconsistency for NOEXCH
learning_raw.NOEXCH = learning_raw.NOEXCH.str.replace("X", "1")

In [40]:
binary_transformer = ColumnTransformer([
            ("binary_x_bl",
             BinaryFeatureRecode(
                 value_map={'true': 'X', 'false': ''}, correct_noisy=False),
             ['PEPSTRFL', 'MAJOR', 'RECINHSE',
                 'RECP3', 'RECPGVG', 'RECSWEEP']
             ),
            ("binary_y_n",
             BinaryFeatureRecode(
                 value_map={'true': 'Y', 'false': 'N'}, correct_noisy=False),
             ['COLLECT1', 'VETERANS', 'BIBLE', 'CATLG', 'HOMEE', 'PETS', 'CDPLAY', 'STEREO',
              'PCOWNERS', 'PHOTO', 'CRAFTS', 'FISHER', 'GARDENIN',  'BOATS', 'WALKER', 'KIDSTUFF',
              'CARDS', 'PLATES']
             ),
            ("binary_e_i",
             BinaryFeatureRecode(
                 value_map={'true': "E", 'false': 'I'}, correct_noisy=False),
             ['AGEFLAG']
             ),
            ("binary_h_u",
             BinaryFeatureRecode(
                 value_map={'true': "H", 'false': 'U'}, correct_noisy=False),
             ['HOMEOWNR']),
            ("binary_b_bl",
             BinaryFeatureRecode(
                 value_map={'true': 'B', 'false': ' '}, correct_noisy=False),
             ['MAILCODE']
             ),
            ("binary_1_0",
             BinaryFeatureRecode(
                 value_map={'true': '1', 'false': '0'}, correct_noisy=False),
             ['NOEXCH', 'HPHONE_D']
             )
        ])


In [41]:
learning_raw.MAJOR.unique()

array([' ', 'X'], dtype=object)

In [42]:
binaries = binary_transformer.fit_transform(learning_raw)
binary_names = [n[n.find('__')+2:]
                 for n in binary_transformer.get_feature_names()]
binaries = pd.DataFrame(data=binaries, columns=binary_names, index=learning_raw.index)

TypeError: unsupported format string passed to list.__format__

In [34]:
binaries.MAJOR

CONTROLN
95515     0.0
148535    0.0
15078     0.0
172556    0.0
7112      0.0
47784     0.0
62117     0.0
109359    0.0
75768     0.0
49909     0.0
106016    0.0
60127     0.0
85548     0.0
12890     0.0
134891    0.0
143689    0.0
64667     0.0
98090     0.0
35557     0.0
42556     0.0
82943     0.0
72675     0.0
190166    0.0
92152     0.0
82229     0.0
160963    0.0
89160     0.0
102610    0.0
122772    0.0
97870     0.0
         ... 
56972     0.0
22658     0.0
126131    0.0
93718     0.0
157506    0.0
31573     0.0
46748     0.0
139193    0.0
98104     0.0
23868     0.0
132458    0.0
17039     0.0
35112     0.0
104515    0.0
12322     1.0
131980    0.0
78831     0.0
29549     0.0
38061     0.0
109741    0.0
47945     0.0
84678     0.0
58178     0.0
156106    0.0
35088     0.0
184568    0.0
122706    0.0
189641    0.0
4693      0.0
185114    1.0
Name: MAJOR, Length: 95412, dtype: float64

In [21]:
learning_raw[binary_names] = binaries

In [23]:
learning_raw.RECPGVG

CONTROLN
95515     NaN
148535    NaN
15078     NaN
172556    NaN
7112      NaN
47784     NaN
62117     NaN
109359    NaN
75768     NaN
49909     NaN
106016    NaN
60127     NaN
85548     NaN
12890     NaN
134891    NaN
143689    NaN
64667     NaN
98090     NaN
35557     NaN
42556     NaN
82943     NaN
72675     NaN
190166    NaN
92152     NaN
82229     NaN
160963    NaN
89160     NaN
102610    NaN
122772    NaN
97870     NaN
         ... 
56972     NaN
22658     NaN
126131    NaN
93718     NaN
157506    NaN
31573     NaN
46748     NaN
139193    NaN
98104     NaN
23868     NaN
132458    NaN
17039     NaN
35112     NaN
104515    NaN
12322     NaN
131980    NaN
78831     NaN
29549     NaN
38061     NaN
109741    NaN
47945     NaN
84678     NaN
58178     NaN
156106    NaN
35088     NaN
184568    NaN
122706    NaN
189641    NaN
4693      NaN
185114    1.0
Name: RECPGVG, Length: 95412, dtype: float64

### Object Features

These features have mixed datatypes and are encoded as strings. This hints at noisy data and features that will have to be transformed before becoming usable.

In [48]:
objects = learning_raw.select_dtypes(include='object').columns
print(objects)

Index(['OSOURCE', 'TCODE', 'MDMAUD', 'RFA_2', 'RFA_3', 'RFA_4', 'RFA_5',
       'RFA_6', 'RFA_7', 'RFA_8', 'RFA_9', 'RFA_10', 'RFA_11', 'RFA_12',
       'RFA_13', 'RFA_14', 'RFA_15', 'RFA_16', 'RFA_17', 'RFA_18', 'RFA_19',
       'RFA_20', 'RFA_21', 'RFA_22', 'RFA_23', 'RFA_24', 'TARGET_B'],
      dtype='object')


In [49]:
learning_raw[objects].describe().transpose()

Unnamed: 0,count,unique,top,freq
OSOURCE,94484,895,MBC,4539
TCODE,95412,55,0,40917
MDMAUD,95412,28,XXXX,95118
RFA_2,95412,14,L1F,30380
RFA_3,93462,70,A1F,21950
RFA_4,93100,63,A1F,21818
RFA_5,61822,40,A1F,11027
RFA_6,91855,108,A1F,15696
RFA_7,86538,105,A1F,10954
RFA_8,91901,108,A1F,11312


In [50]:
learning_raw[objects] = learning_raw[objects].astype("category")

### Date features
Dates are parsed into datetime64 by pandas on reading the csv.

In [51]:
dates = learning_raw[dl.DATE_FEATURES]
dates.describe().transpose()

Unnamed: 0,count,unique,top,freq,first,last
ODATEDW,95412,54,1995-01-01 00:00:00,15358,1983-06-01 00:00:00,1997-01-01 00:00:00
DOB,71692,935,1948-01-01 00:00:00,1479,1901-01-01 00:00:00,1997-10-01 00:00:00
ADATE_2,95412,2,1997-06-01 00:00:00,95399,1997-04-01 00:00:00,1997-06-01 00:00:00
ADATE_3,93462,2,1996-06-01 00:00:00,93444,1996-04-01 00:00:00,1996-06-01 00:00:00
ADATE_4,93221,8,1996-04-01 00:00:00,92405,1995-11-01 00:00:00,1996-09-01 00:00:00
ADATE_5,61822,1,1996-04-01 00:00:00,61822,1996-04-01 00:00:00,1996-04-01 00:00:00
ADATE_6,91855,2,1996-03-01 00:00:00,91804,1996-01-01 00:00:00,1996-03-01 00:00:00
ADATE_7,86538,3,1996-02-01 00:00:00,81512,1995-12-01 00:00:00,1996-02-01 00:00:00
ADATE_8,91901,5,1996-01-01 00:00:00,85468,1995-11-01 00:00:00,1996-05-01 00:00:00
ADATE_9,84167,3,1995-11-01 00:00:00,80718,1995-09-01 00:00:00,1995-11-01 00:00:00


In the cleaned dataset, multibyte features were split. There are therefore more features present.

## The cleaning process put together

The steps highlighted above are conveniently wrapped in the class `Cleaner` in module `data_loader`

In [52]:
learning_clean = data_loader.clean_data

In [53]:
learning_clean.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95412 entries, 95515 to 185114
Columns: 520 entries, ODATEDW to DOMAINSocioEconomic
dtypes: category(95), datetime64[ns](50), float64(77), int64(298)
memory usage: 318.9 MB
