<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-the-learning-dataset" data-toc-modified-id="Loading-the-learning-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading the learning dataset</a></span></li><li><span><a href="#Overview" data-toc-modified-id="Overview-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Overview</a></span><ul class="toc-item"><li><span><a href="#Numerical-Features" data-toc-modified-id="Numerical-Features-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Numerical Features</a></span></li><li><span><a href="#Categorical-Features" data-toc-modified-id="Categorical-Features-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Categorical Features</a></span></li></ul></li><li><span><a href="#Cleaning-data" data-toc-modified-id="Cleaning-data-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Cleaning data</a></span><ul class="toc-item"><li><span><a href="#Treating-multibyte-features" data-toc-modified-id="Treating-multibyte-features-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Treating multibyte features</a></span></li><li><span><a href="#Categorical-features" data-toc-modified-id="Categorical-features-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Categorical features</a></span><ul class="toc-item"><li><span><a href="#Ordinal" data-toc-modified-id="Ordinal-3.2.1"><span class="toc-item-num">3.2.1&nbsp;&nbsp;</span>Ordinal</a></span></li></ul></li><li><span><a href="#Binary-features" data-toc-modified-id="Binary-features-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Binary features</a></span></li><li><span><a href="#Object-Features" data-toc-modified-id="Object-Features-3.4"><span class="toc-item-num">3.4&nbsp;&nbsp;</span>Object Features</a></span><ul class="toc-item"><li><span><a href="#Dates" data-toc-modified-id="Dates-3.4.1"><span class="toc-item-num">3.4.1&nbsp;&nbsp;</span>Dates</a></span></li></ul></li><li><span><a href="#The-cleaning-process-put-together" data-toc-modified-id="The-cleaning-process-put-together-3.5"><span class="toc-item-num">3.5&nbsp;&nbsp;</span>The cleaning process put together</a></span></li></ul></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Low-variance-(constant)-and-sparse-feature-removal" data-toc-modified-id="Low-variance-(constant)-and-sparse-feature-removal-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Low variance (constant) and sparse feature removal</a></span></li><li><span><a href="#Converting-dates" data-toc-modified-id="Converting-dates-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Converting dates</a></span></li><li><span><a href="#Preprocessing-put-together" data-toc-modified-id="Preprocessing-put-together-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Preprocessing put together</a></span></li></ul></li><li><span><a href="#Feature-engineering" data-toc-modified-id="Feature-engineering-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Feature engineering</a></span><ul class="toc-item"><li><span><a href="#Imputation-of-missing-values" data-toc-modified-id="Imputation-of-missing-values-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Imputation of missing values</a></span><ul class="toc-item"><li><span><a href="#Categoricals" data-toc-modified-id="Categoricals-5.1.1"><span class="toc-item-num">5.1.1&nbsp;&nbsp;</span>Categoricals</a></span></li><li><span><a href="#missingpy" data-toc-modified-id="missingpy-5.1.2"><span class="toc-item-num">5.1.2&nbsp;&nbsp;</span>missingpy</a></span></li><li><span><a href="#fancyimpute" data-toc-modified-id="fancyimpute-5.1.3"><span class="toc-item-num">5.1.3&nbsp;&nbsp;</span>fancyimpute</a></span></li></ul></li><li><span><a href="#Nominal-features" data-toc-modified-id="Nominal-features-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Nominal features</a></span></li><li><span><a href="#Removing-low-variance-features" data-toc-modified-id="Removing-low-variance-features-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Removing low-variance features</a></span></li><li><span><a href="#Feature-engineering-combined" data-toc-modified-id="Feature-engineering-combined-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Feature engineering combined</a></span></li></ul></li></ul></div>

# Cleaning


This notebook contains all code for the cleaning of the KDD Cup 98 datasets.

* Splits into learning and test
* Prepares the data for model fitting

This will be done with scikit-learn's transforming framework in order to ensure all transformations are applied identically on training, test and validation datasets.

First, the steps necessary are analysed, then the implemented cleaner is introduced.

In [1]:
%load_ext autoreload

In [2]:
%run ./common_init.ipynb

Setup logging to file: out.log
Figure output directory saved in figure_output at /home/datarian/OneDrive/unine/Master_Thesis/figures


In [3]:
%autoreload 2
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import HashingEncoder, OneHotEncoder, OrdinalEncoder

# Load custom code
import kdd98.data_handler as dh
import kdd98.utils_transformer as ut
from kdd98.transformers import *
from kdd98.config import Config

Using TensorFlow backend.


In [4]:
# Where to save the figures
IMAGES_PATH = pathlib.Path(figure_output/'preprocessing')

pathlib.Path(IMAGES_PATH).mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = pathlib.Path(IMAGES_PATH, fig_id + "." + fig_extension)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Loading the learning dataset


In [5]:
data_provider = dh.KDD98DataProvider("cup98LRN.txt")

In [6]:
learning_raw = data_provider.raw_data

## Overview

A first, general look at the data structure:

In [None]:
learning_raw.info()

* There are 481 features (of which one is the index)
* A total of 95412 examples
* 25 categorical features, 48 numerical features with missing values, 297 integer features without missing values and 110 object (string / date) features

In [None]:
learning_raw.head()

### Numerical Features

In [None]:
numerical = learning_raw.select_dtypes(include=np.number).columns
print("There are {:1} numerical features".format(len(numerical)))

The ZIP code, which should be numerical, is missing from the list as it has some input errors. This is evident as it is a object feature:

In [None]:
learning_raw.ZIP.describe()

In [None]:
# Fix formatting for ZIP feature
learning_raw.ZIP = learning_raw.ZIP.str.replace(
    '-', '').replace([' ', '.'], np.nan).astype('int64')

### Categorical Features

Some categories are already created on import of the data. Additionally, we will have to treat some special cases:

* Multibyte features. These are features that group together several related nominal features. These are mainly the promotion history codes. Recency, Frequency and Amount as of a particular mailing are glued together in one feature. For RFA_2 and additionally MDMAUD, the major donor matrix, the features were already spread out by the supplier of the data. These two were dropped on import of the CSV file and their spread out features kept.

* OSOURCE: It identifies the origin of the data for a particular record. However, it has so many levels that the feature space would get inflated heavily by one-hot encoding. For this feature, hasing is employed.

* TCODE: Special treatment will also be necessary for the TCODE feature. It describes the title code (Ms., Hon., and so on) in an unfortunate integer coding ranging from 1e0 to 1e4. We will also use the hasing encoder for these features

After having the categorical features ready, missing values are assigned their own category, 'missing'. Then, all non-hashed categorical features are one-hot encoded.

In [None]:
categories = learning_raw.select_dtypes(include='category').columns
print(categories)

In [None]:
learning_raw[categories].describe().transpose()

## Cleaning data

### Treating multibyte features

In [None]:
print(dh.NOMINAL_FEATURES)

The cup documentation states that for the MDMAUD_* features, X is used as NA code. This is fixed now:

In [None]:
learning_raw[['MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A']] = learning_raw.loc[:, ['MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A']].replace('X', np.nan)

In [None]:
multibyte_transformer = ColumnTransformer([
            ("spread_rfa",
             MultiByteExtract(["R", "F", "A"]),
             dh.NOMINAL_FEATURES[2:]),
             ("spread_domain",
             MultiByteExtract(["Urbanicity", "SocioEconomic"]),
             ["DOMAIN"])
        ])

In [None]:
multibytes = multibyte_transformer.fit_transform(learning_raw)
multibytes_names = [n[n.find('__')+2:]
                 for n in multibyte_transformer.get_feature_names()]

Merge learning and the new nominal features, then drop the originals

In [None]:
multibytes = pd.DataFrame(data=multibytes, columns=multibytes_names,
                   index=learning_raw.index).astype("category")
learning_raw = learning_raw.merge(multibytes, on=learning_raw.index.name)

In [None]:
learning_raw.drop(dh.NOMINAL_FEATURES[2:]+["DOMAIN"], inplace=True)

In [None]:
for cat in learning_raw.select_dtypes(include="category").columns:
    learning_raw[cat] = learning_raw[cat].cat.remove_unused_categories()
    print("Feature: {}\n{}".format(cat, learning_raw[cat].cat.categories))

### Categorical features

#### Ordinal

Several ordinal features are present. We need to ensure to encode the levels correctly.

In [None]:
ordinal_transformer = ColumnTransformer([
            ("order_mdmaud",
             OrdinalEncoder(mapping=dh.ORDINAL_MAPPING_MDMAUD,
                            handle_unknown='ignore'),
             ['MDMAUD_R', 'MDMAUD_A']),
            ("order_rfa",
             OrdinalEncoder(mapping=dh.ORDINAL_MAPPING_RFA,
                            handle_unknown='ignore'),
                            list(learning_raw.filter(regex=r"RFA_\d{1,2}A", axis=1).columns.values)),
            ("recode_socioecon", RecodeUrbanSocioEconomic(), ["DOMAINUrbanicity", "DOMAINSocioEconomic"])
        ])

In [None]:
ordinals = ordinal_transformer.fit_transform(learning_raw)

In [None]:
ordinal_names = [n[n.find('__')+2:]
                 for n in ordinal_transformer.get_feature_names()]

In [None]:
ordinals = pd.DataFrame(data=ordinals, columns=ordinal_names,
                   index=learning_raw.index).astype("category")
learning_raw[ordinal_names] = ordinals

When the order is obvious, no order has to be passed in (i.e. 0 < 1 < 2 < 3 < ... and alphabetical)

In [None]:
learning_raw["WEALTH1"].describe()

In [None]:
remaining_ordinals = ['WEALTH1','WEALTH2','INCOME']+learning_raw.filter(regex=r"RFA_\d{1,2}F").columns.values.tolist()

for f in learning_raw[remaining_ordinals]:
    try:
        learning_raw[f] = learning_raw[f].cat.as_ordered()
    except AttributeError:
        learning_raw[f] = learning_raw[f].astype("category").cat.as_ordered()

### Binary features

For these, we will convert the values specified as True and False as per the dataset dictionary into 1.0 and 0.0 respectively. Furthermore, input errors are also being treated. In the end, these features will be of dtype float64, having {1.0, 0.0 and NaN} as values.

For features that either have a value representing True or are empty (as specified in the dataset dictionary), all empty cells will be considered False. For features specifically denoting True and False values, these will be coded appropriately and empty cells set to NaN.

In [None]:
learning_raw[dh.BINARY_FEATURES].describe().transpose()

NOEXCH has X and 1 for True, 0 for False, which is not consistent with the documentation. It is therefore recoded to 1/0

In [None]:
learning_raw.NOEXCH.unique()

In [None]:
# Fix binary encoding inconsistency for NOEXCH
learning_raw.NOEXCH = learning_raw.NOEXCH.str.replace("X", "1")

In [None]:
binary_transformer = ColumnTransformer([
            ("binary_x_bl",
             BinaryFeatureRecode(
                 value_map={'true': 'X', 'false': ' '}, correct_noisy=False),
             ['PEPSTRFL', 'MAJOR', 'RECINHSE',
                 'RECP3', 'RECPGVG', 'RECSWEEP']
             ),
            ("binary_y_n",
             BinaryFeatureRecode(
                 value_map={'true': 'Y', 'false': 'N'}, correct_noisy=False),
             ['COLLECT1', 'VETERANS', 'BIBLE', 'CATLG', 'HOMEE', 'PETS', 'CDPLAY', 'STEREO',
              'PCOWNERS', 'PHOTO', 'CRAFTS', 'FISHER', 'GARDENIN',  'BOATS', 'WALKER', 'KIDSTUFF',
              'CARDS', 'PLATES']
             ),
            ("binary_e_i",
             BinaryFeatureRecode(
                 value_map={'true': "E", 'false': 'I'}, correct_noisy=False),
             ['AGEFLAG']
             ),
            ("binary_h_u",
             BinaryFeatureRecode(
                 value_map={'true': "H", 'false': 'U'}, correct_noisy=False),
             ['HOMEOWNR']),
            ("binary_b_bl",
             BinaryFeatureRecode(
                 value_map={'true': 'B', 'false': ' '}, correct_noisy=False),
             ['MAILCODE']
             ),
            ("binary_1_0",
             BinaryFeatureRecode(
                 value_map={'true': '1', 'false': '0'}, correct_noisy=False),
             ['NOEXCH', 'HPHONE_D']
             )
        ])


In [None]:
learning_raw.MAJOR.unique()

In [None]:
binaries = binary_transformer.fit_transform(learning_raw)
binary_names = [n[n.find('__')+2:]
                 for n in binary_transformer.get_feature_names()]
binaries = pd.DataFrame(data=binaries, columns=binary_names, index=learning_raw.index)

In [None]:
binaries.MAJOR.describe()

In [None]:
learning_raw[binary_names] = binaries

In [None]:
learning_raw.RECPGVG

### Object Features

These features have mixed datatypes and are encoded as strings. This hints at noisy data and features that will have to be transformed before becoming usable.

In [None]:
objects = learning_raw.select_dtypes(include='object').columns
print(objects)

In [None]:
learning_raw[objects].describe().transpose()

In [None]:
learning_raw[objects] = learning_raw[objects].astype("category")

#### Dates
For dates, input errors are fixed. The dataset contains some dates of length 3, while they should be formatted YYMM. For the short dates, the leading 0 is missing.


In [None]:
dates = learning_raw[dh.DATE_FEATURES]
dates.describe().transpose()

In [None]:
unq

In [None]:
threedigits = set()
for f in dh.DATE_FEATURES:
    s = dates.loc[:,f]
    if len(s.loc[s.str.len() == 3].values) > 0:
        out = s.loc[s.str.len() == 3]
        print(out)
        unq = out.unique()
        unq.sort()
        print(unq)

We only have three-digit birth dates. All other dates were correctly entered as YYMM.

Looking at these values, the only possibilities for a missing **trailing** digit would be where a **0** or **1** is at the end. For the cases that have a **1** at the end, all years would then be even decades. This cannot be a coincidence and therefore it is more likely that the **leading 0 was forgotten** for these examples.

The only case with a **0** at the end is 410, which is assumed to be 0410 based on the observation for the **1**s.

In [None]:
def fix_format(d):
    if not pd.isna(d):
        if len(d) == 3:
            d = '0'+d
    return d

data[DATE_FEATURES] = data.loc[:,DATE_FEATURES].applymap(fix_format)


### The cleaning process put together

The steps highlighted above are conveniently wrapped in the class `Cleaner` in module `data_provider`

In [15]:
learning_clean = data_provider.clean_data

In [None]:
learning_clean.info()

In the cleaned dataset, multibyte features were split. There are therefore more features present.

## Preprocessing

The aim of preprocessing is to:

- Remove low variance and sparse features
- Impute missing values

In order to assess low-variance features, the data should be imputed.
For imputation to work, all non-numeric data has to be converted.

### Low variance (constant) and sparse feature removal

Following is the list of features that have less than 20% values. These will be removed before continuing

In [16]:
[c for c in learning_clean.columns if learning_clean[c].isna().sum() / len(learning_clean.index) <= 0.2]

['ODATEDW',
 'OSOURCE',
 'TCODE',
 'STATE',
 'ZIP',
 'MAILCODE',
 'RECINHSE',
 'RECP3',
 'RECPGVG',
 'RECSWEEP',
 'CLUSTER',
 'GENDER',
 'HIT',
 'MALEMILI',
 'MALEVET',
 'VIETVETS',
 'WWIIVETS',
 'LOCALGOV',
 'STATEGOV',
 'FEDGOV',
 'MAJOR',
 'PEPSTRFL',
 'POP901',
 'POP902',
 'POP903',
 'POP90C1',
 'POP90C2',
 'POP90C3',
 'POP90C4',
 'POP90C5',
 'ETH1',
 'ETH2',
 'ETH3',
 'ETH4',
 'ETH5',
 'ETH6',
 'ETH7',
 'ETH8',
 'ETH9',
 'ETH10',
 'ETH11',
 'ETH12',
 'ETH13',
 'ETH14',
 'ETH15',
 'ETH16',
 'AGE901',
 'AGE902',
 'AGE903',
 'AGE904',
 'AGE905',
 'AGE906',
 'AGE907',
 'CHIL1',
 'CHIL2',
 'CHIL3',
 'AGEC1',
 'AGEC2',
 'AGEC3',
 'AGEC4',
 'AGEC5',
 'AGEC6',
 'AGEC7',
 'CHILC1',
 'CHILC2',
 'CHILC3',
 'CHILC4',
 'CHILC5',
 'HHAGE1',
 'HHAGE2',
 'HHAGE3',
 'HHN1',
 'HHN2',
 'HHN3',
 'HHN4',
 'HHN5',
 'HHN6',
 'MARR1',
 'MARR2',
 'MARR3',
 'MARR4',
 'HHP1',
 'HHP2',
 'DW1',
 'DW2',
 'DW3',
 'DW4',
 'DW5',
 'DW6',
 'DW7',
 'DW8',
 'DW9',
 'HV1',
 'HV2',
 'HV3',
 'HV4',
 'HU1',
 'HU2',
 'HU

In [12]:
data_provider.clean_data.TARGET_B.describe()

count    0.0
mean     NaN
std      NaN
min      NaN
25%      NaN
50%      NaN
75%      NaN
max      NaN
Name: TARGET_B, dtype: float64

{'MDMAUD_F', 'TARGET_B'}

Now, to the low variance features

In [17]:
numeric_features = [c for c in learning_clean.columns if learning_clean[c].dtype in ['int64', 'Int64', 'float64']]

  """Entry point for launching an IPython kernel.


In [18]:
lowvar_numeric = [c for c in numeric_features if learning_clean[c].var() < 1e-6]
print(lowvar_numeric)

['COLLECT1', 'VETERANS', 'BIBLE', 'CATLG', 'HOMEE', 'PETS', 'CDPLAY', 'STEREO', 'PCOWNERS', 'PHOTO', 'CRAFTS', 'FISHER', 'GARDENIN', 'BOATS', 'WALKER', 'KIDSTUFF', 'CARDS', 'PLATES']


Constant categoricals have only one level:

In [None]:
categorical_features = learning_clean.select_dtypes(include=["category", "object"])

In [None]:
lowvar_categorical = [c for c in categorical_features.columns if len(set(categorical_features[c].unique())-set('nan')-set([np.nan])) == 1]
print(lowvar_categorical)

In [None]:
lowvar_features = lowvar_numeric + lowvar_categorical
print(lowvar_features)

In [None]:
lowvar_sparse_to_remove = set(lowvar_features + sparse_features)
print(lowvar_sparse_to_remove)

In [None]:
learning_clean.drop(lowvar_sparse_to_remove, axis=1, inplace=True)

In [None]:
learning_clean.info()

### Converting dates

There are several date features. ODATEDW is the date the record was added, DOB the birth date. ADATE_* and RDATE_* are from the promotion history. ADATE_* is the date of a mailing, RDATE_* the date the donation for the corresponding mailing was received. While these dates are not of particular interest (very low variance), the time it took to respond might be.
Furthermore, there are the features MINRDATE, MAXRDATE, MAXADATE, FISTDATE, NEXTDATE and LASTDATE coming from the giving history file.

In [None]:
print(dh.DATE_FEATURES)

In [None]:
learning_clean[dh.DATE_FEATURES]

Three different transformations are applied:

1. ODATEDW, DOB: Years before 1997 -> membership duration, age
2. Giving history features: Relative time in months to 1997/06/01
3. For the promotion history, as specified above, the time for response in months

There are redundant features which can be safely removed, as is shown below:

1. FISTDATE and NEXTDATE are contained in TIMELAG, the number of months between first and second donation
2. DOB, the date of birth, is contained in the feature AGE

Now, we transform the dates from the giving history. First, we create two dataframes with the sending dates of the mailings and the dates when the gift (donation) for these was received.

In [None]:
don_hist_transformer = ColumnTransformer([
    ("months_to_donation",
     MonthsToDonation(),
     dh.PROMO_HISTORY_DATES+dh.GIVING_HISTORY_DATES
     )
])

In [None]:
donation_responses = don_hist_transformer.fit_transform(learning_clean)

In [None]:
don_hist_feature_names = [n[n.find('__')+2:]
                 for n in don_hist_transformer.get_feature_names()]

In [None]:
donation_responses = pd.DataFrame(
    donation_responses, index=learning_clean.index, columns=don_hist_feature_names)

In [None]:
learning_clean = learning_clean.merge(donation_responses, on=learning_clean.index.name)
learning_clean.drop(dh.PROMO_HISTORY_DATES+dh.GIVING_HISTORY_DATES, inplace=True)

Time delta computation of the remaining features with either a specific reference or the date of the most recent mailing as a reference:

* Time since last donation, minimum- and maximum donation and receiving most recent promotion
* Delta between first and next donation
* Age, years of membership

In [None]:
timedelta_transformer = ColumnTransformer([
    ("time_last_donation", DeltaTime(unit='months'), ['LASTDATE','MINRDATE','MAXRDATE','MAXADATE']),
    ("delta_first_next", DeltaTime(reference_date=learning.NEXTDATE), ['FISTDATE']),
    ("membership_years", DeltaTime(unit='years'),['ODATEDW', 'DOB'])
])

In [None]:
timedeltas = timedelta_transformer.fit_transform(learning_clean)

In [None]:
timedelta_feature_names = [n[n.find('__')+2:]
                 for n in timedelta_transformer.get_feature_names()]

In [None]:
timedeltas = pd.DataFrame(timedeltas, index=learning_clean.index,columns=timedelta_feature_names)

In [None]:
timedeltas.columns

In [None]:
learning_clean = learning_clean.merge(timedeltas, on=learning_clean.index.name)
learning_clean.drop(dh.date_features, axis=1,inplace=True)

Studying redundance of DOB <-> AGE and \[FISTDATE, NEXTDATE\] <-> TIMELAG

In [None]:
ages = pd.DataFrame([learning_clean.AGE, timedeltas.DOB_DELTA_YEARS]).T

In [None]:
ages.loc[ages.AGE != ages.DOB_DELTA_YEARS,:].dropna()

In [None]:
lags = pd.DataFrame([learning_clean.TIMELAG, timedeltas.FISTDATE_NEXTDATE_DELTA_MONTHS]).T

In [None]:
lags.loc[lags.TIMELAG != lags.FISTDATE_NEXTDATE_DELTA_MONTHS,:].dropna()

The transformed feature DOB is represented in the feature AGE already. So we can drop DOB_DELTA_YEARS. TIMELAG already holds the difference in months between FISTDATE and NEXTDATE, so this delta can also be safely removed together with the original features

In [None]:
learning_clean.drop(['DOB_DELTA_YEARS', 'FISTDATE_NEXTDATE_DELTA_MONTHS'], axis=1,inplace=True)

### Preprocessing put together
Again, the operations shown above are bundled together in `data_provider.Cleaner.preprocess()`.

In [None]:
learning_preprocessed = data_provider.preprocessed_data

In [None]:
learning_preprocessed.info()

## Feature engineering

### Imputation of missing values

In [None]:
import missingno as msno
msno.matrix(learning_preprocessed)
save_fig("missing_matrix", tight_layout=False)

In [None]:
msno.matrix(learning_preprocessed.drop(dh.US_CENSUS_FEATURES, axis=1))
save_fig("missing_matrix_no_census", tight_layout=False)

#### Categoricals

Nominal features cannot be imputed by sophisticated imputation methods. The nominal features are therefore first imputed using the mode of each feature.

In [None]:
categoricals = learning_preprocessed.select_dtypes(include="category")
categorical_features = categoricals.columns.values.tolist()

In [None]:
imputed_categoricals = categoricals.fillna(categoricals.mode().iloc[0])

In [None]:
for c in imputed_categoricals[[c for c in imputed_categoricals.columns if imputed_categoricals[c].cat.categories.dtype == 'object']]:
    print("{} has {} levels:\n{}".format(c,len(categoricals[c].cat.categories),categoricals[c].cat.categories))
    print("Number of missing values left: {}".format(imputed_categoricals[c].isna().sum()))

In [None]:
learning_preprocessed[imputed_categoricals.columns] = imputed_categoricals

In [None]:
learning_numerical = data_provider.numerical_data

#### missingpy

In [None]:
from missingpy import KNNImputer
imputer = KNNImputer(n_neighbors=3, weights="distance",)

In [None]:
learning_numerical = learning_preprocessed.loc[:,learning_preprocessed.select_dtypes("number").columns.values.tolist()]
sparse_features = [c for c in learning_numerical.columns if learning_numerical[c].count() / len(learning_numerical.index) <= 0.2]
print(sparse_features)

In [None]:
learning_numerical.drop(sparse_features, axis=1,inplace=True)

In [None]:
imputed = imputer.fit_transform(learning_numerical.values)

#### fancyimpute

In [None]:
from fancyimpute import IterativeImputer

In [None]:
learning_numerical.info()

In [None]:
imputed = IterativeImputer(n_iter=5,initial_strategy="median", random_state=Config.get("random_seed"),verbose=1).fit_transform(learning_numerical)

In [None]:
imputed = pd.DataFrame(data=imputed, columns = learning_numerical.columns, index=learning_numerical.index)

In [None]:
imputed.isna().sum().sum()

### Nominal features

Now, the nominals (categorical features with string levels) are worked on. Those categoricals with high cardinality (many levels) are hashed so as to not increase dimensionality too much.
The remaining features are one-hot encoded.
https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

### Removing low-variance features
Code: https://stackoverflow.com/questions/29298973/removing-features-with-low-variance-scikit-learn

In [None]:
cat_and_obj = learning_clean.select_dtypes(include=["category", "object"]).columns.values.tolist()
get_low_variance_columns(learning_clean, skip_columns=cat_and_obj)

### Feature engineering combined

In [7]:
learning_imputed = data_provider.numerical_data

[IterativeImputer] Completing matrix with shape (95412, 421)
[IterativeImputer] Ending imputation round 1/5, elapsed time 243.10
[IterativeImputer] Ending imputation round 2/5, elapsed time 485.35
[IterativeImputer] Ending imputation round 3/5, elapsed time 728.04
[IterativeImputer] Ending imputation round 5/5, elapsed time 1213.38


In [13]:
learning_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95412 entries, 95515 to 185114
Columns: 658 entries, MAILCODE to DOMAINUrbanicity_C
dtypes: float64(420), int64(238)
memory usage: 482.2 MB


In [15]:
learning_imputed.isna().sum().sum()

0