<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-the-learning-dataset" data-toc-modified-id="Loading-the-learning-dataset-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading the learning dataset</a></span></li><li><span><a href="#Splitting-into-training--and-test-dataset" data-toc-modified-id="Splitting-into-training--and-test-dataset-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Splitting into training- and test dataset</a></span></li><li><span><a href="#Separating-features-and-label" data-toc-modified-id="Separating-features-and-label-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Separating features and label</a></span></li><li><span><a href="#Cleaning-pipeline" data-toc-modified-id="Cleaning-pipeline-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Cleaning pipeline</a></span><ul class="toc-item"><li><span><a href="#Automating-cleaning-and-feature-extraction" data-toc-modified-id="Automating-cleaning-and-feature-extraction-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Automating cleaning and feature extraction</a></span></li></ul></li><li><span><a href="#Imputation-of-missing-values" data-toc-modified-id="Imputation-of-missing-values-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Imputation of missing values</a></span></li><li><span><a href="#Removing-constant-features" data-toc-modified-id="Removing-constant-features-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Removing constant features</a></span></li><li><span><a href="#Exploring-strategies-for-specific-feature-types" data-toc-modified-id="Exploring-strategies-for-specific-feature-types-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Exploring strategies for specific feature types</a></span><ul class="toc-item"><li><span><a href="#Constant-and-Sparse-Features" data-toc-modified-id="Constant-and-Sparse-Features-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Constant and Sparse Features</a></span></li><li><span><a href="#Numerical-features" data-toc-modified-id="Numerical-features-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Numerical features</a></span></li><li><span><a href="#Categorical-features" data-toc-modified-id="Categorical-features-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Categorical features</a></span></li><li><span><a href="#Remaining-object-features" data-toc-modified-id="Remaining-object-features-7.4"><span class="toc-item-num">7.4&nbsp;&nbsp;</span>Remaining object features</a></span></li></ul></li><li><span><a href="#Preprocessing-Pipeline" data-toc-modified-id="Preprocessing-Pipeline-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Preprocessing Pipeline</a></span></li><li><span><a href="#Feature-Selection" data-toc-modified-id="Feature-Selection-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Feature Selection</a></span><ul class="toc-item"><li><span><a href="#Removing-constant-features-(zero-variance)" data-toc-modified-id="Removing-constant-features-(zero-variance)-9.1"><span class="toc-item-num">9.1&nbsp;&nbsp;</span>Removing constant features (zero variance)</a></span></li><li><span><a href="#Sparse-Features" data-toc-modified-id="Sparse-Features-9.2"><span class="toc-item-num">9.2&nbsp;&nbsp;</span>Sparse Features</a></span></li><li><span><a href="#Advanced-approaches" data-toc-modified-id="Advanced-approaches-9.3"><span class="toc-item-num">9.3&nbsp;&nbsp;</span>Advanced approaches</a></span></li></ul></li><li><span><a href="#Feature-Extraction" data-toc-modified-id="Feature-Extraction-10"><span class="toc-item-num">10&nbsp;&nbsp;</span>Feature Extraction</a></span></li></ul></div>

# Preprocessing
This notebook contains all code for the preprocessing of the KDD Cup 98 datasets.
* Splits into learning and test
* Learns the transformation pipeline on the learning dataset for future use
* Prepares the data for model fitting

This will be done with scikit-learn's transforming framework in order to ensure all transformations are applied identically on training, test and validation datasets.


The transformations are 'learned' on the training dataset and then applied to the test dataset and new data later on.

In [1]:
%load_ext autoreload

In [2]:
%run ./common_init.ipynb

Setup logging to file: out.log
Figure output directory saved in figure_output at /home/datarian/OneDrive/unine/Master_Thesis/figures


In [3]:
%autoreload 2
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import HashingEncoder, OneHotEncoder, OrdinalEncoder

# Load custom code
import kdd98.data_loader as dl
import kdd98.utils_transformer as ut
from kdd98.transformers import *
from kdd98.config import App

In [4]:
# Where to save the figures
IMAGES_PATH = pathlib.Path(figure_output/'preprocessing')

pathlib.Path(IMAGES_PATH).mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = pathlib.Path(IMAGES_PATH/fig_id + "." + fig_extension)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

## Loading the learning dataset


Set working directory to main code folder

In [5]:
data_loader = dl.KDD98DataLoader("cup98LRN.txt")
learning_raw = data_loader.get_dataset()

## Overview

A first, general look at the data structure:

In [6]:
learning_raw.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 95412 entries, 95515 to 185114
Columns: 480 entries, ODATEDW to GEOCODE2
dtypes: category(25), datetime64[ns](53), float64(48), int64(298), object(56)
memory usage: 334.2+ MB


* There are 481 features (of which one is the index)
* A total of 95412 examples
* 24 categorical features, 53 datetime features, 48 numerical features with missing values, 297 integer features without missing values and 56 string features

In [7]:
learning_raw.head()

Unnamed: 0_level_0,ODATEDW,OSOURCE,TCODE,STATE,ZIP,MAILCODE,PVASTATE,DOB,NOEXCH,RECINHSE,...,TARGET_D,HPHONE_D,RFA_2R,RFA_2F,RFA_2A,MDMAUD_R,MDMAUD_F,MDMAUD_A,CLUSTER2,GEOCODE2
CONTROLN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
95515,1989-01-01,GRI,0,IL,61081,,,1937-12-01,0,,...,0,0,L,4,E,,,,39.0,C
148535,1994-01-01,BOA,1,CA,91326,,,1952-02-01,0,,...,0,0,L,2,G,,,,1.0,A
15078,1990-01-01,AMH,1,NC,27017,,,NaT,0,,...,0,1,L,4,E,,,,60.0,C
172556,1987-01-01,BRY,0,CA,95953,,,1928-01-01,0,,...,0,1,L,4,E,,,,41.0,C
7112,1986-01-01,,0,FL,33176,,,1920-01-01,0,X,...,0,1,L,2,F,,,,26.0,A


### Numerical Features

In [8]:
numerical = learning_raw.select_dtypes(include=np.number).columns
print("There are {:1} numerical features".format(len(numerical)))

There are 346 numerical features


### Categorical Features

Categories were defined on import of the csv data. The categories were identified in the dataset dictionary.

In [9]:
categories = learning_raw.select_dtypes(include='category').columns
print(categories)

Index(['STATE', 'PVASTATE', 'DOMAIN', 'CLUSTER', 'CHILD03', 'CHILD07',
       'CHILD12', 'CHILD18', 'INCOME', 'GENDER', 'WEALTH1', 'DATASRCE',
       'SOLP3', 'SOLIH', 'WEALTH2', 'GEOCODE', 'LIFESRC', 'TARGET_D', 'RFA_2R',
       'RFA_2F', 'RFA_2A', 'MDMAUD_R', 'MDMAUD_F', 'MDMAUD_A', 'GEOCODE2'],
      dtype='object')


In [11]:
learning_raw[categories].describe().transpose()

Unnamed: 0,count,unique,top,freq
STATE,95412,57,CA,17343
PVASTATE,1458,2,P,1453
DOMAIN,93096,16,R2,13623
CLUSTER,93096,53,40,3979
CHILD03,1146,3,M,869
CHILD07,1566,3,M,1061
CHILD12,1811,3,M,1149
CHILD18,2847,3,M,1442
INCOME,74126,7,5,15451
GENDER,92455,6,F,51277


### Binary Features

In [12]:
print(dl.binary_features)

['MAILCODE', 'NOEXCH', 'RECSWEEP', 'RECINHSE', 'RECP3', 'RECPGVG', 'AGEFLAG', 'HOMEOWNR', 'MAJOR', 'COLLECT1', 'BIBLE', 'CATLG', 'HOMEE', 'PETS', 'CDPLAY', 'STEREO', 'PCOWNERS', 'PHOTO', 'CRAFTS', 'FISHER', 'GARDENIN', 'BOATS', 'WALKER', 'KIDSTUFF', 'CARDS', 'PLATES', 'PEPSTRFL', 'TARGET_B', 'HPHONE_D', 'VETERANS']


Several binary features have only one value set. These are features where a blank represents false.

In [None]:
learning_raw[dl.binary_features].describe().transpose()

### Object Features

These features have mixed datatypes and are encoded as strings. This hints at noisy data and features that will have to be transformed before becoming usable.

In [None]:
objects = learning_raw.select_dtypes(include='object').columns
print(objects)

In [None]:
learning_raw[objects].describe().transpose()

### Date features
Dates are parsed into datetime64 by pandas on reading the csv.

In [None]:
dates = learning_raw[dl.date_features]
dates.describe().transpose()

## Splitting into training- and test dataset

Before applying *any* transformations, the dataset will be split 80/20 into a learning and test set.

Let's look at feature TARGET_B, which describes whether a person has donated or not:

In [13]:
learning_raw.TARGET_B.value_counts(normalize=True) # 5 % of recipients have donated.

0    0.949241
1    0.050759
Name: TARGET_B, dtype: float64

We want to preserve this ratio in the split datasets. scikit-learn provides a method for achieving this.

In [14]:
seed = App.config("random_seed")
from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8, random_state=seed)
for learn_index, test_index in splitter.split(learning_raw, learning_raw.TARGET_B.astype('int')):
    l_i = learn_index
    t_i = test_index
    kdd_learn = learning_raw.iloc[learn_index]
    kdd_test = learning_raw.iloc[test_index]

Now, check that the two sets are really disjoint

In [15]:
set(kdd_learn.index).intersection(kdd_test.index)

set()

Check the frequencies of the donors in the sets:

In [16]:
kdd_learn['TARGET_B'].value_counts(normalize=True)

0    0.949246
1    0.050754
Name: TARGET_B, dtype: float64

In [17]:
kdd_test['TARGET_B'].value_counts(normalize=True)

0    0.949222
1    0.050778
Name: TARGET_B, dtype: float64

## Separating features and label

First, we separate the features from the labels. We will also remove the label "TARGET_B", which is an indicator variable for donors that is no longer of interest

**All preprocessing is performed on *kdd_learn_feat***

In [18]:
kdd_learn_feat = kdd_learn.drop(['TARGET_B', 'TARGET_D'],axis=1).copy()
kdd_learn_labels = kdd_learn[['TARGET_B','TARGET_D']].copy()

## Preprocessing pipeline

The preprocessing pipeline results in a dataset with numerical (binary features encoded correclty), categorial and string date features.

Following this step, feature extraction, imputation, dropping of constant and sparse features and ensuring all data is numerical can be tackled.

### Automating cleaning and feature extraction

The function below allows for a one-step cleaning and feature extraction, returning a pandas dataframe with all features correctly labelled.

As has been seen in EDA, some features are redundant. These will not be processed by the pipeline here.

https://medium.com/bigdatarepublic/integrating-pandas-and-scikit-learn-with-pipelines-f70eb6183696
https://ramhiser.com/post/2018-04-16-building-scikit-learn-pipeline-with-pandas-dataframe/

In [33]:
from sklearn.pipeline import FeatureUnion

In [34]:
preproc_pipe = FeatureUnion(n_jobs=1, transformer_list= [
    ("binary_features",
     # Binary features need recoding. All of them will be True / False afterwards, encoded as 1 / 0
     ColumnTransformer([
        ("binary_x_bl",
         BinaryFeatureRecode(value_map={'true': 'X', 'false': ' '}, correct_noisy=False),
         ['PEPSTRFL', 'NOEXCH', 'MAJOR', 'RECINHSE', 'RECP3', 'RECPGVG', 'RECSWEEP']
         ),
        ("binary_y_n",
         BinaryFeatureRecode(value_map={'true': 'Y', 'false': 'N'}, correct_noisy=False),
         ['COLLECT1', 'VETERANS', 'BIBLE', 'CATLG', 'HOMEE', 'PETS', 'CDPLAY', 'STEREO',
          'PCOWNERS', 'PHOTO', 'CRAFTS', 'FISHER', 'GARDENIN',  'BOATS', 'WALKER', 'KIDSTUFF',
          'CARDS', 'PLATES']
         ),
        ("binary_e_i",
         BinaryFeatureRecode(value_map={'true': "E", 'false': 'I'}, correct_noisy=False),
         ['AGEFLAG']
         ),
        ("binary_h_u",
         BinaryFeatureRecode(value_map={'true': "H", 'false': 'U'}, correct_noisy=False),
         ['HOMEOWNR']),
        ("binary_b_bl",
         BinaryFeatureRecode(value_map={'true': 'B', 'false': ' '}, correct_noisy=False),
         ['MAILCODE']
         ),
        ("binary_1_0",
         BinaryFeatureRecode(value_map={'true': '1', 'false': '0'}, correct_noisy=False),
         ['HPHONE_D']
         )
        ])
    ),
    ("date_features",
     # Date features are converted to time deltas.
     ColumnTransformer([
        ("months_to_donation", MonthsToDonation(), dl.promo_history_dates+dl.giving_history_dates),
         ("time_last_donation", DeltaTime(unit='months'), ['LASTDATE','MINRDATE','MAXRDATE','MAXADATE']),
        ("membership_years", DeltaTime(unit='years'),['ODATEDW'])
        ])
    ),
    ("osource",
      ColumnTransformer([("hash_osource", HashingEncoder(), ['OSOURCE'])])
    ),
    ("tcode",
      ColumnTransformer([("hash_tcode", HashingEncoder(), ['TCODE'])])
    ),
    ("zip",
      ColumnTransformer([("hash_zip", HashingEncoder(), ['ZIP'])])
    ),
    ("rfa",
      Pipeline([
        # Recency / Frequency / Amount featrues are spread out into individual features, then ordinally encoded
        ("spread_rfa", ColumnTransformer([('spread', MultiByteExtract(["R", "F", "A"]), dl.nominal_features[2:])])),
        ("order_multibytes", OrdinalEncoder(mapping=dl.ordinal_mapping_rfa,handle_unknown='ignore'))
      ])
    ),
    ("domain",
     Pipeline([
         # The domain feature holds a code for urbanicity and socio economic status of an area. It is split into two
         # and then the socio economic status is recoded to an ordinal feature
         ("spread_domain", ColumnTransformer([("spread",MultiByteExtract(["Urbanicity", "SocioEconomic"]),["DOMAIN"])])),
         ("recode_socioecon", RecodeUrbanSocioEconomic())
     ])
    ),
    ("mdmaud",
     ColumnTransformer([
         ("mdmaud",
         OrdinalEncoder(mapping=dl.ordinal_mapping_mdmaud,handle_unknown='ignore'),
         ['MDMAUD_R','MDMAUD_A'])
     ]),
     remainder = 'passthrough'
    )
])

In [35]:
preproc_pipe.fit(kdd_learn_feat)

KeyError: 'RFA_3A'

In [None]:
res_1 = preproc_pipe.transform(kdd_learn_feat)

In [None]:
preproc_pipe.fit(kdd_learn_feat)

In [None]:
res_2 = preproc_pipe.transform(kdd_learn_feat)

In [None]:
def do_cleaning_preprocessing(dataset, fit=False):
    
    # Cleaning stage
    
    # Binary features
    binary_transformers = ColumnTransformer([
        ("binary_x_bl",
         BinaryFeatureRecode(value_map={'true': 'X', 'false': ' '}, correct_noisy=False),
         ['PEPSTRFL', 'NOEXCH', 'MAJOR', 'RECINHSE', 'RECP3', 'RECPGVG', 'RECSWEEP']
         ),
        ("binary_y_n",
         BinaryFeatureRecode(value_map={'true': 'Y', 'false': 'N'}, correct_noisy=False),
         ['COLLECT1', 'VETERANS', 'BIBLE', 'CATLG', 'HOMEE', 'PETS', 'CDPLAY', 'STEREO',
          'PCOWNERS', 'PHOTO', 'CRAFTS', 'FISHER', 'GARDENIN',  'BOATS', 'WALKER', 'KIDSTUFF',
          'CARDS', 'PLATES']
         ),
        ("binary_e_i",
         BinaryFeatureRecode(value_map={'true': "E", 'false': 'I'}, correct_noisy=False),
         ['AGEFLAG']
         ),
        ("binary_h_u",
         BinaryFeatureRecode(value_map={'true': "H", 'false': 'U'}, correct_noisy=False),
         ['HOMEOWNR']),
        ("binary_b_bl",
         BinaryFeatureRecode(value_map={'true': 'B', 'false': ' '}, correct_noisy=False),
         ['MAILCODE']
         ),
        ("binary_1_0",
         BinaryFeatureRecode(value_map={'true': '1', 'false': '0'}, correct_noisy=False),
         ['HPHONE_D']
         )
    ])
    
    
    # Dates
    don_hist_transformer = ColumnTransformer([
        ("months_to_donation",
         MonthsToDonation(),
         dl.promo_history_dates+dl.giving_history_dates
         )
    ])
    
    
    timedelta_transformer = ColumnTransformer([
        ("time_last_donation", DeltaTime(unit='months'), ['LASTDATE','MINRDATE','MAXRDATE','MAXADATE']),
        ("membership_years", DeltaTime(unit='years'),['ODATEDW'])
    ])
    
    
    # Categorical Features
    
    # Nominals
    osource_transformer = ColumnTransformer([
        ("hash_osource", HashingEncoder(), ['OSOURCE'])
    ])
    
    tcode_transformer = ColumnTransformer([
        ("hash_tcode", HashingEncoder(), ['TCODE'])
    ])
    
    # Ordinals
    multibyte_transformer = ColumnTransformer([
        ("spread",
         MultiByteExtract(["R", "F", "A"]),
         dl.nominal_features[2:])
    ])
    
    domain_transformer = Pipeline([
        ("spread", ColumnTransformer([
            ("spread_domain",
            MultiByteExtract(["Urbanicity", "SocioEconomic"]),
            ["DOMAIN"])
        ])),
        ("recode_socioecon", RecodeUrbanSocioEconomic())
    ])

    
    # Remaining ordinals
    ordinal_transformer = ColumnTransformer([
        ("order_ordinals",
        OrdinalEncoder(mapping=dl.ordinal_mapping_mdmaud,handle_unknown='ignore'),
        ['MDMAUD_R','MDMAUD_A']),
        ("order_multibytes",
        OrdinalEncoder(mapping=dl.ordinal_mapping_rfa,handle_unknown='ignore'),
         list(dataset.filter(like="RFA_",axis=1).columns))
    ])
    
    # Transforming the data (possibly fitting first) and rebuilding the pandas dataframe
    
    binarys = binary_transformers.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, binarys, binary_transformers)
    donation_responses = don_hist_transformer.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, donation_responses, don_hist_transformer)
    timedeltas = timedelta_transformer.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, timedeltas, timedelta_transformer, drop=dl.date_features)
    osource = osource_transformer.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, osource, osource_transformer)
    tcode = tcode_transformer.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, tcode, osource_transformer)
    multibytes = multibyte_transformer.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, multibytes, multibyte_transformer, drop=dl.nominal_features, new_dtype="category")
    domains = domain_transformer.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, domains, domain_transformer)
    ordinals = ordinal_transformer.fit_transform(dataset)
    dataset = ut.update_df_with_transformed(dataset, ordinals, ordinal_transformer)
    
    return dataset

In [None]:
dataset = do_cleaning_preprocessing(kdd_learn_feat)

The next step is to one-hot encode categorical features. This results in an all-numeric dataframe.

In [None]:
dataset2 = pd.get_dummies(dataset)

In [None]:
dataset2.info()

## Imputation of missing values

https://github.com/epsilon-machine/missingpy

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525

This step requires that we first drop features with more than 80% missing values for the KNNImputer to work.

Best results with k=3: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959387/

In [None]:
[c for c in dataset2.columns if dataset2[c].count() / len(dataset2.index) <= 0.2]

In [None]:
dataset2.drop([c for c in dataset2.columns if dataset2[c].count() / len(dataset2.index) <= 0.2],axis=1,inplace=True)

We set weights to distance so that binary and categorical features get an integer value:
https://www.queryxchange.com/q/27_52658127/imputing-missing-values-with-knn/

In [None]:
from missingpy import KNNImputer
imputer = KNNImputer(n_neighbors=3, weights="distance")
kdd_learn_feat_imputed = imputer.fit_transform(dataset2)

## Removing constant features

As per the documentation, features with either low variance or very few non-NA examples are to be dropped.

In [None]:
[c for c in kdd_learn_feat_imputed.columns if kdd_learn_feat_imputed[c].var() <= 1e-5]

In [None]:
def get_low_variance_cols(df=None, cols=None,
                             skip_cols=[], thresh=1e-5,
                             autoremove=False):
    """
    Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
    """
    try:
        # get list of all the original df cols
        all_cols = df.select_dtypes(include="number").columns

        # remove `skip_cols`
        remaining_cols = all_cols.drop(skip_cols)

        # get length of new index
        max_index = len(remaining_cols) - 1

        # get indices for `skip_cols`
        skipped_idx = [all_cols.get_loc(column)
                       for column
                       in skip_cols]

        # adjust insert location by the number of cols removed
        # (for non-zero insertion locations) to keep relative
        # locations intact
        for idx, item in enumerate(skipped_idx):
            if item > max_index:
                diff = item - max_index
                skipped_idx[idx] -= diff
            if item == max_index:
                diff = item - len(skip_cols)
                skipped_idx[idx] -= diff
            if idx == 0:
                skipped_idx[idx] = item

        # get values of `skip_cols`
        skipped_values = df.iloc[:, skipped_idx].values

        # get dataframe values
        X = df.loc[:, remaining_cols].values

        # instantiate VarianceThreshold object
        vt = VarianceThreshold(threshold=thresh)

        # fit vt to data
        vt.fit(X)

        # get the indices of the features that are being kept
        feature_indices = vt.get_support(indices=True)

        # remove low-variance cols from index
        feature_names = [remaining_cols[idx]
                         for idx, _
                         in enumerate(remaining_cols)
                         if idx
                         in feature_indices]

        # get the cols to be removed
        removed_features = list(np.setdiff1d(remaining_cols,
                                             feature_names))
        print("Found {0} low-variance cols."
              .format(len(removed_features)))

        # remove the cols
        if autoremove:
            print("Removing low-variance features.")
            # remove the low-variance cols
            X_removed = vt.transform(X)

            print("Reassembling the dataframe (with low-variance "
                  "features removed).")
            # re-assemble the dataframe
            df = pd.DataFrame(data=X_removed,
                                  cols=feature_names)

            # add back the `skip_cols`
            for idx, index in enumerate(skipped_idx):
                df.insert(loc=index,
                              column=skip_cols[idx],
                              value=skipped_values[:, idx])
            print("Succesfully removed low-variance cols.")

        # do not remove cols
        else:
            print("No changes have been made to the dataframe.")

    except Exception as e:
        print(e)
        print("Could not remove low-variance features. Something "
              "went wrong.")
        pass

    return df, removed_features

In [None]:
df, removed = get_low_variance_cols(kdd_learn_feat_2)

## Exploring strategies for specific feature types

* Noisy data: Correction of data entry / formatting errors
    - These errors must be corrected without excluding the records in question
* Missing data: Has to be inferred from known values
    - (e.g., mean, median, mode, a modeled value).
    - One exception to this rule is the attributes containing 99.5 percent or more missings. These are to be dropped
* Sparse data: Events actually represented in given data make only a very small subset of the event space are to be dropped
* Constant values are to be dropped

### Constant and Sparse Features

Features where only one value is present and those where the majority is empty are to be dropped.


In [None]:
const_sparse_transformer = DropSparseLowVar(keep_anyways=["RAMNT_\d{1,2}", "MONTHS_TO_DONATION_\d{1,2}"])
cs = const_sparse_transformer.fit(learning)
cs = const_sparse_transformer.fit_transform(learning)
set(cs.columns)
const_sparse_transformer.get_feature_names()

### Numerical features

In [None]:
numerical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

### Categorical features

### Remaining object features

In [None]:
objects = learning_raw.select_dtypes(include='object').columns
print(objects)

In [None]:
for f in objects:
    print(f+": "+learning_raw[f].unique())

These are two types:

* ZIP: Malformed zip codes. Some have a dash at the end, which has to be removed.
* Multibyte values. These can be extracted into separate features bytewise. However, this is done in feature extraction later on

## Preprocessing Pipeline

It is now time to construct the preprocessing pipeline. A set of transforming operations is concatenated to a sequence of operations. This pipeline is the learned on the learning dataset. All transformations to the learning dataset will then later be applied to the test dataset and to new data.

In [None]:
numerical_feats = list(kdd_learn_feat.select_dtypes(include=np.number).columns)
categorical_feats = list(kdd_learn_feat.select_dtypes(include=np.number).columns)

With all categories now properly formatted, it is time for one-hot encoding. The sklearn pipeline also has an impute transformation. NaN's get their own level, "missing". This step results in a huge increase in the dimension of the feature space. It is also heavy on computation.

In [None]:
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("one_hot",  OneHotEncoder(impute_missing=True,use_cat_names=True,return_df=True))
])

categories_transformer = ColumnTransformer([
    ("cat_encoder",
     cat_pipe,
     list(kdd_learn_feat.select_dtypes(include="category").columns))
])

Interests and donations

In [None]:
data = learning_raw.loc[:,dl.interest_features+["TARGET_D"]].fillna(0)
interests = pd.melt(data,value_vars=dl.interest_features, value_name="Interest")
data.head()

Features with constant values:

## Feature Selection
Meant to reduce dimensionality by selecting only features that are 'interesting enough' to be considered in order to boost performance of calculations / improve accuracy of the estimator
- By variance threshold
- Recursive Feature Elimination by Cross-Validation
- L1-based feature selection (Logistic Regression, Lasso, SVM)
- Tree-based feature selection

See [scikit-learn: feature selection](http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection)

### Removing constant features (zero variance)

sklearn.feature_selection_variance_threshold

In [None]:
for column in learning.columns:
        if len(learning[column].unique()) == 1:
            print(column)

### Sparse Features

In [None]:
sparse_features = []
for column in learning:
    top_freq = learning[column].value_counts(normalize=True).iloc[0]
    if top_freq > 0.995:
        sparse_features.append(column)
        print(column+" has a top frequency of: " + str(top_freq))
        print(learning[column].value_counts(normalize=True))

In [None]:
sparse_features

### Advanced approaches

* If overfitting is a problem, ensemble-learning or tree learning can be used to find important features, then apply SelectFromModel before the actual estimator. See http://scikit-learn.org/stable/modules/feature_selection.html

## Feature Extraction
All explanatory fields have to be numerical for the subsequent operations with scikit-learn. Here, the necessary feature extractions are performed.

See [scikit-learn: feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html)

In [None]:
import pandas as pd

In [None]:
symbolic_features = []
symbolic_features.append(tds.SymbolicFeatureSpreader(
    "DOMAIN", ["U", "S"])) #Urbanicity, SocioEconomicStatus
# RFA_2 is already spread out
for i in range(3, 25):
    feature = "_".join(["RFA", str(i)])
    symbolic_features.append(tds.SymbolicFeatureSpreader(
        feature, ["R", "F", "A"])) # Recency, Frequency, Amount

spread_multibyte = pd.DataFrame(index=learning_raw.index)
for f in symbolic_features:
    f.set_tidy_dataset_ref(learning_raw)
    spread_multibyte = pd.concat([spread_multibyte,f.spread(inplace=False)],axis=1)

In [None]:
spread_multibyte.info()

# PCA

A first look at important features

In [None]:
from sklearn import decomposition

In [None]:
X = learning.drop(["TARGET_B","TARGET_D"],axis=1)

In [None]:
n_comp = 3
pca = decomposition.PCA(n_components = n_comp)
pca.fit(X)
result = pd.DataFrame(pca.transform(X), columns=["PCA%i" % i for i in range(n_comp)], index=X.index)

In [None]:
import cProfile
domain_spreader = tds.SymbolicFieldToDummies(learning,"RFA_24",["Recency", "Frequency", "Amount"])
cProfile.run('domain_spreader.spread()', sort='time')

In [None]:
learning.head()

In [None]:
import os
import numpy as np
import sys
os.getcwd()
proj_dir = os.path.split(os.getcwd())[0]
if proj_dir not in sys.path:
    sys.path.append(proj_dir)

In [None]:
import eda.tidy_dataset as tds
tidy = tds.TidyDataset("cup98LRN.txt")

In [None]:
raw = tidy.get_raw_data()

In [None]:
spreader = tds.SymbolicFieldToDummies(
    raw, "RFA_24", ["Recency", "Frequency", "Amount"])
spreader.spread()