In [None]:
%load_ext autoreload

In [None]:
%run ./common_init.ipynb

In [None]:
%autoreload 2
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import HashingEncoder, OneHotEncoder, OrdinalEncoder

# Load custom code
import kdd98.data_handler as dh
import kdd98.utils_transformer as ut
from kdd98.transformers import *
from kdd98.config import Config

In [None]:
# Where to save the figures
IMAGES_PATH = pathlib.Path(figure_output/'preprocessing')

pathlib.Path(IMAGES_PATH).mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = pathlib.Path(IMAGES_PATH/fig_id + "." + fig_extension)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Feature Transformations

## Dates

There are several date features. ODATEDW is the date the record was added, DOB the birth date. ADATE_* and RDATE_* are from the promotion history. ADATE_* is the date of a mailing, RDATE_* the date the donation for the corresponding mailing was received. While these dates are not of particular interest (very low variance), the time it took to respond might be.
Furthermore, there are the features MINRDATE, MAXRDATE, MAXADATE, FISTDATE, NEXTDATE and LASTDATE coming from the giving history file.

Three different transformations are applied:

1. ODATEDW, DOB: Years before 1997 -> membership duration, age
2. Giving history features: Relative time in months to 1997/06/01
3. For the promotion history, as specified above, the time for response in months

There are redundant features which can be safely removed, as is shown below:

1. FISTDATE and NEXTDATE are contained in TIMELAG, the number of months between first and second donation
2. DOB, the date of birth, is contained in the feature AGE

In [None]:
print(dh.date_features)

Now, we transform the dates from the giving history. First, we create two dataframes with the sending dates of the mailings and the dates when the gift (donation) for these was received.

In [None]:
don_hist_transformer = ColumnTransformer([
    ("months_to_donation",
     MonthsToDonation(),
     dh.PROMO_HISTORY_DATES+dh.GIVING_HISTORY_DATES
     )
])

In [None]:
donation_responses = don_hist_transformer.fit_transform(learning)

In [None]:
don_hist_feature_names = [n[n.find('__')+2:]
                 for n in don_hist_transformer.get_feature_names()]

In [None]:
donation_responses = pd.DataFrame(
    donation_responses, index=learning.index, columns=don_hist_feature_names)

In [None]:
learning = learning.merge(donation_responses, on=learning.index.name)

Time delta computation of the remaining features with either a specific reference or the date of the most recent mailing as a reference:

* Time since last donation, minimum- and maximum donation and receiving most recent promotion
* Delta between first and next donation
* Age, years of membership

In [None]:
timedelta_transformer = ColumnTransformer([
    ("time_last_donation", DeltaTime(unit='months'), ['LASTDATE','MINRDATE','MAXRDATE','MAXADATE']),
    ("delta_first_next", DeltaTime(reference_date=learning.NEXTDATE), ['FISTDATE']),
    ("membership_years", DeltaTime(unit='years'),['ODATEDW', 'DOB'])
])

In [None]:
timedeltas = timedelta_transformer.fit_transform(learning)

In [None]:
timedelta_feature_names = [n[n.find('__')+2:]
                 for n in timedelta_transformer.get_feature_names()]

In [None]:
timedeltas = pd.DataFrame(timedeltas, index=learning.index,columns=timedelta_feature_names)

In [None]:
timedeltas.columns

In [None]:
learning = learning.merge(timedeltas, on=learning.index.name)
learning.drop(dh.date_features, axis=1,inplace=True)

Studying redundance of DOB <-> AGE and \[FISTDATE, NEXTDATE\] <-> TIMELAG

In [None]:
ages = pd.DataFrame([learning.AGE, timedeltas.DOB_DELTA_YEARS]).T

In [None]:
ages.loc[ages.AGE != ages.DOB_DELTA_YEARS,:].dropna()

In [None]:
lags = pd.DataFrame([learning.TIMELAG, timedeltas.FISTDATE_NEXTDATE_DELTA_MONTHS]).T

In [None]:
lags.loc[lags.TIMELAG != lags.FISTDATE_NEXTDATE_DELTA_MONTHS,:].dropna()

The transformed feature DOB is represented in the feature AGE already. So we can drop DOB_DELTA_YEARS. TIMELAG already holds the difference in months between FISTDATE and NEXTDATE, so this delta can also be safely removed together with the original features

In [None]:
learning.drop(['DOB_DELTA_YEARS', 'FISTDATE_NEXTDATE_DELTA_MONTHS'], axis=1,inplace=True)

# Preprocessing pipeline

The preprocessing pipeline results in a dataset with numerical (binary features encoded correclty), categorial and string date features.

Following this step, feature extraction, imputation, dropping of constant and sparse features and ensuring all data is numerical can be tackled.

https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087

https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

The hashing transformer hashes the nominal feature values into an 8 bit representation. If more than one feature is passed in, they all get encoded into the same 8 bits, therefore in effect reducing the dimensionality of the data.

In [None]:
data_loader = dh.KDD98DataLoader("cup98LRN.txt")

In [None]:
learning = data_loader.clean_data

In [None]:
hashing_transformer = ColumnTransformer([
            ("hash_osource", HashingEncoder(), ['OSOURCE']),
            ("hash_tcode", HashingEncoder(), ['TCODE']),
            ("hash_zip", HashingEncoder(), ['ZIP'])
        ])
hashes = hashing_transformer.fit_transform(learning)
data = ut.update_df_with_transformed(learning,hashes,hashing_transformer)
data = data.drop(['OSOURCE', 'TCODE', 'ZIP'], axis=1)

In [None]:
hashing_names = hashing_transformer.get_feature_names()
data_df = pd.DataFrame(data = hashes, columns = hashing_names, index=learning.index)

In [None]:
data_df.ZIP

In [None]:
("date_features",
     # Date features are converted to time deltas.
     ColumnTransformer([
        ("months_to_donation", MonthsToDonation(), dh.promo_history_dates+dh.giving_history_dates),
         ("time_last_donation", DeltaTime(unit='months'), ['LASTDATE','MINRDATE','MAXRDATE','MAXADATE']),
        ("membership_years", DeltaTime(unit='years'),['ODATEDW'])
        ])
    ),
    ("osource",
      ColumnTransformer([("hash_osource", HashingEncoder(), ['OSOURCE'])])
    ),
    ("tcode",
      ColumnTransformer([("hash_tcode", HashingEncoder(), ['TCODE'])])
    ),
    ("zip",
      ColumnTransformer([("hash_zip", HashingEncoder(), ['ZIP'])])
    ),
    ("rfa",
      Pipeline([
        # Recency / Frequency / Amount featrues are spread out into individual features, then ordinally encoded
        ("spread_rfa", ColumnTransformer([('spread', MultiByteExtract(["R", "F", "A"]), dh.nominal_features[2:])])),
        ("order_multibytes", OrdinalEncoder(mapping=dh.ordinal_mapping_rfa,handle_unknown='ignore'))
      ])
    ),
    ("domain",
     Pipeline([
         # The domain feature holds a code for urbanicity and socio economic status of an area. It is split into two
         # and then the socio economic status is recoded to an ordinal feature
         ("spread_domain", ColumnTransformer([("spread",MultiByteExtract(["Urbanicity", "SocioEconomic"]),["DOMAIN"])])),
         ("recode_socioecon", RecodeUrbanSocioEconomic())
     ])
    ),
    ("mdmaud",
     ColumnTransformer([
         ("mdmaud",
         OrdinalEncoder(mapping=dh.ordinal_mapping_mdmaud,handle_unknown='ignore'),
         ['MDMAUD_R','MDMAUD_A'])
     ]),
     remainder = 'passthrough'
    )

## Imputation of missing values

https://github.com/epsilon-machine/missingpy

Olga Troyanskaya, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein and Russ B. Altman, Missing value estimation methods for DNA microarrays, BIOINFORMATICS Vol. 17 no. 6, 2001 Pages 520-525

This step requires that we first drop features with more than 80% missing values for the KNNImputer to work.

Best results with k=3: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4959387/

In [None]:
[c for c in dataset2.columns if dataset2[c].count() / len(dataset2.index) <= 0.2]
dataset2.drop([c for c in dataset2.columns if dataset2[c].count() / len(dataset2.index) <= 0.2],axis=1,inplace=True)

In [None]:
dataset2.drop([c for c in dataset2.columns if dataset2[c].count() / len(dataset2.index) <= 0.2],axis=1,inplace=True)

We set weights to distance so that binary and categorical features get an integer value:
https://www.queryxchange.com/q/27_52658127/imputing-missing-values-with-knn/

In [None]:
from missingpy import KNNImputer
imputer = KNNImputer(n_neighbors=3, weights="distance")
kdd_learn_feat_imputed = imputer.fit_transform(dataset2)

## Removing constant features

As per the documentation, features with either low variance or very few non-NA examples are to be dropped.

In [None]:
[c for c in kdd_learn_feat_imputed.columns if kdd_learn_feat_imputed[c].var() <= 1e-5]

### Removing constant features (zero variance)

sklearn.feature_selection_variance_threshold

In [None]:
for column in learning.columns:
        if len(learning[column].unique()) == 1:
            print(column)

### Sparse Features

In [None]:
sparse_features = []
for column in learning:
    top_freq = learning[column].value_counts(normalize=True).iloc[0]
    if top_freq > 0.995:
        sparse_features.append(column)
        print(column+" has a top frequency of: " + str(top_freq))
        print(learning[column].value_counts(normalize=True))

In [None]:
sparse_features

### Advanced approaches

* If overfitting is a problem, ensemble-learning or tree learning can be used to find important features, then apply SelectFromModel before the actual estimator. See http://scikit-learn.org/stable/modules/feature_selection.html

In [None]:
def get_low_variance_cols(df=None, cols=None,
                             skip_cols=[], thresh=1e-5,
                             autoremove=False):
    """
    Wrapper for sklearn VarianceThreshold for use on pandas dataframes.
    """
    try:
        # get list of all the original df cols
        all_cols = df.select_dtypes(include="number").columns

        # remove `skip_cols`
        remaining_cols = all_cols.drop(skip_cols)

        # get length of new index
        max_index = len(remaining_cols) - 1

        # get indices for `skip_cols`
        skipped_idx = [all_cols.get_loc(column)
                       for column
                       in skip_cols]

        # adjust insert location by the number of cols removed
        # (for non-zero insertion locations) to keep relative
        # locations intact
        for idx, item in enumerate(skipped_idx):
            if item > max_index:
                diff = item - max_index
                skipped_idx[idx] -= diff
            if item == max_index:
                diff = item - len(skip_cols)
                skipped_idx[idx] -= diff
            if idx == 0:
                skipped_idx[idx] = item

        # get values of `skip_cols`
        skipped_values = df.iloc[:, skipped_idx].values

        # get dataframe values
        X = df.loc[:, remaining_cols].values

        # instantiate VarianceThreshold object
        vt = VarianceThreshold(threshold=thresh)

        # fit vt to data
        vt.fit(X)

        # get the indices of the features that are being kept
        feature_indices = vt.get_support(indices=True)

        # remove low-variance cols from index
        feature_names = [remaining_cols[idx]
                         for idx, _
                         in enumerate(remaining_cols)
                         if idx
                         in feature_indices]

        # get the cols to be removed
        removed_features = list(np.setdiff1d(remaining_cols,
                                             feature_names))
        print("Found {0} low-variance cols."
              .format(len(removed_features)))

        # remove the cols
        if autoremove:
            print("Removing low-variance features.")
            # remove the low-variance cols
            X_removed = vt.transform(X)

            print("Reassembling the dataframe (with low-variance "
                  "features removed).")
            # re-assemble the dataframe
            df = pd.DataFrame(data=X_removed,
                                  cols=feature_names)

            # add back the `skip_cols`
            for idx, index in enumerate(skipped_idx):
                df.insert(loc=index,
                              column=skip_cols[idx],
                              value=skipped_values[:, idx])
            print("Succesfully removed low-variance cols.")

        # do not remove cols
        else:
            print("No changes have been made to the dataframe.")

    except Exception as e:
        print(e)
        print("Could not remove low-variance features. Something "
              "went wrong.")
        pass

    return df, removed_features

In [None]:
df, removed = get_low_variance_cols(kdd_learn_feat_2)

## Exploring strategies for specific feature types

* Noisy data: Correction of data entry / formatting errors
    - These errors must be corrected without excluding the records in question
* Missing data: Has to be inferred from known values
    - (e.g., mean, median, mode, a modeled value).
    - One exception to this rule is the attributes containing 99.5 percent or more missings. These are to be dropped
* Sparse data: Events actually represented in given data make only a very small subset of the event space are to be dropped
* Constant values are to be dropped

### Constant and Sparse Features

Features where only one value is present and those where the majority is empty are to be dropped.


In [None]:
const_sparse_transformer = DropSparseLowVar(keep_anyways=["RAMNT_\d{1,2}", "MONTHS_TO_DONATION_\d{1,2}"])
cs = const_sparse_transformer.fit(learning)
cs = const_sparse_transformer.fit_transform(learning)
set(cs.columns)
const_sparse_transformer.get_feature_names()

### Numerical features

In [None]:
numerical_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="median"))
])

### Remaining object features

In [None]:
objects = learning_raw.select_dtypes(include='object').columns
print(objects)

In [None]:
for f in objects:
    print(f+": "+learning_raw[f].unique())

These are two types:

* ZIP: Malformed zip codes. Some have a dash at the end, which has to be removed.
* Multibyte values. These can be extracted into separate features bytewise. However, this is done in feature extraction later on

## Preprocessing Pipeline

It is now time to construct the preprocessing pipeline. A set of transforming operations is concatenated to a sequence of operations. This pipeline is the learned on the learning dataset. All transformations to the learning dataset will then later be applied to the test dataset and to new data.

In [None]:
numerical_feats = list(kdd_learn_feat.select_dtypes(include=np.number).columns)
categorical_feats = list(kdd_learn_feat.select_dtypes(include=np.number).columns)

With all categories now properly formatted, it is time for one-hot encoding. The sklearn pipeline also has an impute transformation. NaN's get their own level, "missing". This step results in a huge increase in the dimension of the feature space. It is also heavy on computation.

In [None]:
cat_pipe = Pipeline([
    ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
    ("one_hot",  OneHotEncoder(impute_missing=True,use_cat_names=True,return_df=True))
])

categories_transformer = ColumnTransformer([
    ("cat_encoder",
     cat_pipe,
     list(kdd_learn_feat.select_dtypes(include="category").columns))
])

Interests and donations

In [None]:
data = learning_raw.loc[:,dh.interest_features+["TARGET_D"]].fillna(0)
interests = pd.melt(data,value_vars=dh.interest_features, value_name="Interest")
data.head()

Features with constant values:

## Splitting into training- and test dataset

Before applying *any* transformations, the dataset will be split 80/20 into a learning and test set.

Let's look at feature TARGET_B, which describes whether a person has donated or not:

In [None]:
learning_raw.TARGET_B.value_counts(normalize=True) # 5 % of recipients have donated.

We want to preserve this ratio in the split datasets. scikit-learn provides a method for achieving this.

In [None]:
seed = Config.get.config("random_seed")
from sklearn.model_selection import StratifiedShuffleSplit

splitter = StratifiedShuffleSplit(n_splits=1, test_size=0.2, train_size=0.8, random_state=seed)
for learn_index, test_index in splitter.split(learning_raw, learning_raw.TARGET_B.astype('int')):
    l_i = learn_index
    t_i = test_index
    kdd_learn = learning_raw.iloc[learn_index]
    kdd_test = learning_raw.iloc[test_index]

Now, check that the two sets are really disjoint

In [None]:
set(kdd_learn.index).intersection(kdd_test.index)

Check the frequencies of the donors in the sets:

In [None]:
kdd_learn['TARGET_B'].value_counts(normalize=True)

In [None]:
kdd_test['TARGET_B'].value_counts(normalize=True)

## Separating features and label

First, we separate the features from the labels. We will also remove the label "TARGET_B", which is an indicator variable for donors that is no longer of interest

**All preprocessing is performed on *kdd_learn_feat***

In [None]:
kdd_learn_feat = kdd_learn.drop(['TARGET_B', 'TARGET_D'],axis=1).copy()
kdd_learn_labels = kdd_learn[['TARGET_B','TARGET_D']].copy()

# Feature Extraction
All explanatory fields have to be numerical for the subsequent operations with scikit-learn. Here, the necessary feature extractions are performed.

See [scikit-learn: feature extraction](http://scikit-learn.org/stable/modules/feature_extraction.html)

# Feature Selection
Meant to reduce dimensionality by selecting only features that are 'interesting enough' to be considered in order to boost performance of calculations / improve accuracy of the estimator
- By variance threshold
- Recursive Feature Elimination by Cross-Validation
- L1-based feature selection (Logistic Regression, Lasso, SVM)
- Tree-based feature selection

See [scikit-learn: feature selection](http://scikit-learn.org/stable/modules/feature_selection.html#feature-selection)