<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Feature-engineering" data-toc-modified-id="Feature-engineering-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Feature engineering</a></span><ul class="toc-item"><li><span><a href="#Encode-zip-codes-as-coordinates" data-toc-modified-id="Encode-zip-codes-as-coordinates-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Encode zip codes as coordinates</a></span></li><li><span><a href="#Converting-dates" data-toc-modified-id="Converting-dates-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Converting dates</a></span><ul class="toc-item"><li><span><a href="#Donation-history" data-toc-modified-id="Donation-history-1.2.1"><span class="toc-item-num">1.2.1&nbsp;&nbsp;</span>Donation history</a></span></li><li><span><a href="#Time-since-donations,-membership-years" data-toc-modified-id="Time-since-donations,-membership-years-1.2.2"><span class="toc-item-num">1.2.2&nbsp;&nbsp;</span>Time since donations, membership years</a></span></li></ul></li><li><span><a href="#Categoricals" data-toc-modified-id="Categoricals-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Categoricals</a></span><ul class="toc-item"><li><span><a href="#Binary-Encoding" data-toc-modified-id="Binary-Encoding-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Binary Encoding</a></span></li><li><span><a href="#One-Hot-Encoding" data-toc-modified-id="One-Hot-Encoding-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>One-Hot Encoding</a></span></li></ul></li><li><span><a href="#Feature-engineering-combined" data-toc-modified-id="Feature-engineering-combined-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Feature engineering combined</a></span></li></ul></li></ul></div>

In [1]:
%load_ext autoreload

In [2]:
%run ./common_init.ipynb

<Figure size 432x288 with 0 Axes>

Setup logging to file: out.log
Figure output directory saved in figure_output at /home/datarian/OneDrive/unine/Master_Thesis/ma-thesis-report/figures
 cwd: /data/home/datarian/git/master-thesis-msc-statistics/code


In [3]:
%autoreload 2
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy import stats
from sklearn.feature_selection import VarianceThreshold
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from category_encoders import HashingEncoder, OneHotEncoder, OrdinalEncoder

# Load custom code
import kdd98.data_handler as dh
import kdd98.utils_transformer as ut
from kdd98.transformers import *
from kdd98.config import Config

Using TensorFlow backend.


In [4]:
# Where to save the figures
IMAGES_PATH = pathlib.Path(figure_output/'preprocessing')

pathlib.Path(IMAGES_PATH).mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = pathlib.Path(IMAGES_PATH, fig_id + "." + fig_extension)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [5]:
data_provider = dh.KDD98DataProvider("cup98LRN.txt")

In [6]:
cleaned = data_provider.cleaned_data
features = cleaned["data"]
targets = cleaned["targets"]

In [7]:
from kdd98.transformers import ZipToCoords
from category_encoders import BinaryEncoder, OneHotEncoder

## Feature engineering

### Encode zip codes as coordinates

Zip codes are transformed to their centroid coordinates. This gives an intuitive measure of geopgraphical relation between examples.

The coordinates are first searched for in a database from the 2018 US census, and, if not found there, the HERE geolocator web service is queried.

For military zip codes, there are no coordinates available. These are set to the coordinates of the Pentagon.

If a Zip code cannot be determined at all, coordinates are set to 0, 0.

In [8]:
len(features.ZIP.unique())

16488

ZipToCoords is a custom scikit-learn transformer. It relies on the Here geolocator service. Get an app id and app code at [https://developer.here.com/](https://developer.here.com/).

It can then be set using Config.set("here_geolocator_app_id", "yourappid"), Config.set("here_geolocator_app_code", "yourappcode") or directly in the code.


```python
class ZipToCoords(NamedFeatureTransformer):
    def __init__(self):
        super().__init__()
        self.app_id = "ZJBxigwxa1QPHlWrtWH6"
        self.app_code = "OJBun02aepkFbuHmYn1bOg"
        try:
            with open(pathlib.Path(Config.get("data_dir"), "zip_db.pkl"),
                      "rb") as zdb:
                self.locations = pickle.load(zdb)
        except Exception:
            zip_db = pd.read_csv(
                pathlib.Path(Config.get("data_dir"), "zipcodes2018.txt"))
            zip_db.columns = ["zip", "ZIP_latitude", "ZIP_longitude"]
            self.locations = zip_db.set_index("zip").to_dict('index')

    def _do_geo_query(self, q):
        geolocator = Here(app_id=Config.get("here_geolocator_app_id"),
                          app_code=Config.get("here_geolocator_app_code"))
        geocode = RateLimiter(geolocator.geocode,
                              min_delay_seconds=0.01,
                              max_retries=4)
        try:
            return geolocator.geocode(query=q, exactly_one=True)
        except GeocoderTimedOut:
            return self._do_geo_query(q)

    def _get_location(self, example):
        if example.ZIP:
            zip = str(example.ZIP).rjust(5, '0')
            q = {'postalcode': zip, 'state': example.STATE}
            location = self._do_geo_query(q)
            if location:
                loc = {
                    'ZIP_latitude': location.latitude,
                    'ZIP_longitude': location.longitude
                }
            else:
                logger.info(
                    "Transformer {}: No location found for zip {} in state {}. Setting to 0, 0"
                    .format(self.__class__.__str__, zip, example.STATE))
                loc = {'ZIP_latitude': 0, 'ZIP_longitude': 0}
        else:
            print(
                "Transformer {}: ZIP is NaN, setting location to NaN as well.".
                format(self.__class__.__str__))
            loc = {'ZIP_latitude': np.nan, 'ZIP_longitude': np.nan}
        return loc

    def _extract_coords(self, example):
        try:
            return self.locations[example.ZIP]
        except KeyError:
            if example.STATE in ["AA", "AE",
                                 "AP"]:  # military zip, no coords available
                self.locations[example.ZIP] = {
                    'ZIP_latitude': 38.8719,
                    'ZIP_longitude': 77.0563
                }
            else:
                try:
                    loc = self._get_location(example)
                    self.locations[example.ZIP] = loc
                except Exception as e:
                    logger.info(
                        "Transformer {}: Failed to retrieve missing zip. Reason: {}"
                        .format(self.__class__.__str__, e))
            return self.locations[example.ZIP]

    def fit(self, X, y=None):
        assert (isinstance(X, pd.DataFrame))
        self.feature_names = ["ZIP_latitude", "ZIP_longitude"]
        return self

    def transform(self, X, y=None):
        X_trans = pd.DataFrame(index=X.index)
        X_trans = X.apply(self._extract_coords, axis=1, result_type="expand")
        try:
            with open(pathlib.Path(Config.get("data_dir"), "zip_db.pkl"),
                      "wb") as zdb:
                pickle.dump(self.locations, zdb)
        except Exception as e:
            logger.warning("Failed to store updated zipcode database.")
        return X_trans
```

In [9]:
zip_to_coords = ColumnTransformer([("zip_to_coords", ZipToCoords(),
                                    ["ZIP", "STATE"])])
coords = zip_to_coords.fit_transform(features)
coords_names = zip_to_coords.get_feature_names()
coords = pd.DataFrame(data=coords, index=features.index, columns=coords_names)

In [10]:
features = features.merge(coords, on=features.index.name)

In [11]:
features.drop("ZIP", axis=1, inplace=True)

### Converting dates

There are several date features. ODATEDW is the date the record was added, DOB the birth date. ADATE_* and RDATE_* are from the promotion history. ADATE_* is the date of a mailing, RDATE_* the date the donation for the corresponding mailing was received. While these dates are not of particular interest (very low variance), the time it took to respond might be.
Furthermore, there are the features MINRDATE (date of smallest donation), MAXRDATE (date of largest donation), MAXADATE (date of most recent promotion received), FISTDATE (date of first donation), NEXTDATE (date of second donation) and LASTDATE (date of most recent donation) coming from the giving history file.

In [12]:
print(dh.DATE_FEATURES)

['ODATEDW', 'DOB', 'ADATE_2', 'ADATE_3', 'ADATE_4', 'ADATE_5', 'ADATE_6', 'ADATE_7', 'ADATE_8', 'ADATE_9', 'ADATE_10', 'ADATE_11', 'ADATE_12', 'ADATE_13', 'ADATE_14', 'ADATE_15', 'ADATE_16', 'ADATE_17', 'ADATE_18', 'ADATE_19', 'ADATE_20', 'ADATE_21', 'ADATE_22', 'ADATE_23', 'ADATE_24', 'RDATE_3', 'RDATE_4', 'RDATE_5', 'RDATE_6', 'RDATE_7', 'RDATE_8', 'RDATE_9', 'RDATE_10', 'RDATE_11', 'RDATE_12', 'RDATE_13', 'RDATE_14', 'RDATE_15', 'RDATE_16', 'RDATE_17', 'RDATE_18', 'RDATE_19', 'RDATE_20', 'RDATE_21', 'RDATE_22', 'RDATE_23', 'RDATE_24', 'LASTDATE', 'MINRDATE', 'MAXRDATE', 'FISTDATE', 'NEXTDATE', 'MAXADATE']


The following helper function updates feature name lists and removes features that are no longer present because they were removed during preprocessing.

In [13]:
ALL_FEATURES = features.columns.values.tolist()
def filter_features(features):
        return [f for f in features if f in ALL_FEATURES]

In [14]:
features[filter_features(dh.DATE_FEATURES)]

Unnamed: 0_level_0,ODATEDW,ADATE_5,ADATE_7,ADATE_8,ADATE_9,ADATE_10,ADATE_11,ADATE_12,ADATE_13,ADATE_14,...,RDATE_16,RDATE_17,RDATE_18,RDATE_19,RDATE_21,RDATE_22,RDATE_24,LASTDATE,MINRDATE,MAXRDATE
CONTROLN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
95515,8901,9604,9602,9601,9511,9510,9510,9508,9507,9506,...,9505,9503,,,,,9406,9512,9208,9402
148535,9401,9604,9602,9601,9511,9510,9510,9509,,,...,9504,,,,,,,9512,9310,9512
15078,9001,9604,9602,9601,9511,,9510,9508,9507,9506,...,9504,,9501,,,9409,9406,9512,9111,9207
172556,8701,9604,9602,9601,9511,,9510,9508,9507,9506,...,9505,9503,,,9411,,,9512,8711,9411
7112,8601,9604,9512,9601,9511,9510,9509,9508,9502,9506,...,,,,,,,,9601,9310,9601
47784,9401,9604,9602,9601,9511,9510,9510,9509,9507,9506,...,,,9506,,,,9407,9506,9407,9412
62117,8701,,9602,9601,9511,9510,9510,9508,9507,9506,...,9504,,,,,9410,,9504,8705,9410
109359,9401,9604,9602,9601,9511,9510,9510,9509,9507,9506,...,9504,,,,,,9407,9508,9507,9508
75768,8801,9604,9602,9601,9511,9510,9509,9508,9507,9506,...,,,,,9411,,,9507,8809,9312
49909,9401,,9602,9601,9511,9511,9511,9509,9507,,...,9504,,,,,,,9504,9309,9504


#### Donation history
From ADATE_*, the date a letter was sent, and RDATE_*, the date a donation was received, we can calculate the time in months it took to respond with a donation.

When computing the deltas, several of them are negative. It is assumed that the there are input errors for the target year. The target year is therefore increased by one year.

The transformer class:

```python
class MonthsToDonation(DateHandler, NamedFeatureTransformer):
    """ Calculates the elapsed months from sending the promotion
        to receiving a donation.
        The mailings usually were sent out over several months
        and in some cases, the donation is recorded as occurring
        before the mailing. In these cases, the sending date was
        probably not recorded correctly for the example in question.
        As a consequence, the first sending month will be used to
        calculate the time delta.
        In some cases, even with this in place, negative durations occur.
        In these cases, one year is subtracted from the donation date.
        All cases where no donation was made are set to nan.
    """

    def __init__(self, reference_date=pd.datetime(1998, 6, 1)):
        self.reference_date = reference_date
        super().__init__(self.reference_date)

    def fit(self, X, y=None):
        return self

    def calc_diff(self, row):
        ref = row[0]
        target = row[1]

        if not pd.isna(ref) and not pd.isna(target):
            try:
                duration = relativedelta.relativedelta(target, ref).years * 12
                duration += relativedelta.relativedelta(target, ref).months
                if duration < 0.0:  # most likely, the target year was off
                    logger.warning("Calculated negative time difference {}"\
                        " with reference {} and target {},\n"\
                        "Adding one year to the target.".format(duration, ref, target))
                    duration = 12. + duration
            except Exception as e:
                logger.error("Failed to calculate time delta. "
                             "Dates: {} and {}\nMessage: {}".format(
                                 row[0], row[1], e))
                duration = np.nan
        elif pd.isna(target):
            duration = np.nan
        return duration

    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame)
        self.feature_names = list()

        X_trans = pd.DataFrame(index=X.index)
        for i in range(3, 25):
            try:
                feat_name = "MONTHS_TO_DONATION_" + str(i)
                send_date = X.loc[:, ["ADATE_" + str(i)]]
                recv_date = X.loc[:, ["RDATE_" + str(i)]]
            except KeyError as e:
                # One of the features is not there, can't compute the delta
                logger.info(
                    "Missing feature for MONTHS_TO_DONATION_{}.".format(i))
                continue

            try:
                logger.info("Transforming {}".format(feat_name))
                try:
                    send_date = self.parse_date(send_date.squeeze())
                    send_date.loc[:] = send_date.min()
                except Exception as e:
                    raise e
                try:
                    recv_date = self.parse_date(recv_date.squeeze())
                except RuntimeError as e:
                    raise e
                diffs = pd.concat([send_date, recv_date],
                                  axis=1).agg(self.calc_diff, axis=1)
                X_trans = X_trans.join(pd.DataFrame(diffs,
                                                    columns=[feat_name],
                                                    index=X_trans.index),
                                       how="inner")
                self.feature_names.extend([feat_name])
            except Exception as e:
                logger.error("Failed to transform '{}' "
                             "on feature {} for reason {}".format(
                                 feat_name, self.__class__.__name__, e))
                raise e
        return X_trans
```


In [15]:
don_history = ColumnTransformer(
    [("months_to_donation",
      MonthsToDonation(reference_date=pd.datetime(1998, 6, 1)),
      filter_features(dh.PROMO_HISTORY_DATES + dh.GIVING_HISTORY_DATES))])
donation_history = don_history.fit_transform(features)
donation_history_names = [n[n.find('__')+2:]
                 for n in don_history.get_feature_names()]
donation_history = pd.DataFrame(data=donation_history, index=features.index, columns=donation_history_names)

In [16]:
features = features.merge(donation_history, on=features.index.name)

In [17]:
features.drop(filter_features(dh.PROMO_HISTORY_DATES + dh.GIVING_HISTORY_DATES), axis=1,inplace=True)

#### Time since donations, membership years
The time deltas for LASTDATE, MINRDATE, MAXRDATE and MAXADATE are expressed in months before the reference date (which is the sending date of the last promotion).

Membership years are also computed against the reference date of the last promotion sent out.

The DeltaTime transformer:

```python
class DeltaTime(DateHandler, NamedFeatureTransformer):
    """Computes the duration between a date and a reference date in months.

    Parameters:
    -----------

    reference_date: A datetimelike
    unit: ['months', 'years']
    """

    def __init__(self, reference_date, unit='months', suffix=True):
        self.reference_date = reference_date
        super().__init__(self.reference_date)
        self.feature_suffix = "_DELTA_" + unit.upper()
        self.unit = unit

    def get_duration(self, target):
        if not pd.isna(target):
            delta = relativedelta.relativedelta(self.reference_date, target)
            if self.unit.lower() == 'months':
                duration = (delta.years * 12) + delta.months
            elif self.unit.lower() == 'years':
                duration = delta.years + 1
        else:
            logger.info(
                "Failed to calculate time delta. Dates: {} and {}.".format(
                    target, self.reference_date))
            duration = np.nan
        return duration

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        assert isinstance(X, pd.DataFrame)

        # We need to ensure we have datetime objects.
        # The dateparser has to return Int64 to work with sklearn, so
        # we need to recast here.
        X_trans = pd.DataFrame()

        for f in X.columns:
            feature_name = f + self.feature_suffix
            try:
                try:
                    target = self.parse_date(X[f])
                except Exception as e:
                    raise e
                X_trans[feature_name] = target.map(self.get_duration)
            except Exception as e:
                logger.error(
                    "Failed to transform '{}' on feature {} for reason {}".
                    format(self.__class__.__name__, f, e))
                raise e
        self.feature_names = X_trans.columns.values.tolist()
        return X_trans

```

In [18]:
t_deltas = ColumnTransformer(
    [("time_last_donation",
      DeltaTime(reference_date=pd.datetime(1997, 6, 1), unit="months"),
      filter_features(["LASTDATE", "MINRDATE", "MAXRDATE", "MAXADATE"])),
     ("membership_years",
      DeltaTime(reference_date=pd.datetime(1997, 6, 1), unit="years"),
      filter_features(["ODATEDW", "DOB"]))])
timedeltas = t_deltas.fit_transform(features)
timedeltas_names = [n[n.find('__')+2:]
                 for n in t_deltas.get_feature_names()]
timedeltas = pd.DataFrame(data=timedeltas, index=features.index, columns=timedeltas_names)

In [19]:
timedeltas

Unnamed: 0_level_0,LASTDATE_DELTA_MONTHS,MINRDATE_DELTA_MONTHS,MAXRDATE_DELTA_MONTHS,ODATEDW_DELTA_YEARS
CONTROLN,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
95515,18,58,40,9
148535,18,44,18,4
15078,18,67,59,8
172556,18,115,31,11
7112,17,44,17,12
47784,24,35,30,4
62117,26,121,32,11
109359,22,23,22,4
75768,23,105,42,10
49909,26,45,26,4


In [20]:
features = features.merge(timedeltas, on=features.index.name)

In [21]:
features.drop(filter_features(["LASTDATE", "MINRDATE", "MAXRDATE", "MAXADATE", "ODATEDW"]), axis=1, inplace=True)

There are redundant features which can be safely removed, as is shown below:

1. FISTDATE and NEXTDATE are contained in TIMELAG, the number of months between first and second donation
2. DOB, the date of birth, is contained in the feature AGE

### Categoricals

We split the (nominal) categorical features into two groups for individual processing. The distinction is made by cardinality. The features with high cardinality will be binary encoded, those with relatively few categories will be one-hot encoded.

https://towardsdatascience.com/smarter-ways-to-encode-categorical-data-for-machine-learning-part-1-of-3-6dca2f71b159

In [22]:
CATEGORICAL_FEATURES = features.select_dtypes(include="category").columns.values.tolist()
BE_CATEGORICALS = ['OSOURCE', 'TCODE', 'STATE', 'CLUSTER']
OHE_CATEGORICALS = [f for f in CATEGORICAL_FEATURES if f not in BE_CATEGORICALS]
OHE_CATEGORICALS

['GENDER',
 'DATASRCE',
 'GEOCODE',
 'LIFESRC',
 'GEOCODE2',
 'RFA_3R',
 'RFA_4R',
 'RFA_5R',
 'RFA_6R',
 'RFA_7R',
 'RFA_8R',
 'RFA_9R',
 'RFA_10R',
 'RFA_11R',
 'RFA_12R',
 'RFA_13R',
 'RFA_14R',
 'RFA_15R',
 'RFA_16R',
 'RFA_17R',
 'RFA_18R',
 'RFA_19R',
 'RFA_20R',
 'RFA_21R',
 'RFA_22R',
 'RFA_23R',
 'RFA_24R',
 'DOMAINUrbanicity']

#### Binary Encoding

Those categoricals with high cardinality (many levels) are binary-encoded so as to not increase dimensionality too much.

In binary encoding, the levels are first encoded ordinally, meaning we have the levels `{1, 2, 3, ..., l}` for the `l` different levels in the data. These levels are then encoded in base 2. This allows us to represent many levels with only few additional features.

Example:

A categorical feature with 100 levels is representable with 6 binary digits ($2^6 = 64, 2^5 = 32, 2^4 = 16, 2^3 = 8, 2^2 = 4, 2^1 = 2, 2^0 = 1$)

In [23]:
binary_encode = ColumnTransformer([
                    ("be_osource", BinaryEncoder(handle_missing="indicator"), filter_features(['OSOURCE'])),
                    ("be_state", BinaryEncoder(handle_missing="indicator"), filter_features(['STATE'])),
                    ("be_cluster", BinaryEncoder(handle_missing="indicator"), filter_features(['CLUSTER'])),
                    ("be_tcode", BinaryEncoder(handle_missing="indicator"), filter_features(['TCODE']))
                ])
binary_encoded_categories = binary_encode.fit_transform(features)
binary_encode_names = [n[n.find('__')+2:]
                 for n in binary_encode.get_feature_names()]
binary_encoded_categories = pd.DataFrame(data=binary_encoded_categories, index=features.index, columns = binary_encode_names)


In [24]:
features = features.merge(binary_encoded_categories, on=features.index.name)

In [25]:
features.drop(filter_features(BE_CATEGORICALS), axis=1, inplace=True)

#### One-Hot Encoding

In one-hot encoding, one new feature is created for each level in a categorical feature, plus an additional indicator feature for missing values. From these new features, each example has one *hot* feature, indicated with a 1, corresponding to the level the example has for the original feature.

In [26]:
one_hot_encoding = ColumnTransformer([("oh",
                                       OneHotEncoder(
                                           use_cat_names=True,
                                           handle_missing="indicator"),
                                       OHE_CATEGORICALS)])
oh_encoded_categories = one_hot_encoding.fit_transform(features)
oh_encoded_categories_names = [n[n.find('__')+2:] for n in one_hot_encoding.get_feature_names()]
oh_encoded_categories = pd.DataFrame(data=oh_encoded_categories, index=features.index, columns = oh_encoded_categories_names)

In [27]:
features = features.merge(oh_encoded_categories, on=features.index.name)

In [28]:
features.drop(OHE_CATEGORICALS, axis=1, inplace=True)

### Feature engineering combined

All the above steps are implemented in package kdd98. The data after feature engineering is readily available:

In [29]:
learning_numeric = data_provider.numeric_data