<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Overview" data-toc-modified-id="Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Overview</a></span><ul class="toc-item"><li><span><a href="#Numerical-Features" data-toc-modified-id="Numerical-Features-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Numerical Features</a></span></li><li><span><a href="#Categorical-Features" data-toc-modified-id="Categorical-Features-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Categorical Features</a></span></li><li><span><a href="#Object-Features" data-toc-modified-id="Object-Features-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Object Features</a></span></li><li><span><a href="#Skewness" data-toc-modified-id="Skewness-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Skewness</a></span></li></ul></li><li><span><a href="#Targets" data-toc-modified-id="Targets-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Targets</a></span><ul class="toc-item"><li><span><a href="#Binary-target" data-toc-modified-id="Binary-target-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Binary target</a></span></li><li><span><a href="#Discrete-Target" data-toc-modified-id="Discrete-Target-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Discrete Target</a></span></li><li><span><a href="#Socio-economic-environment-and-urbanicity" data-toc-modified-id="Socio-economic-environment-and-urbanicity-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Socio-economic environment and urbanicity</a></span></li><li><span><a href="#Correlations" data-toc-modified-id="Correlations-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Correlations</a></span></li><li><span><a href="#Correlations-between-numerical-features,-excluding-US-census-data" data-toc-modified-id="Correlations-between-numerical-features,-excluding-US-census-data-2.5"><span class="toc-item-num">2.5&nbsp;&nbsp;</span>Correlations between numerical features, excluding US census data</a></span></li><li><span><a href="#Promotion-history-correlations" data-toc-modified-id="Promotion-history-correlations-2.6"><span class="toc-item-num">2.6&nbsp;&nbsp;</span>Promotion history correlations</a></span></li><li><span><a href="#Giving-history-correlations" data-toc-modified-id="Giving-history-correlations-2.7"><span class="toc-item-num">2.7&nbsp;&nbsp;</span>Giving history correlations</a></span></li><li><span><a href="#Puttting-donors-on-a-map" data-toc-modified-id="Puttting-donors-on-a-map-2.8"><span class="toc-item-num">2.8&nbsp;&nbsp;</span>Puttting donors on a map</a></span></li><li><span><a href="#Categorical-features" data-toc-modified-id="Categorical-features-2.9"><span class="toc-item-num">2.9&nbsp;&nbsp;</span>Categorical features</a></span></li><li><span><a href="#The-US-census-data" data-toc-modified-id="The-US-census-data-2.10"><span class="toc-item-num">2.10&nbsp;&nbsp;</span>The US census data</a></span></li><li><span><a href="#Income,-Wealth-and-donations" data-toc-modified-id="Income,-Wealth-and-donations-2.11"><span class="toc-item-num">2.11&nbsp;&nbsp;</span>Income, Wealth and donations</a></span></li></ul></li></ul></div>

# Exploratory Data Analysis
This notebook contains all code for the prelimiatory analysis of the KDD Cup 98 datasets

In [186]:
%load_ext autoreload

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [187]:
# Set up logging and graphics defaults
%run ./common_init.ipynb

In [188]:
%autoreload 2

import kdd98.data_handler as dh
import kdd98.utils_transformer as ut
from kdd98.transformers import *

# Where to save the figures
IMAGES_PATH = pathlib.Path(figure_output/'eda')

pathlib.Path(IMAGES_PATH).mkdir(parents=True, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = pathlib.Path(IMAGES_PATH, fig_id + "." + fig_extension)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [189]:
plt.rcParams['figure.figsize'] = (12, 8)

In [190]:
data_handler = dh.KDD98DataProvider("cup98LRN.txt")
learning = data_handler.clean_data

## Overview

A first, general look at the data structure:

In [None]:
learning.info()

* There are 481 features (of which one is the index)
* A total of 95412 examples
* 24 categorical features, 53 datetime features, 48 numerical features with missing values, 297 integer features without missing values and 56 string features

In [None]:
learning.head()

### Numerical Features

In [None]:
numerical = learning.select_dtypes(include=np.number).columns
print("There are {:1} numerical features".format(len(numerical)))

### Categorical Features

Categories were defined on import of the csv data. The categories were identified in the dataset dictionary.

In [None]:
categories = learning.select_dtypes(include='category').columns
print(categories)

In [None]:
learning.loc[:, categories].describe().transpose()

### Object Features

The date features are not yet encoded into actual dates. Therefore, they still show as objects.

In [None]:
objects = learning.select_dtypes(include='object').columns
print(objects)

In [None]:
learning.loc[:, objects].describe()

### Skewness

Pandas calculates the adjusted Fisher-Pearson standardized moment coefficient.

In [None]:
learning_preprocessed = data_handler.preprocessed_data

In [None]:
learning_skew = learning_preprocessed.skew()

In [None]:
n = len(learning_preprocessed.index)
cb = 1.96*np.sqrt(6*n*(n-1)/((n-2)*(n+1)*(n+3))) / np.sqrt(n)
cb

In [None]:
ax = sns.scatterplot(data=learning_skew)
cb_high = ax.axhline(cb, c="red", label="95 \% C.I. for symmetric distribution")
cb_low = ax.axhline(-cb, c="red")
plt.legend(handles=[cb_high])
plt.tick_params(
    axis='x',          # changes apply to the x-axis
    which='both',      # both major and minor ticks are affected
    bottom=False,      # ticks along the bottom edge are off
    top=False,         # ticks along the top edge are off
    labelbottom=False) # labels along the bottom edge are off
save_fig("skewness_numeric_features")

In [None]:
most_skewed = learning_skew[np.abs(learning_skew) > 20].index.values.tolist()
most_skewed

In [None]:
learning_preprocessed.loc[:,most_skewed].hist()

For the categorical features, we plot histograms

In [None]:
pd.plotting.scatter_matrix(learning_preprocessed.select_dtypes(include="category").apply(pd.Series.value_counts, axis=1))

In [None]:
numeric_features = learning.select_dtypes(include="number")
stat, p_agost = stats.normaltest(numeric_features,axis=0,nan_policy='omit')

In [None]:
agost_skew = pd.DataFrame({'Feature': learning.select_dtypes(include=np.float64), 'DAgostino_stat': stat, 'DAgostino_pval': p_agost})

In [None]:
anderson_skew = learning.select_dtypes(include=np.float64).apply(stats.anderson, axis=0).reset_index()

In [None]:
df_skewtest = pd.DataFrame({'Feature': learning.select_dtypes(include=np.float64).columns, 'Statistic': anderson_skew.statistic, 'P_val': anderson_skew.pvalue})

## Targets

### Binary target
TARGET_B indicates wheter an example donated in the current campaign.

In [None]:
fig = sns.barplot(x = [0,1], y = learning.groupby('TARGET_B')['TARGET_B'].count()/len(learning.index),
                  palette=Config.get("color_palette_binary"));
fig.set_xticklabels(["No", "Yes"]);
plt.xlabel("Example has donated");
plt.ylabel("Percentage of examples");
plt.ylim([0,1])
save_fig(fig_id="_ratio_binary");

### Discrete Target
TARGET_D represents the dollar amount donated in the current campaign.

In [13]:
learning.TARGET_D.describe()

count    95412.000000
mean         0.793073
std          4.429725
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max        200.000000
Name: TARGET_D, dtype: float64

In [None]:
learning.TARGET_D = learning.TARGET_D.astype('float64')

In [None]:
fig = sns.distplot(learning.loc[learning.TARGET_D > 0, 'TARGET_D'], bins=50, hist_kws={'alpha': 0.5}, color=Config.get("color_palette")[0])
plt.ylabel("Percentage of donors");
plt.xlabel("Amount donated, \$")
save_fig('target_distribution')

In [None]:
learning.loc[learning.TARGET_D > 0.0, 'TARGET_D'].median()

* The label is imbalanced, with roughly 95% / 5%
* Most donations are below 20 dollars. The median is 13 \$
* Spikes are visible for 5, 10, 15, 20, 25, 50, 100 and 200 $
* The distribution is right-skewed

Checking the claim from the documentation that donations are positively correlated with the time since the last donation. We plot the duration since the last gift against the donation amount for the current campaign. The marker size indicates the total number of times an example has donated so far.

It is evident that from a lag of &geq; 15 months, donations increase indeed, and over the whole spectrum of amounts. We see a marked difference in 100- and 50 $ donations.

For this analysis, we need to transform the raw date features to time differences. When preprocessing the data, this is taken care of. So we briefly get the preprocessed data for this:

In [None]:
learning_preproc = data_handler.preprocessed_data

In [None]:
learning_preproc.TARGET_D = learning_preproc.TARGET_D.astype("float64")

In [None]:
learning_preproc.AVGGIFT.describe()

In [None]:
sns.scatterplot(x='LASTDATE_DELTA_MONTHS',y='TARGET_D', size='Nr. of times donated', alpha=0.4, data=learning_preproc.loc[learning.TARGET_D > 0,:].rename(columns = {"NGIFTALL": "Nr. of times donated"}),
                palette=Config.get("color_palette_binary"), sizes=(1, 200))

plt.xlabel("Months since last donation of active donors");
plt.ylabel("Amount donated, \$");
save_fig(fig_id="donations_vs_time_since_last")

### Socio-economic environment and urbanicity

Donations by living environment (C=City, R=Rural, S=Suburban, T=Town,U=Urban; lowest numbers represent highest socio-economic ranking). Major donors versus non-major donors.

Surprisingly, one of the top donations came from a rural region of low socio-economic status. Major donors that donated this time are not present in the lowest socio-economic environments.

In [None]:
learning.MAJOR.describe()

In [None]:
learning.MAJOR = learning.MAJOR.map({0: "No", 1: "Yes"})
sns.violinplot(y="TARGET_D", x="DOMAINUrbanicity", hue='Major donor',cut=0, data=learning.loc[learning.TARGET_D > 0,["TARGET_D", "DOMAINUrbanicity", "MAJOR"]].rename(columns = {"MAJOR": "Major donor"}),
               palette=Config.get("color_palette_binary"), legend_out=True)
                   
#new_labels = ['No', 'Yes']
#for t, l in zip(g._legend.texts, new_labels): t.set_text(l)

plt.xlabel("Living environment");
plt.ylabel("Amount donated, dollar US");
save_fig(fig_id="donations_vs_living_environment")

All-time donations by environment. The y- axis is in log scale. We see now that each socio-economic environment also harbours major donors.

In [None]:
fig=sns.boxplot(y="RAMNTALL", x="DOMAINUrbanicity", hue='Major donor', data=learning.rename(columns = {"MAJOR": "Major donor"}),palette=Config.get("color_palette_binary"))
fig.set_yscale('log')
plt.xlabel("Living environment (C: City, R: Rural, S: Suburban, T: Town, U:Urban)");
plt.ylabel("Lifetime amount donated, dollar US");
save_fig(fig_id="all_time_donations_vs_living_environment")

In [None]:
urb_aggr = {
    "RAMNTALL": {
        "total_donations": 'sum',
        'average_per_capita': lambda x: sum(x) / len(x)
    }
}

urb_aggregated = learning.groupby("DOMAINUrbanicity").agg(urb_aggr)
urb_aggregated.RAMNTALL.total_donations.plot(kind="bar", colormap=Config.get("color_map"))

In [None]:
urb_aggregated

In [None]:
fig=sns.boxplot(y="RAMNTALL", x="DOMAINSocioEconomic", hue='Major donor', data=learning.rename(columns = {"MAJOR": "Major donor"}),palette=Config.get("color_palette_binary"))
fig.set_yscale('log')
plt.xlabel("Socio-economic status (1 lowest, 3 highest)");
plt.ylabel("Lifetime amount donated, dollar US");
save_fig(fig_id="all_time_donations_vs_socio_economic")

### Correlations

Since there are so many features, we will plot those who have a significant correlation only.

In [None]:
corr_all = learning.drop(['TARGET_B','TARGET_D'], axis=1).corr()

In [None]:
mask_all = np.zeros_like(corr_all, dtype=np.bool)
mask_all[np.triu_indices_from(mask_all)] = True

sns.heatmap(corr_all,
            cmap=Config.get("color_map_diverging"), mask=mask_all, vmax=1.0, center = 0.0, square=True,
            linewidths = 0)

### Correlations between numerical features, excluding US census data

In [None]:
data_exclude_census_numeric = learning[learning.columns.difference(dh.US_CENSUS_FEATURES)].select_dtypes(include=["float64"])

In [None]:
data_exclude_census_corr = data_exclude_census_numeric[data_exclude_census_numeric.columns.difference(['TARGET_B','TARGET_D'])].corr()

In [None]:
mask_census = np.zeros_like(data_exclude_census_corr, dtype=np.bool)
mask_census[np.triu_indices_from(mask_census)] = True

sns.heatmap(data_exclude_census_corr, mask=mask_census, cmap=Config.get("color_map_diverging"), vmax=1.0, center=0,
            square=True, linewidths=.1, cbar_kws={"shrink": .5}, xticklabels=True,yticklabels=True)

### Promotion history correlations

In [None]:
donation_responses = learning.loc[:,dh.GIVING_HISTORY + dh.GIVING_HISTORY_SUMMARY]
multibytes = learning.loc[:,]

In [None]:
promotion_history_features = learning.reindex(columns=dh.PROMO_HISTORY_SUMMARY+dh.PROMO_HISTORY_DATES)
prom_hist_corr = promotion_history_features[promotion_history_features.columns.difference(['TARGET_B','TARGET_D'])].corr()

In [None]:
mask_promo = np.zeros_like(prom_hist_corr, dtype=np.bool)
mask_promo[np.triu_indices_from(mask_promo)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 20))

sns.heatmap(prom_hist_corr, mask=mask_promo, cmap=Config.get("color_map_diverging"), vmax=1.0, center=0,
            square=True, linewidths=.3, cbar_kws={"shrink": .5}, xticklabels=True,yticklabels=True)
save_fig(fig_id="correlations_promotion_giving_history")

### Giving history correlations

In [None]:
giving_hist_f = dh.GIVING_HISTORY + dh.GIVING_HISTORY_SUMMARY +['LASTDATE_DELTA_MONTHS', 'MINRDATE_DELTA_MONTHS',
       'MAXRDATE_DELTA_MONTHS', 'MAXADATE_DELTA_MONTHS']
giving_history_features = learning.reindex(columns=giving_hist_f)
giving_corr = giving_history_features[giving_history_features.columns.difference(['TARGET_B','TARGET_D'])]

In [None]:
mask_giving = np.zeros_like(giving_corr, dtype=np.bool)
mask_giving[np.triu_indices_from(mask_giving)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(20, 20))

sns.heatmap(giving_corr, mask=mask_giving, cmap=Config.get("color_map_diverging"), vmax=1.0, center=0,
            square=True, linewidths=.1, cbar_kws={"shrink": .5}, xticklabels=True,yticklabels=True)
save_fig(fig_id="correlations_giving_history")

### Puttting donors on a map

In [None]:
num_donors_by_zip = learning[['ZIP', 'TARGET_B']].groupby('ZIP', as_index=False).agg('sum') # number of people who donated
num_members_by_zip = learning[['ZIP', 'TARGET_B']].groupby('ZIP', as_index=False).agg('count') # number of people who are registered at that ZIP
cum_donation_by_zip = learning[['ZIP', 'TARGET_D']].groupby('ZIP', as_index=False).agg('sum')
zip_states = learning[['ZIP','STATE']].drop_duplicates()

In [None]:
data_by_zip = cum_donation_by_zip.merge(num_members_by_zip, on='ZIP').merge(zip_states, on='ZIP')
data_by_zip.columns = ["ZIP", "CumDonation", "MemberCount", "State"]

In [None]:
def rel_donation(row):
    if row.CumDonation != 0.0:
        return row.CumDonation/(1.0 if row.MemberCount == 0.0 else row.MemberCount)
    else:
        return 0.0

data_by_zip['RelDonation'] = data_by_zip.apply(rel_donation,axis=1)

In [None]:
from geopy.geocoders import Here
from geopy.extra.rate_limiter import RateLimiter
from geopy.exc import GeocoderTimedOut

def do_geo_query(q):
    geolocator = Here(app_id="ZJBxigwxa1QPHlWrtWH6", app_code="OJBun02aepkFbuHmYn1bOg")
    geocode = RateLimiter(geolocator.geocode, min_delay_seconds=0.01, max_retries=4)
    try:
        return geolocator.geocode(query=q, exactly_one=True)
    except GeocoderTimedOut:
        return do_geo_query(q)

def get_loc(example):
    if example.ZIP:
        zip = str(int(example.ZIP)).rjust(5, '0')
        q = {'postalcode': zip, 'state': example.State}
        return do_geo_query(q)
    else:
        return None
    
def extract_coords(location):
    return [location.latitude, location.longitude]

In [None]:
import pickle
from tqdm import tqdm

tqdm.pandas()

try:
    zip_data = open(pathlib.Path(Config.get("data_dir"),"zip_data.pkl").resolve(), "rb")
    locations = pickle.load(zip_data)
    zip_data.close()
except Exception as e:
    locations = data_by_zip.progress_apply(get_loc, axis=1)
    locations = pd.DataFrame(locations, columns="location")
    locations['ZIP'] = data_by_zip.ZIP
    with open(pathlib.Path(Config.get("data_dir"),"zip_data.pkl").resolve(), "wb") as zip_data:
        pickle.dump(locations, zip_data)


In [None]:
data_by_zip = data_by_zip.merge(locations, on='ZIP')

In [None]:
data_by_zip.loc[:,'longitude'] = data_by_zip.location.apply(lambda l: l.longitude if l != None else None)
data_by_zip.loc[:,'latitude'] = data_by_zip.location.apply(lambda l: l.latitude if l != None else None)

AA, AE and AP stand for armed services. ZIP codes don't work here, they point anywhere. Also, we only include locations where someone has actually donated by filtering on CumDonation.

In [None]:
data_by_zip1 = data_by_zip.loc[data_by_zip.State != ['AA','AE','AP'],:]
data_by_zip2 = data_by_zip1.loc[data_by_zip1.CumDonation > 0.0,:]

In [None]:
import cartopy.crs as ccrs
import cartopy.io.img_tiles as cimgt
import cartopy.feature as cfeature
fig = plt.figure(figsize=(20,16))

osm_terrain = cimgt.OSM()


ax = fig.add_subplot(1, 1, 1, projection=osm_terrain.crs)

ax.set_extent([-166, -65, 10, 65], crs=ccrs.PlateCarree())
ax.set_extent([-130, -65, 10, 52], crs=ccrs.PlateCarree())
ax.add_image(osm_terrain, 6)

lon = data_by_zip2.longitude
lat = data_by_zip2.latitude
mc = data_by_zip2.MemberCount
cd = data_by_zip2.CumDonation
rd = data_by_zip2.RelDonation

data_by_zip2.plot(kind="scatter",x="longitude",y="latitude",ax=ax,
                  s=cd, c=rd, label="Cumulative Donations",
                  legend=True, alpha=0.4, cmap=Config.get("color_map"),
                  subplots=True, colorbar=True, transform=ccrs.PlateCarree())
            
save_fig(fig_id="donations_geographical")

* Most donations come from the urban areas, especially San Francisco, Los Angeles, Miami, Chicago and Detroit. To a lesser extent, cities like Houston, Dallas, Minneapolis, Atlanta, Tampa, Seattle and Phoenix can be made out.
* Interestingly, the East Coast has not donated, despite featuring some large metropolitan areas like New York, Boston, or Washington

### Categorical features

In [None]:
categories = learning.select_dtypes("category").copy()
target = learning['TARGET_B'].astype("category")
categories = categories.drop('TARGET_B', axis=1)
#categories['TARGET_B'] = learning.TARGET_B.astype("category")
#categories['TARGET_D'] = learning.TARGET_D
#categories_grouped = categories.groupby('TARGET_B')

In [None]:
len(categories.columns.values)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
lm = LogisticRegression(C=1e5, solver='lbfgs', multi_class='multinomial')
lm.fit(np.ndarray(categories),y=np.ndarray(target))

In [None]:
pd.crosstab(categories.TARGET_D,[categories.INCOME],margins=True)

### The US census data

In [None]:
census = learning[dh.us_census_features]
census_corr = census.corr()

In [None]:
mask = np.zeros_like(census_corr, dtype=np.binary)
mask[np.triu_indices_from(mask)] = True

# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(12, 12))

# Generate a custom diverging colormap
cmap = sns.diverging_palette(220, 10, as_cmap=True)

sns.heatmap(census_corr, mask=mask, cmap=cmap, vmax=1.0, center=0,
            square=True, linewidths=.2, cbar_kws={"shrink": .5})
save_fig(fig_id="correlation_census")

In [None]:
census.select_dtypes(include="int64")

### Income, Wealth and donations

In [None]:
inc_targ = sns.violinplot(x="INCOME", y="TARGET_D", data=learning.loc[learning.TARGET_D > 0.0, ["INCOME","TARGET_D"]])
inc_targ.set_yscale('log')
plt.show()

In [None]:
weal1_targ = sns.violinplot(x="WEALTH1", y="TARGET_D", data=learning.loc[learning.TARGET_D > 0.0, ["WEALTH1","TARGET_D"]])
weal1_targ.set_yscale('log')
plt.show()

In [None]:
weal2_targ = sns.violinplot(x="WEALTH2", y="TARGET_D", data=learning.loc[learning.TARGET_D > 0.0, ["WEALTH2","TARGET_D"]])
weal2_targ.set_yscale('log')
plt.show()

In [None]:
sns.catplot(x="WEALTH2", y="TARGET_D", hue="MAJOR",
            kind="violin", inner="stick", split=True, data=learning.loc[learning.TARGET_D > 0.0,:].rename())

In [None]:
sns.catplot(x="CLUSTER", y="TARGET_D", kind="box", data=learning)

In [None]:
sns.distplot(learning.loc[learning.TARGET_D > 0.0,
                          'TARGET_D'], bins=50, kde=False, rug=True)

In [None]:
learning.select_dtypes(include=np.float).hist(bins=50, figsize=(50, 50))
plt.show()
save_fig("float_feature_histograms")