# Feature Selection Notebook

## Introduction
There is a very high amount of features in data. Depending on the clustering strategy this can lead to the Curse of Dimensionaliy in a clustering scenario. Refer to [this answer on Stack Exchange](https://stats.stackexchange.com/questions/232500/how-do-i-know-my-k-means-clustering-algorithm-is-suffering-from-the-curse-of-dim) for a more in-depth discussion about it.

## Methodology
A way to select features aiming towards reducing dimensionality is to use the information in the customers data that is not present in the general dataset (i.e. `CUSTOMER_GROUP, ONLINE_PURCHASE AND PRODUCT_GROUP` columns) to choose the columns that help the most in segmenting these columns. This way, we can choose the features that most likely segment customers and can use them to understand the behaviour of the general population and the customer population.

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest, mutual_info_classif
import pickle

In [2]:
df_customers = pd.read_parquet('data/refined/customers_data.parquet')

# Checking for multicolinearity

We should check especially for multicolinearity for categorical variables (ordinals are categorical nontheless), since there are so many of them.  
In this context, too many variables can lead to the Curse of Dimensionality so that it is hard to set entries apart in the feature space.  

To compare approx. 300 variables with themselves would generate a result too expensive computationally (approx. 44850 unique combinations), since we have to evaluate all variables against themselves.  
As a proxy, we will use the information-levels to calculate the colinearities within each level to keep only the most important features for each level. 

## Calculating Multicolinearities by Information Group

In [None]:
# We wont use numeric vars to calculate Cramer's V

numeric_vars = ['ANZ_HAUSHALTE_AKTIV',
                'ANZ_HH_TITEL',
                'ANZ_PERSONEN',
                'ANZ_TITEL',
                'GEBURTSJAHR',
                'KBA13_ANZAHL_PKW',
                'MIN_GEBAEUDEJAHR']

In [None]:
num_var_filter = ~col_classification['Attribute'].isin(numeric_vars)

In [None]:
def calculate_cramers_v(arr1, arr2):

    '''
    Calculates Cramer's V for two arrays.
    The value lies between 0 and 1 (inclusive)

    :param arr1: Array of categorical variable
    :param arr2: Array of categorical variable

    :return v: Cramer's V index value
    '''

    crosstab = stats.contingency.crosstab(arr1, arr2)[1]

    chi2 = stats.chi2_contingency(crosstab)[0]

    # calculating the total number of observations
    n = np.sum(crosstab)

    # getting the degrees of freedom
    dof = min(crosstab.shape)-1
    
    # calculating cramer's v
    v = np.sqrt(chi2/(n*dof))

    return v

def calculate_frame_cramer_coefs(dataframe):

    '''
    Calculates pairwise Cramer's V for all possible combinations of categorical variables in 'dataframe'.
    Similar behaviour to pandas' .corr() method.

    :param dataframe: Pandas DataFrame columns with categorical variables

    :return matrix: Pairwise matrix with all variable combinations
    '''
    
    numpy_frame = dataframe.dropna().values

    matrix = np.diag([1.0] * numpy_frame.shape[1])

    table_range = list(range(0,numpy_frame.shape[1]))

    # Getting unique index combinations to minimize iterations
    combos = [combo for combo in itertools.combinations(table_range,2)]

    for i, j in combos:
        
        v = calculate_cramers_v(numpy_frame[:,i], numpy_frame[:,j])

        matrix[i,j] = v

        matrix[j,i] = v

    return matrix

**WARNING: Running the cells below can take a while. That is why the values are exported to csv, so we can use them later without going through these calculations**

In [None]:
# # Sorting information levels by amount of columns to generate results faster
# sorted_col_classes = col_classification['Information level'].value_counts(ascending=True)

In [None]:
# for i, c in enumerate(sorted_col_classes.index): : #  enumerate(['125m x 125m Grid','Household'])

#     relevant_cols = col_classification[(col_classification['Information level'] == c) & num_var_filter]['Attribute']

#     v_matrix = calculate_frame_cramer_coefs(census[relevant_cols])

#     v_frame = pd.DataFrame(v_matrix, columns = relevant_cols, index = relevant_cols)

#     v_frame.name = c

#     v_frame.to_csv(f'data/trusted/{v_frame.name}_cramer.csv', sep = ';')

#     if i == 0:

#         v_frame_list = [v_frame]

#     else:

#         v_frame_list.append(v_frame)

---

In [None]:
DATA_PATH = 'data/trusted/'

In [None]:
cramer_frame_list = []

for file in os.listdir(DATA_PATH):

    # Skipping to use updated version
    if file == 'Microcell (RR3_ID)_cramer.csv':

        continue

    if file.endswith('_cramer.csv'):

        name = file.replace('_cramer.csv','')

        frame = pd.read_csv(os.path.join(DATA_PATH, file), sep = ';', index_col = 0)

        if file == 'Microcell (RR3_ID)_updated_cramer.csv':

            frame = pd.read_csv(os.path.join(DATA_PATH, file), index_col = 0)

        frame.name = name
    
        cramer_frame_list.append(frame)

# Feature Selection
Based on multicolinearity and business definitions of columns within each group

## 125m x 125m Grid
For this information level, what does not relate to banking activities or marginally related to the clients' business (mail-order organics) will be drop and not analyzed.

In [None]:
# Keeping columns related to banking activities and related to clients' businesses
grid_cols_list = ['D19_BANKEN_DIREKT',
                            'D19_BANKEN_GROSS',
                            'D19_BANKEN_LOKAL',
                            'D19_BANKEN_REST',
                            'D19_BIO_OEKO',
                            'D19_DIGIT_SERV',
                            'D19_LEBENSMITTEL',
                            'D19_VOLLSORTIMENT',
                            'D19_VERSAND_REST']

In [None]:
cramer_frame_list[0].columns

Index(['D19_BANKEN_DIREKT', 'D19_BANKEN_GROSS', 'D19_BANKEN_LOKAL',
       'D19_BANKEN_REST', 'D19_BEKLEIDUNG_GEH', 'D19_BEKLEIDUNG_REST',
       'D19_BIO_OEKO', 'D19_BILDUNG', 'D19_DIGIT_SERV', 'D19_DROGERIEARTIKEL',
       'D19_ENERGIE', 'D19_FREIZEIT', 'D19_GARTEN', 'D19_HANDWERK',
       'D19_HAUS_DEKO', 'D19_KINDERARTIKEL', 'D19_KOSMETIK',
       'D19_LEBENSMITTEL', 'D19_NAHRUNGSERGAENZUNG', 'D19_RATGEBER',
       'D19_REISEN', 'D19_SAMMELARTIKEL', 'D19_SCHUHE', 'D19_SONSTIGE',
       'D19_TECHNIK', 'D19_TELKO_MOBILE', 'D19_TELKO_REST', 'D19_TIERARTIKEL',
       'D19_VERSICHERUNGEN', 'D19_VOLLSORTIMENT', 'D19_VERSAND_REST',
       'D19_WEIN_FEINKOST', 'D19_BUCH_CD', 'D19_LETZTER_KAUF_BRANCHE',
       'D19_LOTTO', 'D19_SOZIALES'],
      dtype='object')

In [None]:
cramer_frame_list[0].loc[grid_cols_list, grid_cols_list].style.background_gradient()

Unnamed: 0_level_0,D19_BANKEN_DIREKT,D19_BANKEN_GROSS,D19_BANKEN_LOKAL,D19_BANKEN_REST,D19_BIO_OEKO,D19_DIGIT_SERV,D19_LEBENSMITTEL,D19_VOLLSORTIMENT,D19_VERSAND_REST
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
D19_BANKEN_DIREKT,1.0,0.407112,0.059448,0.15186,0.118379,0.068619,0.074993,0.07042,0.133773
D19_BANKEN_GROSS,0.407112,1.0,0.032744,0.158219,0.015708,0.040695,0.037608,0.062099,0.090168
D19_BANKEN_LOKAL,0.059448,0.032744,1.0,0.058339,0.071716,0.02773,0.050515,0.027842,0.046358
D19_BANKEN_REST,0.15186,0.158219,0.058339,1.0,0.090558,0.041225,0.064007,0.056588,0.102916
D19_BIO_OEKO,0.118379,0.015708,0.071716,0.090558,1.0,0.078483,0.117177,0.055483,0.09963
D19_DIGIT_SERV,0.068619,0.040695,0.02773,0.041225,0.078483,1.0,0.04093,0.049721,0.09127
D19_LEBENSMITTEL,0.074993,0.037608,0.050515,0.064007,0.117177,0.04093,1.0,0.078803,0.085009
D19_VOLLSORTIMENT,0.07042,0.062099,0.027842,0.056588,0.055483,0.049721,0.078803,1.0,0.102574
D19_VERSAND_REST,0.133773,0.090168,0.046358,0.102916,0.09963,0.09127,0.085009,0.102574,1.0


Even though direct banking has some correlation to big banks, they convey different informations that might be interesting to the user.  
For instance, who uses direct banking might be a more tech-savy user. This gives us a different segment indication then simply assuming a user uses big banks.

## Buildings

In [None]:
cramer_frame_list[1].name

'Building'

In [None]:
cramer_frame_list[1].style.background_gradient()

Unnamed: 0_level_0,GEBAEUDETYP,KBA05_HERSTTEMP,KBA05_MODTEMP,KONSUMNAEHE,OST_WEST_KZ,WOHNLAGE
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GEBAEUDETYP,1.0,0.058127,0.057239,0.146835,0.054584,0.071675
KBA05_HERSTTEMP,0.058127,1.0,0.505993,0.081404,0.255567,0.303144
KBA05_MODTEMP,0.057239,0.505993,1.0,0.07329,0.099981,0.298161
KONSUMNAEHE,0.146835,0.081404,0.07329,1.0,0.109692,0.167219
OST_WEST_KZ,0.054584,0.255567,0.099981,0.109692,1.0,0.094349
WOHNLAGE,0.071675,0.303144,0.298161,0.167219,0.094349,1.0


KBA05_HERSTTEMP and KBA05_MODTEMP seem to have shared information. They are more closely related to auto-manufacturing then actually characteristics of repondents themselves. They may belong better to the RR3_ID class.  

Either way, it makes sanse that car brands and segments are correlated to eachother as well as living conditions in a neighbourhood (WOHNANLAGE).  

No other columns from the group will be dropped

In [None]:
buildings_col_list = list(cramer_frame_list[1].columns.drop(['KBA05_HERSTTEMP','KBA05_MODTEMP']))

## RR3_ID

In [None]:
rr3_list = list(col_classification[(col_classification['Information level'] ==  'Microcell (RR3_ID)') & num_var_filter]['Attribute'].values) \
                + ['KBA05_HERSTTEMP','KBA05_MODTEMP']

In [None]:
# v_matrix = calculate_frame_cramer_coefs(census[rr3_list])

# v_frame = pd.DataFrame(v_matrix, columns = rr3_list, index = rr3_list)

In [None]:
# v_frame.to_csv('data/trusted/Microcell (RR3_ID)_cramer_updated.csv')

In [None]:
v_frame = cramer_frame_list[4]

In [None]:
# Checking variables that can have important multicolinearities 
((v_frame >= 0.3) & (v_frame < 1)).sum().sort_values(ascending = False).head(10)

KBA05_MAXHERST    9
KBA05_MOTOR       9
KBA05_MAXSEG      7
KBA05_KW3         7
KBA05_CCM1        6
KBA05_MAXBJ       6
KBA05_KRSKLEIN    6
KBA05_MOD1        6
KBA05_SEG2        5
KBA05_MAXVORB     5
dtype: int64

In [None]:
# Looking into some examples
v_frame[(v_frame >= 0.3) & (v_frame < 1)][['KBA05_MAXHERST','KBA05_MOTOR','KBA05_MAXSEG']]

Unnamed: 0,KBA05_MAXHERST,KBA05_MOTOR,KBA05_MAXSEG
KBA05_AUTOQUOT,,,
KBA05_BAUMAX,,,
KBA05_CCM1,,0.463402,0.301857
KBA05_CCM2,,0.433955,
KBA05_CCM3,,0.401816,
KBA05_CCM4,,0.396834,
KBA05_DIESEL,,,
KBA05_FRAU,,,
KBA05_GBZ,,,
KBA05_HERST1,0.403921,,


From these examples we can see that the correlations occur frequently in variables that are aggregated into other variables. This is somewhat expected and can be verified by the variables' description on their documentation.  
To reduce the number of variables, those that aggregate information will be kept. If there are correlations within this subset of variables, another selection will be made.

In [None]:
kept = ['KBA05_AUTOQUOT',
        'KBA05_DIESEL',
        'KBA05_FRAU',
        'KBA05_GBZ',
        'KBA05_KRSAQUOT',
        'KBA05_KRSKLEIN',
        'KBA05_KRSOBER',
        'KBA05_KRSVAN',
        'KBA05_KRSZUL',
        'KBA05_MAXAH',
        'KBA05_MAXBJ',
        'KBA05_MAXHERST',
        'KBA05_MAXSEG',
        'KBA05_MAXVORB',
        'KBA05_MOTOR',
        'KBA05_MOTRAD',
        'KBA05_HERSTTEMP',
        'KBA05_MODTEMP']

In [None]:
v_frame.loc[kept, kept].style.background_gradient()

Unnamed: 0,KBA05_AUTOQUOT,KBA05_DIESEL,KBA05_FRAU,KBA05_GBZ,KBA05_KRSAQUOT,KBA05_KRSKLEIN,KBA05_KRSOBER,KBA05_KRSVAN,KBA05_KRSZUL,KBA05_MAXAH,KBA05_MAXBJ,KBA05_MAXHERST,KBA05_MAXSEG,KBA05_MAXVORB,KBA05_MOTOR,KBA05_MOTRAD,KBA05_HERSTTEMP,KBA05_MODTEMP
KBA05_AUTOQUOT,1.0,0.204132,0.150679,0.366311,0.508323,0.152027,0.219448,0.118688,0.17855,0.176657,0.075224,0.095022,0.044824,0.134825,0.099243,0.269872,0.091483,0.058182
KBA05_DIESEL,0.204132,1.0,0.080456,0.162997,0.149412,0.112813,0.104329,0.066027,0.128905,0.071707,0.091265,0.124947,0.083041,0.048032,0.220788,0.13553,0.12159,0.054358
KBA05_FRAU,0.150679,0.080456,1.0,0.122057,0.121396,0.156388,0.101524,0.052828,0.079196,0.059067,0.051937,0.036304,0.095453,0.074177,0.084803,0.097353,0.027916,0.050964
KBA05_GBZ,0.366311,0.162997,0.122057,1.0,0.270407,0.125568,0.171121,0.100433,0.150499,0.147698,0.066585,0.101861,0.03674,0.132081,0.110259,0.311013,0.103281,0.064825
KBA05_KRSAQUOT,0.508323,0.149412,0.121396,0.270407,1.0,0.128534,0.160946,0.166592,0.149016,0.154233,0.056927,0.072677,0.032348,0.108186,0.09226,0.219807,0.076031,0.043046
KBA05_KRSKLEIN,0.152027,0.112813,0.156388,0.125568,0.128534,1.0,0.133933,0.05764,0.063904,0.074486,0.078501,0.152645,0.475631,0.047797,0.309265,0.091746,0.107418,0.233877
KBA05_KRSOBER,0.219448,0.104329,0.101524,0.171121,0.160946,0.133933,1.0,0.082176,0.069702,0.100968,0.044349,0.168331,0.286847,0.054913,0.254888,0.094268,0.090535,0.141617
KBA05_KRSVAN,0.118688,0.066027,0.052828,0.100433,0.166592,0.05764,0.082176,1.0,0.085692,0.067214,0.077449,0.073308,0.037588,0.049195,0.069301,0.059701,0.073223,0.036308
KBA05_KRSZUL,0.17855,0.128905,0.079196,0.150499,0.149016,0.063904,0.069702,0.085692,1.0,0.099739,0.345793,0.062154,0.055741,0.239928,0.073824,0.09822,0.048187,0.036793
KBA05_MAXAH,0.176657,0.071707,0.059067,0.147698,0.154233,0.074486,0.100968,0.067214,0.099739,1.0,0.070591,0.053639,0.04302,0.209747,0.067941,0.101194,0.053937,0.039874


In [None]:
to_drop = ['KBA05_GBZ','KBA05_AUTOQUOT','KBA05_MAXSEG','KBA05_HERSTTEMP']

Dropped:

- 'KBA05_GBZ': The number of buildings is correlated to the main type of building. The type of building is more interesting to understand clusters
- 'KBA05_AUTOQUOT': Correlated with KRSAUTOQUOT, that conveys the same information
- 'KBA05_MAXSEG': Correlate with too many other features
- 'KBA05_HERSTTEMP': Conveys similar information to MAXHERST, also strongly correlated to other variables

In [None]:
rr3_col_list = list(set(kept) - set(to_drop))

## Community 

Community actually contains continuous variables, therefore these will be kept, with exception of `RELAT_AB`, since it will convey similar information as `ARBEIT` when looking the variables' description

In [None]:
community_col_list = list(cramer_frame_list[2].columns.drop(['RELAT_AB']))

## Household

In [None]:
cramer_frame_list[3].name

'Household'

Looking into the columns in this category, we see that there might be some themes that we would like to avoid as date markers, considering the business we want to segment our customers for. E.g. the actuality of the transactions for telecommunications businessess (`D19_TELKO_DATUM`) might not be of interest for an mail-order organics company.  
We keep, therefore, only columns that might be pertinent for our case. 

In [None]:
cramer_frame_list[3].columns

Index(['ALTER_HH', 'HH_EINKOMMEN_SCORE', 'D19_KONSUMTYP',
       'D19_GESAMT_OFFLINE_DATUM', 'D19_GESAMT_ONLINE_DATUM',
       'D19_GESAMT_DATUM', 'D19_BANKEN_OFFLINE_DATUM',
       'D19_BANKEN_ONLINE_DATUM', 'D19_BANKEN_DATUM',
       'D19_TELKO_OFFLINE_DATUM', 'D19_TELKO_ONLINE_DATUM', 'D19_TELKO_DATUM',
       'D19_VERSAND_OFFLINE_DATUM', 'D19_VERSAND_ONLINE_DATUM',
       'D19_VERSAND_DATUM', 'D19_VERSI_OFFLINE_DATUM',
       'D19_VERSI_ONLINE_DATUM', 'D19_VERSI_DATUM',
       'D19_GESAMT_ONLINE_QUOTE_12', 'D19_BANKEN_ONLINE_QUOTE_12',
       'D19_VERSAND_ONLINE_QUOTE_12', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
       'D19_BANKEN_ANZ_12', 'D19_BANKEN_ANZ_24', 'D19_GESAMT_ANZ_12',
       'D19_GESAMT_ANZ_24', 'D19_KONSUMTYP_MAX', 'D19_TELKO_ANZ_12',
       'D19_TELKO_ANZ_24', 'D19_TELKO_ONLINE_QUOTE_12', 'D19_VERSAND_ANZ_12',
       'D19_VERSAND_ANZ_24', 'D19_VERSI_ANZ_12', 'D19_VERSI_ANZ_24',
       'D19_VERSI_ONLINE_QUOTE_12', 'ANZ_KINDER', 'ANZ_STATISTISCHE_HAUSHALTE',
       'STRUKT

In [None]:
household_cols_list = [#'ALTER_HH',
                        'HH_EINKOMMEN_SCORE',
                        'D19_KONSUMTYP',
                            'D19_GESAMT_OFFLINE_DATUM',
                            'D19_GESAMT_ONLINE_DATUM',
                            'D19_GESAMT_DATUM',
                            'D19_BANKEN_DATUM',
                            'D19_VERSAND_OFFLINE_DATUM',
                            'D19_VERSAND_ONLINE_DATUM',
                            'D19_VERSAND_DATUM',
                            'D19_GESAMT_ONLINE_QUOTE_12',
                            'D19_BANKEN_ONLINE_QUOTE_12',
                            'D19_VERSAND_ONLINE_QUOTE_12',
                            'W_KEIT_KIND_HH',
                            'WOHNDAUER_2008',
                            'D19_GESAMT_ANZ_12',
                            'D19_GESAMT_ANZ_24',
                            'D19_KONSUMTYP_MAX',
                            'D19_VERSAND_ANZ_12',
                            'D19_VERSAND_ANZ_24',
                            'ANZ_KINDER',
                            'ANZ_STATISTISCHE_HAUSHALTE',
                            'STRUKTURTYP']

In [None]:
cramer_frame_list[3].loc[household_cols_list, household_cols_list].style.background_gradient()

Unnamed: 0_level_0,HH_EINKOMMEN_SCORE,D19_KONSUMTYP,D19_GESAMT_OFFLINE_DATUM,D19_GESAMT_ONLINE_DATUM,D19_GESAMT_DATUM,D19_BANKEN_DATUM,D19_VERSAND_OFFLINE_DATUM,D19_VERSAND_ONLINE_DATUM,D19_VERSAND_DATUM,D19_GESAMT_ONLINE_QUOTE_12,D19_BANKEN_ONLINE_QUOTE_12,D19_VERSAND_ONLINE_QUOTE_12,W_KEIT_KIND_HH,WOHNDAUER_2008,D19_GESAMT_ANZ_12,D19_GESAMT_ANZ_24,D19_KONSUMTYP_MAX,D19_VERSAND_ANZ_12,D19_VERSAND_ANZ_24,ANZ_KINDER,ANZ_STATISTISCHE_HAUSHALTE,STRUKTURTYP
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
HH_EINKOMMEN_SCORE,1.0,0.109993,0.084898,0.074926,0.085503,0.032254,0.095939,0.075044,0.086865,0.070464,0.017329,0.068709,0.085683,0.093393,0.081428,0.085912,0.114336,0.080196,0.085344,0.040225,0.338323,0.138857
D19_KONSUMTYP,0.109993,1.0,0.206661,0.345968,0.391414,0.227821,0.197514,0.327225,0.352443,0.294061,0.141978,0.27905,0.152508,0.052622,0.374447,0.458377,0.74455,0.331017,0.391096,0.086162,0.100348,0.058781
D19_GESAMT_OFFLINE_DATUM,0.084898,0.206661,1.0,0.122709,0.273222,0.11657,0.753641,0.124248,0.260968,0.279688,0.073582,0.236218,0.111421,0.029056,0.197412,0.209739,0.216006,0.188778,0.205324,0.046302,0.070147,0.062609
D19_GESAMT_ONLINE_DATUM,0.074926,0.345968,0.122709,1.0,0.674477,0.246174,0.109879,0.843285,0.737979,0.339493,0.11811,0.311374,0.142615,0.069638,0.376851,0.389914,0.376182,0.37887,0.389936,0.068648,0.062023,0.036718
D19_GESAMT_DATUM,0.085503,0.391414,0.273222,0.674477,1.0,0.228058,0.210135,0.583628,0.666775,0.276448,0.101107,0.25518,0.133785,0.070114,0.457808,0.483749,0.43858,0.385687,0.397986,0.063048,0.070119,0.045972
D19_BANKEN_DATUM,0.032254,0.227821,0.11657,0.246174,0.228058,1.0,0.066575,0.105578,0.103062,0.138518,0.292082,0.09175,0.097191,0.042776,0.201792,0.208667,0.281201,0.118182,0.129112,0.042613,0.030468,0.013834
D19_VERSAND_OFFLINE_DATUM,0.095939,0.197514,0.753641,0.109879,0.210135,0.066575,1.0,0.116903,0.298301,0.220263,0.037294,0.243959,0.111092,0.04011,0.170674,0.180445,0.202984,0.183711,0.197171,0.045891,0.078022,0.080159
D19_VERSAND_ONLINE_DATUM,0.075044,0.327225,0.124248,0.843285,0.583628,0.105578,0.116903,1.0,0.857206,0.307635,0.060432,0.33337,0.143062,0.0588,0.349427,0.359242,0.347632,0.40065,0.410657,0.069175,0.061151,0.036149
D19_VERSAND_DATUM,0.086865,0.352443,0.260968,0.737979,0.666775,0.103062,0.298301,0.857206,1.0,0.288155,0.058626,0.310035,0.138494,0.045751,0.367758,0.380027,0.378542,0.427687,0.444283,0.065187,0.069692,0.049026
D19_GESAMT_ONLINE_QUOTE_12,0.070464,0.294061,0.279688,0.339493,0.276448,0.138518,0.220263,0.307635,0.288155,1.0,0.132469,0.665813,0.127713,0.027609,0.388339,0.336139,0.305934,0.38273,0.332549,0.058907,0.054444,0.038701


Online and offline data seem to be correlated. As well as 12 and 24 months data. The aggregate columns (identified by not having "online" or "offline" in the column names) will be kept for the first case, 12 months for the latter. 12 months is chosen because we want to segment the database into possible customers immediatly, therefore, more recent data is more interesting then longer time periods.

In [None]:
household_cols_list = [#'ALTER_HH',
                        'HH_EINKOMMEN_SCORE',
                        'D19_KONSUMTYP',
                        'D19_GESAMT_DATUM',
                        'D19_BANKEN_DATUM',
                        'D19_VERSAND_DATUM',
                        'D19_GESAMT_ONLINE_QUOTE_12',
                        'D19_BANKEN_ONLINE_QUOTE_12',
                        'D19_VERSAND_ONLINE_QUOTE_12',
                        'W_KEIT_KIND_HH',
                        'WOHNDAUER_2008',
                        'D19_GESAMT_ANZ_12',
                        'D19_KONSUMTYP_MAX',
                        'D19_VERSAND_ANZ_12',
                        'ANZ_KINDER',
                        'ANZ_STATISTISCHE_HAUSHALTE',
                        'STRUKTURTYP']

In [None]:
cramer_frame_list[3].loc[household_cols_list, household_cols_list].style.background_gradient()

Unnamed: 0_level_0,HH_EINKOMMEN_SCORE,D19_KONSUMTYP,D19_GESAMT_DATUM,D19_BANKEN_DATUM,D19_VERSAND_DATUM,D19_GESAMT_ONLINE_QUOTE_12,D19_BANKEN_ONLINE_QUOTE_12,D19_VERSAND_ONLINE_QUOTE_12,W_KEIT_KIND_HH,WOHNDAUER_2008,D19_GESAMT_ANZ_12,D19_KONSUMTYP_MAX,D19_VERSAND_ANZ_12,ANZ_KINDER,ANZ_STATISTISCHE_HAUSHALTE,STRUKTURTYP
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
HH_EINKOMMEN_SCORE,1.0,0.109993,0.085503,0.032254,0.086865,0.070464,0.017329,0.068709,0.085683,0.093393,0.081428,0.114336,0.080196,0.040225,0.338323,0.138857
D19_KONSUMTYP,0.109993,1.0,0.391414,0.227821,0.352443,0.294061,0.141978,0.27905,0.152508,0.052622,0.374447,0.74455,0.331017,0.086162,0.100348,0.058781
D19_GESAMT_DATUM,0.085503,0.391414,1.0,0.228058,0.666775,0.276448,0.101107,0.25518,0.133785,0.070114,0.457808,0.43858,0.385687,0.063048,0.070119,0.045972
D19_BANKEN_DATUM,0.032254,0.227821,0.228058,1.0,0.103062,0.138518,0.292082,0.09175,0.097191,0.042776,0.201792,0.281201,0.118182,0.042613,0.030468,0.013834
D19_VERSAND_DATUM,0.086865,0.352443,0.666775,0.103062,1.0,0.288155,0.058626,0.310035,0.138494,0.045751,0.367758,0.378542,0.427687,0.065187,0.069692,0.049026
D19_GESAMT_ONLINE_QUOTE_12,0.070464,0.294061,0.276448,0.138518,0.288155,1.0,0.132469,0.665813,0.127713,0.027609,0.388339,0.305934,0.38273,0.058907,0.054444,0.038701
D19_BANKEN_ONLINE_QUOTE_12,0.017329,0.141978,0.101107,0.292082,0.058626,0.132469,1.0,0.059577,0.055191,0.016031,0.168974,0.193992,0.075111,0.026605,0.01838,0.009675
D19_VERSAND_ONLINE_QUOTE_12,0.068709,0.27905,0.25518,0.09175,0.310035,0.665813,0.059577,1.0,0.12631,0.027734,0.353018,0.283155,0.414297,0.058845,0.052516,0.036801
W_KEIT_KIND_HH,0.085683,0.152508,0.133785,0.097191,0.138494,0.127713,0.055191,0.12631,1.0,0.087741,0.14068,0.140693,0.131627,0.447214,0.114097,0.074624
WOHNDAUER_2008,0.093393,0.052622,0.070114,0.042776,0.045751,0.027609,0.016031,0.027734,0.087741,1.0,0.018202,0.059265,0.016105,0.031316,0.087382,0.05352


There are some other strong correlations, but considering that the data has a considerable amount of `NaN` values in itself, these redundancies are interesting to keep at some level to get homogeneus clusters.  
The only other correlation that will be handled is the `D19_KONSUMTYP` and the `D19_KONSUMTYP_MAX` since they convey the same information. The "max" column will be dropped

In [None]:
household_cols_list.remove('D19_KONSUMTYP_MAX')

## RR4_ID

In [None]:
cramer_frame_list[5]

Unnamed: 0_level_0,CAMEO_DEUG_2015,CAMEO_DEU_2015,KBA05_ALTER1,KBA05_ALTER2,KBA05_ALTER3,KBA05_ALTER4,KBA05_ANHANG,KBA05_ANTG1,KBA05_ANTG2,KBA05_ANTG3,KBA05_ANTG4,CAMEO_INTL_2015
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
CAMEO_DEUG_2015,1.0,1.0,0.156121,0.080368,0.096975,0.112977,0.245989,0.307384,0.195404,0.207195,0.275275,0.812627
CAMEO_DEU_2015,1.0,1.0,0.1774,0.146382,0.115171,0.178524,0.266768,0.321341,0.210685,0.220738,0.288699,1.0
KBA05_ALTER1,0.156121,0.1774,1.0,0.069837,0.138553,0.207337,0.159244,0.173502,0.119041,0.14037,0.155803,0.167059
KBA05_ALTER2,0.080368,0.146382,0.069837,1.0,0.270582,0.276089,0.0964,0.095485,0.071673,0.076166,0.093293,0.13613
KBA05_ALTER3,0.096975,0.115171,0.138553,0.270582,1.0,0.156176,0.116907,0.118914,0.085626,0.094501,0.105175,0.102966
KBA05_ALTER4,0.112977,0.178524,0.207337,0.276089,0.156176,1.0,0.140397,0.134159,0.097115,0.098978,0.158742,0.164476
KBA05_ANHANG,0.245989,0.266768,0.159244,0.0964,0.116907,0.140397,1.0,0.318594,0.244595,0.223573,0.244924,0.255598
KBA05_ANTG1,0.307384,0.321341,0.173502,0.095485,0.118914,0.134159,0.318594,1.0,0.367649,0.412086,0.398888,0.31552
KBA05_ANTG2,0.195404,0.210685,0.119041,0.071673,0.085626,0.097115,0.244595,0.367649,1.0,0.344843,0.363847,0.203466
KBA05_ANTG3,0.207195,0.220738,0.14037,0.076166,0.094501,0.098978,0.223573,0.412086,0.344843,1.0,0.290603,0.214638


The KBA columns are misclassified (Should be RR3). They will be kept since they can convey useful information regarding the economic power of a respondent.

In [None]:
cramer_frame_list[5].loc[['CAMEO_DEUG_2015','CAMEO_DEU_2015','CAMEO_INTL_2015'],
                            ['CAMEO_DEUG_2015','CAMEO_DEU_2015','CAMEO_INTL_2015']]

Unnamed: 0_level_0,CAMEO_DEUG_2015,CAMEO_DEU_2015,CAMEO_INTL_2015
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
CAMEO_DEUG_2015,1.0,1.0,0.812627
CAMEO_DEU_2015,1.0,1.0,1.0
CAMEO_INTL_2015,0.812627,1.0,1.0


From their definition, it is natural that these columns are highly correlated. We will drop only the internation classification, just to have uniform definition of german demographics if they are useful to interpret clusters later on.

In [None]:
rr4_col_list = list(cramer_frame_list[5].columns.drop('CAMEO_INTL_2015'))

## Person

The definition of columns in this category show how diverse each feature is. We will only look, therefore, to features that might correlate a lot to others and would not aggregate so much to our model

In [None]:
cramer_frame_list[6].name

'Person'

In [None]:
((cramer_frame_list[6] > 0.3) & (cramer_frame_list[6] < 1)).sum().sort_values(ascending = False)

SEMIO_KULT               20
SEMIO_ERL                20
SEMIO_REL                19
ANREDE_KZ                18
SEMIO_TRADV              18
SEMIO_KAEM               18
SEMIO_RAT                18
SEMIO_DOM                17
ALTERSKATEGORIE_GROB     17
SEMIO_KRIT               16
SEMIO_VERT               15
SEMIO_FAM                13
SEMIO_SOZ                12
CJT_TYP_1                12
SEMIO_MAT                12
AGER_TYP                 12
HEALTH_TYP               12
LP_LEBENSPHASE_FEIN      11
SEMIO_PFLICHT            10
SEMIO_LUST                9
SHOPPER_TYP               8
PRAEGENDE_JUGENDJAHRE     8
LP_LEBENSPHASE_GROB       8
CJT_TYP_5                 7
FINANZ_MINIMALIST         7
CJT_TYP_2                 7
CJT_TYP_3                 7
FINANZ_HAUSBAUER          7
VERS_TYP                  7
GREEN_AVANTGARDE          7
CJT_TYP_6                 7
CJT_TYP_4                 6
ALTERSKATEGORIE_FEIN      6
LP_STATUS_GROB            6
LP_STATUS_FEIN            5
ZABEOTYP            

In [None]:
(cramer_frame_list[6][['SEMIO_KULT',
                    'SEMIO_ERL', 
                    'SEMIO_REL']] > 0.3).style.highlight_max(color = 'green')

Unnamed: 0_level_0,SEMIO_KULT,SEMIO_ERL,SEMIO_REL
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
AGER_TYP,True,True,True
ALTERSKATEGORIE_GROB,True,True,True
ANREDE_KZ,True,True,True
CJT_GESAMTTYP,False,False,False
FINANZ_MINIMALIST,False,False,False
FINANZ_SPARER,False,False,False
FINANZ_VORSORGER,False,False,False
FINANZ_ANLEGER,False,False,False
FINANZ_UNAUFFAELLIGER,False,False,False
FINANZ_HAUSBAUER,False,False,False


We can see from the examples that the correlations show up mainly regarding gender and age, as well as correlations to other drivers.  
This means that mainly that the drivers (`SEMIO_`) are associated to certain demographics and we can expect some patterns to emerge from the drivers themselves.  
This is important information for a clustering exercise.  

Considering the general relevance of the variables to segment the demographics, we will keep all variables in this level

In [None]:
person_col_list = list(cramer_frame_list[6].columns.drop('AGER_TYP'))

## PLZ8

From the definitions, we will drop initially columns related to very specific attributes of a car (Specific Manufacturer, for instance) and then check for possible correlation groups, like the `CCM, KMH and KW` columns, for example.  

This is because to segment customers demographically, it is more useful to know what kind of vehicle they drive and not necessairly the actual vehicle. We can then infer their spending patterns from this information. E.g. a more eco-friendly person might (i.e. a possible customer for organic produce) favor more low-power cars that don't emit so much CO2.

In [None]:
plz8_to_drop = ['KBA13_AUDI',
                'KBA13_BMW',
                'KBA13_FAB_ASIEN',
                'KBA13_FAB_SONSTIGE',
                'KBA13_FIAT',
                'KBA13_FORD',
                'KBA13_HERST_ASIEN',
                'KBA13_HERST_AUDI_VW',
                'KBA13_HERST_BMW_BENZ',
                'KBA13_HERST_EUROPA',
                'KBA13_HERST_FORD_OPEL',
                'KBA13_HERST_SONST',
                'KBA13_KRSHERST_AUDI_VW',
                'KBA13_KRSHERST_BMW_BENZ',
                'KBA13_KRSHERST_FORD_OPEL',
                'KBA13_MAZDA',
                'KBA13_MERCEDES',
                'KBA13_MOTOR',
                'KBA13_NISSAN',
                'KBA13_OPEL',
                'KBA13_PEUGEOT',
                'KBA13_RENAULT',
                'KBA13_TOYOTA',
                'KBA13_VW',
                'KBA13_SITZE_4',
                'KBA13_SITZE_5',
                'KBA13_SITZE_6']

In [None]:
plz8_kept = cramer_frame_list[7].columns.drop(plz8_to_drop)

In [None]:
len(plz8_kept)

95

### Horsepower correlations

In [None]:
hp_cols = ['KBA13_CCM_1000',
'KBA13_CCM_1200',
'KBA13_CCM_1400',
'KBA13_CCM_0_1400',
'KBA13_CCM_1500',
# 'KBA13_CCM_1400_2500', # Column NOT IN DATA
'KBA13_CCM_1600',
'KBA13_CCM_1800',
'KBA13_CCM_2000',
'KBA13_CCM_2500',
'KBA13_CCM_2501',
'KBA13_KMH_110',
'KBA13_KMH_140',
'KBA13_KMH_180',
'KBA13_KMH_0_140',
'KBA13_KMH_140_210',
'KBA13_KMH_211',
'KBA13_KMH_250',
'KBA13_KMH_251',
'KBA13_KW_30',
'KBA13_KW_40',
'KBA13_KW_50',
'KBA13_KW_60',
'KBA13_KW_0_60',
'KBA13_KW_70',
'KBA13_KW_61_120',
'KBA13_KW_80',
'KBA13_KW_90',
'KBA13_KW_110',
'KBA13_KW_120',
'KBA13_KW_121']

In [None]:
ccm_cols = ['KBA13_CCM_1000',
            'KBA13_CCM_1200',
            'KBA13_CCM_1400',
            'KBA13_CCM_0_1400',
            'KBA13_CCM_1500',
            # 'KBA13_CCM_1400_2500', # Column NOT IN DATA
            'KBA13_CCM_1600',
            'KBA13_CCM_1800',
            'KBA13_CCM_2000',
            'KBA13_CCM_2500',
            'KBA13_CCM_2501']

In [None]:
cramer_frame_list[7].loc[hp_cols, hp_cols].drop(ccm_cols).style.background_gradient()

Unnamed: 0_level_0,KBA13_CCM_1000,KBA13_CCM_1200,KBA13_CCM_1400,KBA13_CCM_0_1400,KBA13_CCM_1500,KBA13_CCM_1600,KBA13_CCM_1800,KBA13_CCM_2000,KBA13_CCM_2500,KBA13_CCM_2501,KBA13_KMH_110,KBA13_KMH_140,KBA13_KMH_180,KBA13_KMH_0_140,KBA13_KMH_140_210,KBA13_KMH_211,KBA13_KMH_250,KBA13_KMH_251,KBA13_KW_30,KBA13_KW_40,KBA13_KW_50,KBA13_KW_60,KBA13_KW_0_60,KBA13_KW_70,KBA13_KW_61_120,KBA13_KW_80,KBA13_KW_90,KBA13_KW_110,KBA13_KW_120,KBA13_KW_121
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1
KBA13_KMH_110,0.252894,0.083167,0.085248,0.158922,0.1845,0.08404,0.076441,0.084536,0.103717,0.119314,1.0,0.233051,0.085512,0.359243,0.133305,0.092016,0.092234,0.147503,0.668596,0.129882,0.070734,0.061625,0.121907,0.071701,0.119612,0.072024,0.075599,0.097941,0.173265,0.10847
KBA13_KMH_140,0.420001,0.094805,0.081642,0.232983,0.233758,0.095342,0.077085,0.109609,0.117167,0.164877,0.233051,1.0,0.082146,0.76777,0.197626,0.085513,0.086108,0.102585,0.327848,0.337987,0.081568,0.059821,0.168621,0.07321,0.161642,0.081776,0.08115,0.096373,0.226072,0.119342
KBA13_KMH_180,0.127265,0.2261,0.344641,0.22202,0.104844,0.196278,0.148462,0.322834,0.249245,0.25134,0.085512,0.082146,1.0,0.082245,0.30098,0.310675,0.309355,0.13286,0.099283,0.170301,0.254668,0.281326,0.456734,0.186094,0.272741,0.053372,0.187225,0.287783,0.162774,0.288813
KBA13_KMH_0_140,0.402078,0.090705,0.083923,0.218886,0.216022,0.096928,0.072413,0.114326,0.107234,0.148482,0.359243,0.76777,0.082245,1.0,0.204945,0.080106,0.081016,0.102849,0.437075,0.286259,0.06798,0.054364,0.177239,0.062999,0.168534,0.074304,0.076022,0.091342,0.209737,0.106544
KBA13_KMH_140_210,0.121214,0.121317,0.198104,0.072401,0.103308,0.171227,0.089088,0.078224,0.366875,0.435585,0.133305,0.197626,0.30098,0.204945,1.0,0.61688,0.609726,0.198206,0.156275,0.093377,0.130303,0.162116,0.179299,0.127931,0.107765,0.13396,0.076766,0.073166,0.249639,0.535209
KBA13_KMH_211,0.121796,0.127523,0.1974,0.137941,0.086095,0.151616,0.072008,0.09023,0.34175,0.426136,0.092016,0.085513,0.310675,0.080106,0.61688,1.0,0.94014,0.207517,0.105116,0.12572,0.137663,0.147627,0.248438,0.111269,0.081278,0.102042,0.064803,0.091924,0.230072,0.52701
KBA13_KMH_250,0.120908,0.127676,0.197268,0.137153,0.085709,0.151001,0.071415,0.089785,0.346206,0.415519,0.092234,0.086108,0.309355,0.081016,0.609726,0.94014,1.0,0.167666,0.106459,0.124458,0.137784,0.146606,0.248144,0.11022,0.080728,0.101072,0.064343,0.091641,0.232682,0.516484
KBA13_KMH_251,0.080904,0.075657,0.097684,0.073327,0.101659,0.093539,0.056402,0.070832,0.109057,0.25223,0.147503,0.102585,0.13286,0.102849,0.198206,0.207517,0.167666,1.0,0.149119,0.075607,0.074672,0.086298,0.107165,0.070365,0.077528,0.07203,0.05438,0.073936,0.116616,0.231341
KBA13_KW_30,0.311808,0.104845,0.101536,0.189252,0.219885,0.103271,0.084361,0.107106,0.125497,0.146531,0.668596,0.327848,0.099283,0.437075,0.156275,0.105116,0.106459,0.149119,1.0,0.164303,0.079456,0.074204,0.147412,0.082511,0.143451,0.085212,0.085603,0.109896,0.204565,0.126066
KBA13_KW_40,0.502925,0.186221,0.075241,0.39114,0.127568,0.082094,0.083081,0.145759,0.112088,0.118947,0.129882,0.337987,0.170301,0.286259,0.093377,0.12572,0.124458,0.075607,0.164303,1.0,0.100019,0.045336,0.287637,0.069802,0.234638,0.070614,0.092338,0.122708,0.111614,0.124609


Some correlations mainly between engine power in KW and horsepower emerge. Also it is good to notice columns that have range intersection, which results in higher correlations.  
Therefore, it might be nice to try out to keep only a single one of the criteria: CCM or KW generated. Evaluating their NaN can help us do this selection

In [None]:
census[hp_cols].isna().mean()

KBA13_CCM_1000       0.118714
KBA13_CCM_1200       0.118714
KBA13_CCM_1400       0.118714
KBA13_CCM_0_1400     0.118714
KBA13_CCM_1500       0.118714
KBA13_CCM_1600       0.118714
KBA13_CCM_1800       0.118714
KBA13_CCM_2000       0.118714
KBA13_CCM_2500       0.118714
KBA13_CCM_2501       0.118714
KBA13_KMH_110        0.118714
KBA13_KMH_140        0.118714
KBA13_KMH_180        0.118714
KBA13_KMH_0_140      0.118714
KBA13_KMH_140_210    0.118714
KBA13_KMH_211        0.118714
KBA13_KMH_250        0.118714
KBA13_KMH_251        0.118714
KBA13_KW_30          0.118714
KBA13_KW_40          0.118714
KBA13_KW_50          0.118714
KBA13_KW_60          0.118714
KBA13_KW_0_60        0.118714
KBA13_KW_70          0.118714
KBA13_KW_61_120      0.118714
KBA13_KW_80          0.118714
KBA13_KW_90          0.118714
KBA13_KW_110         0.118714
KBA13_KW_120         0.118714
KBA13_KW_121         0.118714
dtype: float64

They are similarly unpopulated. We can choose either one of them. CCM will be kept

In [None]:
kw_drop = ['KBA13_KW_30',
            'KBA13_KW_40',
            'KBA13_KW_50',
            'KBA13_KW_60',
            'KBA13_KW_0_60',
            'KBA13_KW_70',
            'KBA13_KW_61_120',
            'KBA13_KW_80',
            'KBA13_KW_90',
            'KBA13_KW_110',
            'KBA13_KW_120',
            'KBA13_KW_121',
            'KBA13_CCM_0_1400'] # Dropping CCM column because of intersection of ranges

In [None]:
plz8_kept = list(set(plz8_kept) - set(kw_drop))

In [None]:
len(plz8_kept)

82

In [None]:
for col in plz8_kept:

    if 'ANTG' in col:

        print(col)

PLZ8_ANTG2
PLZ8_ANTG3
PLZ8_ANTG1
KBA13_ANTG3
PLZ8_ANTG4
KBA13_ANTG2
KBA13_ANTG1
KBA13_ANTG4


Oddly enough, some columns have different prefixes but same name

In [None]:
plz_antg = ['PLZ8_ANTG1',
            'PLZ8_ANTG2',
            'PLZ8_ANTG3',
            'PLZ8_ANTG4']

kba13_antg = ['KBA13_ANTG1',
                'KBA13_ANTG2',
                'KBA13_ANTG3',
                'KBA13_ANTG4']


In [None]:
# Are the columns with the same name the same?
for pair in list(zip(plz_antg, kba13_antg)):


    print(
            ((census[pair[0]] == census[pair[1]]) == True).all()
        )
    


False
False
False
False


Even though they have the same name, they are not totally the same. Since the `KBA13_ANTG` are not found in the documentation, the PLZ8 columns will be kept.

In [None]:
plz8_col_list = list(set(plz8_kept) - set(kba13_antg))

In [None]:
len(cramer_frame_list)

10

## Postcode

The definition of the variables suggests that they have some correlation.

In [None]:
cramer_frame_list[8].style.background_gradient()

Unnamed: 0_level_0,BALLRAUM,EWDICHTE,INNENSTADT
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BALLRAUM,1.0,0.340406,0.337231
EWDICHTE,0.340406,1.0,0.399588
INNENSTADT,0.337231,0.399588,1.0


They do have some correlation, but since they represent relatively different characteriscs, they will not be dropped

In [None]:
postcode_col_list = list(cramer_frame_list[8].columns)

## RR1_ID

In [None]:
cramer_frame_list[9].style.background_gradient()

Unnamed: 0_level_0,GEBAEUDETYP_RASTER,KKK,MOBI_REGIO,ONLINE_AFFINITAET,REGIOTYP,MOBI_RASTER
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
GEBAEUDETYP_RASTER,1.0,0.078496,0.186512,0.0443,0.079294,0.21279
KKK,0.078496,1.0,0.081568,0.047181,0.603747,0.07618
MOBI_REGIO,0.186512,0.081568,1.0,0.115377,0.093296,0.378735
ONLINE_AFFINITAET,0.0443,0.047181,0.115377,1.0,0.043647,0.102379
REGIOTYP,0.079294,0.603747,0.093296,0.043647,1.0,0.092963
MOBI_RASTER,0.21279,0.07618,0.378735,0.102379,0.092963,1.0


Neighbourhood typology and purchasing power are naturally correlated. It could be argued that they have a strong correlation given the 0 - 1 Scale from Cramer's V. Since they are so strongly correlated and we have geographical variables included in other levels (such as Community) the respondent Purchasing power (KKK) will be kept.  

Also, considering the present correlation from MOBI_RASTER and MOBI_REGIO and the fact that MOBI_RASTER was not found in the documentation, MOBI_RASTER will be dropped

In [None]:
rr1_cols = list(cramer_frame_list[9].columns.drop(['REGIOTYP','MOBI_RASTER']))

In [None]:
buildings_col_list

['GEBAEUDETYP', 'KONSUMNAEHE', 'OST_WEST_KZ', 'WOHNLAGE']

# Generating first subset

In [None]:
# Some are dropped in the exclusion by NaN stage
kept_num_vars = list(np.intersect1d(census.columns, numeric_vars))

In [None]:
selected_features = ['LNR'] + \
                    kept_num_vars + \
                    grid_cols_list + \
                    buildings_col_list  + \
                    rr3_col_list  + \
                    community_col_list + \
                    household_cols_list + \
                    rr4_col_list + \
                    person_col_list + \
                    plz8_col_list + \
                    postcode_col_list + \
                    rr1_cols

In [None]:
print(len(selected_features), f'out of {census.shape[1]} features were selected to use in the segmentation')

196 out of 335 features were selected to use in the segmentation


In [None]:
census[selected_features].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Columns: 196 entries, LNR to ONLINE_AFFINITAET
dtypes: float64(153), int64(41), object(2)
memory usage: 1.3+ GB


In [3]:
df_customers.head()

Unnamed: 0,LNR,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_PERSONEN,ANZ_TITEL,KBA13_ANZAHL_PKW,MIN_GEBAEUDEJAHR,D19_BANKEN_DIREKT,D19_BANKEN_GROSS,D19_BANKEN_LOKAL,...,BALLRAUM,EWDICHTE,INNENSTADT,GEBAEUDETYP_RASTER,KKK,MOBI_REGIO,ONLINE_AFFINITAET,CUSTOMER_GROUP,ONLINE_PURCHASE,PRODUCT_GROUP
0,9626,1.0,0.0,2.0,0.0,1201.0,1992.0,0.0,0.0,0.0,...,3.0,2.0,4.0,4.0,1.0,4.0,3.0,MULTI_BUYER,0,COSMETIC_AND_FOOD
2,143872,1.0,0.0,1.0,0.0,433.0,1992.0,0.0,0.0,0.0,...,7.0,4.0,1.0,3.0,3.0,3.0,1.0,MULTI_BUYER,0,COSMETIC_AND_FOOD
3,143873,0.0,0.0,0.0,0.0,755.0,1992.0,0.0,0.0,0.0,...,7.0,1.0,7.0,4.0,3.0,4.0,2.0,MULTI_BUYER,0,COSMETIC
4,143874,7.0,0.0,4.0,0.0,513.0,1992.0,2.0,0.0,1.0,...,3.0,4.0,4.0,3.0,4.0,3.0,5.0,MULTI_BUYER,0,FOOD
5,143888,1.0,0.0,2.0,0.0,1167.0,1992.0,0.0,0.0,0.0,...,7.0,5.0,8.0,4.0,2.0,3.0,3.0,MULTI_BUYER,0,COSMETIC_AND_FOOD


In [4]:
df_customers[['CUSTOMER_GROUP','PRODUCT_GROUP']]

Unnamed: 0,CUSTOMER_GROUP,PRODUCT_GROUP
0,MULTI_BUYER,COSMETIC_AND_FOOD
2,MULTI_BUYER,COSMETIC_AND_FOOD
3,MULTI_BUYER,COSMETIC
4,MULTI_BUYER,FOOD
5,MULTI_BUYER,COSMETIC_AND_FOOD
...,...,...
191647,MULTI_BUYER,COSMETIC_AND_FOOD
191648,SINGLE_BUYER,COSMETIC
191649,MULTI_BUYER,COSMETIC_AND_FOOD
191650,SINGLE_BUYER,FOOD


In [5]:
for col in ['CUSTOMER_GROUP','PRODUCT_GROUP']:

    display(df_customers[col].value_counts(dropna = False))

MULTI_BUYER     98547
SINGLE_BUYER    41751
Name: CUSTOMER_GROUP, dtype: int64

COSMETIC_AND_FOOD    75446
FOOD                 33779
COSMETIC             31073
Name: PRODUCT_GROUP, dtype: int64

In [6]:
X = df_customers.drop(columns = ['LNR','CUSTOMER_GROUP','PRODUCT_GROUP','ONLINE_PURCHASE'])

In [8]:
selected = []

for col in ['ONLINE_PURCHASE','PRODUCT_GROUP','CUSTOMER_GROUP']:

    print(f'Running for {col}')

    y = df_customers[col]

    selector = SelectKBest(mutual_info_classif, k = 10)

    selector.fit(X, y)

    selected.append(selector.get_feature_names_out())

Running for ONLINE_PURCHASE
Running for PRODUCT_GROUP
Running for CUSTOMER_GROUP


In [9]:
selected_features = list(np.append(selected[0],selected[1]))

selected_features = list(np.append(selected_features, selected[2]))

In [10]:
selected_features = list(set(selected_features))

In [11]:
len(selected_features)

16

In [12]:
selected_features

['FINANZ_VORSORGER',
 'CJT_TYP_6',
 'WOHNDAUER_2008',
 'NATIONALITAET_KZ',
 'VERS_TYP',
 'CJT_TYP_3',
 'OST_WEST_KZ',
 'CJT_TYP_2',
 'STRUKTURTYP',
 'FINANZ_SPARER',
 'CJT_TYP_5',
 'D19_BANKEN_DATUM',
 'CJT_KATALOGNUTZER',
 'PRAEGENDE_JUGENDJAHRE',
 'CJT_TYP_4',
 'ANREDE_KZ']

From 100 possible variables, when we remove the intersections we are left with 73 candidates that are useful for segregating the categories in the customer data. We will try to cluster using these variables.  
We need only to pay attention to the numeric columns afterwards, since the mutual information criterion might not be so useful for classifying this type of data to help segregate categories.

In [13]:
with open('data/trusted/selected_features.pkl','wb') as file:

    pickle.dump(selected_features, file)