# Segmentation EDA Notebook

## Introduction
This notebook contains the preprocessing EDA for the Demographic data.  
It will be the base for the preprocessing steps taken.  

In this stage, one of the most important parts is feature selection for clustering, since there are many possible variables.  
Selecting the most important features will help us avoid the "curse of dimensionality" for clustering and also help gain computation time.

## Steps contained
1. Missing values handling
2. Duplicates check/ handling
4. Feature Selection - Too unpopulated columns, multicolinearities (Obtained from profile)
5. Feature Selection - Manual exclusion of variables not relevant businesswise

**Note: To select manually the variables that will be kept, data from the DIAS Information Levels - Attributes Spreadsheet will be used** 

## Context
The targeted company for this project is an Organics company that is selling mail-ordered products.

# Imports

In [187]:
# Data Wrangling
import pandas as pd
import numpy as np

# Utils
import itertools
import csv
import os

# Data Viz
import seaborn as sns

# ML and Statistics
import scipy.stats as stats

In [188]:
sns.set(rc={'figure.figsize':(10,5)})

# Reading Data

In [189]:
census = pd.read_csv('data/raw/Udacity_AZDIAS_052018.csv', sep = ';')

  census = pd.read_csv('data/raw/Udacity_AZDIAS_052018.csv', sep = ';')


# Type handling on warned columns by Pandas

In [190]:
col_check = census.iloc[:, [18,19]].applymap(type)

In [191]:
census.iloc[:, [18,19]].head(10)

Unnamed: 0,CAMEO_DEUG_2015,CAMEO_INTL_2015
0,,
1,8.0,51.0
2,4.0,24.0
3,2.0,12.0
4,6.0,43.0
5,8.0,54.0
6,4.0,22.0
7,2.0,14.0
8,1.0,13.0
9,1.0,15.0


Probably, what is happening for these columns is that not encoded NaN Values are making pandas read the file as `float`.  
This will be temporarily fixed, so the columns can be preprocessed along with the others.  

In [192]:
col_check[col_check['CAMEO_DEUG_2015'] != float]

Unnamed: 0,CAMEO_DEUG_2015,CAMEO_INTL_2015
2048,<class 'str'>,<class 'str'>
2050,<class 'str'>,<class 'str'>
2052,<class 'str'>,<class 'str'>
2053,<class 'str'>,<class 'str'>
2054,<class 'str'>,<class 'str'>
...,...,...
886779,<class 'str'>,<class 'str'>
886780,<class 'str'>,<class 'str'>
886781,<class 'str'>,<class 'str'>
886782,<class 'str'>,<class 'str'>


In [193]:
census.loc[[2048,2050,2052], ['CAMEO_DEUG_2015']]

Unnamed: 0,CAMEO_DEUG_2015
2048,4
2050,3
2052,7


In [194]:
census[census['CAMEO_DEUG_2015'] == 'X'][['CAMEO_DEUG_2015','CAMEO_INTL_2015']].drop_duplicates()

Unnamed: 0,CAMEO_DEUG_2015,CAMEO_INTL_2015
2511,X,XX


'X's will be treated as `NaN`

In [195]:
census['CAMEO_DEUG_2015'] = census['CAMEO_DEUG_2015'].replace('X',np.nan)

census['CAMEO_INTL_2015'] = census['CAMEO_INTL_2015'].replace('XX',np.nan)

In [196]:
census['CAMEO_DEUG_2015'] = census['CAMEO_DEUG_2015'].fillna(-1).astype(int)

census['CAMEO_INTL_2015'] = census['CAMEO_INTL_2015'].fillna(-1).astype(int)

In [197]:
census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int32(2), int64(93), object(4)
memory usage: 2.4+ GB


# Missing Values Handling

## Creating replacements for NaN Values in each column

In [198]:
nan_val_df = pd.read_excel('data/raw/DIAS Attributes - Values 2017.xlsx', header = 1, usecols = 'B:E', dtype = str)

nan_val_df[['Attribute','Description']] = nan_val_df[['Attribute','Description']].fillna(method='ffill')

# Assuming, from manual inspection from the 'Values' Spreadsheet, that NaNs are represented with substrings in Meaning col
nan_val_df = nan_val_df[nan_val_df['Meaning'].str.contains('unknown|no transaction[s]? known',regex=True,na  = False)]

nan_val_df['Value'] = nan_val_df['Value'].str.replace('\s','', regex = True)

nan_val_df['Value'] = nan_val_df['Value'].str.split(',')

nan_val_map = dict(zip(nan_val_df['Attribute'], nan_val_df['Value']))

In [199]:
# Creating a dictionary in a pandas friendly format for filling nans
nested_nan_map = {}

for i, (k, v) in enumerate(nan_val_map.items()):

    nested_nan_map[k] = {int(digit):np.nan for digit in v}

In [200]:
census = census.replace(nested_nan_map)

# Checking for duplicates on census Data

In [201]:
# Full duplicates
census.duplicated().sum()

0

In [202]:
# Checking if an ID shows up more than once
census.LNR.duplicated().sum()

0

It seems as there are no duplicates

## NaN Values on columns

In [203]:
nan_proportion = census.isna()\
                        .mean()\
                        .sort_values(ascending = False)

In [204]:
nan_proportion.head()

ALTER_KIND4                 0.998648
TITEL_KZ                    0.997576
ALTER_KIND3                 0.993077
D19_TELKO_ONLINE_DATUM      0.990796
D19_BANKEN_OFFLINE_DATUM    0.977911
dtype: float64

In [205]:
(nan_proportion > 0.8).sum()

15

In [206]:
(nan_proportion > 0.7).sum()

21

In [207]:
nan_proportion[nan_proportion >= 0.7]

ALTER_KIND4                  0.998648
TITEL_KZ                     0.997576
ALTER_KIND3                  0.993077
D19_TELKO_ONLINE_DATUM       0.990796
D19_BANKEN_OFFLINE_DATUM     0.977911
ALTER_KIND2                  0.966900
D19_TELKO_ANZ_12             0.962713
D19_BANKEN_ANZ_12            0.933252
D19_TELKO_ANZ_24             0.927052
D19_VERSI_ANZ_12             0.921532
D19_TELKO_OFFLINE_DATUM      0.919092
ALTER_KIND1                  0.909048
D19_BANKEN_ANZ_24            0.891025
D19_VERSI_ANZ_24             0.871879
D19_BANKEN_ONLINE_DATUM      0.815715
D19_BANKEN_DATUM             0.761125
AGER_TYP                     0.760196
D19_TELKO_DATUM              0.747063
EXTSEL992                    0.733996
D19_VERSAND_ANZ_12           0.715840
D19_VERSAND_OFFLINE_DATUM    0.711645
dtype: float64

It is interesting to use 80% as threshold since we would lose a column that refers directly to mail-ordering if we use 70%.

In [208]:
census.drop(columns = nan_proportion[nan_proportion >= 0.8].index, inplace=True)

# Checking for multicolinearity

We should check especially for multicolinearity for categorical variables (ordinals are categorical nontheless), since there are so many of them.  
In this context, too many variables can lead to the Curse of Dimensionality so that it is hard to set entries apart in the feature space.  

To compare approx. 300 variables with themselves would generate a result too expensive computationally (approx. 44850 unique combinations), since we have to evaluate all variables against themselves.  
As a proxy, we will use the information-levels to calculate the colinearities within each level to keep only the most important features for each level. 

## Fetching level-information data by column

In [209]:
att_info = pd.read_excel('data/raw/DIAS Information Levels - Attributes 2017.xlsx', usecols = 'B:E', header = 1)

In [210]:
att_info['Information level'] = att_info['Information level'].fillna(method='ffill')\
                                                            .fillna(method='backfill')

In [211]:
att_info['Information level'].unique()

array(['Person', 'Household', 'Building', 'Microcell (RR4_ID)',
       'Microcell (RR3_ID)', '125m x 125m Grid', 'Postcode ', 'RR1_ID',
       'PLZ8', 'Community'], dtype=object)

In [212]:
att_info['Information level'] = att_info['Information level'].str.strip()

In [213]:
att_info['Information level'].nunique()

10

In [214]:
att_info = att_info[['Information level','Attribute']]

display(att_info)

Unnamed: 0,Information level,Attribute
0,Person,AGER_TYP
1,Person,ALTERSKATEGORIE_GROB
2,Person,ANREDE_KZ
3,Person,CJT_GESAMTTYP
4,Person,FINANZ_MINIMALIST
...,...,...
308,Community,ARBEIT
309,Community,EINWOHNER
310,Community,GKZ
311,Community,ORTSGR_KLS9


In [215]:
att_info[att_info['Attribute'].str.endswith('RZ')]

Unnamed: 0,Information level,Attribute
152,125m x 125m Grid,D19_BANKEN_DIREKT_RZ
153,125m x 125m Grid,D19_BANKEN_GROSS_RZ
154,125m x 125m Grid,D19_BANKEN_LOKAL_RZ
155,125m x 125m Grid,D19_BANKEN_REST_RZ
156,125m x 125m Grid,D19_BEKLEIDUNG_GEH_RZ
157,125m x 125m Grid,D19_BEKLEIDUNG_REST_RZ
158,125m x 125m Grid,D19_BIO_OEKO_RZ
159,125m x 125m Grid,D19_BILDUNG_RZ
160,125m x 125m Grid,D19_BUCH_RZ
161,125m x 125m Grid,D19_DIGIT_SERV_RZ


In [216]:
att_info['Information level'].unique()

array(['Person', 'Household', 'Building', 'Microcell (RR4_ID)',
       'Microcell (RR3_ID)', '125m x 125m Grid', 'Postcode', 'RR1_ID',
       'PLZ8', 'Community'], dtype=object)

In [217]:
# Fixing the names for 125 grid because in the data they dont have _RZ in their name
att_info['Attribute'] = att_info['Attribute'].str.replace('_RZ','')

In [218]:
diff_cols = list(np.setdiff1d(census.columns.values, att_info['Attribute'].values))

print(diff_cols)

['AKT_DAT_KL', 'ALTERSKATEGORIE_FEIN', 'ANZ_KINDER', 'ANZ_STATISTISCHE_HAUSHALTE', 'CAMEO_INTL_2015', 'CJT_KATALOGNUTZER', 'CJT_TYP_1', 'CJT_TYP_2', 'CJT_TYP_3', 'CJT_TYP_4', 'CJT_TYP_5', 'CJT_TYP_6', 'D19_BUCH_CD', 'D19_GESAMT_ANZ_12', 'D19_GESAMT_ANZ_24', 'D19_KONSUMTYP_MAX', 'D19_LETZTER_KAUF_BRANCHE', 'D19_LOTTO', 'D19_SOZIALES', 'D19_TELKO_ONLINE_QUOTE_12', 'D19_VERSAND_ANZ_12', 'D19_VERSAND_ANZ_24', 'D19_VERSI_ONLINE_QUOTE_12', 'DSL_FLAG', 'EINGEFUEGT_AM', 'EINGEZOGENAM_HH_JAHR', 'EXTSEL992', 'FIRMENDICHTE', 'GEMEINDETYP', 'HH_DELTA_FLAG', 'KBA13_ANTG1', 'KBA13_ANTG2', 'KBA13_ANTG3', 'KBA13_ANTG4', 'KBA13_BAUMAX', 'KBA13_CCM_1401_2500', 'KBA13_CCM_3000', 'KBA13_CCM_3001', 'KBA13_GBZ', 'KBA13_HHZ', 'KBA13_KMH_210', 'KK_KUNDENTYP', 'KOMBIALTER', 'KONSUMZELLE', 'LNR', 'MOBI_RASTER', 'RT_KEIN_ANREIZ', 'RT_SCHNAEPPCHEN', 'RT_UEBERGROESSE', 'SOHO_KZ', 'STRUKTURTYP', 'UMFELD_ALT', 'UMFELD_JUNG', 'UNGLEICHENN_FLAG', 'VERDICHTUNGSRAUM', 'VHA', 'VHN', 'VK_DHT4A', 'VK_DISTANZ', 'VK_ZG11']


Not all variables are described in the Information Levels.

In [219]:
# Removing ID
diff_cols.remove('LNR')

Even though these columns are not assigned to any information group, the majority has a prefix that helps manually assigning them to a group.  
Also, the `CJT_TYP_X` columns are actually dummyfied versions of a column in the dataset

In [220]:
[col for col in diff_cols if col.startswith('D19')]

['D19_BUCH_CD',
 'D19_GESAMT_ANZ_12',
 'D19_GESAMT_ANZ_24',
 'D19_KONSUMTYP_MAX',
 'D19_LETZTER_KAUF_BRANCHE',
 'D19_LOTTO',
 'D19_SOZIALES',
 'D19_TELKO_ONLINE_QUOTE_12',
 'D19_VERSAND_ANZ_12',
 'D19_VERSAND_ANZ_24',
 'D19_VERSI_ONLINE_QUOTE_12']

In [221]:
grid_cols = ['D19_BUCH_CD',
            'D19_LETZTER_KAUF_BRANCHE',
            'D19_LOTTO',
            'D19_SOZIALES']

In [222]:
household_cols = ['D19_TELKO_ONLINE_QUOTE_12',
                    'D19_VERSAND_ANZ_12',
                    'D19_VERSAND_ANZ_24',
                    'D19_VERSI_ONLINE_QUOTE_12',
                    'D19_GESAMT_ANZ_12',
                    'D19_GESAMT_ANZ_24',
                    'D19_KONSUMTYP_MAX']

### Fixing columns that respect the name structure and/or are in the documentation

In [223]:
new_rows = []

for col in diff_cols:

    if col in household_cols:

        new_val = ('Household', col)

        new_rows.append(new_val)

    if col in grid_cols:

        new_val = ('125m x 125m Grid', col)

        new_rows.append(new_val)

    if col.startswith('KBA13'):

        new_val = ('PLZ8',col)

        new_rows.append(new_val)

    if col.startswith('CJT'):

        new_val = ('Person',col)

        new_rows.append(new_val)

for _, name in new_rows:

    diff_cols.remove(name)

In [224]:
diff_cols

['AKT_DAT_KL',
 'ALTERSKATEGORIE_FEIN',
 'ANZ_KINDER',
 'ANZ_STATISTISCHE_HAUSHALTE',
 'CAMEO_INTL_2015',
 'DSL_FLAG',
 'EINGEFUEGT_AM',
 'EINGEZOGENAM_HH_JAHR',
 'EXTSEL992',
 'FIRMENDICHTE',
 'GEMEINDETYP',
 'HH_DELTA_FLAG',
 'KK_KUNDENTYP',
 'KOMBIALTER',
 'KONSUMZELLE',
 'MOBI_RASTER',
 'RT_KEIN_ANREIZ',
 'RT_SCHNAEPPCHEN',
 'RT_UEBERGROESSE',
 'SOHO_KZ',
 'STRUKTURTYP',
 'UMFELD_ALT',
 'UMFELD_JUNG',
 'UNGLEICHENN_FLAG',
 'VERDICHTUNGSRAUM',
 'VHA',
 'VHN',
 'VK_DHT4A',
 'VK_DISTANZ',
 'VK_ZG11']

Some columns still are to be accounted for. This will be done by manually inspecting each case to check if the columns exist in the dictionary or have problems in their name.

In [225]:
# # Exporting column names for classification

# with open('data/raw/unaccounted_cols.csv','w', newline='') as file:

#     writer = csv.writer(file, delimiter=';')

#     writer.writerow(['col_name'])

#     for col in diff_cols:

#         writer.writerow([col])

### Building final classification

In [226]:
diff_cols_remainder = pd.read_csv('data/trusted/unaccounted_cols.csv', sep = ';')

In [227]:
print('Columns unaccounted for:',diff_cols_remainder.shape[0])

Columns unaccounted for: 30


In [230]:
diff_cols_remainder['information'].value_counts()

UNDOCUMENTED          22
Household              4
Person                 2
Microcell (RR4_ID)     1
RR1_ID                 1
Name: information, dtype: int64

From 30 columns that were still unaccounted for, 22 **WERE NOT FOUND IN THE DOCUMENTATION**.  
In a real-life scenario, these would be brought to the knowledge of the business or data-sourcing team responsible so that they could be documented.  
In the context of this project, **these columns will be dropped** since we cannot safely interpret for our segmentation, if they come to be useful.

In [231]:
columns_to_drop = diff_cols_remainder[diff_cols_remainder['information'] == 'UNDOCUMENTED']['col_name'].values

In [232]:
census.drop(columns = columns_to_drop, inplace = True)

In [233]:
cols_to_keep_list = list(zip(diff_cols_remainder[diff_cols_remainder['information'] != 'UNDOCUMENTED']['information'],
                         diff_cols_remainder[diff_cols_remainder['information'] != 'UNDOCUMENTED']['col_name']))

In [234]:
new_rows.extend(cols_to_keep_list)

In [235]:
new_rows_frame = pd.DataFrame(new_rows, columns= ['Information level', 'Attribute'])

In [236]:
new_rows_frame

Unnamed: 0,Information level,Attribute
0,Person,CJT_KATALOGNUTZER
1,Person,CJT_TYP_1
2,Person,CJT_TYP_2
3,Person,CJT_TYP_3
4,Person,CJT_TYP_4
5,Person,CJT_TYP_5
6,Person,CJT_TYP_6
7,125m x 125m Grid,D19_BUCH_CD
8,Household,D19_GESAMT_ANZ_12
9,Household,D19_GESAMT_ANZ_24


In [237]:
att_info.shape

(313, 2)

In [238]:
att_info_updated = pd.concat([att_info,new_rows_frame], axis = 0)

In [239]:
# Are the columns in the census table contained in the informations table?
np.setdiff1d(census.columns, att_info_updated['Attribute'])

array(['LNR'], dtype=object)

This is ok since this is the ID column

In [240]:
class_census_cols = np.intersect1d(census.columns, att_info_updated['Attribute'])

In [241]:
len(class_census_cols)

328

In [242]:
col_classification = att_info_updated[att_info_updated['Attribute'].isin(class_census_cols)]

In [243]:
col_classification['Information level'].value_counts()

PLZ8                  123
Microcell (RR3_ID)     54
Person                 50
125m x 125m Grid       36
Household              32
Microcell (RR4_ID)     12
Building                9
RR1_ID                  6
Postcode                3
Community               3
Name: Information level, dtype: int64

## Calculating Multicolinearities by Information Group

In [244]:
# We wont use numeric vars to calculate Cramer's V

numeric_vars = ['ANZ_HAUSHALTE_AKTIV',
                'ANZ_HH_TITEL',
                'ANZ_PERSONEN',
                'ANZ_TITEL',
                'GEBURTSJAHR',
                'KBA13_ANZAHL_PKW',
                'MIN_GEBAEUDEJAHR']

In [245]:
num_var_filter = ~col_classification['Attribute'].isin(numeric_vars)

In [246]:
col_classification[(col_classification['Information level'] == 'Community') & num_var_filter]

Unnamed: 0,Information level,Attribute
308,Community,ARBEIT
311,Community,ORTSGR_KLS9
312,Community,RELAT_AB


In [247]:
def calculate_cramers_v(arr1, arr2):

    '''
    Calculates Cramer's V for two arrays.
    The value lies between 0 and 1 (inclusive)

    :param arr1: Array of categorical variable
    :param arr2: Array of categorical variable

    :return v: Cramer's V index value
    '''

    crosstab = stats.contingency.crosstab(arr1, arr2)[1]

    chi2 = stats.chi2_contingency(crosstab)[0]

    # calculating the total number of observations
    n = np.sum(crosstab)

    # getting the degrees of freedom
    dof = min(crosstab.shape)-1
    
    # calculating cramer's v
    v = np.sqrt(chi2/(n*dof))

    return v

def calculate_frame_cramer_coefs(dataframe):

    '''
    Calculates pairwise Cramer's V for all possible combinations of categorical variables in 'dataframe'.
    Similar behaviour to pandas' .corr() method.

    :param dataframe: Pandas DataFrame columns with categorical variables

    :return matrix: Pairwise matrix with all variable combinations
    '''
    
    numpy_frame = dataframe.dropna().values

    matrix = np.diag([1.0] * numpy_frame.shape[1])

    table_range = list(range(0,numpy_frame.shape[1]))

    # Getting unique index combinations to minimize iterations
    combos = [combo for combo in itertools.combinations(table_range,2)]

    for i, j in combos:
        
        v = calculate_cramers_v(numpy_frame[:,i], numpy_frame[:,j])

        matrix[i,j] = v

        matrix[j,i] = v

    return matrix

**WARNING: Running the cells below can take a while. That is why the values are exported to csv, so we can use them later without going through these calculations**

In [118]:
# Sorting information levels by amount of columns to generate results faster
sorted_col_classes = col_classification['Information level'].value_counts(ascending=True)

In [125]:
for i, c in ['125m x 125m Grid','Household']: # enumerate(sorted_col_classes.index):  

    relevant_cols = col_classification[(col_classification['Information level'] == c) & num_var_filter]['Attribute']

    v_matrix = calculate_frame_cramer_coefs(census[relevant_cols])

    v_frame = pd.DataFrame(v_matrix, columns = relevant_cols, index = relevant_cols)

    v_frame.name = c

    v_frame.to_csv(f'data/trusted/{v_frame.name}_cramer.csv', sep = ';')

    if i == 0:

        v_frame_list = [v_frame]

    else:

        v_frame_list.append(v_frame)

---

In [58]:
DATA_PATH = 'data/trusted/'

In [63]:
os.listdir(DATA_PATH)

['Building_cramer.csv',
 'Community_cramer.csv',
 'Household_cramer.csv',
 'Microcell (RR3_ID)_cramer.csv',
 'Microcell (RR4_ID)_cramer.csv',
 'Person_cramer.csv',
 'PLZ8_cramer.csv',
 'Postcode_cramer.csv',
 'RR1_ID_cramer.csv',
 'unaccounted_cols.csv']

In [65]:
cramer_frame_list = []

for file in os.listdir(DATA_PATH):

    if file.endswith('_cramer.csv'):

        name = file.replace('_cramer.csv','')

        frame = pd.read_csv(os.path.join(DATA_PATH, file), sep = ';')

        frame.name = name
    
        cramer_frame_list.append(frame)

# Feature Selection
Based on multicolinearity and business definitions of columns within each group

### Buildings

In [68]:
cramer_frame_list[0].style.background_gradient()

Unnamed: 0,Attribute,GEBAEUDETYP,KBA05_HERSTTEMP,KBA05_MODTEMP,KONSUMNAEHE,OST_WEST_KZ,WOHNLAGE
0,GEBAEUDETYP,1.0,0.058127,0.057239,0.146835,0.054584,0.071675
1,KBA05_HERSTTEMP,0.058127,1.0,0.505993,0.081404,0.255567,0.303144
2,KBA05_MODTEMP,0.057239,0.505993,1.0,0.07329,0.099981,0.298161
3,KONSUMNAEHE,0.146835,0.081404,0.07329,1.0,0.109692,0.167219
4,OST_WEST_KZ,0.054584,0.255567,0.099981,0.109692,1.0,0.094349
5,WOHNLAGE,0.071675,0.303144,0.298161,0.167219,0.094349,1.0


KBA05_HERSTTEMP and KBA05_MODTEMP seem to have shared information. They are more closely related to auto-manufacturing then actually characteristics of repondents themselves. They may belong better to the RR3_ID class.  

Either way, it makes sanse that car brands and segments are correlated to eachother as well as living conditions in a neighbourhood (WOHNANLAGE).  

No other columns from the group will be dropped

In [109]:
buildings_col_list = cramer_frame_list[0].columns

### RR3_ID

In [77]:
rr3_list = list(col_classification[(col_classification['Information level'] ==  'Microcell (RR3_ID)') & num_var_filter]['Attribute'].values) \
                + ['KBA05_HERSTTEMP','KBA05_MODTEMP']

In [79]:
v_matrix = calculate_frame_cramer_coefs(census[rr3_list])

v_frame = pd.DataFrame(v_matrix, columns = rr3_list, index = rr3_list)

In [80]:
# v_frame.to_csv('data/trusted/Microcell (RR3_ID)_cramer_updated.csv')

In [111]:
# Checking variables that can have important multicolinearities 
((v_frame >= 0.3) & (v_frame < 1)).sum().sort_values(ascending = False).head(10)

KBA05_MAXHERST    9
KBA05_MOTOR       9
KBA05_MAXSEG      7
KBA05_KW3         7
KBA05_CCM1        6
KBA05_MAXBJ       6
KBA05_KRSKLEIN    6
KBA05_MOD1        6
KBA05_SEG2        5
KBA05_MAXVORB     5
dtype: int64

In [89]:
# Looking into some examples
v_frame[(v_frame >= 0.3) & (v_frame < 1)][['KBA05_MAXHERST','KBA05_MOTOR','KBA05_MAXSEG']]

Unnamed: 0,KBA05_MAXHERST,KBA05_MOTOR,KBA05_MAXSEG
KBA05_AUTOQUOT,,,
KBA05_BAUMAX,,,
KBA05_CCM1,,0.463402,0.301857
KBA05_CCM2,,0.433955,
KBA05_CCM3,,0.401816,
KBA05_CCM4,,0.396834,
KBA05_DIESEL,,,
KBA05_FRAU,,,
KBA05_GBZ,,,
KBA05_HERST1,0.403921,,


From these examples we can see that the correlations occur frequently in variables that are aggregated into other variables. This is somewhat expected.  
To reduce the number of variables, those that aggregate information will be kept. If there are correlations within this subset of variables, another selection will be made.

In [105]:
kept = ['KBA05_AUTOQUOT',
        'KBA05_BAUMAX',
        'KBA05_DIESEL',
        'KBA05_FRAU',
        'KBA05_GBZ',
        'KBA05_KRSAQUOT',
        'KBA05_KRSKLEIN',
        'KBA05_KRSOBER',
        'KBA05_KRSVAN',
        'KBA05_KRSZUL',
        'KBA05_MAXAH',
        'KBA05_MAXBJ',
        'KBA05_MAXHERST',
        'KBA05_MAXSEG',
        'KBA05_MAXVORB',
        'KBA05_MOTOR',
        'KBA05_MOTRAD',
        'KBA05_HERSTTEMP',
        'KBA05_MODTEMP']

In [106]:
v_frame.loc[kept, kept].style.background_gradient()

Unnamed: 0,KBA05_AUTOQUOT,KBA05_BAUMAX,KBA05_DIESEL,KBA05_FRAU,KBA05_GBZ,KBA05_KRSAQUOT,KBA05_KRSKLEIN,KBA05_KRSOBER,KBA05_KRSVAN,KBA05_KRSZUL,KBA05_MAXAH,KBA05_MAXBJ,KBA05_MAXHERST,KBA05_MAXSEG,KBA05_MAXVORB,KBA05_MOTOR,KBA05_MOTRAD,KBA05_HERSTTEMP,KBA05_MODTEMP
KBA05_AUTOQUOT,1.0,0.383831,0.204132,0.150679,0.366311,0.508323,0.152027,0.219448,0.118688,0.17855,0.176657,0.075224,0.095022,0.044824,0.134825,0.099243,0.269872,0.091483,0.058182
KBA05_BAUMAX,0.383831,1.0,0.162271,0.122296,0.549989,0.28647,0.128973,0.171743,0.088552,0.148344,0.160119,0.075233,0.121101,0.08016,0.142783,0.132736,0.248707,0.106755,0.063261
KBA05_DIESEL,0.204132,0.162271,1.0,0.080456,0.162997,0.149412,0.112813,0.104329,0.066027,0.128905,0.071707,0.091265,0.124947,0.083041,0.048032,0.220788,0.13553,0.12159,0.054358
KBA05_FRAU,0.150679,0.122296,0.080456,1.0,0.122057,0.121396,0.156388,0.101524,0.052828,0.079196,0.059067,0.051937,0.036304,0.095453,0.074177,0.084803,0.097353,0.027916,0.050964
KBA05_GBZ,0.366311,0.549989,0.162997,0.122057,1.0,0.270407,0.125568,0.171121,0.100433,0.150499,0.147698,0.066585,0.101861,0.03674,0.132081,0.110259,0.311013,0.103281,0.064825
KBA05_KRSAQUOT,0.508323,0.28647,0.149412,0.121396,0.270407,1.0,0.128534,0.160946,0.166592,0.149016,0.154233,0.056927,0.072677,0.032348,0.108186,0.09226,0.219807,0.076031,0.043046
KBA05_KRSKLEIN,0.152027,0.128973,0.112813,0.156388,0.125568,0.128534,1.0,0.133933,0.05764,0.063904,0.074486,0.078501,0.152645,0.475631,0.047797,0.309265,0.091746,0.107418,0.233877
KBA05_KRSOBER,0.219448,0.171743,0.104329,0.101524,0.171121,0.160946,0.133933,1.0,0.082176,0.069702,0.100968,0.044349,0.168331,0.286847,0.054913,0.254888,0.094268,0.090535,0.141617
KBA05_KRSVAN,0.118688,0.088552,0.066027,0.052828,0.100433,0.166592,0.05764,0.082176,1.0,0.085692,0.067214,0.077449,0.073308,0.037588,0.049195,0.069301,0.059701,0.073223,0.036308
KBA05_KRSZUL,0.17855,0.148344,0.128905,0.079196,0.150499,0.149016,0.063904,0.069702,0.085692,1.0,0.099739,0.345793,0.062154,0.055741,0.239928,0.073824,0.09822,0.048187,0.036793


In [107]:
to_drop = ['KBA05_GBZ','KBA05_AUTOQUOT','KBA05_MAXSEG','KBA05_HERSTTEMP']

Dropped:

- 'KBA05_GBZ': The number of buildings is correlated to the main type of building.
- 'KBA05_AUTOQUOT': Correlated with KRSAUTOQUOT, that conveys the same information
- 'KBA05_MAXSEG': Correlate with too many other features
- 'KBA05_HERSTTEMP': Conveys similar information to MAXHERST, also strongly correlated to other variables

In [110]:
rr3_col_list = list(set(kept) - set(to_drop))

## Community 

Community actually contains continuous variables (should be identified earlier), therefore these will be kept, with exception of `RELAT_AB`, since it will convey similar information as `ARBEIT`

In [None]:
community_col_list = cramer_frame_list[1].columns.drop(['RELAT_AB','Attribute'])

### Household

In [123]:
cramer_frame_list[2].name

'Household'

In [124]:
cramer_frame_list[2].shape

(66, 67)

In [129]:
cramer_frame_list[2] = cramer_frame_list[2].set_index('Attribute') 

In [139]:
cramer_frame_list[2].columns

Index(['ALTER_HH', 'HH_EINKOMMEN_SCORE', 'D19_KONSUMTYP',
       'D19_GESAMT_OFFLINE_DATUM', 'D19_GESAMT_ONLINE_DATUM',
       'D19_GESAMT_DATUM', 'D19_BANKEN_DATUM', 'D19_TELKO_DATUM',
       'D19_VERSAND_OFFLINE_DATUM', 'D19_VERSAND_ONLINE_DATUM',
       'D19_VERSAND_DATUM', 'D19_VERSI_OFFLINE_DATUM',
       'D19_VERSI_ONLINE_DATUM', 'D19_VERSI_DATUM',
       'D19_GESAMT_ONLINE_QUOTE_12', 'D19_BANKEN_ONLINE_QUOTE_12',
       'D19_VERSAND_ONLINE_QUOTE_12', 'W_KEIT_KIND_HH', 'WOHNDAUER_2008',
       'D19_BANKEN_DIREKT', 'D19_BANKEN_GROSS', 'D19_BANKEN_LOKAL',
       'D19_BANKEN_REST', 'D19_BEKLEIDUNG_GEH', 'D19_BEKLEIDUNG_REST',
       'D19_BILDUNG', 'D19_BIO_OEKO', 'D19_BUCH_CD', 'D19_DIGIT_SERV',
       'D19_DROGERIEARTIKEL', 'D19_ENERGIE', 'D19_FREIZEIT', 'D19_GARTEN',
       'D19_GESAMT_ANZ_12', 'D19_GESAMT_ANZ_24', 'D19_HANDWERK',
       'D19_HAUS_DEKO', 'D19_KINDERARTIKEL', 'D19_KONSUMTYP_MAX',
       'D19_KOSMETIK', 'D19_LEBENSMITTEL', 'D19_LETZTER_KAUF_BRANCHE',
       'D19_LOT

In [135]:
((cramer_frame_list[2] > 0.3) & (cramer_frame_list[2] <= 1)).sum().sort_values(ascending=False).head(15)

D19_GESAMT_ONLINE_DATUM        7
D19_VERSAND_ONLINE_DATUM       6
KK_KUNDENTYP                   6
D19_GESAMT_ANZ_12              5
D19_VERSAND_ONLINE_QUOTE_12    5
D19_VERSAND_DATUM              5
D19_GESAMT_ANZ_24              5
D19_GESAMT_DATUM               5
D19_VERSAND_ANZ_12             4
D19_VERSAND_ANZ_24             4
D19_BANKEN_DIREKT              4
D19_GESAMT_ONLINE_QUOTE_12     4
D19_TELKO_DATUM                3
D19_KONSUMTYP                  3
D19_BANKEN_GROSS               3
dtype: int64

In [137]:
cramer_frame_list[2][['D19_GESAMT_ONLINE_DATUM',
                        'D19_VERSAND_ONLINE_DATUM',
                        'KK_KUNDENTYP']]

Unnamed: 0_level_0,D19_GESAMT_ONLINE_DATUM,D19_VERSAND_ONLINE_DATUM,KK_KUNDENTYP
Attribute,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ALTER_HH,0.064625,0.061768,0.064595
HH_EINKOMMEN_SCORE,0.028009,0.025992,0.027124
D19_KONSUMTYP,0.140448,0.132302,0.174603
D19_GESAMT_OFFLINE_DATUM,0.074976,0.077721,0.149024
D19_GESAMT_ONLINE_DATUM,1.000000,0.785371,0.449465
...,...,...,...
D19_WEIN_FEINKOST,0.018653,0.018869,0.025603
ANZ_KINDER,0.039939,0.040074,0.028883
ANZ_STATISTISCHE_HAUSHALTE,0.065215,0.060175,0.067158
KK_KUNDENTYP,0.449465,0.395137,1.000000
