# Segmentation EDA Notebook

## Introduction
This notebook contains the preprocessing EDA for the Demographic data.  
It will be the base for the preprocessing steps taken.  

In this stage, one of the most important parts is feature selection for clustering, since there are many possible variables.  
Selecting the most important features will help us avoid the "curse of dimensionality" for clustering and also help gain computation time.

## Steps contained
1. Missing values handling
2. Duplicates check/ handling
4. Feature Selection - Too unpopulated columns, multicolinearities (Obtained from profile)
5. Feature Selection - Manual exclusion of variables not relevant businesswise

**Note: To select manually the variables that will be kept, data from the DIAS Information Levels - Attributes Spreadsheet will be used** 

## Context
The targeted company for this project is an Organics company that is selling mail-ordered products.

# Imports

In [90]:
import pandas as pd
import numpy as np
import missingno as msno
from pandas_profiling import ProfileReport

import csv

import seaborn as sns

In [2]:
sns.set(rc={'figure.figsize':(10,5)})

# Reading Data

In [3]:
census = pd.read_csv('data/raw/Udacity_AZDIAS_052018.csv', sep = ';')

  census = pd.read_csv('data/raw/Udacity_AZDIAS_052018.csv', sep = ';')


# Type handling on warned columns by Pandas

In [4]:
col_check = census.iloc[:, [18,19]].applymap(type)

In [5]:
census.iloc[:, [18,19]].head(10)

Unnamed: 0,CAMEO_DEUG_2015,CAMEO_INTL_2015
0,,
1,8.0,51.0
2,4.0,24.0
3,2.0,12.0
4,6.0,43.0
5,8.0,54.0
6,4.0,22.0
7,2.0,14.0
8,1.0,13.0
9,1.0,15.0


Probably, what is happening for the `CAMEO_DEUG_2015`is that not encoded NaN Values are making pandas read the file as `float`.  
This will be temporarily fixed, so the columns can be preprocessed along with the others.  

For the `CAMEO_INTL_2015` column, the values are actually strings. But, since there is no encoding for NaN Values, it will be left as is 

In [6]:
col_check[col_check['CAMEO_DEUG_2015'] != float]

Unnamed: 0,CAMEO_DEUG_2015,CAMEO_INTL_2015
2048,<class 'str'>,<class 'str'>
2050,<class 'str'>,<class 'str'>
2052,<class 'str'>,<class 'str'>
2053,<class 'str'>,<class 'str'>
2054,<class 'str'>,<class 'str'>
...,...,...
886779,<class 'str'>,<class 'str'>
886780,<class 'str'>,<class 'str'>
886781,<class 'str'>,<class 'str'>
886782,<class 'str'>,<class 'str'>


In [7]:
census.loc[[2048,2050,2052], ['CAMEO_DEUG_2015']]

Unnamed: 0,CAMEO_DEUG_2015
2048,4
2050,3
2052,7


In [9]:
census[census['CAMEO_DEUG_2015'] == 'X'][['CAMEO_DEUG_2015','CAMEO_INTL_2015']].drop_duplicates()

Unnamed: 0,CAMEO_DEUG_2015,CAMEO_INTL_2015
2511,X,XX


'X's will be treated as `NaN`

In [10]:
census['CAMEO_DEUG_2015'] = census['CAMEO_DEUG_2015'].replace('X',np.nan)

census['CAMEO_INTL_2015'] = census['CAMEO_INTL_2015'].replace('XX',np.nan)

In [11]:
census['CAMEO_DEUG_2015'] = census['CAMEO_DEUG_2015'].fillna(-1).astype(int)

In [12]:
census.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891221 entries, 0 to 891220
Columns: 366 entries, LNR to ALTERSKATEGORIE_GROB
dtypes: float64(267), int32(1), int64(93), object(5)
memory usage: 2.4+ GB


# Missing Values Handling

## Creating replacements for NaN Values in each column

In [13]:
nan_val_df = pd.read_excel('data/raw/DIAS Attributes - Values 2017.xlsx', header = 1, usecols = 'B:E', dtype = str)

nan_val_df[['Attribute','Description']] = nan_val_df[['Attribute','Description']].fillna(method='ffill')

# Assuming, from manual inspection from the 'Values' Spreadsheet, that NaNs are represented with substrings in Meaning col
nan_val_df = nan_val_df[nan_val_df['Meaning'].str.contains('unknown|no transaction[s]? known',regex=True,na  = False)]

nan_val_df['Value'] = nan_val_df['Value'].str.replace('\s','', regex = True)

nan_val_df['Value'] = nan_val_df['Value'].str.split(',')

nan_val_map = dict(zip(nan_val_df['Attribute'], nan_val_df['Value']))

In [14]:
nested_nan_map = {}

for i, (k, v) in enumerate(nan_val_map.items()):

    nested_nan_map[k] = {int(digit):np.nan for digit in v}

In [15]:
census = census.replace(nested_nan_map)

# Checking for duplicates on census Data

In [16]:
# Full duplicates
census.duplicated().sum()

0

In [17]:
# Checking if an ID shows up more than once
census.LNR.duplicated().sum()

0

It seems as there are no duplicates

## NaN Values on columns

In [18]:
nan_proportion = census.isna()\
                        .mean()\
                        .sort_values(ascending = False)

In [19]:
nan_proportion.head()

ALTER_KIND4                 0.998648
TITEL_KZ                    0.997576
ALTER_KIND3                 0.993077
D19_TELKO_ONLINE_DATUM      0.990796
D19_BANKEN_OFFLINE_DATUM    0.977911
dtype: float64

In [20]:
(nan_proportion > 0.8).sum()

15

In [21]:
(nan_proportion > 0.7).sum()

21

In [22]:
nan_proportion[nan_proportion >= 0.7]

ALTER_KIND4                  0.998648
TITEL_KZ                     0.997576
ALTER_KIND3                  0.993077
D19_TELKO_ONLINE_DATUM       0.990796
D19_BANKEN_OFFLINE_DATUM     0.977911
ALTER_KIND2                  0.966900
D19_TELKO_ANZ_12             0.962713
D19_BANKEN_ANZ_12            0.933252
D19_TELKO_ANZ_24             0.927052
D19_VERSI_ANZ_12             0.921532
D19_TELKO_OFFLINE_DATUM      0.919092
ALTER_KIND1                  0.909048
D19_BANKEN_ANZ_24            0.891025
D19_VERSI_ANZ_24             0.871879
D19_BANKEN_ONLINE_DATUM      0.815715
D19_BANKEN_DATUM             0.761125
AGER_TYP                     0.760196
D19_TELKO_DATUM              0.747063
EXTSEL992                    0.733996
D19_VERSAND_ANZ_12           0.715840
D19_VERSAND_OFFLINE_DATUM    0.711645
dtype: float64

It is interesting to use 80% as threshold since we would lose a column that refers directly to mail-ordering if we use 70%.

In [23]:
census.drop(columns = nan_proportion[nan_proportion >= 0.8].index, inplace=True)

# Checking for multicolinearity

We should check especially for multicolinearity for categorical variables (ordinals are categorical nontheless), since there are so many of them.  
In this context, too many variables can lead to the Curse of Dimensionality so that it is hard to set apart in the feature space.  

To compare approx. 300 variables with themselves would generate a result too expensive computationally, since we have to evaluate all variables against themselves.  
As a proxy, we will use the information-levels to calculate the colinearities within each level to keep only the most important features for each level. 

## Fetching level-information data by column

In [25]:
att_info = pd.read_excel('data/raw/DIAS Information Levels - Attributes 2017.xlsx', usecols = 'B:E', header = 1)

In [26]:
att_info['Information level'] = att_info['Information level'].fillna(method='ffill')\
                                                            .fillna(method='backfill')

In [27]:
att_info['Information level'].nunique()

10

In [28]:
att_info = att_info[['Information level','Attribute']]

display(att_info)

Unnamed: 0,Information level,Attribute
0,Person,AGER_TYP
1,Person,ALTERSKATEGORIE_GROB
2,Person,ANREDE_KZ
3,Person,CJT_GESAMTTYP
4,Person,FINANZ_MINIMALIST
...,...,...
308,Community,ARBEIT
309,Community,EINWOHNER
310,Community,GKZ
311,Community,ORTSGR_KLS9


Not all variables are described in the Information Levels

In [86]:
diff_cols = list(np.setdiff1d(census.columns.values, att_info['Attribute'].values))

print(diff_cols)

['AKT_DAT_KL', 'ALTERSKATEGORIE_FEIN', 'ANZ_KINDER', 'ANZ_STATISTISCHE_HAUSHALTE', 'CAMEO_INTL_2015', 'CJT_KATALOGNUTZER', 'CJT_TYP_1', 'CJT_TYP_2', 'CJT_TYP_3', 'CJT_TYP_4', 'CJT_TYP_5', 'CJT_TYP_6', 'D19_BANKEN_DIREKT', 'D19_BANKEN_GROSS', 'D19_BANKEN_LOKAL', 'D19_BANKEN_REST', 'D19_BEKLEIDUNG_GEH', 'D19_BEKLEIDUNG_REST', 'D19_BILDUNG', 'D19_BIO_OEKO', 'D19_BUCH_CD', 'D19_DIGIT_SERV', 'D19_DROGERIEARTIKEL', 'D19_ENERGIE', 'D19_FREIZEIT', 'D19_GARTEN', 'D19_GESAMT_ANZ_12', 'D19_GESAMT_ANZ_24', 'D19_HANDWERK', 'D19_HAUS_DEKO', 'D19_KINDERARTIKEL', 'D19_KONSUMTYP_MAX', 'D19_KOSMETIK', 'D19_LEBENSMITTEL', 'D19_LETZTER_KAUF_BRANCHE', 'D19_LOTTO', 'D19_NAHRUNGSERGAENZUNG', 'D19_RATGEBER', 'D19_REISEN', 'D19_SAMMELARTIKEL', 'D19_SCHUHE', 'D19_SONSTIGE', 'D19_SOZIALES', 'D19_TECHNIK', 'D19_TELKO_MOBILE', 'D19_TELKO_ONLINE_QUOTE_12', 'D19_TELKO_REST', 'D19_TIERARTIKEL', 'D19_VERSAND_ANZ_12', 'D19_VERSAND_ANZ_24', 'D19_VERSAND_REST', 'D19_VERSICHERUNGEN', 'D19_VERSI_ONLINE_QUOTE_12', 'D19_VOLL

In [87]:
# Removing ID
diff_cols.remove('LNR')

Even though these columns are not assigned to any information group, the majority has a prefix that helps manually assigning them to a group. Also, the `CJT_TYP_X` columns are actually dummyfied versions of a column in the dataset

### Fixing columns that respect the name structure and/or are in the documentation

In [88]:
new_rows = []

for col in diff_cols:

    if col.startswith('D19'):

        new_val = ('Household', col)

        new_rows.append(new_val)

    if col.startswith('KBA13'):

        new_val = ('PLZ8',col)

        new_rows.append(new_val)

    if col.startswith('CJT'):

        new_val = ('Person',col)

        new_rows.append(new_val)

for _, name in new_rows:

    diff_cols.remove(name)

In [89]:
diff_cols

['AKT_DAT_KL',
 'ALTERSKATEGORIE_FEIN',
 'ANZ_KINDER',
 'ANZ_STATISTISCHE_HAUSHALTE',
 'CAMEO_INTL_2015',
 'DSL_FLAG',
 'EINGEFUEGT_AM',
 'EINGEZOGENAM_HH_JAHR',
 'EXTSEL992',
 'FIRMENDICHTE',
 'GEMEINDETYP',
 'HH_DELTA_FLAG',
 'KK_KUNDENTYP',
 'KOMBIALTER',
 'KONSUMZELLE',
 'MOBI_RASTER',
 'RT_KEIN_ANREIZ',
 'RT_SCHNAEPPCHEN',
 'RT_UEBERGROESSE',
 'SOHO_KZ',
 'STRUKTURTYP',
 'UMFELD_ALT',
 'UMFELD_JUNG',
 'UNGLEICHENN_FLAG',
 'VERDICHTUNGSRAUM',
 'VHA',
 'VHN',
 'VK_DHT4A',
 'VK_DISTANZ',
 'VK_ZG11']

Some columns still are to be accounted for. This will be done by manually inspecting each case to check if the columns exist in the dictionary or have problems in their name.

In [98]:
# # Exporting column names for classification

# with open('data/raw/unaccounted_cols.csv','w', newline='') as file:

#     writer = csv.writer(file, delimiter=';')

#     writer.writerow(['col_name'])

#     for col in diff_cols:

#         writer.writerow([col])

### Building final classification

In [108]:
diff_cols_remainder = pd.read_csv('data/trusted/unaccounted_cols.csv', sep = ';')

In [114]:
print('Columns unaccounted for:',diff_cols_remainder.shape[0])

Columns unaccounted for: 30


In [110]:
diff_cols_remainder['information'].value_counts()

UNDOCUMENTED          22
Household              4
Person                 2
Microcell (RR4_ID)     1
RR1_ID                 1
Name: information, dtype: int64

From 30 columns that were still unaccounted for, 22 **WERE NOT FOUND IN THE DOCUMENTATION**.  
In a real-life scenario, these would be brought to the knowledge of the business or data-sourcing team responsible so that they could be documented.  
In the context of this project, **these columns will be dropped** since we cannot safely interpret for our segmentation, if they come to be useful.

In [120]:
columns_to_drop = diff_cols_remainder[diff_cols_remainder['information'] == 'UNDOCUMENTED']['col_name'].values

In [121]:
census.drop(columns = columns_to_drop, inplace = True)

In [128]:
cols_to_keep_list = list(zip(diff_cols_remainder[diff_cols_remainder['information'] != 'UNDOCUMENTED']['information'],
                         diff_cols_remainder[diff_cols_remainder['information'] != 'UNDOCUMENTED']['col_name']))

In [129]:
new_rows.extend(cols_to_keep_list)

In [None]:
new_rows_frame = pd.DataFrame(new_rows, columns= ['Information level', 'Attribute'])

In [142]:
new_rows_frame

Unnamed: 0,Information level,Attribute
0,Person,CJT_KATALOGNUTZER
1,Person,CJT_TYP_1
2,Person,CJT_TYP_2
3,Person,CJT_TYP_3
4,Person,CJT_TYP_4
...,...,...
64,Microcell (RR4_ID),CAMEO_INTL_2015
65,Household,KK_KUNDENTYP
66,RR1_ID,MOBI_RASTER
67,Person,SOHO_KZ


In [141]:
att_info.shape

(313, 2)

In [143]:
att_info_updated = pd.concat([att_info,new_rows_frame], axis = 0)

In [145]:
# Are the columns in the census table contained in the informations table?
np.setdiff1d(census.columns, att_info_updated['Attribute'])

array(['LNR'], dtype=object)

In [168]:
class_census_cols = np.intersect1d(census.columns, att_info_updated['Attribute'])

In [170]:
len(class_census_cols)

328

In [171]:
col_classification = att_info_updated[att_info_updated['Attribute'].isin(class_census_cols)]

In [175]:
col_classification['Information level'].value_counts()

PLZ8                  123
Household              68
Microcell (RR3_ID)     54
Person                 50
Microcell (RR4_ID)     12
Building                9
RR1_ID                  6
Postcode                3
Community               3
Name: Information level, dtype: int64

308         ARBEIT
311    ORTSGR_KLS9
312       RELAT_AB
Name: Attribute, dtype: object

## Calculating Multicolinearities by Information Group

In [160]:
import scipy.stats as stats

In [None]:
# We wont use these

numeric_vars = ['ANZ_HAUSHALTE_AKTIV',
                'ANZ_HH_TITEL',
                'ANZ_PERSONEN',
                'ANZ_TITEL',
                'GEBURTSJAHR',
                'KBA13_ANZAHL_PKW',
                'MIN_GEBAEUDEJAHR']

In [179]:
test_cols = col_classification[col_classification['Information level'] == 'Community']['Attribute'].values

In [183]:
test_sample = census.drop(columns = numeric_vars + ['LNR'])[test_cols].sample(1000, random_state = 5)

In [187]:
test_sample.dropna(inplace = True)

In [188]:
test_sample

Unnamed: 0,ARBEIT,ORTSGR_KLS9,RELAT_AB
536960,3.0,9.0,4.0
428175,3.0,2.0,5.0
457159,3.0,5.0,1.0
421550,4.0,8.0,5.0
573175,3.0,5.0,3.0
...,...,...,...
466324,3.0,5.0,3.0
616483,2.0,4.0,3.0
862718,3.0,1.0,2.0
846895,3.0,5.0,4.0


In [255]:
def calculate_cramers_v(arr1, arr2):

    crosstab = stats.contingency.crosstab(arr1, arr2)[1]

    chi2 = stats.chi2_contingency(crosstab)[0]

    # calculating the total number of observations
    n = np.sum(crosstab)

    # getting the degrees of freedom
    dof = min(crosstab.shape)-1
    
    # calculating cramer's v
    v = np.sqrt(chi2/(n*dof))

    return v

def calculate_frame_cramer_coefs(dataframe):

    numpy_frame = dataframe.values

    matrix = np.diag([1.0] * dataframe.shape[1])

    table_range = list(range(0,dataframe.shape[1]))

    combos = [combo for combo in itertools.combinations(table_range,2)]

    for i, j in combos:
        
        v = calculate_cramers_v(numpy_frame[:,i], numpy_frame[:,j])

        matrix[i,j] = v

        matrix[j,i] = v

    return matrix
    

In [256]:
calculate_frame_cramer_coefs(test_sample)

array([[1.        , 0.34453101, 0.42873879],
       [0.34453101, 1.        , 0.34006518],
       [0.42873879, 0.34006518, 1.        ]])

In [None]:
test_sample.shape

In [245]:
temp_mat = np.diag([1] * test_sample.shape[1])

In [246]:
temp_mat

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

In [247]:
temp_mat[0,1] = 50

In [248]:
temp_mat

array([[ 1, 50,  0],
       [ 0,  1,  0],
       [ 0,  0,  1]])

In [218]:
temp_map

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])

In [203]:
import itertools

[(0, 1), (0, 2), (1, 2)]

In [None]:
combos = [combo for combo in itertools.combinations(table_range,2)]

In [200]:
np.empty((test_sample.shape[1], test_sample.shape[1]))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [197]:
test_sample

Unnamed: 0,ARBEIT,ORTSGR_KLS9,RELAT_AB
536960,3.0,9.0,4.0
428175,3.0,2.0,5.0
457159,3.0,5.0,1.0
421550,4.0,8.0,5.0
573175,3.0,5.0,3.0
...,...,...,...
466324,3.0,5.0,3.0
616483,2.0,4.0,3.0
862718,3.0,1.0,2.0
846895,3.0,5.0,4.0


1. Todas as combinações de variaveis
2. Criar Matriz
3. Preencher matriz com calculo 

In [195]:
calculate_cramers_v(test_sample['ARBEIT'], test_sample['ORTSGR_KLS9'])

0.344531011907817

In [184]:
test_sample.head()

Unnamed: 0,ARBEIT,ORTSGR_KLS9,RELAT_AB
536960,3.0,9.0,4.0
428175,3.0,2.0,5.0
457159,3.0,5.0,1.0
421550,4.0,8.0,5.0
573175,3.0,5.0,3.0


In [177]:
# Are the columns in the census table contained in the informations table?
np.setdiff1d(census.columns, att_info_updated['Attribute'])

array(['LNR'], dtype=object)

In [162]:
dataset = np.array([[4, 13, 17, 11], [4, 6, 9, 12],
                    [2, 7, 4, 2], [5, 13, 10, 12],
                    [5, 6, 14, 12]])
  
# Finding Chi-squared test statistic, 
# sample size, and minimum of rows and
# columns
X2 = stats.chi2_contingency(dataset, correction=False)[0]

In [167]:
classified_census_cols

array(['AGER_TYP', 'ALTERSKATEGORIE_FEIN', 'ALTERSKATEGORIE_GROB',
       'ALTER_HH', 'ANREDE_KZ', 'ANZ_HAUSHALTE_AKTIV', 'ANZ_HH_TITEL',
       'ANZ_KINDER', 'ANZ_PERSONEN', 'ANZ_STATISTISCHE_HAUSHALTE',
       'ANZ_TITEL', 'ARBEIT', 'BALLRAUM', 'CAMEO_DEUG_2015',
       'CAMEO_DEU_2015', 'CAMEO_INTL_2015', 'CJT_GESAMTTYP',
       'CJT_KATALOGNUTZER', 'CJT_TYP_1', 'CJT_TYP_2', 'CJT_TYP_3',
       'CJT_TYP_4', 'CJT_TYP_5', 'CJT_TYP_6', 'D19_BANKEN_DATUM',
       'D19_BANKEN_DIREKT', 'D19_BANKEN_GROSS', 'D19_BANKEN_LOKAL',
       'D19_BANKEN_ONLINE_QUOTE_12', 'D19_BANKEN_REST',
       'D19_BEKLEIDUNG_GEH', 'D19_BEKLEIDUNG_REST', 'D19_BILDUNG',
       'D19_BIO_OEKO', 'D19_BUCH_CD', 'D19_DIGIT_SERV',
       'D19_DROGERIEARTIKEL', 'D19_ENERGIE', 'D19_FREIZEIT', 'D19_GARTEN',
       'D19_GESAMT_ANZ_12', 'D19_GESAMT_ANZ_24', 'D19_GESAMT_DATUM',
       'D19_GESAMT_OFFLINE_DATUM', 'D19_GESAMT_ONLINE_DATUM',
       'D19_GESAMT_ONLINE_QUOTE_12', 'D19_HANDWERK', 'D19_HAUS_DEKO',
       'D19_KINDE

In [None]:
 
    chi2 = stats.chi2_contingency(cross_tabs)[0]
    # calculating the total number of observations
    n = cross_tabs.sum().sum()
    # getting the degrees of freedom
    dof = min(cross_tabs.shape)-1
    # calculating cramer's v
    v = np.sqrt(chi2/(n*dof))

In [None]:
def calculate_cramers_v(arr1, arr2):

    

