# Feature Selection Notebook

## Introduction
There is a very high amount of features in data. Depending on the clustering strategy this can lead to the Curse of Dimensionaliy in a clustering scenario. Refer to [this answer on Stack Exchange](https://stats.stackexchange.com/questions/232500/how-do-i-know-my-k-means-clustering-algorithm-is-suffering-from-the-curse-of-dim) for a more in-depth discussion about it.

## Methodology
A way to select features aiming towards reducing dimensionality is to use the information in the customers data that is not present in the general dataset (i.e. `CUSTOMER_GROUP, ONLINE_PURCHASE AND PRODUCT_GROUP` columns) to choose the columns that help the most in segmenting these columns. This way, we can choose the features that most likely segment customers and can use them to understand the behaviour of the general population and the customer population.

In [1]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.feature_selection import SelectKBest, mutual_info_classif
import pickle

In [2]:
df_customers = pd.read_parquet('data/refined/customers_data.parquet')

In [3]:
df_customers.head()

Unnamed: 0,LNR,ANZ_HAUSHALTE_AKTIV,ANZ_HH_TITEL,ANZ_PERSONEN,ANZ_TITEL,KBA13_ANZAHL_PKW,MIN_GEBAEUDEJAHR,D19_BANKEN_DIREKT,D19_BANKEN_GROSS,D19_BANKEN_LOKAL,...,BALLRAUM,EWDICHTE,INNENSTADT,GEBAEUDETYP_RASTER,KKK,MOBI_REGIO,ONLINE_AFFINITAET,CUSTOMER_GROUP,ONLINE_PURCHASE,PRODUCT_GROUP
0,9626,1.0,0.0,2.0,0.0,1201.0,1992.0,0.0,0.0,0.0,...,3.0,2.0,4.0,4.0,1.0,4.0,3.0,MULTI_BUYER,0,COSMETIC_AND_FOOD
2,143872,1.0,0.0,1.0,0.0,433.0,1992.0,0.0,0.0,0.0,...,7.0,4.0,1.0,3.0,3.0,3.0,1.0,MULTI_BUYER,0,COSMETIC_AND_FOOD
3,143873,0.0,0.0,0.0,0.0,755.0,1992.0,0.0,0.0,0.0,...,7.0,1.0,7.0,4.0,3.0,4.0,2.0,MULTI_BUYER,0,COSMETIC
4,143874,7.0,0.0,4.0,0.0,513.0,1992.0,2.0,0.0,1.0,...,3.0,4.0,4.0,3.0,4.0,3.0,5.0,MULTI_BUYER,0,FOOD
5,143888,1.0,0.0,2.0,0.0,1167.0,1992.0,0.0,0.0,0.0,...,7.0,5.0,8.0,4.0,2.0,3.0,3.0,MULTI_BUYER,0,COSMETIC_AND_FOOD


In [4]:
df_customers[['CUSTOMER_GROUP','PRODUCT_GROUP']]

Unnamed: 0,CUSTOMER_GROUP,PRODUCT_GROUP
0,MULTI_BUYER,COSMETIC_AND_FOOD
2,MULTI_BUYER,COSMETIC_AND_FOOD
3,MULTI_BUYER,COSMETIC
4,MULTI_BUYER,FOOD
5,MULTI_BUYER,COSMETIC_AND_FOOD
...,...,...
191647,MULTI_BUYER,COSMETIC_AND_FOOD
191648,SINGLE_BUYER,COSMETIC
191649,MULTI_BUYER,COSMETIC_AND_FOOD
191650,SINGLE_BUYER,FOOD


In [5]:
for col in ['CUSTOMER_GROUP','PRODUCT_GROUP']:

    display(df_customers[col].value_counts(dropna = False))

MULTI_BUYER     98547
SINGLE_BUYER    41751
Name: CUSTOMER_GROUP, dtype: int64

COSMETIC_AND_FOOD    75446
FOOD                 33779
COSMETIC             31073
Name: PRODUCT_GROUP, dtype: int64

In [6]:
X = df_customers.drop(columns = ['LNR','CUSTOMER_GROUP','PRODUCT_GROUP','ONLINE_PURCHASE'])

In [8]:
selected = []

for col in ['ONLINE_PURCHASE','PRODUCT_GROUP','CUSTOMER_GROUP']:

    print(f'Running for {col}')

    y = df_customers[col]

    selector = SelectKBest(mutual_info_classif, k = 10)

    selector.fit(X, y)

    selected.append(selector.get_feature_names_out())

Running for ONLINE_PURCHASE
Running for PRODUCT_GROUP
Running for CUSTOMER_GROUP


In [9]:
selected_features = list(np.append(selected[0],selected[1]))

selected_features = list(np.append(selected_features, selected[2]))

In [10]:
selected_features = list(set(selected_features))

In [11]:
len(selected_features)

16

In [12]:
selected_features

['FINANZ_VORSORGER',
 'CJT_TYP_6',
 'WOHNDAUER_2008',
 'NATIONALITAET_KZ',
 'VERS_TYP',
 'CJT_TYP_3',
 'OST_WEST_KZ',
 'CJT_TYP_2',
 'STRUKTURTYP',
 'FINANZ_SPARER',
 'CJT_TYP_5',
 'D19_BANKEN_DATUM',
 'CJT_KATALOGNUTZER',
 'PRAEGENDE_JUGENDJAHRE',
 'CJT_TYP_4',
 'ANREDE_KZ']

From 100 possible variables, when we remove the intersections we are left with 73 candidates that are useful for segregating the categories in the customer data. We will try to cluster using these variables.  
We need only to pay attention to the numeric columns afterwards, since the mutual information criterion might not be so useful for classifying this type of data to help segregate categories.

In [13]:
with open('data/trusted/selected_features.pkl','wb') as file:

    pickle.dump(selected_features, file)