# Clustering on mixed data-types

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). 

Clustering is applied to identify and segment data into groups with similar characteristics. A primary limitation of popular clustering algorithms and approaches is that these are often suitable for either numerical (i.e PCA) or categorical (i.e MCA) data separately.  

In practice, there are several strategies to infer data clusters and segmentation using these popular clustering methods including: 
 - Combining the output of independent PCA and MCA for downstream classification.  
 - Converting and rescaling categorical data and using PCA. This is ok for nominals (ie weak  neutral  strong = -1 0 1) but fails for categories with no relative difference (ie location or industry codes).

In [None]:
import os
import pandas as pd
import numpy as np
#import tensorflow_data_validation as tfdv

### 1. Case Study: Auto insurance claims [dataset](https://www.kaggle.com/xiaomengsun/car-insurance-claim-data)

In [None]:
# load data
DATA_PATH = os.path.join(os.getcwd(),'../_data')
df = pd.read_csv(os.path.join(DATA_PATH,'car-insurance-claim-data/car_insurance_claim.csv'),low_memory=False,)

# convert object to numerical
df[['INCOME','HOME_VAL','BLUEBOOK','OLDCLAIM', 'CLM_AMT',]] = df[['INCOME','HOME_VAL','BLUEBOOK','OLDCLAIM', 'CLM_AMT',]].replace('[^.0-9]', '', regex=True,).astype(float).fillna(0)

# clean textual classes
for col in df.columns:
    if df[col].dtype == 'O':
        df[col] = df[col].str.upper().replace('Z_','',regex=True).replace('[^A-Z]','',regex=True)
        
data_types = {f:t for f,t in zip(df.columns,df.dtypes)}

df[:2]

### 2. Feature Encoding & Engineering

***what features do we have?***
Having explored I found this [data dictionary](https://rpubs.com/data_feelings/msda_data621_hw4) and following key definitions:
- Bluebook = car re-sale value. 
- MVR_PTS = [MotorVehicleRecordPoints (MVR) ](https://www.wnins.com/losscontrolbulletins/MVREvaluation.pdf) details an individual’s past driving history indicating violations and accidents over a specified period
- TIF = Time In Force / customer lifetime
- YOJ = years in job
- CLM_FRQ = # of claims in past 5 years
- OLDCLAIM = sum $ of claims in past 5 years

In [None]:
# copy df
tdf = df.copy()

In [None]:
feat_id = ['ID']
feat_account = ['KIDSDRIV', 'BIRTH', 'AGE', 'HOMEKIDS', 'YOJ', 'INCOME',
                'PARENT1', 'HOME_VAL', 'MSTATUS', 'GENDER', 'EDUCATION', 'OCCUPATION','URBANICITY','TIF',]
feat_car = [ 'TRAVTIME', 'CAR_USE','MVR_PTS','BLUEBOOK','CAR_TYPE', 'RED_CAR','REVOKED','CAR_AGE',]
feat_claims = ['OLDCLAIM', 'CLM_FREQ', 'CLAIM_FLAG','CLM_AMT',]

data_meta = pd.DataFrame(tdf.nunique(),columns=['num'],index=None).sort_values('num').reset_index()
data_meta.columns = ['name','num']
data_meta[:2]

***transform binary variables***

In [None]:
from sklearn.preprocessing import LabelEncoder

In [None]:
le = LabelEncoder()
for feat in data_meta.loc[data_meta['num']<=2,'name'].values:
    tdf[feat] = le.fit_transform(tdf[feat])

In [None]:
Xy = tdf[feat_account+feat_car+feat_claims].copy()

### 2. EDA
- multiple account years (renewals)

In [None]:
Xy[:2]

### 4. Similarity

[$Gower$ $distance$](https://www.jstor.org/stable/2528823?seq=1) was proposed to measure dissimilarity between subjects with mixed types of variables using the mathematical concept of distance.
- [docs](https://rdrr.io/cran/gower/api)

In [None]:
import gower

# # Example: to find the most similar record to i=0, in rows i=1...i=100
# gower.gower_topn(Xy.iloc[0:1,:], Xy.iloc[1:100,], n = 1)
# Xy.iloc[[0,42],:].T

In [None]:
try: 
    gd = np.load(os.path.join(DATA_PATH,'car-insurance-claim-data/car_insurance_claim_gower_distance.npy'))
    print('Gower distances loaded from file.')
except:
    print('Calculating Gower dsitances...5-8 minutes')
    %time gd = gower.gower_matrix(Xy[:])
    np.save(os.path.join(DATA_PATH,'car-insurance-claim-data/car_insurance_claim_gower_distance.npy'),gd)

### 5. Clustering

In [None]:
# k-mediods python implmentation in scikit-learn-extra
# https://scikit-learn-extra.readthedocs.io/en/latest/install.html
# C++ build tools may be required on windows
# https://www.scivision.dev/python-windows-visual-c-14-required/

# or k-mediods in pyclustering
# https://pypi.org/project/pyclustering/

In [None]:
from pyclustering.cluster.kmedoids import kmedoids
from pyclustering.cluster import cluster_visualizer,cluster_visualizer_multidim
from pyclustering.cluster.silhouette import silhouette

In [None]:
k = 3
n = 3000
sample = np.nan_to_num(gd[:n,:n])

In [None]:
# import networkx as nx

# G = nx.from_numpy_matrix(sample)
# edge_list = [i for i in nx.generate_edgelist(G,data=True)]

***Cluster $k=n$***

In [None]:
# initiate k random medoids
# also sets k clusters
initial_medoids = np.random.randint(0,1000,size=3)
kmedoids_instance = kmedoids(sample,initial_medoids, data_type='distance_matrix')

# run cluster analysis and obtain results
kmedoids_instance.process()
clusters = kmedoids_instance.get_clusters()
medoids = kmedoids_instance.get_medoids()

# score
# The silhouette value is a measure of how similar an object
# is to its own cluster compared to other clusters
score = silhouette(data=sample, clusters=clusters,data_type='distance_matrix').process().get_score()

***Cluster using silhouette score to find $max(k)$***
- [visualizer seems to work with paierd list only](https://github.com/annoviko/pyclustering/issues/499)

In [None]:
# search using silhouette score
# https://codedocs.xyz/annoviko/pyclustering/classpyclustering_1_1cluster_1_1silhouette_1_1silhouette__ksearch.html
from pyclustering.cluster.center_initializer import random_center_initializer
from pyclustering.cluster.silhouette import silhouette_ksearch_type, silhouette_ksearch

search_instance = silhouette_ksearch(sample, kmin=4, kmax=7,
                                     algorithm=silhouette_ksearch_type.KMEDOIDS).process()

amount = search_instance.get_amount()
scores = search_instance.get_scores()
print("Scores: '%s'" % str(scores))

# Create instance of K-Medoids algorithm with optimal settings from search
initial_medoids = np.random.randint(0,n,size=amount)
kmedoids_instance = kmedoids(sample,initial_medoids, data_type='distance_matrix')
kmedoids_instance.process()

# capture results
clusters = kmedoids_instance.get_clusters()
medoids = kmedoids_instance.get_medoids()

In [None]:
len(clusters), sample.shape

In [None]:
visualizer = cluster_visualizer()
visualizer.append_clusters(clusters, data=None)
visualizer.show()

# *References*

- https://towardsdatascience.com/clustering-on-mixed-type-data-8bbd0a2569c3
- https://medium.com/@rumman1988/clustering-categorical-and-numerical-datatype-using-gower-distance-ab89b3aa90d9
- https://www.researchgate.net/post/What_is_the_best_way_for_cluster_analysis_when_you_have_mixed_type_of_data_categorical_and_scale
- https://www.google.com/search?client=firefox-b-d&q=python+gower+distance
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise_distances.html
- https://discuss.analyticsvidhya.com/t/clustering-technique-for-mixed-numeric-and-categorical-variables/6753
- https://stackoverflow.com/questions/24196897/r-distance-matrix-and-clustering-for-mixed-and-large-dataset
- https://www.analyticsvidhya.com/blog/2015/11/easy-methods-deal-categorical-variables-predictive-modeling/
- https://towardsdatascience.com/all-about-categorical-variable-encoding-305f3361fd02
- https://rpubs.com/data_feelings/msda_data621_hw4
- https://pypi.org/project/gower/
- https://scikit-learn-extra.readthedocs.io/en/latest/generated/sklearn_extra.cluster.KMedoids.html
- https://towardsdatascience.com/k-medoids-clustering-on-iris-data-set-1931bf781e05
- https://www.rdocumentation.org/packages/cluster/versions/2.1.0/topics/pam
- https://github.com/annoviko/pyclustering/issues/499
- https://stats.stackexchange.com/questions/2717/clustering-with-a-distance-matrix