<h1>Gower distance calculation for Python</h1>

<p>It is not all the time that the data under study is an even matrix of numerical values. Sometimes, you need to dig into data with mixed types of variables (e.g., categorical, boolean, numerical).
</p>
<p>This notebook proposes a refactoring for scipy's pdist function in order to support the Gower mixed dissimilarity.
</p>
<p>For more details about the Gower distance, please visit: <a href="http://members.cbio.mines-paristech.fr/~jvert/svn/bibli/local/Gower1971general.pdf">Gower, J.C., 1971, A General Coefficient of Similarity and Some of Its Properties</a>.</p>


<h2>1. Generate some data with mixed types</h2>

In [1]:
import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from scipy._lib.six import xrange

import os
import math

X=pd.DataFrame({'age':[21,21,19,30,21,21,19,30],
'gender':['M','M','N','M','F','F','F','F'],
'civil_status':['MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED'],
'salary':[3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0],
'children':[True,False,True,True,True,False,False,True],
'available_credit':[2200,100,22000,1100,2000,100,6000,2200]})

print(X)

   age gender civil_status   salary  children  available_credit
0   21      M      MARRIED   3000.0      True              2200
1   21      M       SINGLE   1200.0     False               100
2   19      N       SINGLE  32000.0      True             22000
3   30      M       SINGLE   1800.0      True              1100
4   21      F      MARRIED   2900.0      True              2000
5   21      F       SINGLE   1100.0     False               100
6   19      F        WIDOW  10000.0     False              6000
7   30      F     DIVORCED   1500.0      True              2200


<h2>2. Gower auxiliary functions</h2>
This is necessary because numpy does not give support for mixed data matrices operations.

In [12]:
import numbers

#Normalize the array
def normalize_mixed_data_columns(arr, dtypes):
  
    if isinstance(arr,pd.DataFrame):
       arr =np.asmatrix(arr.copy())
    elif isinstance(arr,np.ndarray):
       arr =arr.copy()
    else:
       raise ValueError('A DataFrame or ndarray must be provided.')
    rows,cols = arr.shape
    for col in xrange(cols):
        if np.issubdtype(dtypes[col],np.number):
            max = arr[:,col].max()+0.0  #Converts it to double
            if (cols>1):
                arr[:,col] = arr[:,col] /max
                
            else:    
                arr= arr/max
    return( arr)

#This is to obtain the range (max-min) values of each numeric column
def calc_range_mixed_data_columns(arr, dtypes):
    rows,cols = arr.shape
    
    result = np.zeros(cols)
    for col in xrange(cols):
        if np.issubdtype(dtypes[col],np.number):
            result[col]= arr[:,col].max()-arr[:,col].min()
    return( result)
    


<h1>3. Refactoring of pdist</h1>
With support for mixed data. Not possible to override the module methods from pdist, because they are private.

In [13]:
#This function must be refactored on pdist module to support mixed data
def _copy_array_if_base_present(a):
    if a.base is not None:
        return a.copy()
    elif np.issubsctype(a, np.float32):
        return np.array(a, dtype=np.double)
    else:
        return a

#This function must be refactored on pdist module to support mixed data
def _convert_to_double(X):
    if X.dtype == np.object:
        return X.copy()
    if X.dtype != np.double:
        X = X.astype(np.double)
    if not X.flags.contiguous:
        X = X.copy()
    return X

#This function was copied from pdist because it is private. No change in the original function.
def _validate_vector(u, dtype=None):
    # XXX Is order='c' really necessary?
    u = np.asarray(u, dtype=dtype, order='c').squeeze()
    # Ensure values such as u=1 and u=[1] still return 1-D arrays.
    u = np.atleast_1d(u)
    if u.ndim > 1:
        raise ValueError("Input vector should be 1-D.")
    return u


#An excerpt from pdist function only with the basic structure to call the gower dist. 
#The original pdist must be adapted for Gower using this as example.
def pdist_(X, metric='euclidean', p=2, w=None, V=None, VI=None):
    X = np.asarray(X, order='c')

    # The C code doesn't do striding.
    X = _copy_array_if_base_present(X)

    s = X.shape
    if len(s) != 2:
        raise ValueError('A 2-dimensional array must be passed.')

    m, n = s
    dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)

    #(...)
    dfun = metric
    k = 0
    for i in xrange(0, m - 1):
        for j in xrange(i + 1, m):
            dm[k] = dfun(X[i], X[j],V=V,w=w,VI=VI)
            k = k + 1

    return dm



<h1>4. The Gower distance function</h1>

In [14]:
from scipy.spatial.distance import pdist, squareform
import numbers

def gower(xi, xj,V=None,w=None,VI=None):
    cols = len(xj)
    
    xi=_validate_vector(xi)
    xj=_validate_vector(xj)

    if V is None:
        raise ValueError('An array with the (max-min) ranges for each numeric column must be passed in V.')

    if VI is None:
        raise ValueError('An array with the dtypes or each numeric column must be passed in VI.')

    if w is None:
        w=[1]*cols
    
    sum_sij =0.0
    sum_wij =0.0
    for col in xrange(cols):
        sij=0.0
        wij=0.0
        
        if np.issubdtype(VI[col], np.number):
            sij=abs(xi[col]-xj[col])/(V[col])
            wij=(w[col],0)[pd.isnull(xi[col]) or pd.isnull(xj[col])]
            
        else:
            sij=(1,0)[xi[col]==xj[col]]
            wij=(w[col],0)[pd.isnull(xi[col]) and pd.isnull(xj[col])]
        
        sum_sij+= (wij*sij)
        sum_wij+=wij

    
    return(sum_sij/sum_wij)




<h1>5. Get the Gower distance matrix</h1>

In [15]:
#It's necessary to obtain the columns dtypes
dtypes = X.dtypes
#It's necessary to normalize between 0 and 1
Xn=normalize_mixed_data_columns(X,dtypes)

#It's necessary to obtain the range (max-min) values of each numeric column
ranges=calc_range_mixed_data_columns(Xn,dtypes)

print("Dissimilarities :")
D=np.tril(squareform(pdist_(Xn, gower,V=ranges,VI=dtypes)))
print(D)
#To get the similarities, do 1-D

Dissimilarities :
[[0.         0.         0.         0.         0.         0.
  0.         0.        ]
 [0.35902381 0.         0.         0.         0.         0.
  0.         0.        ]
 [0.67073985 0.69643032 0.         0.         0.         0.
  0.         0.        ]
 [0.31787418 0.3138769  0.6552807  0.         0.         0.
  0.         0.        ]
 [0.16872811 0.52362903 0.67280129 0.4824794  0.         0.
  0.         0.        ]
 [0.52622985 0.16720604 0.6969697  0.48108294 0.35750174 0.
  0.         0.        ]
 [0.59697856 0.45600237 0.74042795 0.74818608 0.43237334 0.28987508
  0.         0.        ]
 [0.47778758 0.65396349 0.8151941  0.34332284 0.31210361 0.4878362
  0.57476615 0.        ]]


<h1>6. The equivalent code in R</h1>
Using the daisy method from {cluster} package

<p>
<code>
library(cluster)

age=c(21,21,19,30,21,21,19,30)
gender=c('M','M','N','M','F','F','F','F')
civil_status=c('MARRIED','SINGLE','SINGLE','SINGLE','MARRIED','SINGLE','WIDOW','DIVORCED')
salary=c(3000.0,1200.0 ,32000.0,1800.0 ,2900.0 ,1100.0 ,10000.0,1500.0)
children=c(TRUE,FALSE,TRUE,TRUE,TRUE,FALSE,FALSE,TRUE)
available_credit=c(2200,100,22000,1100,2000,100,6000,2200)
X=data.frame(age,gender,civil_status,salary,children,available_credit)

D=daisy(X,metric="gower")

print(D)

Dissimilarities :
          1         2         3         4         5         6         7
2 0.3590238                                                            
3 0.6707398 0.6964303                                                  
4 0.3178742 0.3138769 0.6552807                                        
5 0.1687281 0.5236290 0.6728013 0.4824794                              
6 0.5262298 0.2006472 0.6969697 0.4810829 0.3575017                    
7 0.5969786 0.5472028 0.7404280 0.7481861 0.4323733 0.3478501          
8 0.4777876 0.6539635 0.8151941 0.3433228 0.3121036 0.4878362 0.5747661


</code>


In [16]:
# Paths and datasets

path = os.getcwd()
datasetpath = os.path.join(path,"data")
flagdataset = os.path.join(datasetpath,"flag2.csv")
zoodataset = os.path.join(datasetpath,"zoo1.csv")


# Data Set to DataFrame
flagdf = pd.read_csv(flagdataset)
zoodf = pd.read_csv(zoodataset)
zoodropdf = zoodf.drop(["animal name","type"],axis=1)
zoodropdf

Unnamed: 0,hair,feathers,eggs,milk,airborne,aquatic,predator,toothed,backbone,breathes,venomous,fins,legs,tail,domestic,catsize
0,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
1,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1
2,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0
3,1,0,0,1,0,0,1,1,1,1,0,0,4,0,0,1
4,1,0,0,1,0,0,1,1,1,1,0,0,4,1,0,1
5,1,0,0,1,0,0,0,1,1,1,0,0,4,1,0,1
6,1,0,0,1,0,0,0,1,1,1,0,0,4,1,1,1
7,0,0,1,0,0,1,0,1,1,0,0,1,0,1,1,0
8,0,0,1,0,0,1,1,1,1,0,0,1,0,1,0,0
9,1,0,0,1,0,0,0,1,1,1,0,0,4,0,1,0


In [18]:
print(zoodropdf.dtypes)


#It's necessary to obtain the columns dtypes
dtypes = zoodropdf.dtypes
#It's necessary to normalize between 0 and 1
zoodropdfnm=normalize_mixed_data_columns(zoodropdf,dtypes)

#It's necessary to obtain the range (max-min) values of each numeric column
ranges=calc_range_mixed_data_columns(zoodropdfnm,dtypes)

print("Dissimilarities :")
Dzoo=np.tril(squareform(pdist_(zoodropdfnm, gower,V=ranges,VI=dtypes)))
print(Dzoo)
#To get the similarities, do 1-D

hair        int64
feathers    int64
eggs        int64
milk        int64
airborne    int64
aquatic     int64
predator    int64
toothed     int64
backbone    int64
breathes    int64
venomous    int64
fins        int64
legs        int64
tail        int64
domestic    int64
catsize     int64
dtype: object
Dissimilarities :
[[0.     0.     0.     ... 0.     0.     0.    ]
 [0.125  0.     0.     ... 0.     0.     0.    ]
 [0.5    0.5    0.     ... 0.     0.     0.    ]
 ...
 [0.0625 0.0625 0.4375 ... 0.     0.     0.    ]
 [0.4375 0.4375 0.4375 ... 0.5    0.     0.    ]
 [0.5625 0.4375 0.4375 ... 0.5    0.25   0.    ]]


In [24]:
x = np.array(zoodropdf.iloc[5])
y = x = zoodropdf.iloc[2]
print(x)
gower(x,y,V=ranges,VI=dtypes)

hair        0
feathers    0
eggs        1
milk        0
airborne    0
aquatic     1
predator    1
toothed     1
backbone    1
breathes    0
venomous    0
fins        1
legs        0
tail        1
domestic    0
catsize     0
Name: 2, dtype: int64


0.0