### Calculating Distance with Categorical Predictors

This program is a solution to the problem 7.1 of chapter 7 of following book. 

Data Mining for Business Analytics: Concepts, Techniques, and Applications in Python, First Edition.

Galit Shmueli, Peter C. Bruce, Peter Gedeck, and Nitin R. Patel

© 2020 John Wiley & Sons, Inc. Published 2020 by John Wiley & Sons, Inc.

#### Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import sklearn as skl
from sklearn.neighbors import NearestNeighbors

Printing versions of libraries

In [4]:
print('pandas version: {}'.format(pd.__version__))
print('numpy version: {}'.format(np.__version__))
print('sklearn version: {}'.format(skl.__version__))

pandas version: 1.5.3
numpy version: 1.23.5
sklearn version: 1.2.1


In [36]:
df = pd.DataFrame({'category':['Stat', 'Other', 'IT'],
                   'num_years':[1, 1.1, 1]})

df['customer'] = df.index + 1
column_to_move = df.pop('customer')
df.insert(0, 'customer', column_to_move)

display(df)

Unnamed: 0,customer,category,num_years
0,1,Stat,1.0
1,2,Other,1.1
2,3,IT,1.0


Declaring method to calculate distnace between existing customers and a new prospect customer

In [37]:
def Calculate_Distance(existing_customers, prospect_customer):
    point_2 = prospect_customer.iloc[:,1:].to_numpy()[0]

    result = []
    for _, row in existing_customers.iterrows():
        point_1 = np.array(row.iloc[1:])
        distance = np.linalg.norm(point_1 - point_2)
        result.append([int(row[0]), distance])

    distance_df = pd.DataFrame(result, columns = ['Customer', 'Distance with Prospect'])
    display(distance_df)

Using two dummies - categorical predictor variable transformed into 2 binaries

In [38]:
df_2binary = pd.get_dummies(df, drop_first=True)
display(df_2binary)

existing_customers = df_2binary.iloc[:2,]
prospect_customer = df_2binary.iloc[2:,]

Calculate_Distance(existing_customers, prospect_customer)

Unnamed: 0,customer,num_years,category_Other,category_Stat
0,1,1.0,0,1
1,2,1.1,1,0
2,3,1.0,0,0


Unnamed: 0,Customer,Distance with Prospect
0,1,1.0
1,2,1.004988


In [44]:
# use NearestNeighbors from scikit-learn to compute knn
knn = NearestNeighbors(n_neighbors=1)
knn.fit(existing_customers.iloc[:, 1:])
distances, indices = knn.kneighbors(prospect_customer.iloc[:,1:])

# indices is a list of lists, we are only interested in the first element
existing_customers.iloc[indices[0], :]

Unnamed: 0,customer,num_years,category_Other,category_Stat
0,1,1.0,0,1


Using three dummies - categorical predictor variable transformed into 3 binaries

In [45]:
df_3binary = pd.get_dummies(df)
display(df_3binary)

existing_customers = df_3binary.iloc[:2,]
prospect_customer = df_3binary.iloc[2:,]

Calculate_Distance(existing_customers, prospect_customer)

Unnamed: 0,customer,num_years,category_IT,category_Other,category_Stat
0,1,1.0,0,0,1
1,2,1.1,0,1,0
2,3,1.0,1,0,0


Unnamed: 0,Customer,Distance with Prospect
0,1,1.414214
1,2,1.417745


In [46]:
# use NearestNeighbors from scikit-learn to compute knn
knn = NearestNeighbors(n_neighbors=1)
knn.fit(existing_customers.iloc[:, 1:])
distances, indices = knn.kneighbors(prospect_customer.iloc[:,1:])

# indices is a list of lists, we are only interested in the first element
existing_customers.iloc[indices[0], :]

Unnamed: 0,customer,num_years,category_IT,category_Other,category_Stat
0,1,1.0,0,0,1
