# Measuring Vehicle Similarity

The goal of this project is to estimate the similarity between vehicles of different brands, ranges, models, etc., which can be a base of future recommendations for automotive dealers.

There are around ~80.000 car varieties in the central catalogue, but even the most general autodealers see only a fraction of these in the market. Therefore, making recommendations based on their past purchases would often result in a very short list of cars. By measuring the similarity between the different variants, this list can be expanded to include "almost-but-not-exactly-identical" vehicles.

In [208]:
import numpy as np
import pandas as pd
import re
from scipy.sparse import csr_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from IPython.display import display


import sparse_dot_topn.sparse_dot_topn as ct

pd.options.mode.chained_assignment = None 

### 1. Import sample data
(n = 5000) 

In [209]:
vehicle_catalogue = pd.read_csv("../data/raw/vehicle_catalogue_20190903.csv")
print(vehicle_catalogue.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
vehicle_id       5000 non-null int64
manufacturer     5000 non-null object
range            5000 non-null object
model            5000 non-null object
derivative       5000 non-null object
transmission     5000 non-null object
trim             4791 non-null object
sector           4949 non-null object
bodystyle        5000 non-null object
fueltype         5000 non-null object
fueldelivery     5000 non-null object
engine_litres    5000 non-null float64
drivetrain       5000 non-null object
dtypes: float64(1), int64(1), object(11)
memory usage: 507.9+ KB
None


The vehicle id is a unique key to the car catalogue, therefore it can be used as an index. The rest of the fields describe the vehicle brand, model, derivative and so on.

In [210]:
vehicle_catalogue.set_index("vehicle_id", inplace=True)
display(vehicle_catalogue.head())

Unnamed: 0_level_0,manufacturer,range,model,derivative,transmission,trim,sector,bodystyle,fueltype,fueldelivery,engine_litres,drivetrain
vehicle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
86054,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line 5dr,MANUAL,AMG LINE,PRESTIGE LOWER,HATCHBACK,PETROL,TURBO,1.3,FRONT WHEEL DRIVE
91229,TOYOTA,C-HR,C-HR HATCHBACK,2.0 Hybrid Dynamic 5dr CVT [Leather/JBL],AUTOMATIC,DYNAMIC,MEDIUM,HATCHBACK,PETROL/ELECTRIC HYBRID,INJECTION,2.0,FRONT WHEEL DRIVE
87304,FORD,KUGA,KUGA ESTATE,1.5 EcoBoost 176 ST-Line Edition 5dr Auto,AUTOMATIC,ST-LINE EDITION,4X4 MEDIUM,ESTATE,PETROL,TURBO,1.5,FOUR WHEEL DRIVE
74189,BMW,M4,M4 CONVERTIBLE,M4 2dr [Competition Pack],MANUAL,M4,PRESTIGE CONVERTIBLE MEDIUM,CONVERTIBLE,PETROL,TURBO,3.0,REAR WHEEL DRIVE
84555,MERCEDES-BENZ,E CLASS,E CLASS CABRIOLET,E450 4Matic AMG Line Premium Plus 2dr 9G-Tronic,AUTOMATIC,AMG LINE,PRESTIGE CONVERTIBLE LARGE,CABRIOLET,PETROL,TURBO,3.0,FOUR WHEEL DRIVE


### 2. Data formatting

There are several ways to measure the similarity between vehicles (collaborative filtering, string distance, etc.). This time I implement one that is building on the TF-IDF method. The inspiration for the approach came from here: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html

The steps in short:
1. Calculate the tf-idf vectors of each entry in the dataset (one car). This vector will count the most important words/ngrams in each of the entries, weighted by the uniqueness of these words on the entire dataset. E.g. the word "QUASHQAI" will be more unique and hence more important than the word "DIESEL", as the latter describes all diesel vehicles, while the former only applies to Nissans. 

2. Calculate the similarities between the different tf-idf vectors, and find the top x most similar vehicles for each vehicles.

3. The resulting car-to-car mapping can be used to join different vehicle datasets together (e.g. one dataset with the list of vehicles currently on sale we need to find buyers for, the other dataset is the list of vehicles the buyers purchased in the past)

In order to use the text description of the vehicles for the algorithm, one long string has to be created:

In [211]:
# vehicle_catalogue = vehicle_catalogue.astype(str, copy=True)

descriptive_fields = ["manufacturer", "range", "model", "derivative", "transmission", "trim", "sector", "bodystyle", "fueltype",
                     "fueldelivery", "engine_litres", "drivetrain"]

vehicle_catalogue["full_vehicle_description"] = (vehicle_catalogue[descriptive_fields]
                                                    .fillna("")    
                                                    .astype(str)
                                                    .apply(lambda x: ' '.join(x), axis = 1)
                                                )

display(vehicle_catalogue.head())


Unnamed: 0_level_0,manufacturer,range,model,derivative,transmission,trim,sector,bodystyle,fueltype,fueldelivery,engine_litres,drivetrain,full_vehicle_description
vehicle_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
86054,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line 5dr,MANUAL,AMG LINE,PRESTIGE LOWER,HATCHBACK,PETROL,TURBO,1.3,FRONT WHEEL DRIVE,MERCEDES-BENZ A CLASS A CLASS HATCHBACK A200 A...
91229,TOYOTA,C-HR,C-HR HATCHBACK,2.0 Hybrid Dynamic 5dr CVT [Leather/JBL],AUTOMATIC,DYNAMIC,MEDIUM,HATCHBACK,PETROL/ELECTRIC HYBRID,INJECTION,2.0,FRONT WHEEL DRIVE,TOYOTA C-HR C-HR HATCHBACK 2.0 Hybrid Dynamic ...
87304,FORD,KUGA,KUGA ESTATE,1.5 EcoBoost 176 ST-Line Edition 5dr Auto,AUTOMATIC,ST-LINE EDITION,4X4 MEDIUM,ESTATE,PETROL,TURBO,1.5,FOUR WHEEL DRIVE,FORD KUGA KUGA ESTATE 1.5 EcoBoost 176 ST-Line...
74189,BMW,M4,M4 CONVERTIBLE,M4 2dr [Competition Pack],MANUAL,M4,PRESTIGE CONVERTIBLE MEDIUM,CONVERTIBLE,PETROL,TURBO,3.0,REAR WHEEL DRIVE,BMW M4 M4 CONVERTIBLE M4 2dr [Competition Pack...
84555,MERCEDES-BENZ,E CLASS,E CLASS CABRIOLET,E450 4Matic AMG Line Premium Plus 2dr 9G-Tronic,AUTOMATIC,AMG LINE,PRESTIGE CONVERTIBLE LARGE,CABRIOLET,PETROL,TURBO,3.0,FOUR WHEEL DRIVE,MERCEDES-BENZ E CLASS E CLASS CABRIOLET E450 4...


Checking if worked.

In [212]:
print(vehicle_catalogue.iloc[0,12])
print(vehicle_catalogue[vehicle_catalogue["trim"].isna()])

MERCEDES-BENZ A CLASS A CLASS HATCHBACK A200 AMG Line 5dr MANUAL AMG LINE PRESTIGE LOWER HATCHBACK PETROL TURBO 1.3 FRONT WHEEL DRIVE
           manufacturer     range                   model  \
vehicle_id                                                  
88643       LAMBORGHINI   HURACAN      HURACAN EVO SPYDER   
90492           BENTLEY  BENTAYGA         BENTAYGA ESTATE   
87410             LEXUS        UX            UX HATCHBACK   
87274             LEXUS        LC                LC COUPE   
88582               BMW  7 SERIES  7 SERIES DIESEL SALOON   
...                 ...       ...                     ...   
85746             LEXUS        ES               ES SALOON   
84353            NISSAN   E-NV200   E-NV200 EVALIA ESTATE   
87405             LEXUS        UX            UX HATCHBACK   
88569               BMW  7 SERIES         7 SERIES SALOON   
77808            JAGUAR    F-TYPE      F-TYPE CONVERTIBLE   

                                         derivative transmission trim  \

### 3. Building the algorithm

1. Calculate the tf-idf matrices and store in a sparse matrix (CSR)
2. Calculate the (cosine) similarity between the vectors
3. Extract the results to a data frame. 

The above method also works with ngrams, might be useful if the word-based decompositions does not produce satisfying results.

In [213]:
def ngrams(string, n=3):
    
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = re.sub(' +',' ',string).strip()
    string = re.sub(r'[,-./]',r'', string)  
    
    string = (string
               .encode("ascii", errors="ignore").decode()
               .lower()              
               .replace('&', 'and')
               .replace(',', ' ')
               .replace('-', ' ')
               .title()
    )
    
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]
    
ngrams(vehicle_catalogue.iloc[0,12])

['Mer',
 'erc',
 'rce',
 'ced',
 'ede',
 'des',
 'esb',
 'sbe',
 'ben',
 'enz',
 'nz ',
 'z A',
 ' A ',
 'A C',
 ' Cl',
 'Cla',
 'las',
 'ass',
 'ss ',
 's A',
 ' A ',
 'A C',
 ' Cl',
 'Cla',
 'las',
 'ass',
 'ss ',
 's H',
 ' Ha',
 'Hat',
 'atc',
 'tch',
 'chb',
 'hba',
 'bac',
 'ack',
 'ck ',
 'k A',
 ' A2',
 'A20',
 '200',
 '00 ',
 '0 A',
 ' Am',
 'Amg',
 'mg ',
 'g L',
 ' Li',
 'Lin',
 'ine',
 'ne ',
 'e 5',
 ' 5D',
 '5Dr',
 'Dr ',
 'r M',
 ' Ma',
 'Man',
 'anu',
 'nua',
 'ual',
 'al ',
 'l A',
 ' Am',
 'Amg',
 'mg ',
 'g L',
 ' Li',
 'Lin',
 'ine',
 'ne ',
 'e P',
 ' Pr',
 'Pre',
 'res',
 'est',
 'sti',
 'tig',
 'ige',
 'ge ',
 'e L',
 ' Lo',
 'Low',
 'owe',
 'wer',
 'er ',
 'r H',
 ' Ha',
 'Hat',
 'atc',
 'tch',
 'chb',
 'hba',
 'bac',
 'ack',
 'ck ',
 'k P',
 ' Pe',
 'Pet',
 'etr',
 'tro',
 'rol',
 'ol ',
 'l T',
 ' Tu',
 'Tur',
 'urb',
 'rbo',
 'bo ',
 'o 1',
 ' 13',
 '13 ',
 '3 F',
 ' Fr',
 'Fro',
 'ron',
 'ont',
 'nt ',
 't W',
 ' Wh',
 'Whe',
 'hee',
 'eel',
 'el ',
 'l D',


Definition of the functions that takes two sparse matrices, and calculate the similarities between the vectors using the cosine similarity, a.k.a. normalised dot product. 

See details here: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

In [214]:
def cossim_matrix_top(A, B, ntop, lower_bound=0):
    # force A and B as a CSR matrix.
    # If they have already been CSR, there is no overhead
    A = A.tocsr()
    B = B.tocsr()
    M, _ = A.shape
    _, N = B.shape
 
    idx_dtype = np.int32
 
    nnz_max = M*ntop
 
    indptr = np.zeros(M+1, dtype=idx_dtype)
    indices = np.zeros(nnz_max, dtype=idx_dtype)
    data = np.zeros(nnz_max, dtype=A.dtype)

    ct.sparse_dot_topn(
        M, N, np.asarray(A.indptr, dtype=idx_dtype),
        np.asarray(A.indices, dtype=idx_dtype),
        A.data,
        np.asarray(B.indptr, dtype=idx_dtype),
        np.asarray(B.indices, dtype=idx_dtype),
        B.data,
        ntop,
        lower_bound,
        indptr, indices, data)

    return csr_matrix((data,indices,indptr),shape=(M,N))

#### 1. Initiate and fit the tf-idf vectorizers

In [215]:
tf_idf_vectorizer_word = TfidfVectorizer(min_df=1, analyzer="word")
# tf_idf_vectorizer_ngram = TfidfVectorizer(min_df=1, analyzer=ngrams)

tf_idf_matrix_word = tf_idf_vectorizer_word.fit_transform(vehicle_catalogue.full_vehicle_description)

In [216]:
print("Shape of tf-idf matrix is:", tf_idf_matrix_word.shape, "\n(Rows of matrix correspond to number of observations, colums correspond to number of unique words)\n")
print(vehicle_catalogue.nunique(), "\n")
print(vehicle_catalogue.iloc[0,:], "\n")
print("tf-idf vector of", vehicle_catalogue.iloc[0,-1], ":\n", tf_idf_matrix_word[0])

Shape of tf-idf matrix is: (5000, 1220) 
(Rows of matrix correspond to number of observations, colums correspond to number of unique words)

manufacturer                  51
range                        338
model                        750
derivative                  4311
transmission                   2
trim                         497
sector                        33
bodystyle                     10
fueltype                       7
fueldelivery                   4
engine_litres                 39
drivetrain                     3
full_vehicle_description    4992
dtype: int64 

manufacturer                                                    MERCEDES-BENZ
range                                                                 A CLASS
model                                                       A CLASS HATCHBACK
derivative                                                  A200 AMG Line 5dr
transmission                                                           MANUAL
trim                     

#### 2. Calculate similarities (0-1 scale)

In [217]:
cossim_matrix = cossim_matrix_top(tf_idf_matrix_word, tf_idf_matrix_word.transpose(), lower_bound=0.75, ntop=11)
print("The dimensions of the similarity (sparse) matrix are:", cossim_matrix.shape)

The dimensions of the similarity (sparse) matrix are: (5000, 5000)


#### 3. Extract results to a data frame

In [218]:
def extract_matches_to_df(similarity_matrix, input_data):
    
    non_zeros = similarity_matrix.nonzero()
    sparserows = non_zeros[0]
    sparsecols = non_zeros[1]
    
    matches_count = sparsecols.size
    
    default_description = np.empty([matches_count], dtype=object)
    matching_description = np.empty([matches_count], dtype=object)
    default_index = np.empty([matches_count], dtype=object)
    matching_index = np.empty([matches_count], dtype=object)
    similarity = np.zeros(matches_count)
    
    for i in range(0, matches_count):
        default_index[i] = input_data.iloc[sparserows[i], :].name
        matching_index[i] = input_data.iloc[sparsecols[i], :].name
        default_description[i] = input_data.iloc[sparserows[i], 12]
        matching_description[i] = input_data.iloc[sparsecols[i], 12]
        similarity[i] = similarity_matrix.data[i]
        
    vehicle_mapping = pd.DataFrame({"default_index": default_index,
                                    "matching_index": matching_index,
                                    # "default_description": default_description,
                                    # "matching_description": matching_description,
                                    "similarity": similarity})
    
    vehicle_mapping = vehicle_mapping[vehicle_mapping.default_index != vehicle_mapping.matching_index]
    vehicle_mapping["ranking"] = vehicle_mapping.groupby("default_index")["similarity"].rank(ascending = False, method = "first")
    return vehicle_mapping
    

In [219]:
top_vehicle_matches = extract_matches_to_df(similarity_matrix=cossim_matrix, input_data=vehicle_catalogue)
display(top_vehicle_matches.head(10))

Unnamed: 0,default_index,matching_index,similarity,ranking
1,86054,83771,0.977302,1.0
2,86054,86056,0.974355,2.0
3,86054,83773,0.952458,3.0
4,86054,86057,0.948524,4.0
5,86054,83774,0.927416,5.0
6,86054,83772,0.921911,6.0
7,86054,86042,0.829648,7.0
8,86054,87744,0.81558,8.0
9,86054,86689,0.814385,9.0
10,86054,86044,0.808464,10.0


In [220]:
print(vehicle_catalogue.loc[86054, "full_vehicle_description"]) 
print("\n")
print(vehicle_catalogue.loc[83771, "full_vehicle_description"])
print("\n")
print(vehicle_catalogue.loc[86044, "full_vehicle_description"])

MERCEDES-BENZ A CLASS A CLASS HATCHBACK A200 AMG Line 5dr MANUAL AMG LINE PRESTIGE LOWER HATCHBACK PETROL TURBO 1.3 FRONT WHEEL DRIVE


MERCEDES-BENZ A CLASS A CLASS HATCHBACK A200 AMG Line 5dr Auto AUTOMATIC AMG LINE PRESTIGE LOWER HATCHBACK PETROL TURBO 1.3 FRONT WHEEL DRIVE


MERCEDES-BENZ A CLASS A CLASS HATCHBACK A180 AMG Line Premium 5dr MANUAL AMG LINE PRESTIGE LOWER HATCHBACK PETROL TURBO 1.3 FRONT WHEEL DRIVE


### Summary

The results look sensible, the algorithm assigns similarly-looking vehicles close to each other.

The map can be used to as an engine of recommender systems, that aim to measure the overlap between two sets of cars (e.g. an automotive vendor's supply, with the buyer's demand.

In [226]:
fields_to_display = ["manufacturer", "range", "model", "derivative", "bodystyle"]

left_side = pd.merge(vehicle_catalogue[fields_to_display],
                     top_vehicle_matches,
                     how="inner",
                     left_index=True,
                     right_on="default_index")

fully_joined = pd.merge(left_side,
                       vehicle_catalogue[fields_to_display],
                       how = "inner",
                       left_on="matching_index",
                       right_index=True, 
                       suffixes=("_orig", "_match")).reset_index(drop=True)

fully_joined[fully_joined.ranking == 1].head()

Unnamed: 0,manufacturer_orig,range_orig,model_orig,derivative_orig,bodystyle_orig,default_index,matching_index,similarity,ranking,manufacturer_match,range_match,model_match,derivative_match,bodystyle_match
0,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line 5dr,HATCHBACK,86054,83771,0.977302,1.0,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line 5dr Auto,HATCHBACK
5,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line Executive 5dr Auto,HATCHBACK,83772,83771,0.943322,1.0,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line 5dr Auto,HATCHBACK
26,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line Premium 5dr Auto,HATCHBACK,83773,86056,0.978446,1.0,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line Premium 5dr,HATCHBACK
33,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line Premium 5dr,HATCHBACK,86056,83773,0.978446,1.0,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line Premium 5dr Auto,HATCHBACK
65,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line Premium Plus 5dr Auto,HATCHBACK,83774,86057,0.979569,1.0,MERCEDES-BENZ,A CLASS,A CLASS HATCHBACK,A200 AMG Line Premium Plus 5dr,HATCHBACK
