<a href="https://colab.research.google.com/github/basan4ik/similarity-search/blob/main/Matching.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Product mathiching based on ML algorithms

The project is done by Basang Basangov.
Telegram: [basan4ik](https://t.me/basan4ik)

## 1. Project Description

Data matching is the workflow process of comparing different data values in structured or unstructured format based on similarity or an underlying entity [[1]](https://www.width.ai/post/data-matching-software#:~:text=How%20you%20can%20use%20machine,similarity%20or%20an%20underlying%20entity.). This notebook provides a workflow of how this type of task can be solved using two-staged process: first, FAISS similarity search is used to get 100-200 the most similar products from a database (that could petentially consist of billions of items), then among those 100-200 products get 5 that are even more similar using supervised ML algoritm, which in our case would be Catboost classification algorithm. The final evaluation metric is accuracy@5.

There are 4 files:
- 'base.csv' is a dataset of anonymized set of products. Each product is presented with a unique id (0-base, 1-base, etc.) and vectors of the shape 1 x 72;
- train.csv is a training dataset. Each row has an id (0-query, 1-query, etc.), a vector (1x72), and id from 'base';
- validation.csv is a dataset of vectors that we need to find the most similar vectors from 'base'.
- validation_answer.csv is a dataset with the right answers to the previous dataset.

This is a project from [Yandex Practicum Masterskaya](https://practicum.yandex.ru/masterskaya/).

## 2. Data Preparation

### 2.1 Installations

In [1]:
!pip install faiss-gpu
!pip install catboost
!pip install ipywidgets
!jupyter nbextension enable --py widgetsnbextension
!pip install zarr

Collecting faiss-gpu
  Downloading faiss_gpu-1.7.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (85.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.5/85.5 MB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-gpu
Successfully installed faiss-gpu-1.7.2
Collecting catboost
  Downloading catboost-1.2.1-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: catboost
Successfully installed catboost-1.2.1
Collecting jedi>=0.16 (from ipython>=4.0.0->ipywidgets)
  Downloading jedi-0.19.0-py2.py3-none-any.whl (1.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: jedi
Successfully installed jedi-0.19.0
Enabling notebook extension jupyter-js-widgets/extension...
Paths used for co

In [2]:
import pandas as pd
import numpy as np
import zarr
import faiss
from catboost import CatBoostClassifier, Pool, metrics, cv

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler


### 2.2 Data Loading

Here, we mount Google drive disk where our files located, but you can find all the necesary files in the 'data' directory on github

In [3]:
# all the necesary files can be found in the 'data' directory on github
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [4]:
with (
    open("/content/drive/MyDrive/similarity-search/data/base.csv", "r") as f1,
    open("/content/drive/MyDrive/similarity-search/data/train.csv", "r") as f2,
    open("/content/drive/MyDrive/similarity-search/data/validation.csv", "r") as f3,
    open("/content/drive/MyDrive/similarity-search/data/validation_answer.csv", "r") as f4,
):
    base = pd.read_csv(f1, index_col=0)
    train = pd.read_csv(f2, index_col=0)
    validation = pd.read_csv(f3, index_col=0)
    validation_answer = pd.read_csv(f4, index_col=0)

### 2.3 Feature Preparation

In [5]:
base.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0-base,-115.08389,11.152912,-64.42676,-118.88089,216.48244,-104.69806,-469.070588,44.348083,120.915344,181.4497,...,-42.808693,38.800827,-151.76218,-74.38909,63.66634,-4.703861,92.93361,115.26919,-112.75664,-60.830353
1-base,-34.562202,13.332763,-69.78761,-166.53348,57.680607,-86.09837,-85.076666,-35.637436,119.718636,195.23419,...,-117.767525,41.1,-157.8294,-94.446806,68.20211,24.346846,179.93793,116.834,-84.888941,-59.52461
2-base,-54.233746,6.379371,-29.210136,-133.41383,150.89583,-99.435326,52.554795,62.381706,128.95145,164.38147,...,-76.3978,46.011803,-207.14442,127.32557,65.56618,66.32568,81.07349,116.594154,-1074.464888,-32.527206
3-base,-87.52013,4.037884,-87.80303,-185.06763,76.36954,-58.985165,-383.182845,-33.611237,122.03191,136.23358,...,-70.64794,-6.358921,-147.20105,-37.69275,66.20289,-20.56691,137.20694,117.4741,-1074.464888,-72.91549
4-base,-72.74385,6.522049,43.671265,-140.60803,5.820023,-112.07408,-397.711282,45.1825,122.16718,112.119064,...,-57.199104,56.642403,-159.35184,85.944724,66.76632,-2.505783,65.315285,135.05159,-1074.464888,0.319401


In [None]:
base.shape

(2918139, 73)

In [None]:
base.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,62,63,64,65,66,67,68,69,70,71
count,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,...,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0,2918139.0
mean,-86.22947,8.080077,-44.5808,-146.635,111.3166,-71.99138,-392.2239,20.35283,123.6842,124.4581,...,-79.02286,33.29735,-154.7962,14.15132,67.79167,23.5449,74.9593,115.5667,-799.339,-47.79125
std,24.89132,4.953387,38.63166,19.8448,46.34809,28.18607,271.655,64.21638,6.356109,64.43058,...,30.45642,28.88603,41.22929,98.95115,1.823356,55.34224,61.345,21.17518,385.4131,41.74802
min,-199.4687,-13.91461,-240.0734,-232.6671,-105.583,-211.0086,-791.4699,-301.8597,93.15305,-173.8719,...,-220.5662,-88.50774,-353.9028,-157.5944,59.50944,-233.1382,-203.6016,15.72448,-1297.931,-226.7801
25%,-103.0654,4.708491,-69.55949,-159.9051,80.50795,-91.37994,-629.3318,-22.22147,119.484,81.76751,...,-98.7639,16.98862,-180.7799,-71.30038,66.58096,-12.51624,33.77574,101.6867,-1074.465,-75.66641
50%,-86.2315,8.03895,-43.81661,-146.7768,111.873,-71.9223,-422.2016,20.80477,123.8923,123.4977,...,-78.48812,34.71502,-153.9773,13.82693,67.81458,23.41649,74.92997,116.0244,-1074.465,-48.59196
75%,-69.25658,11.47007,-19.62527,-133.3277,142.3743,-52.44111,-156.6686,63.91821,127.9705,167.2206,...,-58.53355,52.16429,-127.3405,99.66753,69.02666,59.75511,115.876,129.5524,-505.7445,-19.71424
max,21.51555,29.93721,160.9372,-51.37478,319.6645,58.80624,109.6325,341.2282,152.2612,427.5421,...,60.17411,154.1678,24.36099,185.0981,75.71203,314.8988,339.5738,214.7063,98.77081,126.9732


In [6]:
base.info()

<class 'pandas.core.frame.DataFrame'>
Index: 2918139 entries, 0-base to 4744766-base
Data columns (total 72 columns):
 #   Column  Dtype  
---  ------  -----  
 0   0       float64
 1   1       float64
 2   2       float64
 3   3       float64
 4   4       float64
 5   5       float64
 6   6       float64
 7   7       float64
 8   8       float64
 9   9       float64
 10  10      float64
 11  11      float64
 12  12      float64
 13  13      float64
 14  14      float64
 15  15      float64
 16  16      float64
 17  17      float64
 18  18      float64
 19  19      float64
 20  20      float64
 21  21      float64
 22  22      float64
 23  23      float64
 24  24      float64
 25  25      float64
 26  26      float64
 27  27      float64
 28  28      float64
 29  29      float64
 30  30      float64
 31  31      float64
 32  32      float64
 33  33      float64
 34  34      float64
 35  35      float64
 36  36      float64
 37  37      float64
 38  38      float64
 39  39      float64


Let's make sure we don't have any missing values

In [7]:
pd.set_option('display.max_rows', None)
base.isna().sum()

0     0
1     0
2     0
3     0
4     0
5     0
6     0
7     0
8     0
9     0
10    0
11    0
12    0
13    0
14    0
15    0
16    0
17    0
18    0
19    0
20    0
21    0
22    0
23    0
24    0
25    0
26    0
27    0
28    0
29    0
30    0
31    0
32    0
33    0
34    0
35    0
36    0
37    0
38    0
39    0
40    0
41    0
42    0
43    0
44    0
45    0
46    0
47    0
48    0
49    0
50    0
51    0
52    0
53    0
54    0
55    0
56    0
57    0
58    0
59    0
60    0
61    0
62    0
63    0
64    0
65    0
66    0
67    0
68    0
69    0
70    0
71    0
dtype: int64

As we can see we have a dataframe with 2918139 rows with 72 float value columns. It has no missing values.

The next, we scale our numberic feautures (basically all features) using scikit's StandardScaler.

In [15]:
scaler = StandardScaler()
scaler.fit(base)

base_scaled = scaler.transform(base)

In [14]:
train.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,63,64,65,66,67,68,69,70,71,Target
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0-query,-53.882748,17.971436,-42.117104,-183.93668,187.51749,-87.14493,-347.360606,38.307602,109.08556,30.413513,...,70.10736,-155.80257,-101.965943,65.90379,34.4575,62.642094,134.7636,-415.750254,-25.958572,675816-base
1-query,-87.77637,6.806268,-32.054546,-177.26039,120.80333,-83.81059,-94.572749,-78.43309,124.9159,140.33107,...,4.669178,-151.69771,-1.638704,68.170876,25.096191,89.974976,130.58963,-1035.092211,-51.276833,366656-base
2-query,-49.979565,3.841486,-116.11859,-180.40198,190.12843,-50.83762,26.943937,-30.447489,125.771164,211.60782,...,78.039764,-169.1462,82.144186,66.00822,18.400496,212.40973,121.93147,-1074.464888,-22.547178,1447819-base
3-query,-47.810562,9.086598,-115.401695,-121.01136,94.65284,-109.25541,-775.150134,79.18652,124.0031,242.65065,...,44.515266,-145.41675,93.990981,64.13135,106.06192,83.17876,118.277725,-1074.464888,-19.902788,1472602-base
4-query,-79.632126,14.442886,-58.903397,-147.05254,57.127068,-16.239529,-321.317964,45.984676,125.941284,103.39267,...,45.02891,-196.09207,-117.626337,66.92622,42.45617,77.621765,92.47993,-1074.464888,-21.149351,717819-base


In [16]:
train.shape

(100000, 73)

In [10]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100000 entries, 0-query to 99999-query
Data columns (total 73 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   0       100000 non-null  float64
 1   1       100000 non-null  float64
 2   2       100000 non-null  float64
 3   3       100000 non-null  float64
 4   4       100000 non-null  float64
 5   5       100000 non-null  float64
 6   6       100000 non-null  float64
 7   7       100000 non-null  float64
 8   8       100000 non-null  float64
 9   9       100000 non-null  float64
 10  10      100000 non-null  float64
 11  11      100000 non-null  float64
 12  12      100000 non-null  float64
 13  13      100000 non-null  float64
 14  14      100000 non-null  float64
 15  15      100000 non-null  float64
 16  16      100000 non-null  float64
 17  17      100000 non-null  float64
 18  18      100000 non-null  float64
 19  19      100000 non-null  float64
 20  20      100000 non-null  float64
 21  21  

In [17]:
train_target = train['Target']
train_features = train.drop(['Target'], axis=1)

In [18]:
train_features_scaled = scaler.transform(train_features)

## 3. FAISS Similarity Search

### 3.1 FAISS Index

- We'll create an index on a base dataset. We'll be using IndexFlatL2 here as it provides the most accurate results trading off the speed which is different to other FAISS indicies that could be much faster trading off some accuracy. The reason behind our choice is that on a given dataset (just a little less than 3m rows with 72 dimensions) FlatL2 takes about a 1 minute to search for 100k queries giving 100 nearest neighbors on Google Colab GPU T4. Which is not that long.
- The train_features is going to be used as the query vectors, with train_target for calculating accuracy metrics and fine-tuning.

In [27]:
d = 72                           # dimensions
nb = 2918139                      # database size
nq = 1000                       # nb of queries
np.random.seed(1234)             # make reproducible

In [20]:
res = faiss.StandardGpuResources()  # use a single GPU
# make it a flat GPU index
index_flat = faiss.IndexFlatL2(d)   # build the index

# make it a flat GPU index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)

In [21]:
# Then we train the index to find a suitable clustering
gpu_index_flat.train(np.ascontiguousarray(base_scaled.astype('float32')))

In [22]:
# Finally we add all embeddings to the index
gpu_index_flat.add(np.ascontiguousarray(base_scaled.astype('float32')))

print(gpu_index_flat.is_trained)

True


In [23]:
print(gpu_index_flat.ntotal)

2918139


### 3.2 FAISS Search

For every vector in train_features we'll search for k closest vectors in index. Then, if it contains a target vector (train_target), we add one point to acc counter. Then calculate a proportion of how many time we actually found target.

In [28]:
# Do not forget that cell magic starts with %% and line magic starts with %.
%%time
k = 10  # we want to see 10 nearest neighbors
D, I = gpu_index_flat.search(np.ascontiguousarray(train_features_scaled[:1000, :].astype('float32')), k)

CPU times: user 240 ms, sys: 799 µs, total: 241 ms
Wall time: 252 ms


### 3.3 Evaluating k-ANN

The code below is copied from this notebook https://colab.research.google.com/drive/1WUG6JO6ra4W3bs6Wi7febuPrd9B4D61B#scrollTo=d42af9e0-3f09-4b4c-a1a6-688387db17de

In [91]:
base_index = {k: v for k, v in enumerate(base.index.to_list())}

The code above enumreates indicies from a base dataframe, so we have:


```
{0: '0-base',
 1: '1-base',
 2: '2-base',
 3: '3-base',
 ...
```




In [34]:
train_target[:1000].shape

(1000,)

In [92]:
acc = 0
for target_base_name, k_closest_vectors in zip(train_target[:1000].values.tolist(), I.tolist()):
    acc += int(target_base_name in [base_index[v] for v in k_closest_vectors])

print(100 * acc / len(I))

69.6


In [95]:
len(target_base_name)

12

We got accuracy 69.6 on first 1000 train rows, if we take the full set 100_000 and 100 the most similar vectors (which was test in other iteration), then we'll get accuracy 79.178 out of 100. I downsized parameter k and train dataset because of 12GB RAM in Google Collab.

The higer, the better.

## 4. Catboost Classfication

### 4.1 Preparing Features for Catboost Classifier

So, a variable I is a numpy array that contains 1000 rows each with 10 indicies of the most similar vectors from base, that is, these indices just are pointers to the base. So, basically, we have a 3 dimensional array - (1000,10,72).

If we concatenate this 3d array with a training feautures set (and later with a validation set) that has a shape of (1000, 72), we would end up with an ndarray of a shape (1000, 10, 144). Then, after adding a class (1 or 0), we can train a model.

In [40]:
print(f'Shape and dtype of base dataset: {base_scaled.shape}, {base_scaled.dtype}')

Shape and dtype of base dataset: (2918139, 72), float64


In [41]:
print(f'Shape and dtype of train dataset: {train_features_scaled[:1000].shape}, {train_features_scaled[:1000].dtype}')

Shape and dtype of train dataset: (1000, 72), float64


In [47]:
print(f'Shape and dtype of I matrix: {I.shape}, {I.dtype}')

Shape and dtype of I matrix: (1000, 10), int64


In [43]:
train_tiled = np.tile(train_features_scaled[:1000], 10).reshape(1000, 10, 72)  # tiled (repeated) rows in train

In [48]:
I_flat = I.ravel()  # flatten I
result_values = base_scaled[I_flat]  # take only those from base that are in flatten
I_base = result_values.reshape(1000, 10, 72)  # create new 3d array

In [59]:
train_target_tiled = np.tile(train_target[:1000], 10)
I_base_names = base.index[I_flat]

In [72]:
check_equal = np.equal(train_target_tiled, I_base_names).astype('int').reshape(1000, 10, 1)

In [73]:
check_equal.shape

(1000, 10, 1)

In [74]:
# Concatenate the two arrays on axis=2 along with the check_equal array
train_final = np.concatenate([train_tiled[..., :], I_base[..., :], check_equal], axis=2)

In [75]:
train_final.shape

(1000, 10, 145)

In [77]:
train_final = train_final.reshape(10000, 145)

In [78]:
X = train_final[:, :-1]
y = train_final[:, -1]


In [96]:
np.average(np.equal(train_target_tiled, I_base_names).astype('int'))

0.0002

In [100]:
base.index[61612]

'62471-base'

Interesting, it seems that values in id column in 'base' dataset are not continious integer values.

### 4.2 Data Splitting

In [79]:
X_train, X_validation, y_train, y_validation = train_test_split(X, y, train_size=0.75, random_state=42)


### 4.3 Model Training

In [80]:
# Baseline model
model = CatBoostClassifier(
    custom_loss=[metrics.Accuracy()],
    random_seed=42,
    logging_level='Silent'
)

In [82]:
from google.colab import output
output.enable_custom_widget_manager()

In [83]:
model.fit(
    X_train, y_train,
    eval_set=(X_validation, y_validation),
#     logging_level='Verbose',  # you can uncomment this for text output
    plot=True
);

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

### 4.4 Model Cross-Validation

In [85]:
cv_params = model.get_params()
cv_params.update({
    'loss_function': metrics.Logloss()
})
cv_data = cv(
    Pool(X, y),
    cv_params,
    plot=True
)



MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))



KeyboardInterrupt: ignored