# Grouping by layout templates
##### The purpouse of this notebook was to test different approach to grouping labels
Firtly we compute distances between neighbour labels. Then we aggregate them and count duplicated vectors. To handle vector storing and querying I used vectorDB. Finally, I created groups by connecting labels if distance between them is common.

##### Dependencies import

In [103]:
from typing import List
import pandas as pd
from docarray import DocList, BaseDoc
from docarray.typing import NdArray

from vectordb import InMemoryExactNNVectorDB

import numpy as np

##### Algorithm hyperparameters

In [109]:
EPSILON = 100 # max distance between vectors counted as one
COMMON_THOLD = 0.2 # thold for vector count to be treated as common. Expressed as fraction of totally processed vectors up to date

##### Dataset preparation

In [110]:
df = pd.read_csv("./data/18789327023.csv")
per_timestamp = df.groupby(["Seen Timestamp"])
processed_frames = {}
for ts, subframe in per_timestamp:
    processed_frames[ts] = subframe
    processed_frames[ts]["dx"] = subframe["X 1"][:-1] - subframe["X 1"][1:]
    processed_frames[ts]["dy"] = subframe["Y 1"][:-1] - subframe["Y 1"][1:]
    processed_frames[ts]["w"] = subframe["X 2"] - subframe["X 1"]
    processed_frames[ts]["h"] = subframe["Y 2"] - subframe["Y 1"]

##### VectorDB setup

In [111]:
class MyDoc(BaseDoc):
    vec_id: int = 0
    embedding: NdArray[4]

db = InMemoryExactNNVectorDB[MyDoc]()




In [112]:
def euclidean_dist(v1: np.ndarray, v2: np.ndarray) -> float:
    return np.sqrt(((v1 - v2) ** 2).sum())

def get_vectors(frame: pd.DataFrame) -> List[np.ndarray]:
    cpy = frame.copy(deep=True)
    cpy['prev_w'] = cpy['w'].shift(1)
    cpy['prev_h'] = cpy['h'].shift(1)
    return cpy[['dx', 'dy', 'w', 'h']].values.tolist()[1:]

In [113]:
SEP = "=" * 20

frequency = {}
added_ctr = 0
processed_ctr = 0
for sdf in processed_frames.values():
    vectors = get_vectors(sdf)
    print(sdf["Text"].iloc[0])
    for i, (v, text) in enumerate(zip(vectors, sdf["Text"])):
        processed_ctr += 1
        if added_ctr > 0:
            doc = db.search(inputs=DocList[MyDoc]([MyDoc(vec_id=-1, embedding=v)]), limit=10).matches[0][0]
            d = euclidean_dist(doc.embedding, v)
            if d < EPSILON:
                frequency[doc.vec_id] += 1
                count = frequency[doc.vec_id]
                if count < processed_ctr * COMMON_THOLD:
                    print(SEP)
                doc.embedding = (v + (count - 1) * doc.embedding ) / count
                db.update(DocList[MyDoc]([doc]))
                print(text)
                continue
        print(SEP)
        print(text)
        db.index(inputs=DocList[MyDoc]([MyDoc(vec_id=added_ctr, embedding=v)]))
        frequency[added_ctr] = 1
        added_ctr += 1

nan
nan
nan
UVP 1.79
1.29
UVP
RAUCH Eistee
je 1,5 I
UVP 0.99
0.69
UVP
LIPTON Ice Tea
je 0,33 I
UVP 1.49
1.19
UVP
HOHES C Water
je 0,75 I
UVP 0.99
0.79
UVP
GEROLSTEINER Mineralwasser
je 1,5 I
Sparen auf Top-Marken
ab 05.09. bis 07.09.
Angebote
Vorteile
Einkaufsliste
Vorteilscode
UVP 1.79
UVP 1.79
1.29
UVP
RAUCH Eistee
je 1,5 I
UVP 0.99
0.69
UVP
LIPTON Ice Tea
je 0,33 I
UVP 1.49
1.19
UVP
HOHES C Water
je 0,75 I
UVP 0.99
0.79
UVP
GEROLSTEINER Mineralwasser
UVP 1.79
UVP 1.79
1.29
UVP
RAUCH Eistee
je 1,5 I
UVP 0.99
0.69
UVP
LIPTON Ice Tea
je 0,33 I
UVP 1.49
1.19
UVP
HOHES C Water
je 0,75 I
UVP 0.99
0.79
UVP
GEROLSTEINER Mineralwasser
je 1,5 I
je 1,5 I
je 0,33 I
UVP 1.49
1.19
UVP
HOHES C Water
je 0,75 I
UVP 0.99
0.79
UVP
GEROLSTEINER Mineralwasser
je 1,5 I
UVP 4.49
2.99
UVP 4.49
UVP 1.49
UVP 1.49
1.19
UVP
HOHES C Water
je 0,75 I
UVP 0.99
0.79
UVP
GEROLSTEINER Mineralwasser
je 1,5 I
UVP 4.49
2.99
UVP
LE SWEET FILOU Vin de France Rouge
je 1 I
UVP 4.49
2.99
UVP
LE SWEET FILOU Vin de France Blan

## Experiment results

The hyperparameters above were choosen as result of empirical experiments. The results show that this method currently cannot provide robust label grouping in case of sparse data. Separated sections often contain multiple coupons or coupons are split between sections. This algorithm is also sensitive to hyperparameter selection which reduces its ability to generalise. Research on larger datasets to be conducted.