Block Filtering and Block Purging after Vector Based Blocking #10

reversingentropy · 2023-07-07T08:04:56Z

Hi, I have tried vector based blocking with sentence transformers and faiss, and these blocks contain a dict of indices and sets of indices.
I can't proceed with Block Filtering and Block Purging.
It doesn't allow cardinality as I presume other methods like QGramsBlocking returns a dict of {'key':datamodel.Block} items.

Nikoletos-K · 2023-07-18T09:47:32Z

Hello, the proper way to use Vector Based Blocking is presented here:

https://pyjedai.readthedocs.io/en/latest/tutorials/pyTorchWorkflow.html

Vector Based Blocking generates a dictionary of ids that correspond to candidate matches. Therefore, at the end of vb blocking, you'll either get this dictionary or a graph similar to entity matching. FAISS also gives distance/similarity scores, avoiding the need for an additional step of entity matching. Check out the tutorial, and if you have any questions, I'm happy to help.

reversingentropy · 2023-07-19T08:01:41Z

Hi Nikoletos,
I used the exact code for using sminilm and faiss, then I used Unique Mapping Clustering .
I achieved low scores for Precision: 3.24% , Recall: 2.23%, F1-score: 2.64%.
How do I achieve the scores of Precision: 83.18% , Recall: 67.10%, F1-score: 74.28%?

:

Code:

from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding
emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',
similarity_search='faiss')

blocks, g = emb.build_blocks(data,
top_k=5,
similarity_distance='euclidean',
load_embeddings_if_exist=False,
save_embeddings=False,
with_entity_matching=True)

from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering
ccc = UniqueMappingClustering()
clusters = ccc.process(g, data, similarity_threshold=0.40)
_ = ccc.evaluate(clusters, with_classification_report=True)

Results:
Building blocks via Embeddings-NN Block Building [sminilm, faiss]
Embeddings-NN Block Building [sminilm, faiss]: 100%
2152/2152 [00:20<00:00, 117.82it/s]
Device selected: cuda

                                     Μethod:  Embeddings-NN Block Building

Method name: Embeddings-NN Block Building
Parameters:
Vectorizer: sminilm
Similarity-Search: faiss
Top-K: 5
Vector size: 384
Runtime: 20.2259 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 9.38%
Recall: 93.77%
F1-score: 17.05%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
True positives: 1009
False positives: 9751
True negatives: 1156633
False negatives: 67
Total comparisons: 10760
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Statistics:
FAISS:
Indices shape returned after search: (1076, 5)
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 9.37732342007435,
'Recall %': 93.77323420074349,
'F1 %': 17.049678945589726,
'True Positives': 1009,
'False Positives': 9751,
'True Negatives': 1156633,
'False Negatives': 67}

                                     Μethod:  Unique Mapping Clustering

Method name: Unique Mapping Clustering
Parameters:
Runtime: 0.0187 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 0.57%
Recall: 0.28%
F1-score: 0.37%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
True positives: 3
False positives: 527
True negatives: 1155627
False negatives: 1073
Total comparisons: 530
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 0.5660377358490566,
'Recall %': 0.2788104089219331,
'F1 %': 0.37359900373599003,
'True Positives': 3,
'False Positives': 527,
'True Negatives': 1155627,
'False Negatives': 1073}

Nikoletos-K · 2023-07-19T08:50:02Z

What I suggest you do is start experimenting with:

top_k=5 (5 to 20)
similarity_distance='euclidean' ('cosine')

and then with the clustering:

similarity_threshold=0.4 (from 0 to 1)

or you can even check the optuna tutorial here https://pyjedai.readthedocs.io/en/latest/tutorials/Optuna.html

Nikoletos-K self-assigned this Jul 7, 2023

Nikoletos-K added the help wanted Extra attention is needed label Jul 7, 2023

Nikoletos-K closed this as completed Jul 18, 2023

AI-team-UoA locked and limited conversation to collaborators Jul 19, 2023

Nikoletos-K converted this issue into discussion #12 Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Block Filtering and Block Purging after Vector Based Blocking #10

Block Filtering and Block Purging after Vector Based Blocking #10

reversingentropy commented Jul 7, 2023

Nikoletos-K commented Jul 18, 2023

reversingentropy commented Jul 19, 2023

Nikoletos-K commented Jul 19, 2023

This issue was moved to a discussion.

This issue was moved to a discussion.

Block Filtering and Block Purging after Vector Based Blocking #10

Block Filtering and Block Purging after Vector Based Blocking #10

Comments

reversingentropy commented Jul 7, 2023

Nikoletos-K commented Jul 18, 2023

reversingentropy commented Jul 19, 2023

Nikoletos-K commented Jul 19, 2023

This issue was moved to a discussion.