Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block Filtering and Block Purging after Vector Based Blocking #10

Closed
reversingentropy opened this issue Jul 7, 2023 · 3 comments
Closed
Assignees
Labels
help wanted Extra attention is needed

Comments

@reversingentropy
Copy link

Hi, I have tried vector based blocking with sentence transformers and faiss, and these blocks contain a dict of indices and sets of indices.
I can't proceed with Block Filtering and Block Purging.
It doesn't allow cardinality as I presume other methods like QGramsBlocking returns a dict of {'key':datamodel.Block} items.

image image
@Nikoletos-K Nikoletos-K self-assigned this Jul 7, 2023
@Nikoletos-K Nikoletos-K added the help wanted Extra attention is needed label Jul 7, 2023
@Nikoletos-K
Copy link
Member

Hello, the proper way to use Vector Based Blocking is presented here:

https://pyjedai.readthedocs.io/en/latest/tutorials/pyTorchWorkflow.html

Vector Based Blocking generates a dictionary of ids that correspond to candidate matches. Therefore, at the end of vb blocking, you'll either get this dictionary or a graph similar to entity matching. FAISS also gives distance/similarity scores, avoiding the need for an additional step of entity matching. Check out the tutorial, and if you have any questions, I'm happy to help.

@reversingentropy
Copy link
Author

Hi Nikoletos,
I used the exact code for using sminilm and faiss, then I used Unique Mapping Clustering .
I achieved low scores for Precision: 3.24% , Recall: 2.23%, F1-score: 2.64%.
How do I achieve the scores of Precision: 83.18% , Recall: 67.10%, F1-score: 74.28%?

:

Code:

from pyjedai.vector_based_blocking import EmbeddingsNNBlockBuilding
emb = EmbeddingsNNBlockBuilding(vectorizer='sminilm',
similarity_search='faiss')

blocks, g = emb.build_blocks(data,
top_k=5,
similarity_distance='euclidean',
load_embeddings_if_exist=False,
save_embeddings=False,
with_entity_matching=True)

from pyjedai.clustering import ConnectedComponentsClustering, UniqueMappingClustering
ccc = UniqueMappingClustering()
clusters = ccc.process(g, data, similarity_threshold=0.40)
_ = ccc.evaluate(clusters, with_classification_report=True)

Results:
Building blocks via Embeddings-NN Block Building [sminilm, faiss]
Embeddings-NN Block Building [sminilm, faiss]: 100%
2152/2152 [00:20<00:00, 117.82it/s]
Device selected: cuda


                                     Μethod:  Embeddings-NN Block Building

Method name: Embeddings-NN Block Building
Parameters:
Vectorizer: sminilm
Similarity-Search: faiss
Top-K: 5
Vector size: 384
Runtime: 20.2259 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 9.38%
Recall: 93.77%
F1-score: 17.05%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
True positives: 1009
False positives: 9751
True negatives: 1156633
False negatives: 67
Total comparisons: 10760
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Statistics:
FAISS:
Indices shape returned after search: (1076, 5)
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 9.37732342007435,
'Recall %': 93.77323420074349,
'F1 %': 17.049678945589726,
'True Positives': 1009,
'False Positives': 9751,
'True Negatives': 1156633,
'False Negatives': 67}


                                     Μethod:  Unique Mapping Clustering

Method name: Unique Mapping Clustering
Parameters:
Runtime: 0.0187 seconds
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Performance:
Precision: 0.57%
Recall: 0.28%
F1-score: 0.37%
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Classification report:
True positives: 3
False positives: 527
True negatives: 1155627
False negatives: 1073
Total comparisons: 530
───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
{'Precision %': 0.5660377358490566,
'Recall %': 0.2788104089219331,
'F1 %': 0.37359900373599003,
'True Positives': 3,
'False Positives': 527,
'True Negatives': 1155627,
'False Negatives': 1073}


@Nikoletos-K
Copy link
Member

What I suggest you do is start experimenting with:

  • top_k=5 (5 to 20)
  • similarity_distance='euclidean' ('cosine')

and then with the clustering:

  • similarity_threshold=0.4 (from 0 to 1)

or you can even check the optuna tutorial here https://pyjedai.readthedocs.io/en/latest/tutorials/Optuna.html

@AI-team-UoA AI-team-UoA locked and limited conversation to collaborators Jul 19, 2023
@Nikoletos-K Nikoletos-K converted this issue into discussion #12 Jul 19, 2023

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants