Content deduplication - Reference #1496 #1641
sekh77
started this conversation in
Show and tell
Replies: 1 comment 1 reply
-
Hi @sekh77! The for query in most_similar_docs:
highest_scored_doc = query[0] |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This is in reference to my earlier discussion #1496
I managed to get the "MostSimilarDocumentsPipeline" running for my document store. And could see duplicates being reported. Here's an example of the Document result object.
most_similar_docs = [{'text': '<>, 'score': 1.0, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0673.txt'}, 'embedding': None, 'id': 'e1eccfd26a6354b493a601bf966d2b2a'}, 'text': '<>, 'score': 0.93728964, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0781.txt'}, 'embedding': None, 'id': 'ea119020fb1dad657dbbef87e7419894'},}}]
I have 10 different entries most_similar_docs[0] to most_similar_docs[10] in the result object. Each entry has top_k=4 - so most_similar_docs[0] has 4 entries.
How do I loop through most_similar_docs, and generate a CSV report as follows:
File name, Score, Duplicate Files
file_0673.txt, 93.7%, 0781.txt
I tried in this way so far: print(list(map(lambda item: item.get('score', 'default value'), most_similar_docs)))
But I get the error: AttributeError: 'list' object has no attribute 'get'
Any help would be greatly appreciated?
Also as @bogdankostic suggested in one of his replies in discussion #1496 , I tried with this statement: most_similar_docs[0].score. But I get an error:
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'list' object has no attribute 'score'
Thanks,
Sekhar H.
Beta Was this translation helpful? Give feedback.
All reactions