Content deduplication - Reference #1496 #1641

sekh77 · 2021-10-23T14:08:33Z

sekh77
Oct 23, 2021

This is in reference to my earlier discussion #1496

I managed to get the "MostSimilarDocumentsPipeline" running for my document store. And could see duplicates being reported. Here's an example of the Document result object.

most_similar_docs = [{'text': '<>, 'score': 1.0, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0673.txt'}, 'embedding': None, 'id': 'e1eccfd26a6354b493a601bf966d2b2a'}, 'text': '<>, 'score': 0.93728964, 'question': None, 'meta': {'_split_id': 0, 'name': 'file_0781.txt'}, 'embedding': None, 'id': 'ea119020fb1dad657dbbef87e7419894'},}}]

I have 10 different entries most_similar_docs[0] to most_similar_docs[10] in the result object. Each entry has top_k=4 - so most_similar_docs[0] has 4 entries.

How do I loop through most_similar_docs, and generate a CSV report as follows:

File name, Score, Duplicate Files
file_0673.txt, 93.7%, 0781.txt

I tried in this way so far: print(list(map(lambda item: item.get('score', 'default value'), most_similar_docs)))
But I get the error: AttributeError: 'list' object has no attribute 'get'

Any help would be greatly appreciated?

Also as @bogdankostic suggested in one of his replies in discussion #1496 , I tried with this statement: most_similar_docs[0].score. But I get an error:
Traceback (most recent call last):
File "", line 1, in
AttributeError: 'list' object has no attribute 'score'

Thanks,
Sekhar H.

bogdankostic · 2021-10-25T14:14:15Z

bogdankostic
Oct 25, 2021

Hi @sekh77!

The MostSimilarDocumentsPipeline returns a list of lists of Document objects. For each document_id you provide to the pipeline's run method, you get a list of Document objects sorted by score with regard to the document you used as query. To get the highest scored document for each document you provided to the run method, the following should work:

for query in most_similar_docs:
    highest_scored_doc = query[0]

1 reply

sekh77 Oct 26, 2021
Author

Hi @bogdankostic -

Yeah the pipeline is returning a list of lists of Document objects. I could get what I needed by using two indexes. Ex: most_similar_docs[0][1].score gave me the score for second highest matching document, most_similar_docs[0][1].meta['name'] provided me the file name, and so on.

For some reason I am getting only 10 results from the run query. I am not sure if there is a default setting that I should change in ElasticSearch config - if you know about this kindly let me know.

Also, updating the embedding's for every run is time and compute intensive. Is there a way to reuse the embedding's that was calculated at the first run if the documents are same?

Finally, how can I get the sentence transformer model downloaded to disk, and then just load it during runtime?

Thanks,
Sekhar H..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Content deduplication - Reference #1496 #1641

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Content deduplication - Reference #1496 #1641

sekh77 Oct 23, 2021

Replies: 1 comment · 1 reply

bogdankostic Oct 25, 2021

sekh77 Oct 26, 2021 Author

sekh77
Oct 23, 2021

Replies: 1 comment 1 reply

bogdankostic
Oct 25, 2021

sekh77 Oct 26, 2021
Author