Support Google SMITH algorithm for long docs matching #719

lalitpagaria · 2021-01-07T20:56:07Z

Currently all BERT based models support searching/matching upto 512 chars long docs. Upcoming SMITH algorithm/encoder looks promising and perform better for long docs (2048 chars long).

I think this would go in FARM but not sure. I am happy to work on it if supported.

Timoeller · 2021-01-08T09:25:33Z

Hey @lalitpagaria thanks for pointing out this paper, seems like a useful resource. As you said, new model types preferably go into FARM through HF transformers integration.

This particular approach is good for semantically matching two long documents, which is not really QA but still might be a useful use case for haystack: when a user finds an answer inside a document, the user might be interested in very similar documents or passages to dig deeper into the topic.

I would still suggest to wait for the implementation in transformers and then bring it in through FARM. Does that sound good?

lalitpagaria · 2021-01-08T09:27:07Z

@Timoeller yes it make sense to wait for implementation in Transformers

Timoeller · 2021-01-11T08:17:56Z

There does not seem to be an issue open in transformers. Do you prefer to open it there yourself or should I?
If you do please reference this issue, so we can track the progress better.

I will close this issue for now, feel free to add things or reopen.

Timoeller · 2021-01-11T09:00:43Z

Sorry for changing states. This is a good idea for improvements to haystack and we should keep it in our icebox - potentially we might need to integrate it ourselves if it does not come to transformers.

lalitpagaria · 2021-01-11T14:06:15Z

@Timoeller transformers is very big now and people there quite busy. So I am not sure when they will implement it if at all.

I see FARM bit independent from transformers so if this encoder bring good value to it, then why not to implement it inside FARM itself.

What I understood from this paper it is dual encoder model. And good for long documents matching. So I see it's value in Haystack retrievers to fetch long documents.

My apologies as my understanding might be wrong because of lack of knowledge about this field. ☺️

Timoeller · 2021-01-11T15:29:35Z

Hey @lalitpagaria totally agreed with your points.

Putting the issue from "closed" into our icebox means we prioritizing it higher : )

And you are right, FARM is independent of transformers, integration of model architectures through transformers is just easier.

Currently there are no immediate plans to incorporate SMITH in FARM though since it is not about retrieving long candidates (given a query), it is about matching 2 long documents with each other. For handling long documents there are also other approaches, like Longformer or Reformer available. The matching of two long documents with SMITH could be a stage after an answer (and related document) was generated. We would like to keep an eye on this use case and if it becomes useful for haystack - that is why we put it into the icebox.

lalitpagaria · 2021-01-11T21:23:06Z

I created huggingface/transformers#9526 in transformers. Hopefully they will pick it. 🤞🏼

Timoeller · 2021-01-14T16:44:50Z

transformers 4.2.0 has integrated Longformer https://huggingface.co/allenai/led-large-16384
that could be used for QA on long documents. Lets test this out soon.

lalitpagaria · 2021-01-15T15:42:49Z

I think it's already being tried and it do not performed well. Refer #61

stale · 2021-08-20T10:35:12Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.

lalitpagaria · 2021-12-19T08:16:16Z

Hugging face already working on PR. So let's wait for it. Adding this comment to remove stale marker.

anakin87 · 2024-02-06T15:07:58Z

Today this is an outdated approach. Closing as "won't fix".

lalitpagaria added the type:feature New feature or request label Jan 7, 2021

Timoeller self-assigned this Jan 8, 2021

Timoeller closed this as completed Jan 11, 2021

Timoeller reopened this Jan 11, 2021

lalitpagaria mentioned this issue Jan 11, 2021

Siamese Multi-depth Transformer-based Hierarchical Encoder huggingface/transformers#9526

Open

3 tasks

tholor added the topic:modeling label Apr 22, 2021

stale bot added the stale label Aug 20, 2021

lalitpagaria removed the stale label Dec 19, 2021

masci unassigned Timoeller Nov 8, 2022

anakin87 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Google SMITH algorithm for long docs matching #719

Support Google SMITH algorithm for long docs matching #719

lalitpagaria commented Jan 7, 2021

Timoeller commented Jan 8, 2021

lalitpagaria commented Jan 8, 2021

Timoeller commented Jan 11, 2021

Timoeller commented Jan 11, 2021

lalitpagaria commented Jan 11, 2021

Timoeller commented Jan 11, 2021

lalitpagaria commented Jan 11, 2021

Timoeller commented Jan 14, 2021

lalitpagaria commented Jan 15, 2021

stale bot commented Aug 20, 2021

lalitpagaria commented Dec 19, 2021

anakin87 commented Feb 6, 2024

Support Google SMITH algorithm for long docs matching #719

Support Google SMITH algorithm for long docs matching #719

Comments

lalitpagaria commented Jan 7, 2021

Timoeller commented Jan 8, 2021

lalitpagaria commented Jan 8, 2021

Timoeller commented Jan 11, 2021

Timoeller commented Jan 11, 2021

lalitpagaria commented Jan 11, 2021

Timoeller commented Jan 11, 2021

lalitpagaria commented Jan 11, 2021

Timoeller commented Jan 14, 2021

lalitpagaria commented Jan 15, 2021

stale bot commented Aug 20, 2021

lalitpagaria commented Dec 19, 2021

anakin87 commented Feb 6, 2024