Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Google SMITH algorithm for long docs matching #719

Closed
lalitpagaria opened this issue Jan 7, 2021 · 12 comments
Closed

Support Google SMITH algorithm for long docs matching #719

lalitpagaria opened this issue Jan 7, 2021 · 12 comments
Labels
topic:modeling type:feature New feature or request

Comments

@lalitpagaria
Copy link
Contributor

Currently all BERT based models support searching/matching upto 512 chars long docs. Upcoming SMITH algorithm/encoder looks promising and perform better for long docs (2048 chars long).

I think this would go in FARM but not sure. I am happy to work on it if supported.

@lalitpagaria lalitpagaria added the type:feature New feature or request label Jan 7, 2021
@Timoeller
Copy link
Contributor

Hey @lalitpagaria thanks for pointing out this paper, seems like a useful resource. As you said, new model types preferably go into FARM through HF transformers integration.

This particular approach is good for semantically matching two long documents, which is not really QA but still might be a useful use case for haystack: when a user finds an answer inside a document, the user might be interested in very similar documents or passages to dig deeper into the topic.

I would still suggest to wait for the implementation in transformers and then bring it in through FARM. Does that sound good?

@Timoeller Timoeller self-assigned this Jan 8, 2021
@lalitpagaria
Copy link
Contributor Author

@Timoeller yes it make sense to wait for implementation in Transformers

@Timoeller
Copy link
Contributor

There does not seem to be an issue open in transformers. Do you prefer to open it there yourself or should I?
If you do please reference this issue, so we can track the progress better.

I will close this issue for now, feel free to add things or reopen.

@Timoeller Timoeller reopened this Jan 11, 2021
@Timoeller
Copy link
Contributor

Sorry for changing states. This is a good idea for improvements to haystack and we should keep it in our icebox - potentially we might need to integrate it ourselves if it does not come to transformers.

@lalitpagaria
Copy link
Contributor Author

@Timoeller transformers is very big now and people there quite busy. So I am not sure when they will implement it if at all.

I see FARM bit independent from transformers so if this encoder bring good value to it, then why not to implement it inside FARM itself.

What I understood from this paper it is dual encoder model. And good for long documents matching. So I see it's value in Haystack retrievers to fetch long documents.

My apologies as my understanding might be wrong because of lack of knowledge about this field. ☺️

@Timoeller
Copy link
Contributor

Hey @lalitpagaria totally agreed with your points.

Putting the issue from "closed" into our icebox means we prioritizing it higher : )

And you are right, FARM is independent of transformers, integration of model architectures through transformers is just easier.


Currently there are no immediate plans to incorporate SMITH in FARM though since it is not about retrieving long candidates (given a query), it is about matching 2 long documents with each other. For handling long documents there are also other approaches, like Longformer or Reformer available. The matching of two long documents with SMITH could be a stage after an answer (and related document) was generated. We would like to keep an eye on this use case and if it becomes useful for haystack - that is why we put it into the icebox.

@lalitpagaria
Copy link
Contributor Author

I created huggingface/transformers#9526 in transformers. Hopefully they will pick it. 🤞🏼

@Timoeller
Copy link
Contributor

transformers 4.2.0 has integrated Longformer https://huggingface.co/allenai/led-large-16384
that could be used for QA on long documents. Lets test this out soon.

@lalitpagaria
Copy link
Contributor Author

I think it's already being tried and it do not performed well. Refer #61

@stale
Copy link

stale bot commented Aug 20, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs.

@stale stale bot added the stale label Aug 20, 2021
@lalitpagaria
Copy link
Contributor Author

Hugging face already working on PR. So let's wait for it. Adding this comment to remove stale marker.

@anakin87
Copy link
Member

anakin87 commented Feb 6, 2024

Today this is an outdated approach. Closing as "won't fix".

@anakin87 anakin87 closed this as not planned Won't fix, can't repro, duplicate, stale Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:modeling type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants