-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Google SMITH algorithm for long docs matching #719
Comments
Hey @lalitpagaria thanks for pointing out this paper, seems like a useful resource. As you said, new model types preferably go into FARM through HF transformers integration. This particular approach is good for semantically matching two long documents, which is not really QA but still might be a useful use case for haystack: when a user finds an answer inside a document, the user might be interested in very similar documents or passages to dig deeper into the topic. I would still suggest to wait for the implementation in transformers and then bring it in through FARM. Does that sound good? |
@Timoeller yes it make sense to wait for implementation in Transformers |
There does not seem to be an issue open in transformers. Do you prefer to open it there yourself or should I? I will close this issue for now, feel free to add things or reopen. |
Sorry for changing states. This is a good idea for improvements to haystack and we should keep it in our icebox - potentially we might need to integrate it ourselves if it does not come to transformers. |
@Timoeller transformers is very big now and people there quite busy. So I am not sure when they will implement it if at all. I see FARM bit independent from transformers so if this encoder bring good value to it, then why not to implement it inside FARM itself. What I understood from this paper it is dual encoder model. And good for long documents matching. So I see it's value in Haystack retrievers to fetch long documents. My apologies as my understanding might be wrong because of lack of knowledge about this field. |
Hey @lalitpagaria totally agreed with your points. Putting the issue from "closed" into our icebox means we prioritizing it higher : ) And you are right, FARM is independent of transformers, integration of model architectures through transformers is just easier. Currently there are no immediate plans to incorporate SMITH in FARM though since it is not about retrieving long candidates (given a query), it is about matching 2 long documents with each other. For handling long documents there are also other approaches, like Longformer or Reformer available. The matching of two long documents with SMITH could be a stage after an answer (and related document) was generated. We would like to keep an eye on this use case and if it becomes useful for haystack - that is why we put it into the icebox. |
I created huggingface/transformers#9526 in transformers. Hopefully they will pick it. 🤞🏼 |
transformers 4.2.0 has integrated Longformer https://huggingface.co/allenai/led-large-16384 |
I think it's already being tried and it do not performed well. Refer #61 |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 21 days if no further activity occurs. |
Hugging face already working on PR. So let's wait for it. Adding this comment to remove stale marker. |
Today this is an outdated approach. Closing as "won't fix". |
Currently all BERT based models support searching/matching upto 512 chars long docs. Upcoming SMITH algorithm/encoder looks promising and perform better for long docs (2048 chars long).
I think this would go in FARM but not sure. I am happy to work on it if supported.
The text was updated successfully, but these errors were encountered: