Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colbert -- > Next step #20

Closed
Tracked by #19
yogeswarl opened this issue Mar 18, 2023 · 14 comments
Closed
Tracked by #19

Colbert -- > Next step #20

yogeswarl opened this issue Mar 18, 2023 · 14 comments
Assignees
Labels
bug Something isn't working

Comments

@yogeswarl
Copy link
Member

yogeswarl commented Mar 18, 2023

Hello @hosseinfani, I have an issue where TCT Colbert works on running it in a single process. Since this was tested on the original queries. But when I multiprocess with predicted queries. it runs into a memory error, and I cannot fix it by any means with the debugger. Please let me know if you will be available in the lab this weekend so we can sort of this issue with your guidance.

Thanks

@yogeswarl yogeswarl mentioned this issue Mar 18, 2023
31 tasks
@yogeswarl yogeswarl self-assigned this Mar 18, 2023
@yogeswarl yogeswarl added the bug Something isn't working label Mar 18, 2023
@yogeswarl
Copy link
Member Author

I believe I have nailed down the problem.
pyserini started support for dense retrieval from 0.19. we have 0.17. Someone in my windows module worked, but it won't run the multiprocessing, and my laptop gave problems. I followed the instructions provided in their documentation, and the unit test is currently running without any error, but previously it wouldn't. Now I have to figure out how to run pyserini natively instead of downloading it as a pip module. I guess my weekend will be spent doing a fresh installation and rewiring based on our needs. Will update you on this topic by Monday night at most.

@yogeswarl
Copy link
Member Author

colbert_test
After a few tests. I was able to sort out the problem. This is in no way possible for us to use Colbert to run through 25 iterations.
As only one is taking this long
Any suggestions?

@hosseinfani
Copy link
Member

@yogeswarl
I need some domain knowledge. Can you explain the dense retrieval for me and also how colbert is needed?

@yogeswarl
Copy link
Member Author

TCT_Colbert is a dense retriever.

Unlike BM25, which uses a bag of words to find the best answer for a query. Dense retrieval uses vectors that were trained using COLBERT. TCT_colbert is an improved version of colBERT. It uses the bi-encoder decoder to train query-passage pairs for highly efficient retrieval, although it is magnitudes in consumption of time.

I will write a literature review on this paper to explain more about TCT Colbert.
https://arxiv.org/pdf/2010.11386.pdf

@hosseinfani
Copy link
Member

hosseinfani commented Mar 21, 2023

@yogeswarl
colbert has pretrained on msmarco docs. you can check for msmarco passage. then you don't neet to train. https://github.com/stanford-futuredata/ColBERT

don't go for the best denser retrievals since we just want to show the application of them using our gold statandars. So, the most simplest but dense model is enough.

@yogeswarl
Copy link
Member Author

yogeswarl commented Mar 21, 2023

For MSMARCO passage, the colbert is already available. I am using the top retrieval index. I am going to go for something simpler and check if that helps. Thanks for your help. But for AOL, we will have to train them!!

@yogeswarl
Copy link
Member Author

Hello @hosseinfani , I did try every dense model. They all can only do maximum of 10 iterations per second. I reread the paper and it is one of the drawbacks. In acheiving efficiency, They sacrifice time consumptions

@yogeswarl
Copy link
Member Author

I would like to suggest we try doing it for a sample of probably 10,000 or a somewhat large number. please advice.
As we speak I will get working on showing the CAIR method working on my workstation, so we can complete running query refinement method as well!

@hosseinfani
Copy link
Member

@yogeswarl
Not sure I understood the concern. The fine-tuning iteration takes time? Or the retrieval step?
I'll be in lab tomorrow around noon. We can meet then.

@yogeswarl
Copy link
Member Author

It is the retrieval step. The dense retrieval model is already prebuilt.

@yogeswarl
Copy link
Member Author

@hosseinfani . As discussed in Lab. Here are our next approaches to tackle the dense retrieval problem.
We first want to find how many original queries have an absolute 0.0 as their map score (these are distinguished as Hard queries - queries which did not have any relevant documents when retrieved through sparse retrieval) Result [...].
From here, we could aim at 2 approaches using the bm25.agg.all

  • to check if the refined query was able to retrieve with high relevance. if not, create a file to pass it on to dense retrieval to measure if this increases their effectiveness.
  • calculate all queries whose refined version could not reach effective relevance and calculate a sample of 1000 -2000 to see if dense retrieval is effective.

I will update you with my findings by next week.

@yogeswarl
Copy link
Member Author

@hosseinfani. Update
I have let the colbert run for about 10 hours. I believe the correct number for our sample would be to let it run for about 10 hours. or approximately 6000 queries. So We choose a sample size of 5000. One major drawback would be we cannot do
multiprocessing with Colbert, since one instance of it, consumes 24-25GB ram just to load the model and the encoder.
I am looking to optimize it further.
Screen Shot 2023-04-11 at 9 24 59 AM

@yogeswarl
Copy link
Member Author

@hosseinfani , Colbert is almost complete. I will get the results by this evening after computing map scores and add it to our docs.

@yogeswarl
Copy link
Member Author

@hosseinfani . This is complete. I have added the results to our google docs!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants