Colbert -- > Next step #20

yogeswarl · 2023-03-18T02:19:15Z

Hello @hosseinfani, I have an issue where TCT Colbert works on running it in a single process. Since this was tested on the original queries. But when I multiprocess with predicted queries. it runs into a memory error, and I cannot fix it by any means with the debugger. Please let me know if you will be available in the lab this weekend so we can sort of this issue with your guidance.

Thanks

yogeswarl · 2023-03-18T14:17:17Z

I believe I have nailed down the problem.
pyserini started support for dense retrieval from 0.19. we have 0.17. Someone in my windows module worked, but it won't run the multiprocessing, and my laptop gave problems. I followed the instructions provided in their documentation, and the unit test is currently running without any error, but previously it wouldn't. Now I have to figure out how to run pyserini natively instead of downloading it as a pip module. I guess my weekend will be spent doing a fresh installation and rewiring based on our needs. Will update you on this topic by Monday night at most.

yogeswarl · 2023-03-20T22:10:42Z

After a few tests. I was able to sort out the problem. This is in no way possible for us to use Colbert to run through 25 iterations.
As only one is taking this long
Any suggestions?

hosseinfani · 2023-03-20T22:57:53Z

@yogeswarl
I need some domain knowledge. Can you explain the dense retrieval for me and also how colbert is needed?

yogeswarl · 2023-03-21T00:30:31Z

TCT_Colbert is a dense retriever.

Unlike BM25, which uses a bag of words to find the best answer for a query. Dense retrieval uses vectors that were trained using COLBERT. TCT_colbert is an improved version of colBERT. It uses the bi-encoder decoder to train query-passage pairs for highly efficient retrieval, although it is magnitudes in consumption of time.

I will write a literature review on this paper to explain more about TCT Colbert.
https://arxiv.org/pdf/2010.11386.pdf

hosseinfani · 2023-03-21T00:37:07Z

@yogeswarl
colbert has pretrained on msmarco docs. you can check for msmarco passage. then you don't neet to train. https://github.com/stanford-futuredata/ColBERT

don't go for the best denser retrievals since we just want to show the application of them using our gold statandars. So, the most simplest but dense model is enough.

yogeswarl · 2023-03-21T00:39:54Z

For MSMARCO passage, the colbert is already available. I am using the top retrieval index. I am going to go for something simpler and check if that helps. Thanks for your help. But for AOL, we will have to train them!!

yogeswarl · 2023-03-24T23:28:35Z

Hello @hosseinfani , I did try every dense model. They all can only do maximum of 10 iterations per second. I reread the paper and it is one of the drawbacks. In acheiving efficiency, They sacrifice time consumptions

yogeswarl · 2023-03-24T23:30:21Z

I would like to suggest we try doing it for a sample of probably 10,000 or a somewhat large number. please advice.
As we speak I will get working on showing the CAIR method working on my workstation, so we can complete running query refinement method as well!

hosseinfani · 2023-03-26T09:32:15Z

@yogeswarl
Not sure I understood the concern. The fine-tuning iteration takes time? Or the retrieval step?
I'll be in lab tomorrow around noon. We can meet then.

yogeswarl · 2023-03-26T14:43:39Z

It is the retrieval step. The dense retrieval model is already prebuilt.

yogeswarl · 2023-03-26T19:10:33Z

@hosseinfani . As discussed in Lab. Here are our next approaches to tackle the dense retrieval problem.
We first want to find how many original queries have an absolute 0.0 as their map score (these are distinguished as Hard queries - queries which did not have any relevant documents when retrieved through sparse retrieval) Result [...].
From here, we could aim at 2 approaches using the bm25.agg.all

to check if the refined query was able to retrieve with high relevance. if not, create a file to pass it on to dense retrieval to measure if this increases their effectiveness.
calculate all queries whose refined version could not reach effective relevance and calculate a sample of 1000 -2000 to see if dense retrieval is effective.

I will update you with my findings by next week.

yogeswarl · 2023-04-11T13:57:19Z

@hosseinfani. Update
I have let the colbert run for about 10 hours. I believe the correct number for our sample would be to let it run for about 10 hours. or approximately 6000 queries. So We choose a sample size of 5000. One major drawback would be we cannot do
multiprocessing with Colbert, since one instance of it, consumes 24-25GB ram just to load the model and the encoder.
I am looking to optimize it further.

yogeswarl · 2023-04-17T15:10:58Z

@hosseinfani , Colbert is almost complete. I will get the results by this evening after computing map scores and add it to our docs.

yogeswarl · 2023-04-18T02:57:15Z

@hosseinfani . This is complete. I have added the results to our google docs!

yogeswarl mentioned this issue Mar 18, 2023

Overall Tasks on Repair #19

Open

31 tasks

yogeswarl self-assigned this Mar 18, 2023

yogeswarl added the bug Something isn't working label Mar 18, 2023

yogeswarl closed this as completed Apr 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Colbert -- > Next step #20

Colbert -- > Next step #20

yogeswarl commented Mar 18, 2023 •

edited

Loading

yogeswarl commented Mar 18, 2023

yogeswarl commented Mar 20, 2023

hosseinfani commented Mar 20, 2023

yogeswarl commented Mar 21, 2023

hosseinfani commented Mar 21, 2023 •

edited

Loading

yogeswarl commented Mar 21, 2023 •

edited

Loading

yogeswarl commented Mar 24, 2023

yogeswarl commented Mar 24, 2023

hosseinfani commented Mar 26, 2023

yogeswarl commented Mar 26, 2023

yogeswarl commented Mar 26, 2023

yogeswarl commented Apr 11, 2023

yogeswarl commented Apr 17, 2023

yogeswarl commented Apr 18, 2023

Colbert -- > Next step #20

Colbert -- > Next step #20

Comments

yogeswarl commented Mar 18, 2023 • edited Loading

yogeswarl commented Mar 18, 2023

yogeswarl commented Mar 20, 2023

hosseinfani commented Mar 20, 2023

yogeswarl commented Mar 21, 2023

hosseinfani commented Mar 21, 2023 • edited Loading

yogeswarl commented Mar 21, 2023 • edited Loading

yogeswarl commented Mar 24, 2023

yogeswarl commented Mar 24, 2023

hosseinfani commented Mar 26, 2023

yogeswarl commented Mar 26, 2023

yogeswarl commented Mar 26, 2023

yogeswarl commented Apr 11, 2023

yogeswarl commented Apr 17, 2023

yogeswarl commented Apr 18, 2023

yogeswarl commented Mar 18, 2023 •

edited

Loading

hosseinfani commented Mar 21, 2023 •

edited

Loading

yogeswarl commented Mar 21, 2023 •

edited

Loading