Observations on RAGatouille Performance with Numerical Data #200

karthikgali · 2024-04-16T05:45:49Z

Hi,

I was exploring RAGatouille for various use cases and observed that it isn't working well when there are numbers. The following is the example I considered:

query = "Verizon added 416,000 broadband subscribers."
raw_results = [
"The broadband subscriber base of Verizon grew by 416,000",
"The broadband subscriber base of Verizon grew by 437,000",
"The broadband subscriber base of Verizon grew by 4,200",
]

Output of jinaai/jina-colbert-v1-en:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.900049209594727,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 26.850547790527344,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 26.354623794555664,
'rank': 2,
'result_index': 2}]

Output of colbert-ir/colbertv2.0:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.490386962890625,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 25.593244552612305,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 24.46894073486328,
'rank': 2,
'result_index': 2}]

Output of mixedbread-ai/mxbai-colbert-v1:
[{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 29.360464096069336,
'rank': 0,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 29.355073928833008,
'rank': 1,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 29.151668548583984,
'rank': 2,
'result_index': 2}]

In the above example, you can observe that even though the numbers are different, there isn't much difference in the scores.

Could someone please explain how to handle numbers.

h4gen · 2024-04-19T10:25:24Z

I think the problem is that Bert is not really good at making sense of numbers in general. We have quite a similar use case and use the search just for semantically relevant parts and get rid of the numeric outliers via filters (We extracted the numbers from the texts so we have them in table form). That said, it really only works on narrow domains where extracting the numbers makes sense. For broad domains this will become hard.

karthikgali changed the title ~~Reranker not considering numbers~~ Observations on RAGatouille Performance with Numerical Data Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Observations on RAGatouille Performance with Numerical Data #200

Observations on RAGatouille Performance with Numerical Data #200

karthikgali commented Apr 16, 2024

h4gen commented Apr 19, 2024

Observations on RAGatouille Performance with Numerical Data #200

Observations on RAGatouille Performance with Numerical Data #200

Comments

karthikgali commented Apr 16, 2024

h4gen commented Apr 19, 2024