You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I was exploring RAGatouille for various use cases and observed that it isn't working well when there are numbers. The following is the example I considered:
query = "Verizon added 416,000 broadband subscribers."
raw_results = [
"The broadband subscriber base of Verizon grew by 416,000",
"The broadband subscriber base of Verizon grew by 437,000",
"The broadband subscriber base of Verizon grew by 4,200",
]
Output of jinaai/jina-colbert-v1-en:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.900049209594727,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 26.850547790527344,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 26.354623794555664,
'rank': 2,
'result_index': 2}]
Output of colbert-ir/colbertv2.0:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.490386962890625,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 25.593244552612305,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 24.46894073486328,
'rank': 2,
'result_index': 2}]
Output of mixedbread-ai/mxbai-colbert-v1:
[{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 29.360464096069336,
'rank': 0,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 29.355073928833008,
'rank': 1,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 29.151668548583984,
'rank': 2,
'result_index': 2}]
In the above example, you can observe that even though the numbers are different, there isn't much difference in the scores.
Could someone please explain how to handle numbers.
The text was updated successfully, but these errors were encountered:
karthikgali
changed the title
Reranker not considering numbers
Observations on RAGatouille Performance with Numerical Data
Apr 16, 2024
I think the problem is that Bert is not really good at making sense of numbers in general. We have quite a similar use case and use the search just for semantically relevant parts and get rid of the numeric outliers via filters (We extracted the numbers from the texts so we have them in table form). That said, it really only works on narrow domains where extracting the numbers makes sense. For broad domains this will become hard.
Hi,
I was exploring RAGatouille for various use cases and observed that it isn't working well when there are numbers. The following is the example I considered:
query = "Verizon added 416,000 broadband subscribers."
raw_results = [
"The broadband subscriber base of Verizon grew by 416,000",
"The broadband subscriber base of Verizon grew by 437,000",
"The broadband subscriber base of Verizon grew by 4,200",
]
Output of jinaai/jina-colbert-v1-en:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.900049209594727,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 26.850547790527344,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 26.354623794555664,
'rank': 2,
'result_index': 2}]
Output of colbert-ir/colbertv2.0:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.490386962890625,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 25.593244552612305,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 24.46894073486328,
'rank': 2,
'result_index': 2}]
Output of mixedbread-ai/mxbai-colbert-v1:
[{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 29.360464096069336,
'rank': 0,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 29.355073928833008,
'rank': 1,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 29.151668548583984,
'rank': 2,
'result_index': 2}]
In the above example, you can observe that even though the numbers are different, there isn't much difference in the scores.
Could someone please explain how to handle numbers.
The text was updated successfully, but these errors were encountered: