Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observations on RAGatouille Performance with Numerical Data #200

Open
karthikgali opened this issue Apr 16, 2024 · 1 comment
Open

Observations on RAGatouille Performance with Numerical Data #200

karthikgali opened this issue Apr 16, 2024 · 1 comment

Comments

@karthikgali
Copy link

Hi,

I was exploring RAGatouille for various use cases and observed that it isn't working well when there are numbers. The following is the example I considered:

query = "Verizon added 416,000 broadband subscribers."
raw_results = [
"The broadband subscriber base of Verizon grew by 416,000",
"The broadband subscriber base of Verizon grew by 437,000",
"The broadband subscriber base of Verizon grew by 4,200",
]

Output of jinaai/jina-colbert-v1-en:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.900049209594727,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 26.850547790527344,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 26.354623794555664,
'rank': 2,
'result_index': 2}]

Output of colbert-ir/colbertv2.0:
[{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 26.490386962890625,
'rank': 0,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 25.593244552612305,
'rank': 1,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 24.46894073486328,
'rank': 2,
'result_index': 2}]

Output of mixedbread-ai/mxbai-colbert-v1:
[{'content': 'The broadband subscriber base of Verizon grew by 437,000',
'score': 29.360464096069336,
'rank': 0,
'result_index': 1},
{'content': 'The broadband subscriber base of Verizon grew by 416,000',
'score': 29.355073928833008,
'rank': 1,
'result_index': 0},
{'content': 'The broadband subscriber base of Verizon grew by 4,200',
'score': 29.151668548583984,
'rank': 2,
'result_index': 2}]

In the above example, you can observe that even though the numbers are different, there isn't much difference in the scores.

Could someone please explain how to handle numbers.

@karthikgali karthikgali changed the title Reranker not considering numbers Observations on RAGatouille Performance with Numerical Data Apr 16, 2024
@h4gen
Copy link

h4gen commented Apr 19, 2024

I think the problem is that Bert is not really good at making sense of numbers in general. We have quite a similar use case and use the search just for semantically relevant parts and get rid of the numeric outliers via filters (We extracted the numbers from the texts so we have them in table form). That said, it really only works on narrow domains where extracting the numbers makes sense. For broad domains this will become hard.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants