Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Querying on Subset of Document_IDs #83

Closed
PrimoUomo89 opened this issue Jan 28, 2024 · 7 comments
Closed

Querying on Subset of Document_IDs #83

PrimoUomo89 opened this issue Jan 28, 2024 · 7 comments
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed

Comments

@PrimoUomo89
Copy link
Contributor

Would love to be able to pass in an array of document_IDs as argument to query function, representing the subset of documents to query. Not familiar enough with the inner workings of the technology to propose a resolution myself. Would gladly take some guidance from someone more senior so I can produce a pull request myself.

@bclavie bclavie added enhancement New feature or request help wanted Extra attention is needed good first issue Good for newcomers labels Jan 28, 2024
@bclavie
Copy link
Collaborator

bclavie commented Jan 28, 2024

Hey, this is a great suggestion and has already been requested!

It's doable without too many code changes because ColBERT natively supports querying only a subset of passage IDs. @anirudhdharmarajan can talk a bit more about this but right now we do have a mapping of document_ids to internal pids (passage_ids) at indexing time. This mapping could be used to make it so search() would only look within pids that are mapped to document_ids specified by the user, producing the behaviour you're requesting.

@PrimoUomo89
Copy link
Contributor Author

I had looked a bit at the pids when reviewing @anirudhdharmarajan 's pull request for document_metadata so a little familiar with what you're describing. I'll take a rough swing this week, but it's possible I'm out of my depth. Will look for feedback when I have a a plan.

@PrimoUomo89
Copy link
Contributor Author

Deleted my previous comments because it had the wrong approach. Will send a pull request later today hopefully.

@PrimoUomo89
Copy link
Contributor Author

Pull Request complete.

@hehuan2363
Copy link

Are you refering to something similar to this issue in Colbert: stanford-futuredata/ColBERT#304

@PrimoUomo89
Copy link
Contributor Author

@hehuan2363
The main difference between what I implemented for this issue and what they're discussing on that issue, is that instead of implementing access to the filter lamda function, I instead implemented access to submitting document_ids (which are resolved to pids before search). I'm not sure what the best implementation would be on this side, but I imagine a lamda function which resolves document_ids in RAGatouille, which would then resolve the pids to search. If someone asks for it, I'll help out with that.

@bclavie
Copy link
Collaborator

bclavie commented Feb 14, 2024

Pull

Thank you! Closing this issue as it's been merged

@bclavie bclavie closed this as completed Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants