Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing duplicates and noise in embeddings #299

Open
drale2k opened this issue Aug 23, 2023 · 2 comments
Open

Preventing duplicates and noise in embeddings #299

drale2k opened this issue Aug 23, 2023 · 2 comments

Comments

@drale2k
Copy link

drale2k commented Aug 23, 2023

I think, even if not yet in scope for lanchianrb, this should be discussed as people will inevitably come across this problem. Especially when embedding documents with langchainrb, what is a good strategy to prevent the same document / strings being re-added repeatedly?

For a whole document i think checksums could work (although for big docs computing a checksum will increase) - but what about individual pages of a document or text chunks? Would love some guidance and maybe later down the road langchain can help with this.

@mengqing
Copy link
Contributor

It seems this is done through indexing

I wonder if there's a roadmap on porting this feature into langchainrb

@drale2k
Copy link
Author

drale2k commented Feb 3, 2024

Thanks that's really useful. Would be great to have something like this in langchainrb. At least a basic version to start with as it is a real PITA to do this manually

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants