-
Notifications
You must be signed in to change notification settings - Fork 686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will the LVD-142M dataset or data processing codes be released? #24
Comments
Releasing LVD-142M is not something that we are considering releasing, I am afraid. Re: open-sourcing the data curation code, this could depend on feedback and interest from the community. |
Would it be possible to share additional details regarding the deduplication part used in the data curation pipeline. Or how that can be done for a custom dataset. |
+1. |
It is probably something simple like In Figure 3, you can see that duplicates include:
Edit: You can find more information in the paper. It is a bit more complex as it is done on the embeddings. |
Thanks for your reply. |
@woctezuma Thank you for the reply.. I was able to follow the A3 section and get some deduplication results |
@salonit77 hi :) could you share the code? |
Using #56 instead to keep track of requests about data curation code. Also happy to provide clarifications on the procedure. |
Thanks for your reply! I have a question about the deduplication method SSCD mentioned in your paper. I would like to confirm that SSCD is only used to extract embedding? Then use the Faiss library to calculate the similarity between embeddings? |
Thanks for your reply~ |
So, the editor said something non-existing in OpenReview :D |
soo,, release it!! 🥇 |
@yyyyyyfs (Not the author..just want to put my two cents:) Processing 5B images with a few gpus would take years. Personally I'd store images on GCS buckets and try to apply for their free TPU units to process it. My personal experience is that a cloud provider like GCP usually optimize the connection between storage and computing units very well so you don't have to worry about speed of loading images into memory. You may just treat it like an embarrassingly data parallel process. |
Thanks for the outstanding work. Do you have the plan to release the LVD-142M dataset or codes for data processing?
The text was updated successfully, but these errors were encountered: