Will the LVD-142M dataset or data processing codes be released? #24

XiaohuJoshua · 2023-04-19T08:27:16Z

Thanks for the outstanding work. Do you have the plan to release the LVD-142M dataset or codes for data processing?

patricklabatut · 2023-04-19T22:47:32Z

Releasing LVD-142M is not something that we are considering releasing, I am afraid. Re: open-sourcing the data curation code, this could depend on feedback and interest from the community.

salonit77 · 2023-04-19T23:30:09Z

Would it be possible to share additional details regarding the deduplication part used in the data curation pipeline. Or how that can be done for a custom dataset.

XiaohuJoshua · 2023-04-20T02:06:24Z

Would it be possible to share additional details regarding the deduplication part used in the data curation pipeline. Or how that can be done for a custom dataset.

+1.

woctezuma · 2023-04-20T07:14:26Z

Would it be possible to share additional details regarding the deduplication part used in the data curation pipeline. Or how that can be done for a custom dataset.

It is probably something simple like idealo/imagededup.
Edit: the procedure is applied to embeddings as shown on Figure 3 rather than images, so it is a bit more complex.

In Figure 3, you can see that duplicates include:

an image which is identical to one of the curated dataset,
another image which has a different ratio but is otherwise nearly identical to another one of the uncurated dataset.

Edit: You can find more information in the paper. It is a bit more complex as it is done on the embeddings.
However, I believe a method using image hashes is also used in a first deduplication process, as "PCA hash" is mentioned. 🤔

XiaohuJoshua · 2023-04-20T09:32:35Z

Thanks for your reply.

salonit77 · 2023-04-20T18:21:06Z

@woctezuma Thank you for the reply.. I was able to follow the A3 section and get some deduplication results

FrancescoSaverioZuppichini · 2023-04-21T13:13:24Z

@salonit77 hi :) could you share the code?

patricklabatut · 2023-04-24T22:18:01Z

Using #56 instead to keep track of requests about data curation code. Also happy to provide clarifications on the procedure.

yyyyyyfs · 2023-06-28T06:25:44Z

depend

Thanks for your reply! I have a question about the deduplication method SSCD mentioned in your paper. I would like to confirm that SSCD is only used to extract embedding？ Then use the Faiss library to calculate the similarity between embeddings？

yyyyyyfs · 2023-06-28T07:07:49Z

Would it be possible to share additional details regarding the deduplication part used in the data curation pipeline. Or how that can be done for a custom dataset.

It is probably something simple like idealo/imagededup. Edit: the procedure is applied to embeddings as shown on Figure 3 rather than images, so it is a bit more complex.

In Figure 3, you can see that duplicates include:

an image which is identical to one of the curated dataset,

another image which has a different ratio but is otherwise nearly identical to another one of the uncurated dataset.

Edit: You can find more information in the paper. It is a bit more complex as it is done on the embeddings. However, I believe a method using image hashes is also used in a first deduplication process, as "PCA hash" is mentioned. 🤔

Thanks for your reply~
About implementation details，I still have some questions。If I have 5B images，but I only have few gpus，how can I do this work？ Reading a large amount of data into memory simultaneously is also a serious problem. Do you have any good suggestions?

forever208 · 2024-02-20T01:54:05Z

Releasing LVD-142M is not something that we are considering releasing, I am afraid. Re: open-sourcing the data curation code, this could depend on feedback and interest from the community.

So, the editor said something non-existing in OpenReview :D
https://openreview.net/forum?id=a68SUt6zFt

dvikdvik · 2024-02-21T22:19:41Z

soo,, release it!! 🥇

rfan-debug · 2024-05-31T19:49:02Z

Would it be possible to share additional details regarding the deduplication part used in the data curation pipeline. Or how that can be done for a custom dataset.

It is probably something simple like idealo/imagededup. Edit: the procedure is applied to embeddings as shown on Figure 3 rather than images, so it is a bit more complex.
In Figure 3, you can see that duplicates include:

an image which is identical to one of the curated dataset,

another image which has a different ratio but is otherwise nearly identical to another one of the uncurated dataset.

Edit: You can find more information in the paper. It is a bit more complex as it is done on the embeddings. However, I believe a method using image hashes is also used in a first deduplication process, as "PCA hash" is mentioned. 🤔

Thanks for your reply~ About implementation details，I still have some questions。If I have 5B images，but I only have few gpus，how can I do this work？ Reading a large amount of data into memory simultaneously is also a serious problem. Do you have any good suggestions?

@yyyyyyfs (Not the author..just want to put my two cents:) Processing 5B images with a few gpus would take years. Personally I'd store images on GCS buckets and try to apply for their free TPU units to process it. My personal experience is that a cloud provider like GCP usually optimize the connection between storage and computing units very well so you don't have to worry about speed of loading images into memory. You may just treat it like an embarrassingly data parallel process.

patricklabatut self-assigned this Apr 19, 2023

patricklabatut added the enhancement New feature or request label Apr 19, 2023

woctezuma mentioned this issue Apr 21, 2023

Dataset #40

Closed

patricklabatut changed the title ~~Will the LVD-142M dataset or data processing codes be released?~~ [request] LVD-142M pretraining dataset and / or data curation code Apr 24, 2023

patricklabatut changed the title ~~[request] LVD-142M pretraining dataset and / or data curation code~~ Will the LVD-142M dataset or data processing codes be released? Apr 24, 2023

patricklabatut mentioned this issue Apr 24, 2023

[request] LVD-142M pretraining dataset and / or data curation code #56

Open

patricklabatut closed this as completed Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will the LVD-142M dataset or data processing codes be released? #24

Will the LVD-142M dataset or data processing codes be released? #24

XiaohuJoshua commented Apr 19, 2023

patricklabatut commented Apr 19, 2023

salonit77 commented Apr 19, 2023 •

edited

XiaohuJoshua commented Apr 20, 2023

woctezuma commented Apr 20, 2023 •

edited

XiaohuJoshua commented Apr 20, 2023

salonit77 commented Apr 20, 2023

FrancescoSaverioZuppichini commented Apr 21, 2023

patricklabatut commented Apr 24, 2023 •

edited

yyyyyyfs commented Jun 28, 2023

yyyyyyfs commented Jun 28, 2023

forever208 commented Feb 20, 2024 •

edited

dvikdvik commented Feb 21, 2024

rfan-debug commented May 31, 2024

Will the LVD-142M dataset or data processing codes be released? #24

Will the LVD-142M dataset or data processing codes be released? #24

Comments

XiaohuJoshua commented Apr 19, 2023

patricklabatut commented Apr 19, 2023

salonit77 commented Apr 19, 2023 • edited

XiaohuJoshua commented Apr 20, 2023

woctezuma commented Apr 20, 2023 • edited

XiaohuJoshua commented Apr 20, 2023

salonit77 commented Apr 20, 2023

FrancescoSaverioZuppichini commented Apr 21, 2023

patricklabatut commented Apr 24, 2023 • edited

yyyyyyfs commented Jun 28, 2023

yyyyyyfs commented Jun 28, 2023

forever208 commented Feb 20, 2024 • edited

dvikdvik commented Feb 21, 2024

rfan-debug commented May 31, 2024

salonit77 commented Apr 19, 2023 •

edited

woctezuma commented Apr 20, 2023 •

edited

patricklabatut commented Apr 24, 2023 •

edited

forever208 commented Feb 20, 2024 •

edited