Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about the dataset. #111

Open
Rexzhan opened this issue Dec 30, 2023 · 8 comments
Open

Question about the dataset. #111

Rexzhan opened this issue Dec 30, 2023 · 8 comments

Comments

@Rexzhan
Copy link

Rexzhan commented Dec 30, 2023

Nice work! But I met an error when trying to reproduce your training progress. Could you tell me how to prepare these three? "coco/annotations/panoptic_train2017_filtrefgumdval.json", "coco/annotations/captions_train2017_filtrefgumdval.json", "coco/annotations/grounding_train2017_filtrefgumd.json".
I think I only got panoptic_train2017_filtrefgumdval_filtvlp/captions_train2017_filtrefgumdval_filtvlp/grounding_train2017_filtrefgumdval_filtvlp following your dataset prepare guide.

@ziqipang
Copy link

@MaureenZOU Thanks for the update! It seems these new files are not reflected on the README yet. To reproduce SEEM, I assume these new files are correct?

@Rexzhan
Copy link
Author

Rexzhan commented Dec 31, 2023

@ziqipang , @MaureenZOU May i ask when the model consumes coco_caption_karpathy_test.arrow,filtcoco2017val_caption_karpathy_train.arrow... in pretrain_arrows_code224 during training stage? I followed the script TRAIN.md, but the provided config files seem to not be using vlp_dataset/coco_caption_karpathy? Please correct me if I am wrong,,

@ziqipang
Copy link

ziqipang commented Jan 2, 2024

@Rexzhan I think XDecoder uses those vision-language data, but SEEM only uses the segmentation data. I am just starting to work in this field. Please double check if my understanding is correct.

@xpzwzwz
Copy link

xpzwzwz commented Jan 4, 2024

@xpzwzwz
Copy link

xpzwzwz commented Jan 4, 2024

@MaureenZOU sorry,but i don't see the file "panoptic_val2017.json " you mentioned in "DATASET.md",could you upload it?thx.

@MaureenZOU
Copy link
Collaborator

This could be download from the official website

@seungyoungshin
Copy link

According to the DATESET.md, I have downloaded related to coco2017 dataset from official website(https://cocodataset.org/#download)
I also create some *.arrow files but I can't create all of files for training.

How to create "4M Image Text Pairs" with ViLT?

4M Image Text Pairs (X-Decoder)
We follow the exact data preparation for the image text pairs data with [ViLT](https://github.com/dandelin/ViLT/blob/master/DATA.md).

# The pretrained arrow file are put under .xdecoder_data/pretrain_arrows_code224 with the following list of files.
["filtcoco2017val_caption_karpathy_train.arrow", "filtcoco2017val_caption_karpathy_val.arrow", "filtcoco2017val_caption_karpathy_restval.arrow"] + ["code224_vg.arrow"] + [f"code224_sbu_{i}.arrow" for i in range(9)] + [f"code224_conceptual_caption_train_{i}.arrow" for i in range(31)]
# ["filtcoco2017val_caption_karpathy_train.arrow", "filtcoco2017val_caption_karpathy_val.arrow", "filtcoco2017val_caption_karpathy_restval.arrow"] are originated from ["filtcoco2017val_caption_karpathy_train.arrow", "filtcoco2017val_caption_karpathy_val.arrow", "filtcoco2017val_caption_karpathy_restval.arrow"] with deletion of coco val2017 overlapped images to avoid information leakage.
To get quick started:

# Download coco karparthy test set (we hack the training data to be coco_caption_karpathy_test.arrow only for quick start in the codebase)
wget https://huggingface.co/xdecoder/X-Decoder/resolve/main/coco_caption_karpathy_test.arrow
After dataset preparation, the dataset structure would be:

.xdecoder_data
└── pretrain_arrows_code224/
    ├── coco_caption_karpathy_test.arrow
    ├── *filtcoco2017val_caption_karpathy_train.arrow
    ├── ...
    ├── *code224_vg.arrow
    ├── *code224_sbu_0.arrow
    ├── ...
    ├── *code224_conceptual_caption_train_0.arrow
    └── ...
* Those datasets are optional for debugging the pipeline. ! NEED to add back when you are training the model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants