Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get .source and .target file at comet_atomic2020_bart #2

Closed
yongho94 opened this issue Feb 20, 2021 · 6 comments
Closed

How to get .source and .target file at comet_atomic2020_bart #2

yongho94 opened this issue Feb 20, 2021 · 6 comments

Comments

@yongho94
Copy link

Hello sir.

I tried to run your codes that use BART model to generate knowledge triples.

In your codes, "models/comet_atomic2020_bart/finetune.py" requires "train.source" file and "train.target" file...

However, I couldn't figure out how to get these files.

How can I get these files?

Thanks.

@RubenBranco
Copy link

RubenBranco commented Feb 22, 2021

Hi @yongho94,

I'm not one of the authors but I might be able to help here. The code expects a .source .target format that used to be the standard format for huggingface libraries before datasets came about. Here's the example page: https://github.com/huggingface/transformers/tree/master/examples/legacy/seq2seq

To produce this for comet, you iterate over the csv file and for each row you concatenate the head with the relation "{head} {rel}" and write to a "train.source" file, and then the tail is written to a train.target file, such that each row in the files correspond to each other.

Might be wrong though.

@yongho94
Copy link
Author

Hi @yongho94,

I'm not one of the authors but I might be able to help here. The code expects a .source .target format that used to be the standard format for huggingface libraries before datasets came about. Here's the example page: https://github.com/huggingface/transformers/tree/master/examples/legacy/seq2seq

To produce this for comet, you iterate over the csv file and for each row you concatenate the head with the relation "{head} {rel}" and write to a "train.source" file, and then the tail is written to a train.target file, such that each row in the files correspond to each other.

Might be wrong though.

Thanks @RubenBranco !!

It seems to be work.

I think i need to try it.

Thanks.

@keisks
Copy link
Contributor

keisks commented Feb 23, 2021

Hi @yongho94,

Thank you for your question. Regarding the data format for BART, @RubenBranco is correct.

The src and trg dataset (for BART) is available here. If you are also interested in the model we trained, you can get it from here

I hope this helps!

@keisks keisks closed this as completed Feb 23, 2021
@Kelaxon
Copy link

Kelaxon commented Mar 29, 2021

@keisks, Sorry for re-opening this, and thanks for the fantastic work.

I have another question regarding the data format:

I saw there are some "none" targets in the training, validating, and testing set. Why do you introduce them? Are they used to prevent over-fitting? If so, how do you determine the ratio and sampling method?

Thanks!
image

@keisks
Copy link
Contributor

keisks commented Mar 31, 2021

The none targets mean that annotators answered there are no tails for given head and relation. In the dataset, we include all the annotations (=no sampling). As you can see, they are sometimes redundant because multiple annotators give same answers.

@Kelaxon
Copy link

Kelaxon commented Apr 1, 2021

Thanks for the explaination👍🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants