Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions to run it against other datasets #7

Closed
Jurys22 opened this issue Dec 15, 2021 · 4 comments
Closed

Suggestions to run it against other datasets #7

Jurys22 opened this issue Dec 15, 2021 · 4 comments

Comments

@Jurys22
Copy link

Jurys22 commented Dec 15, 2021

Hi! I'm pretty new to deep learning and ASTE.

Can you please suggest to me the necessary steps to run this against another dataset?
Do I need to follow this data structure (https://github.com/xuuuluuu/SemEval-Triplet-data/blob/master/README.md#data-description) on my dataset by labeling it?
How can I modify the code on Colab for new datasets? thank you
Any other advice?

Thank you

@chiayewken
Copy link
Owner

Hi, yes you would need to annotate the data in the same format. In the folder "aste/data/triplet_data", you can create a folder called "new_data", and put train.txt, dev.txt and test.txt inside. Then, you can specify the new dataset for training by modifying line 11 in aste/main.sh to be "--names new_data, " and line 12 to be "--seeds 0, ".

@Jurys22
Copy link
Author

Jurys22 commented Jan 4, 2022

Thank you!
I was reading a closed issue about data format, and I am wondering:

1 - has the data format changed?
From:
Exactly=O as=O posted=O plus=O a=O great=O value=T-POS .=O####Exactly=O as=O posted=O plus=O a=O great=S value=O .=O####[([6], [5], 'POS')]
To:
Exactly as posted plus a great value . [([6], [5], 'POS')]

2 - looking at the data generated in the colab, Span-Aste/aste/data/triplet_data/14lap I see that train,test,dev have similar structure:
Train
Not even safe mode boots .####Not=O even=O safe=T-NEG mode=T-NEG boots=O .=O####Not=S even=O safe=O mode=O boots=O .=O####[([2, 3], [0], 'NEG')]

Test
A lot of features and shortcuts on the MBP that I was never exposed to on a normal PC .####A=O lot=O of=O features=T-NEU and=O shortcuts=TT-NEU on=O the=O MBP=O that=O I=O was=O never=O exposed=O to=O on=O a=O normal=O PC=O .=O####A=O lot=S of=S features=O and=O shortcuts=O on=O the=O MBP=O that=O I=O was=O never=O exposed=O to=O on=O a=O normal=O PC=O .=O####[([3], [1, 2], 'NEU'), ([5], [1, 2], 'NEU')]

Eval
It was slow , locked up , and also had hardware replaced after only 2 months !####It=O was=O slow=O ,=O locked=O up=O ,=O and=O also=O had=O hardware=T-NEG replaced=O after=O only=O 2=O months=O !=O####It=O was=O slow=O ,=O locked=O up=O ,=O and=O also=O had=O hardware=O replaced=S after=O only=O 2=O months=O !=O####[([10], [11], 'NEG')]

Do I need then to label manually the three sets during the first tests on my dataset?
If yes, once I am sure that it works on my type of dataset, should the final data format be something like that -I will use the same sentence for the example but of course they will be different in the real scenario:

Train:
Exactly as posted plus a great value . [([6], [5], 'POS')]

Test and Dev:
Exactly as posted plus a great value .

Thank you

@chiayewken
Copy link
Owner

Hi, the data format that the training script needs is the same that is in Span-ASTE/aste/data/triplet_data/14lap/train.txt, which is like the sample below. The train, dev and test samples have the same format.

I charge it at night and skip taking the cord with me because of the good battery life .####I=O charge=O it=O at=O night=O and=O skip=O taking=O the=O cord=O with=O me=O because=O of=O the=O good=O battery=T-POS life=T-POS .=O####I=O charge=O it=O at=O night=O and=O skip=O taking=O the=O cord=O with=O me=O because=O of=O the=O good=S battery=O life=O .=O####[([16, 17], [15], 'POS')]

@chiayewken
Copy link
Owner

Hi, to make it more convenient to apply to new datasets, you can omit the tags component of the annotation, and include just the sentence and triplet information, such as the sample below. Each line in the train, dev and test set can have the same format.

I charge it at night and skip taking the cord with me because of the good battery life .#### #### ####[([16, 17], [15], 'POS')]

@Jurys22 Jurys22 closed this as completed Jan 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants