Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove language restrictions in tydiqa + add arabic prompts #763

Merged
merged 22 commits into from Oct 26, 2022

Conversation

KhalidAlt
Copy link
Contributor

@KhalidAlt KhalidAlt commented May 5, 2022

  • Remove the if statement to allow English prompts to work across the Dataset.
  • Add Arabic prompts for primary task subset with if statement to include only Arabic set.
  • Add Arabic prompts for secondary task subset.

@KhalidAlt KhalidAlt changed the title remove language restrictions remove language restrictions in tydiqa May 5, 2022
@stephenbach
Copy link
Member

Can you make this PR target the eval-hackathon branch? Thanks!

@KhalidAlt KhalidAlt changed the base branch from main to eval-hackathon May 5, 2022 20:48
@KhalidAlt KhalidAlt changed the title remove language restrictions in tydiqa remove language restrictions in tydiqa + add arabic prompts May 15, 2022
VictorSanh and others added 3 commits May 22, 2022 22:36
…orkshop#778)

* accelerate `get_infos` by caching the `DataseInfoDict`s

* quality

* consistency
@VictorSanh VictorSanh self-assigned this Jun 27, 2022
@VictorSanh
Copy link
Member

VictorSanh commented Jun 29, 2022

fixed the merged conflicts

@KhalidAlt what is template.yaml.save and template.yaml.save.1? Can we remove them?

@KhalidAlt
Copy link
Contributor Author

KhalidAlt commented Jun 30, 2022

Hi @VictorSanh, I think you can remove them. Also, I will ping @zaidalyafeai to see if he can help reviewing the Arabic prompts.

@VictorSanh
Copy link
Member

@KhalidAlt there are still errors in tests for some reason, if you can get to it before please do, I have a few things to sort out this afternoon first

KhalidAlt and others added 5 commits June 30, 2022 23:31
* fix empty documents - multi_news

* fix test - unrecognized variable
* Added languages widget to UI.

* Style fixes.

* Added English tag to existing datasets.

* Add languages to viewer mode.

* Update language codes.

* Update CONTRIBUTING.md.

* Update screenshot.

* Add "Prompt" to UI to clarify languages tag usage.
@stephenbach
Copy link
Member

Sorry, I took so long to look at this that it's fallen out of date. Can you resolve the conflicts and I can take another look?

@zaidalyafeai
Copy link
Contributor

If u need some help I can review the prompts.

@KhalidAlt
Copy link
Contributor Author

Hi @zaidalyafeai, It will be so helpful if you can review the Arabic prompts. Thanks a lot.

@KhalidAlt
Copy link
Contributor Author

KhalidAlt commented Aug 16, 2022

Hi @zaidalyafeai, I think you can review the prompts much easier now. I fixed the Unicode characters problem and it shows the Arabic characters now. Please review it if you can and let me know if anything is missing. Thanks a lot.

@KhalidAlt
Copy link
Contributor Author

KhalidAlt commented Aug 16, 2022

Hi @stephenbach, I fixed all the conflicts and a small bug that prevented Unicode characters from showing properly in .yaml files. Please let me know if there is anything need to change. Also, show_new_templates check is always failing due to large size of data.

@zaidalyafeai
Copy link
Contributor

@KhalidAlt the branch seems to include many other commits for other languages?

@KhalidAlt
Copy link
Contributor Author

@zaidalyafeai, previously, Unicode characters were saved on byte format. Therefore, I fixed the issue, and most prompts have changed from byte to Unicode. This is the reason that there are many other commits for other languages. Please let me know if anything to be changed. Thanks

@zaidalyafeai
Copy link
Contributor

Sure will review prompts today.

@samsontmr
Copy link
Member

@KhalidAlt @stephenbach Any chance we can get this merged soon? We need the filter to be removed before we can run the evaluations for the paper. Thanks!

@VictorSanh
Copy link
Member

@KhalidAlt it the tests pass locally, and @zaidalyafeai has validated, i can merge this in

@stephenbach
Copy link
Member

Whoops, I DMed @KhalidAlt. Please don’t merge yet. There’s some code duplication in templates.py that needs to be removed.

@stephenbach
Copy link
Member

Also do we like/need the allow_unicode=True argument? I’m not sure exactly what it does. Should we use it to regenerate all the .yaml at once to avoid unexpected diffs down the road?

@KhalidAlt
Copy link
Contributor Author

KhalidAlt commented Oct 26, 2022

Hi @stephenbach , I removed the code duplication. I used the argument (allow_unicode=True) because non-ASCII language prompts were saved in byte formatting which makes it difficult to review them manually using a .ymal file.

@stephenbach
Copy link
Member

There's a .app.py.swp in the changes.

@KhalidAlt
Copy link
Contributor Author

KhalidAlt commented Oct 26, 2022

Hi @stephenbach, I have deleted app.py.swp in the cleaning commit. In addition, I have reverted the changes to templates.py and regenerated the tydiqa prompts. Also, I have tested it locally and everything seems right. Please let me know if anything still missing.

@stephenbach
Copy link
Member

Oh my bad, I see now that you were removing it, not adding it. Merging now. Thanks for your patience and hard work on this!

@stephenbach
Copy link
Member

For the record, show_new_templates fails because the dataset is too big to run on GitHub and check_templates fails because of the issue that #832 addresses.

Copy link
Member

@stephenbach stephenbach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@stephenbach stephenbach merged commit ef3eb4a into bigscience-workshop:eval-hackathon Oct 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants