Closes #252 by MFreidank · Pull Request #432 · bigscience-workshop/biomedical

MFreidank · 2022-04-12T11:44:51Z

This PR addresses issue #252.

This is a local dataset, data can be obtained by filling a simple form here (selecting NTCIR-13): http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

@galtay Following up on our previous discussion via discord.
As per your suggestions, I implemented a single source configuration with a "Language" (casing for compatibility with the other features in the source data) feature that generates records for all languages.

For text classification, I implemented a BioConfig subset for each language, using this naming convention:
"ntcir_13_medweb_classification_{language_code}_bigbio_text" for language code "ja", "en", "zh".

For text to text (translation) I implemented a BioConfig subset for each language pair, using this naming convention:
"ntcir_13_medweb_translation_{language_code}_{target_language_code}_source"

Please let me know if I misunderstood anything or any additional changes are required.
As you had correctly guessed, I'm currently unable to run unit tests locally, due to the naming above.
Can you guide me on the best way to run tests for this dataset?

Another open question is around dependencies. NTCIR source data ships as excel files (.xlsx).
To load it using pandas.read_excel I needed to install the package openpyxl.
It would probably be desirable to keep this dependency optional, check if this package is installed and bail with a readable error message if it is not. Do you agree? Are there any examples for how this is best handled in datasets?

Name: NTCIR-13 MedWeb
Description: NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires to perform a multi-label classification that labels for eight diseases/symptoms must be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n labels for eight diseases/symptoms. The achievements of this task can almost be directly applied to a fundamental engine for actual applications. This task provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold. Additionally the cross-language corpus can also be used for translation between Japanese, English and Chinese. For more information, see: http://research.nii.ac.jp/ntcir/permission/ntcir-13/perm-en-MedWeb.html
Paper: http://research.nii.ac.jp/ntcir/workshop/OnlineProceedings13/pdf/ntcir/01-NTCIR13-OV-MEDWEB-WakamiyaS.pdf
Data: http://research.nii.ac.jp/ntcir/permission/ntcir-13/perm-en-MedWeb.html

Checkbox

Confirm that this PR is linked to the dataset issue.
Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
Implement _info(), _split_generators() and _generate_examples() in dataloader script.
Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
Confirm dataloader script works with datasets.load_dataset function.
Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

ruisi-su · 2022-04-15T18:59:30Z

@MFreidank Thank you for helping with this dataset! I can confirm the unit test is failing on this dataset. I tried running test.test_bigbio on a subset of this dataset, and it returned the error below. I will discuss this with the other collaborators, and we will let you know shortly.

Downloading and preparing dataset ntcir13_med_web_dataset/ntcir_13_medweb_source to /Users/rosaline17/.cache/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_source-658288ee2aa81ab7/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963...
0 examples [00:00, ? examples/s][]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rosaline17/biomedical/tests/test_bigbio.py", line 180, in setUp
    use_auth_token=self.USE_AUTH_TOKEN,
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/load.py", line 1699, in load_dataset
    use_auth_token=use_auth_token,
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 596, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 684, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 1081, in _prepare_split
    disable=bool(logging.get_verbosity() == logging.NOTSET),
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/Users/rosaline17/.cache/huggingface/modules/datasets_modules/datasets/ntcir_13_medweb/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963/ntcir_13_medweb.py", line 285, in _generate_examples
    df = pd.concat(dataframes)
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 295, in concat
    sort=sort,
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 342, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

----------------------------------------------------------------------
Ran 1 test in 0.182s

Steps to replicate:

> python
>>> from datasets import load_dataset
>>> data = load_dataset("biodatasets/ntcir_13_medweb/ntcir_13_medweb.py", name="ntcir_13_medweb_translation_en_zh_bigbio_t2t", data_dir="<local_dir>/MedWeb_TestCollection")
Using custom data configuration ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4
Reusing dataset ntcir13_med_web_dataset (<LOCAL_CACHE_DIR>/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963)
> python -m tests.test_bigbio biodatasets/ntcir_13_medweb/ntcir_13_medweb.py --schema T2T --data_dir <LOCAL_CACHE_DIR>/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963

NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires to perform a multi-label classification that labels for eight diseases/symptoms must be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n labels for eight diseases/symptoms. The achievements of this task can almost be directly applied to a fundamental engine for actual applications. This task provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold. Additionally the cross-language corpus can also be used for translation between Japanese, English and Chinese. For more information, see: http://research.nii.ac.jp/ntcir/permission/ntcir-13/perm-en-MedWeb.html

MFreidank · 2022-04-21T11:56:44Z

Hi @ruisi-su
Thank you for helping with testing.
I get your error if I don't add the suffix /MedWeb_TestCollection at the end of my input to the data_dir argument.

With this suffix added I got this error instead:

ValueError: BuilderConfig ntcir_13_medweb_bigbio_t2t not found. Available: ['ntcir_13_medweb_source', 'ntcir_13_medweb_classification_ja_bigbio_text', 'ntcir_13_medweb_translation_ja_en_bigbio_t2t', 'ntcir_13_medweb_translation_ja_zh_bigbio_t2t', 'ntcir_13_medweb_classification_en_bigbio_text', 'ntcir_13_medweb_translation_en_ja_bigbio_t2t', 'ntcir_13_medweb_translation_en_zh_bigbio_t2t', 'ntcir_13_medweb_classification_zh_bigbio_text', 'ntcir_13_medweb_translation_zh_ja_bigbio_t2t', 'ntcir_13_medweb_translation_zh_en_bigbio_t2t']

This happens because I have only one source but multiple bigbio configurations.
However, in my prior discussion on discord with @galtay he had recommended this approach and mentioned that we can figure out how to do testing as tests would likely fail.

I believe this fix to the test interface (once merged) would likely allow us to test these configurations individually.

@galtay Can you confirm if I'm understanding this correctly?

sunnnymskang · 2022-04-21T13:40:37Z

@galtay

galtay · 2022-04-22T04:00:12Z

Hi everyone, this dataset sits right in the middle of a few of the failings of our test suite :)
For the moment, can we try running this alternate test file over each config name?
https://github.com/bigscience-workshop/biomedical/blob/master/tests/test_bigbio_by_name.py

@MFreidank, perhaps you can write a script that loops over all the config names and calls the above test script for each? Eventually we'll have a robust enough single test script to handle cases like this.

MFreidank · 2022-04-25T07:16:03Z

Hi @galtay

Hi everyone, this dataset sits right in the middle of a few of the failings of our test suite :) For the moment, can we try running this alternate test file over each config name? https://github.com/bigscience-workshop/biomedical/blob/master/tests/test_bigbio_by_name.py

@MFreidank, perhaps you can write a script that loops over all the config names and calls the above test script for each? Eventually we'll have a robust enough single test script to handle cases like this.

I've been able to confirm that when running a loop over all config names and using the script above all tests pass.

Will integrate your suggested changes to how data files are managed and unzipped and document the files in the description as well.

Do you believe that would be sufficient to merge this PR?

Update: @galtay Have made the changes and added commits, please have a look if the changes look good.
I was following the guide in CONTRIBUTING.md regarding rebase etc, but it seems my branches locally and on my fork had diverged somehow in the process. Force pushed (with lease) to avoid an unnecessary merge, let me know if the current state looks okay or if I'd need to reopen this PR.

galtay

Thanks for all the changes. LGTM 🎉

MFreidank requested review from galtay, hakunanatasha, jason-fries, leonweber, ruisi-su, sg-wbi and sunnnymskang as code owners April 12, 2022 11:44

sunnnymskang self-assigned this Apr 12, 2022

MFreidank added 5 commits April 21, 2022 13:22

minor changes: formatting

f2a0c83

minor chg: remove for-development dataset loading

c834769

removing obsolete subset configuration

87ac81d

chg: use URLS like in examples/bioasq_task_b.py

de134a1

ruisi-su added Translation Task Span / Sentence Classification Task local dataset dataset requires local files to run labels Apr 21, 2022

galtay previously approved these changes Apr 22, 2022

View reviewed changes

Comment thread biodatasets/ntcir_13_medweb/ntcir_13_medweb.py

improve data file/data dir management

b763eb0

MFreidank dismissed galtay’s stale review via b763eb0 April 25, 2022 07:37

MFreidank requested a review from debajyotidatta as a code owner April 25, 2022 07:37

MFreidank requested a review from galtay April 25, 2022 07:37

sunnnymskang removed their assignment Apr 26, 2022

galtay self-assigned this Apr 26, 2022

galtay approved these changes Apr 26, 2022

View reviewed changes

galtay merged commit 866dc38 into bigscience-workshop:master Apr 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Closes #252 #432

Closes #252 #432
galtay merged 6 commits intobigscience-workshop:masterfrom
MFreidank:ntcir_13_medweb

MFreidank commented Apr 12, 2022 •

edited

Loading

Uh oh!

ruisi-su commented Apr 15, 2022 •

edited

Loading

Uh oh!

MFreidank commented Apr 21, 2022

Uh oh!

sunnnymskang commented Apr 21, 2022

Uh oh!

galtay commented Apr 22, 2022 •

edited

Loading

Uh oh!

Uh oh!

MFreidank commented Apr 25, 2022 •

edited

Loading

Uh oh!

galtay left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MFreidank commented Apr 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checkbox

Uh oh!

ruisi-su commented Apr 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MFreidank commented Apr 21, 2022

Uh oh!

sunnnymskang commented Apr 21, 2022

Uh oh!

galtay commented Apr 22, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

MFreidank commented Apr 25, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

galtay left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MFreidank commented Apr 12, 2022 •

edited

Loading

ruisi-su commented Apr 15, 2022 •

edited

Loading

galtay commented Apr 22, 2022 •

edited

Loading

MFreidank commented Apr 25, 2022 •

edited

Loading