Skip to content

Closes #252 #432

Merged
galtay merged 6 commits intobigscience-workshop:masterfrom
MFreidank:ntcir_13_medweb
Apr 26, 2022
Merged

Closes #252 #432
galtay merged 6 commits intobigscience-workshop:masterfrom
MFreidank:ntcir_13_medweb

Conversation

@MFreidank
Copy link
Copy Markdown
Contributor

@MFreidank MFreidank commented Apr 12, 2022

This PR addresses issue #252.

This is a local dataset, data can be obtained by filling a simple form here (selecting NTCIR-13): http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html

@galtay Following up on our previous discussion via discord.
As per your suggestions, I implemented a single source configuration with a "Language" (casing for compatibility with the other features in the source data) feature that generates records for all languages.

For text classification, I implemented a BioConfig subset for each language, using this naming convention:
"ntcir_13_medweb_classification_{language_code}_bigbio_text" for language code "ja", "en", "zh".

For text to text (translation) I implemented a BioConfig subset for each language pair, using this naming convention:
"ntcir_13_medweb_translation_{language_code}_{target_language_code}_source"

Please let me know if I misunderstood anything or any additional changes are required.
As you had correctly guessed, I'm currently unable to run unit tests locally, due to the naming above.
Can you guide me on the best way to run tests for this dataset?

Another open question is around dependencies. NTCIR source data ships as excel files (.xlsx).
To load it using pandas.read_excel I needed to install the package openpyxl.
It would probably be desirable to keep this dependency optional, check if this package is installed and bail with a readable error message if it is not. Do you agree? Are there any examples for how this is best handled in datasets?

Checkbox

  • Confirm that this PR is linked to the dataset issue.
  • Create the dataloader script biodatasets/my_dataset/my_dataset.py (please use only lowercase and underscore for dataset naming).
  • Provide values for the _CITATION, _DATASETNAME, _DESCRIPTION, _HOMEPAGE, _LICENSE, _URLs, _SUPPORTED_TASKS, _SOURCE_VERSION, and _BIGBIO_VERSION variables.
  • Implement _info(), _split_generators() and _generate_examples() in dataloader script.
  • Make sure that the BUILDER_CONFIGS class attribute is a list with at least one BigBioConfig for the source schema and one for a bigbio schema.
  • Confirm dataloader script works with datasets.load_dataset function.
  • Confirm that your dataloader script passes the test suite run with python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.
  • If my dataset is local, I have provided an output of the unit-tests in the PR (please copy paste). This is OPTIONAL for public datasets, as we can test these without access to the data files.

@ruisi-su
Copy link
Copy Markdown
Collaborator

ruisi-su commented Apr 15, 2022

@MFreidank Thank you for helping with this dataset! I can confirm the unit test is failing on this dataset. I tried running test.test_bigbio on a subset of this dataset, and it returned the error below. I will discuss this with the other collaborators, and we will let you know shortly.

Downloading and preparing dataset ntcir13_med_web_dataset/ntcir_13_medweb_source to /Users/rosaline17/.cache/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_source-658288ee2aa81ab7/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963...
0 examples [00:00, ? examples/s][]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/rosaline17/biomedical/tests/test_bigbio.py", line 180, in setUp
    use_auth_token=self.USE_AUTH_TOKEN,
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/load.py", line 1699, in load_dataset
    use_auth_token=use_auth_token,
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 596, in download_and_prepare
    dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 684, in _download_and_prepare
    self._prepare_split(split_generator, **prepare_split_kwargs)
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 1081, in _prepare_split
    disable=bool(logging.get_verbosity() == logging.NOTSET),
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/tqdm/std.py", line 1180, in __iter__
    for obj in iterable:
  File "/Users/rosaline17/.cache/huggingface/modules/datasets_modules/datasets/ntcir_13_medweb/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963/ntcir_13_medweb.py", line 285, in _generate_examples
    df = pd.concat(dataframes)
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 295, in concat
    sort=sort,
  File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 342, in __init__
    raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate

----------------------------------------------------------------------
Ran 1 test in 0.182s

Steps to replicate:

> python
>>> from datasets import load_dataset
>>> data = load_dataset("biodatasets/ntcir_13_medweb/ntcir_13_medweb.py", name="ntcir_13_medweb_translation_en_zh_bigbio_t2t", data_dir="<local_dir>/MedWeb_TestCollection")
Using custom data configuration ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4
Reusing dataset ntcir13_med_web_dataset (<LOCAL_CACHE_DIR>/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963)
> python -m tests.test_bigbio biodatasets/ntcir_13_medweb/ntcir_13_medweb.py --schema T2T --data_dir <LOCAL_CACHE_DIR>/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963

MFreidank added 5 commits April 21, 2022 13:22
NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires
to perform a multi-label classification that labels for eight diseases/symptoms must
be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n
labels for eight diseases/symptoms. The achievements of this task can almost be
directly applied to a fundamental engine for actual applications.

This task provides pseudo-Twitter messages in a cross-language and multi-label corpus,
covering three languages (Japanese, English, and Chinese), and annotated with eight
labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache,
fever, runny nose, and cold. Additionally the cross-language corpus can
also be used for translation between Japanese, English and Chinese.

For more information, see:
http://research.nii.ac.jp/ntcir/permission/ntcir-13/perm-en-MedWeb.html
@MFreidank
Copy link
Copy Markdown
Contributor Author

Hi @ruisi-su
Thank you for helping with testing.
I get your error if I don't add the suffix /MedWeb_TestCollection at the end of my input to the data_dir argument.

With this suffix added I got this error instead:

ValueError: BuilderConfig ntcir_13_medweb_bigbio_t2t not found. Available: ['ntcir_13_medweb_source', 'ntcir_13_medweb_classification_ja_bigbio_text', 'ntcir_13_medweb_translation_ja_en_bigbio_t2t', 'ntcir_13_medweb_translation_ja_zh_bigbio_t2t', 'ntcir_13_medweb_classification_en_bigbio_text', 'ntcir_13_medweb_translation_en_ja_bigbio_t2t', 'ntcir_13_medweb_translation_en_zh_bigbio_t2t', 'ntcir_13_medweb_classification_zh_bigbio_text', 'ntcir_13_medweb_translation_zh_ja_bigbio_t2t', 'ntcir_13_medweb_translation_zh_en_bigbio_t2t']

This happens because I have only one source but multiple bigbio configurations.
However, in my prior discussion on discord with @galtay he had recommended this approach and mentioned that we can figure out how to do testing as tests would likely fail.

I believe this fix to the test interface (once merged) would likely allow us to test these configurations individually.

@galtay Can you confirm if I'm understanding this correctly?

@sunnnymskang
Copy link
Copy Markdown
Collaborator

@galtay

@ruisi-su ruisi-su added Translation Task Span / Sentence Classification Task local dataset dataset requires local files to run labels Apr 21, 2022
@galtay
Copy link
Copy Markdown
Collaborator

galtay commented Apr 22, 2022

Hi everyone, this dataset sits right in the middle of a few of the failings of our test suite :)
For the moment, can we try running this alternate test file over each config name?
https://github.com/bigscience-workshop/biomedical/blob/master/tests/test_bigbio_by_name.py

@MFreidank, perhaps you can write a script that loops over all the config names and calls the above test script for each? Eventually we'll have a robust enough single test script to handle cases like this.

galtay
galtay previously approved these changes Apr 22, 2022
Comment thread biodatasets/ntcir_13_medweb/ntcir_13_medweb.py
@MFreidank
Copy link
Copy Markdown
Contributor Author

MFreidank commented Apr 25, 2022

Hi @galtay

Hi everyone, this dataset sits right in the middle of a few of the failings of our test suite :) For the moment, can we try running this alternate test file over each config name? https://github.com/bigscience-workshop/biomedical/blob/master/tests/test_bigbio_by_name.py

@MFreidank, perhaps you can write a script that loops over all the config names and calls the above test script for each? Eventually we'll have a robust enough single test script to handle cases like this.

I've been able to confirm that when running a loop over all config names and using the script above all tests pass.

Will integrate your suggested changes to how data files are managed and unzipped and document the files in the description as well.

Do you believe that would be sufficient to merge this PR?

Update: @galtay Have made the changes and added commits, please have a look if the changes look good.
I was following the guide in CONTRIBUTING.md regarding rebase etc, but it seems my branches locally and on my fork had diverged somehow in the process. Force pushed (with lease) to avoid an unnecessary merge, let me know if the current state looks okay or if I'd need to reopen this PR.

@MFreidank MFreidank requested a review from galtay April 25, 2022 07:37
@sunnnymskang sunnnymskang removed their assignment Apr 26, 2022
@galtay galtay self-assigned this Apr 26, 2022
Copy link
Copy Markdown
Collaborator

@galtay galtay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the changes. LGTM 🎉

@galtay galtay merged commit 866dc38 into bigscience-workshop:master Apr 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

local dataset dataset requires local files to run Span / Sentence Classification Task Translation Task

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants