Closes #252 #432
Closes #252 #432galtay merged 6 commits intobigscience-workshop:masterfrom MFreidank:ntcir_13_medweb
Conversation
|
@MFreidank Thank you for helping with this dataset! I can confirm the unit test is failing on this dataset. I tried running Downloading and preparing dataset ntcir13_med_web_dataset/ntcir_13_medweb_source to /Users/rosaline17/.cache/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_source-658288ee2aa81ab7/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963...
0 examples [00:00, ? examples/s][]
E
======================================================================
ERROR: runTest (__main__.TestDataLoader)
Run all tests that check:
----------------------------------------------------------------------
Traceback (most recent call last):
File "/Users/rosaline17/biomedical/tests/test_bigbio.py", line 180, in setUp
use_auth_token=self.USE_AUTH_TOKEN,
File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/load.py", line 1699, in load_dataset
use_auth_token=use_auth_token,
File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 596, in download_and_prepare
dl_manager=dl_manager, verify_infos=verify_infos, **download_and_prepare_kwargs
File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 684, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/datasets-1.17.1.dev0-py3.7.egg/datasets/builder.py", line 1081, in _prepare_split
disable=bool(logging.get_verbosity() == logging.NOTSET),
File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/tqdm/std.py", line 1180, in __iter__
for obj in iterable:
File "/Users/rosaline17/.cache/huggingface/modules/datasets_modules/datasets/ntcir_13_medweb/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963/ntcir_13_medweb.py", line 285, in _generate_examples
df = pd.concat(dataframes)
File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 295, in concat
sort=sort,
File "/Users/rosaline17/anaconda3/envs/py37/lib/python3.7/site-packages/pandas/core/reshape/concat.py", line 342, in __init__
raise ValueError("No objects to concatenate")
ValueError: No objects to concatenate
----------------------------------------------------------------------
Ran 1 test in 0.182sSteps to replicate: > python
>>> from datasets import load_dataset
>>> data = load_dataset("biodatasets/ntcir_13_medweb/ntcir_13_medweb.py", name="ntcir_13_medweb_translation_en_zh_bigbio_t2t", data_dir="<local_dir>/MedWeb_TestCollection")
Using custom data configuration ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4
Reusing dataset ntcir13_med_web_dataset (<LOCAL_CACHE_DIR>/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963)
> python -m tests.test_bigbio biodatasets/ntcir_13_medweb/ntcir_13_medweb.py --schema T2T --data_dir <LOCAL_CACHE_DIR>/huggingface/datasets/ntcir13_med_web_dataset/ntcir_13_medweb_translation_en_zh_bigbio_t2t-19d9a2aea40d9ba4/1.0.0/8e83c529c04b8b5316340dc87ad56ca8c16fe4b0ad1e2372a4bd2a076935a963 |
NTCIR-13 MedWeb (Medical Natural Language Processing for Web Document) task requires to perform a multi-label classification that labels for eight diseases/symptoms must be assigned to each tweet. Given pseudo-tweets, the output are Positive:p or Negative:n labels for eight diseases/symptoms. The achievements of this task can almost be directly applied to a fundamental engine for actual applications. This task provides pseudo-Twitter messages in a cross-language and multi-label corpus, covering three languages (Japanese, English, and Chinese), and annotated with eight labels such as influenza, diarrhea/stomachache, hay fever, cough/sore throat, headache, fever, runny nose, and cold. Additionally the cross-language corpus can also be used for translation between Japanese, English and Chinese. For more information, see: http://research.nii.ac.jp/ntcir/permission/ntcir-13/perm-en-MedWeb.html
|
Hi @ruisi-su With this suffix added I got this error instead: This happens because I have only one I believe this fix to the test interface (once merged) would likely allow us to test these configurations individually. @galtay Can you confirm if I'm understanding this correctly? |
|
Hi everyone, this dataset sits right in the middle of a few of the failings of our test suite :) @MFreidank, perhaps you can write a script that loops over all the config names and calls the above test script for each? Eventually we'll have a robust enough single test script to handle cases like this. |
|
Hi @galtay
I've been able to confirm that when running a loop over all config names and using the script above all tests pass. Will integrate your suggested changes to how data files are managed and unzipped and document the files in the description as well. Do you believe that would be sufficient to merge this PR? Update: @galtay Have made the changes and added commits, please have a look if the changes look good. |
galtay
left a comment
There was a problem hiding this comment.
Thanks for all the changes. LGTM 🎉
This PR addresses issue #252.
This is a local dataset, data can be obtained by filling a simple form here (selecting NTCIR-13): http://www.nii.ac.jp/dsc/idr/en/ntcir/ntcir.html
@galtay Following up on our previous discussion via discord.
As per your suggestions, I implemented a single
sourceconfiguration with a"Language"(casing for compatibility with the other features in the source data) feature that generates records for all languages.For text classification, I implemented a
BioConfigsubset for each language, using this naming convention:"ntcir_13_medweb_classification_{language_code}_bigbio_text"for language code"ja","en","zh".For text to text (translation) I implemented a
BioConfigsubset for each language pair, using this naming convention:"ntcir_13_medweb_translation_{language_code}_{target_language_code}_source"Please let me know if I misunderstood anything or any additional changes are required.
As you had correctly guessed, I'm currently unable to run unit tests locally, due to the naming above.
Can you guide me on the best way to run tests for this dataset?
Another open question is around dependencies. NTCIR source data ships as excel files (
.xlsx).To load it using
pandas.read_excelI needed to install the packageopenpyxl.It would probably be desirable to keep this dependency optional, check if this package is installed and bail with a readable error message if it is not. Do you agree? Are there any examples for how this is best handled in
datasets?Checkbox
biodatasets/my_dataset/my_dataset.py(please use only lowercase and underscore for dataset naming)._CITATION,_DATASETNAME,_DESCRIPTION,_HOMEPAGE,_LICENSE,_URLs,_SUPPORTED_TASKS,_SOURCE_VERSION, and_BIGBIO_VERSIONvariables._info(),_split_generators()and_generate_examples()in dataloader script.BUILDER_CONFIGSclass attribute is a list with at least oneBigBioConfigfor the source schema and one for a bigbio schema.datasets.load_datasetfunction.python -m tests.test_bigbio biodatasets/my_dataset/my_dataset.py.