Skip to content
This repository has been archived by the owner on Jan 15, 2024. It is now read-only.

Test dataset issue when performing unitest with python3 #56

Closed
cgraywang opened this issue Apr 21, 2018 · 3 comments
Closed

Test dataset issue when performing unitest with python3 #56

cgraywang opened this issue Apr 21, 2018 · 3 comments
Assignees

Comments

@cgraywang
Copy link
Contributor

This error caused this PR CI failed: #55

The detailed error msg is as below:

tests/unittest/test_datasets.py::test_men Downloading tests/data/men/MEN.tar.gz from http://clic.cimec.unitn.it/~elia.bruni/resources/MEN.tar.gz...

FAILED

=================================== FAILURES ===================================

___________________________________ test_men ___________________________________

def test_men():

    for segment, length in [("full", 3000), ("dev", 2000), ("test", 1000)]:

        data = nlp.data.MEN(
          root=os.path.join('tests', 'data', 'men'), segment=segment)

tests/unittest/test_datasets.py:146:


gluonnlp/data/word_embedding_evaluation.py:307: in init

super(MEN, self).__init__(root=root)

gluonnlp/data/word_embedding_evaluation.py:163: in init

super(WordSimilarityEvaluationDataset, self).__init__(root=root)

gluonnlp/data/word_embedding_evaluation.py:120: in init

self._download_data()

gluonnlp/data/word_embedding_evaluation.py:131: in _download_data

verify=self._verify_ssl)
@leezu
Copy link
Contributor

leezu commented Apr 22, 2018

This seems to be caused by intermittent downtime of the clic.cimec.unitn.it server.
While we are likely allowed to redistribute the MEN dataset (it is not 100% clear if the permissive license applies to the dataset or just only models trained on the dataset) and could thereby control the source server, we can't redistribute most other datasets.

Therefore I suggest not to clean the cached datasets downloaded from external hosts between subsequent test runs. To make sure we catch URLs that go down permanently, we would have to set up a separate CI job that once in a while runs the complete test suite after cleaning the cached datasets.

I don't have access to the CI configuration, but I suppose this is feasible. If there is no objection I will implement the cache of external datasets now to make sure we won't experience any such intermittent test failures in future. Then in a follow-up we can set up the scheduled test without cache. For that I either need access to the CI configuration or someone (@szha ?) would have to enable it.

Please let me know if you have any alternative suggestions

@cgraywang
Copy link
Contributor Author

Well, is the cache of the dataset that is not allowed to redistribute allowed? I suggest we only use the datasets that are clear to the license issue. Which datasets used in the embedding evaluation having a free-distribution license?

@leezu
Copy link
Contributor

leezu commented Apr 22, 2018

With cache I mean that the CI server doesn't re-download the dataset on every run. This means even if the authors webserver goes down intermittently, the tests would still pass.

Redistribution would allow us to upload the datasets to S3 and replace the links to the authors webserver with a link to our S3 bucket. S3 is unlikely to go down, so our tests wouldn't be flaky in the first case. Unfortunately 90% of the datasets for Embedding Eval as well as CoNLL do not allow redistribution (and possibly others).

However we plan to contact the dataset owners once the toolkit is released and ask for special redistribution permission. This would save them data transfer costs caused by users of our toolkit, so they may agree.

Closing this now as #58 and #62 should have solved the issue.

You may need to make a code-change to rerun the tests.

Feel free to reopen if the issue persists.

@leezu leezu closed this as completed Apr 22, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants