Test dataset issue when performing unitest with python3 #56

cgraywang · 2018-04-21T23:32:49Z

This error caused this PR CI failed: #55

The detailed error msg is as below:

tests/unittest/test_datasets.py::test_men Downloading tests/data/men/MEN.tar.gz from http://clic.cimec.unitn.it/~elia.bruni/resources/MEN.tar.gz...

FAILED

=================================== FAILURES ===================================

___________________________________ test_men ___________________________________
def test_men():

    for segment, length in [("full", 3000), ("dev", 2000), ("test", 1000)]:

        data = nlp.data.MEN(
          root=os.path.join('tests', 'data', 'men'), segment=segment)
tests/unittest/test_datasets.py:146:

gluonnlp/data/word_embedding_evaluation.py:307: in init
super(MEN, self).__init__(root=root)
gluonnlp/data/word_embedding_evaluation.py:163: in init
super(WordSimilarityEvaluationDataset, self).__init__(root=root)
gluonnlp/data/word_embedding_evaluation.py:120: in init
self._download_data()
gluonnlp/data/word_embedding_evaluation.py:131: in _download_data
verify=self._verify_ssl)

The text was updated successfully, but these errors were encountered:

leezu · 2018-04-22T01:36:41Z

This seems to be caused by intermittent downtime of the clic.cimec.unitn.it server.
While we are likely allowed to redistribute the MEN dataset (it is not 100% clear if the permissive license applies to the dataset or just only models trained on the dataset) and could thereby control the source server, we can't redistribute most other datasets.

Therefore I suggest not to clean the cached datasets downloaded from external hosts between subsequent test runs. To make sure we catch URLs that go down permanently, we would have to set up a separate CI job that once in a while runs the complete test suite after cleaning the cached datasets.

I don't have access to the CI configuration, but I suppose this is feasible. If there is no objection I will implement the cache of external datasets now to make sure we won't experience any such intermittent test failures in future. Then in a follow-up we can set up the scheduled test without cache. For that I either need access to the CI configuration or someone (@szha ?) would have to enable it.

Please let me know if you have any alternative suggestions

cgraywang · 2018-04-22T04:39:31Z

Well, is the cache of the dataset that is not allowed to redistribute allowed? I suggest we only use the datasets that are clear to the license issue. Which datasets used in the embedding evaluation having a free-distribution license?

leezu · 2018-04-22T05:31:46Z

With cache I mean that the CI server doesn't re-download the dataset on every run. This means even if the authors webserver goes down intermittently, the tests would still pass.

Redistribution would allow us to upload the datasets to S3 and replace the links to the authors webserver with a link to our S3 bucket. S3 is unlikely to go down, so our tests wouldn't be flaky in the first case. Unfortunately 90% of the datasets for Embedding Eval as well as CoNLL do not allow redistribution (and possibly others).

However we plan to contact the dataset owners once the toolkit is released and ask for special redistribution permission. This would save them data transfer costs caused by users of our toolkit, so they may agree.

Closing this now as #58 and #62 should have solved the issue.

You may need to make a code-change to rerun the tests.

Feel free to reopen if the issue persists.

cgraywang assigned leezu and szha Apr 21, 2018

leezu closed this as completed Apr 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test dataset issue when performing unitest with python3 #56

Test dataset issue when performing unitest with python3 #56

cgraywang commented Apr 21, 2018

leezu commented Apr 22, 2018

cgraywang commented Apr 22, 2018

leezu commented Apr 22, 2018

Test dataset issue when performing unitest with python3 #56

Test dataset issue when performing unitest with python3 #56

Comments

cgraywang commented Apr 21, 2018

leezu commented Apr 22, 2018

cgraywang commented Apr 22, 2018

leezu commented Apr 22, 2018