[Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test #1249

sxjscience · 2020-06-13T07:17:48Z

Add match_tokens_with_char_spans to utility
We convert the character spans to token starts+ends based on binary search
Enable downloading from S3. Now we are able to call

from gluonnlp.utils import download
download('s3://commoncrawl/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz', overwrite=True)

Add Ubuntu test

In order for this feature to work, the user needs to configure S3 correctly.

Also, speed test shows that downloading from S3 can be around 4 times faster in a c4.8x machine in EC2:

In [7]: %timeit download('s3://commoncrawl/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/
   ...: wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz', overwrite=True)
Downloading CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz from s3://commoncrawl/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz...
100%|██████████████████████████████████████████████████████| 157M/157M [00:01<00:00, 87.3MiB/s]
2.5 s ± 401 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit download('https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-24/segme
   ...: nts/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz', overw
   ...: rite=True)
Downloading CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz from https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2020-24/segments/1590347385193.5/wet/CC-MAIN-20200524210325-20200525000325-00003.warc.wet.gz...
100%|██████████████████████████████████████████████████████| 157M/157M [00:08<00:00, 19.0MiB/s]
8.13 s ± 389 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This will help us to download large datasets like wikipedia + commoncrawl.

… from S3

sxjscience · 2020-06-13T07:28:46Z

@zheyuye We may try to revise our wikipedia downloading script as:

Try to use S3
Fallback to https if S3 raises an exception.

zheyuye

LGTM

sxjscience · 2020-06-13T07:43:02Z

Find that wikipedia is not available in S3. However, CommonCrawl is in S3 so this functionality is helpful for us to download commoncrawl.

codecov · 2020-06-13T08:23:13Z

Codecov Report

Merging #1249 into numpy will increase coverage by 0.11%.
The diff coverage is 61.40%.

@@            Coverage Diff             @@
##            numpy    #1249      +/-   ##
==========================================
+ Coverage   82.32%   82.44%   +0.11%     
==========================================
  Files          38       38              
  Lines        5410     5450      +40     
==========================================
+ Hits         4454     4493      +39     
- Misses        956      957       +1

Impacted Files	Coverage Δ
src/gluonnlp/utils/misc.py	`48.18% <50.00%> (-2.61%)`	⬇️
src/gluonnlp/utils/lazy_imports.py	`55.71% <66.66%> (+1.02%)`	⬆️
src/gluonnlp/utils/__init__.py	`100.00% <100.00%> (ø)`
src/gluonnlp/utils/preprocessing.py	`100.00% <100.00%> (ø)`
src/gluonnlp/data/loading.py	`83.39% <0.00%> (+7.54%)`	⬆️

zheyuye · 2020-06-13T08:43:05Z

Perhaps it is also possible to fix some invalid links in datasets in this PR like General NLP Benchmarks and scripts in like https://github.com/dmlc/gluon-nlp/blob/numpy/scripts/datasets/README.md

sxjscience · 2020-06-15T17:26:46Z

~~I'm not sure why it gives s 61.40% diff hit. I'll merge this in first.~~(Might be related to the coverage for this patch). The test coverage is actually increased after this PR.

codecov.yml

addressed

sxjscience added 2 commits June 13, 2020 00:13

add match_tokens_with_char_spans to utility + add ability to download…

1969e34

… from S3

Update lazy_imports.py

b89cf30

zheyuye reviewed Jun 13, 2020

View reviewed changes

zheyuye approved these changes Jun 13, 2020

View reviewed changes

Update lazy_imports.py

703fb46

sxjscience added 6 commits June 13, 2020 13:24

Revise broken link

dd46fae

test downloading

ed2ab07

enable ubuntu test

51e157e

update

cde4717

Update unittests.yml

c2a9def

Update .coveragerc

83f0f25

sxjscience added 2 commits June 15, 2020 10:47

Create codecov.yml

7c70819

Update test_models.py

a969d46

sxjscience changed the title ~~[Numpy] Add match_tokens_with_char_spans to utility + Enable downloading from S3~~ [Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test Jun 15, 2020

sxjscience added 3 commits June 15, 2020 12:14

fix bug

fded22f

Update test_models.py

8937935

Update codecov.yml

e0c1a43

szha previously requested changes Jun 15, 2020

View reviewed changes

codecov.yml Outdated Show resolved Hide resolved

Delete codecov.yml

a0773fc

sxjscience added 3 commits June 15, 2020 19:26

do not paralleize the backbone forward test

50b6664

update test cases

e822ee0

use a smaller batch_size + seq_length for testing

5b01ee5

szha merged commit 85b6f09 into dmlc:numpy Jun 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test #1249

[Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test #1249

sxjscience commented Jun 13, 2020 •

edited

sxjscience commented Jun 13, 2020

zheyuye left a comment

sxjscience commented Jun 13, 2020 •

edited

codecov bot commented Jun 13, 2020 •

edited

zheyuye commented Jun 13, 2020

sxjscience commented Jun 15, 2020 •

edited

[Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test #1249

[Numpy] Add "match_tokens_with_char_spans" + Enable downloading from S3 + Add Ubuntu test #1249

Conversation

sxjscience commented Jun 13, 2020 • edited

sxjscience commented Jun 13, 2020

zheyuye left a comment

Choose a reason for hiding this comment

sxjscience commented Jun 13, 2020 • edited

codecov bot commented Jun 13, 2020 • edited

Codecov Report

zheyuye commented Jun 13, 2020

sxjscience commented Jun 15, 2020 • edited

sxjscience commented Jun 13, 2020 •

edited

sxjscience commented Jun 13, 2020 •

edited

codecov bot commented Jun 13, 2020 •

edited

sxjscience commented Jun 15, 2020 •

edited