Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to use the IWSLT2016 dataset #72

Open
drdozer opened this issue Aug 2, 2021 · 5 comments
Open

how to use the IWSLT2016 dataset #72

drdozer opened this issue Aug 2, 2021 · 5 comments

Comments

@drdozer
Copy link

drdozer commented Aug 2, 2021

Hi - I want to play around with some language translation tasks and saw that you've got Transformers.Datasets.IWSLT.IWSLT2016. How do I interact with this to get data that I can train a model on? I couldn't find anything in the documentation to help me out.

@chengchingwen
Copy link
Owner

chengchingwen commented Aug 3, 2021

You can find some simple usages in the toy example.

Basically,

using Transformers
using Transformers.Datasets # utilities for dataset 
using Transformers.Datasets: IWSLT # IWSLT datasets

# available language for iwslt2016: :en, :cs, :ar, :fr, :de
src_lang = :en 
dst_lang = :de 

 iwslt2016 = IWSLT.IWSLT2016(src_lang, dst_lang) # Create dataset

# get vocabulary from training data
vocab = get_vocab(iwslt2016)

# create dataset object
# each one is a 2-tuple of channels containing src sentence and dst sentence
training_set = dataset(Train, iwslt2016)
dev_set = dataset(Dev, iwslt2016)
test_set = dataset(Test, iwslt2016) # usually test set won't contain ground truth, but iwslt2016 somehow does

# get datas
batch_size = 1
src_sent, dst_sent = get_batch(training_set, batch_size) # each one is a vector of sentences

Once you run through all the data, get_batch will return an empty vector, then you can recreate the dataset object.

@maj0e
Copy link
Contributor

maj0e commented Feb 13, 2022

Above example fails with following error message:

┌ Info: Downloading
│   source = "https://wit3.fbk.eu/archive/2016-01//texts/en/de/en-de.tgz"
│   dest = "/home/markus/.julia/datadeps/IWSLT2016 en-de/en-de.tgz"
│   progress = NaN
│   time_taken = "0.05 s"
│   time_remaining = "NaN s"
│   average_speed = "2.141 MiB/s"
│   downloaded = "105.240 KiB"
│   remaining = "∞ B"
└   total = "∞ B"
ERROR: LoadError: HTTP.ExceptionRequest.StatusError(404, "GET", "/archive/2016-01//texts/en/de/en-de.tgz", HTTP.Messages.Response:
"""
HTTP/1.1 404 Not Found
Content-Type: text/html; charset=utf-8
X-Frame-Options: DENY
Cache-Control: no-cache, no-store, max-age=0, must-revalidate
Pragma: no-cache
Expires: Mon, 01 Jan 1990 00:00:00 GMT
Date: Sun, 13 Feb 2022 09:22:44 GMT
Cross-Origin-Opener-Policy: unsafe-none
Content-Security-Policy: base-uri 'self';object-src 'none';report-uri /_/view/cspreport;script-src 'nonce-iuQ6rOqUw/NwssS2azzWNQ' 'unsafe-inline' 'unsafe-eval';worker-src 'self';frame-ancestors https://google-admin.corp.google.com/
Referrer-Policy: origin
Server: ESF
X-XSS-Protection: 0
X-Content-Type-Options: nosniff
Accept-Ranges: none
Vary: Accept-Encoding
Transfer-Encoding: chunked

Looking at the website of IWSLT, it seems that the datasets moved to Google Drive instead.

@chengchingwen
Copy link
Owner

chengchingwen commented Feb 14, 2022

Looks like they no longer provide file links for specific translation pair, we would need to rewrite the datadeps base on that

@maj0e
Copy link
Contributor

maj0e commented Feb 14, 2022

I thought I could fix this quickly by changing the download link and adapt the post_fetch_method to search for the translation pairs in the right subfolder, but it seems like DataDeps.jl doesn't support downloading from GoogleDrive (or maybe I did something wrong).
From a quick glance at DataDeps.jl, I found a issue discussing this topic.

@chengchingwen
Copy link
Owner

@maj0e move issue to #85

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants