Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Proxy for _download_extract_downstream_data() related functions #136

Merged
merged 5 commits into from Nov 6, 2019
Merged

Add Proxy for _download_extract_downstream_data() related functions #136

merged 5 commits into from Nov 6, 2019

Conversation

cregouby
Copy link
Contributor

@cregouby cregouby commented Nov 4, 2019

No description provided.

@tholor
Copy link
Member

tholor commented Nov 5, 2019

Hi @cregouby ,

Thanks for tackling the proxy issue now also for the download of datasets!

I finally also had some time to look into it and I have two suggestions:

  1. Let's make the proxy argument explicit instead of passing it all way down inside of **kwargs. This will simplify debugging and helps users to spot the option of setting proxies more easily.
  2. Not sure how you planned to set the proxies argument in file_to_dicts() in the initial version of the PR. I thinks it's easier to set it as an attribute of the Processor itself
TextClassificationProcessor(..., proxies=<your-proxy>)

... and then accessing it directly in file_to_dicts():

    def file_to_dicts(self, file: str) -> [dict]:
        ...
        dicts = read_tsv(
              ...
            proxies=self.proxies
            )

        return dicts

With that we avoid some additional passes of **kwargs from the processor down to file_to_dicts().

What do you think? Could you please test, if this actually works behind a proxy?

@cregouby
Copy link
Contributor Author

cregouby commented Nov 6, 2019

I fully agree. putting **kwargs everywhere is a quick fix. Your proposal seem cleaner.

I'll be able to test it soon.

@cregouby
Copy link
Contributor Author

cregouby commented Nov 6, 2019

Hello @tholor

  1. Your commits works like a charm behing the proxy :
proxies = {'https': config['https_proxy']}
language_model = LanguageModel.load(pretrained_model_name_or_path = "bert-base-cased", 
                                    language = "english",
                                    proxies=proxies
                                    )

provide

11/06/2019 10:08:01 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /xxx/xxx/cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352
11/06/2019 10:08:01 - INFO - transformers.configuration_utils -   Model config {
 ...
}

11/06/2019 10:08:02 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin not found in cache or force_download set to True, downloading to /tmp/tmp9w8wg9z3
100%|██████████| 435779157/435779157 [04:07<00:00, 1757340.05B/s]
11/06/2019 10:12:11 - INFO - transformers.file_utils -   copying /tmp/tmp9w8wg9z3 to cache at /xxx/xxx/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
11/06/2019 10:12:21 - INFO - transformers.file_utils -   creating metadata file for /xxx/xxx/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
11/06/2019 10:12:21 - INFO - transformers.file_utils -   removing temp file /tmp/tmp9w8wg9z3
11/06/2019 10:12:21 - INFO - transformers.modeling_utils -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /xxx/xxx/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
  1. Idem for dataset download
processor = BertStyleLMProcessor(data_dir = config['project_directory']+"lm_finetune_nips", 
                                 tokenizer=tokenizer,
                                 max_seq_len=128, 
                                 max_docs=25,
                                 next_sent_pred=True,
                                 proxies = proxies,
                                )
data_silo = DataSilo(
    processor=processor,
    batch_size=32, 
    distributed=False
)
11/06/2019 11:00:18 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
11/06/2019 11:00:18 - INFO - farm.data_handler.data_silo -   Loading train set from: /xxx/xxx/lm_finetune_nips/train.txt 
11/06/2019 11:00:18 - INFO - farm.data_handler.utils -   downloading and extracting file lm_finetune_nips to dir /xxx/xxx/lm_finetune_nips/train.txt 
100%|██████████| 63322187/63322187 [00:13<00:00, 4788623.15B/s]

@tholor
Copy link
Member

tholor commented Nov 6, 2019

Awesome! Then let's merge it into master.

@tholor tholor self-requested a review November 6, 2019 12:00
@tholor tholor self-assigned this Nov 6, 2019
@tholor tholor added enhancement New feature or request part: processor Processor labels Nov 6, 2019
@tholor tholor merged commit 6682b3a into deepset-ai:master Nov 6, 2019
@tholor tholor changed the title Fix ##115 for _download_extract_downstream_data() related functions Add Proxy for _download_extract_downstream_data() related functions Nov 27, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request part: processor Processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants