Add Proxy for _download_extract_downstream_data() related functions #136

cregouby · 2019-11-04T13:49:06Z

No description provided.

…ocessor attribute

tholor · 2019-11-05T14:20:46Z

Thanks for tackling the proxy issue now also for the download of datasets!

I finally also had some time to look into it and I have two suggestions:

Let's make the proxy argument explicit instead of passing it all way down inside of **kwargs. This will simplify debugging and helps users to spot the option of setting proxies more easily.
Not sure how you planned to set the proxies argument in file_to_dicts() in the initial version of the PR. I thinks it's easier to set it as an attribute of the Processor itself

TextClassificationProcessor(..., proxies=<your-proxy>)

... and then accessing it directly in file_to_dicts():

    def file_to_dicts(self, file: str) -> [dict]:
        ...
        dicts = read_tsv(
              ...
            proxies=self.proxies
            )

        return dicts

With that we avoid some additional passes of **kwargs from the processor down to file_to_dicts().

What do you think? Could you please test, if this actually works behind a proxy?

cregouby · 2019-11-06T07:33:37Z

I fully agree. putting **kwargs everywhere is a quick fix. Your proposal seem cleaner.

I'll be able to test it soon.

cregouby · 2019-11-06T11:05:03Z

Hello @tholor

Your commits works like a charm behing the proxy :

proxies = {'https': config['https_proxy']}
language_model = LanguageModel.load(pretrained_model_name_or_path = "bert-base-cased", 
                                    language = "english",
                                    proxies=proxies
                                    )

provide

11/06/2019 10:08:01 - INFO - transformers.configuration_utils -   loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-config.json from cache at /xxx/xxx/cache/torch/transformers/b945b69218e98b3e2c95acf911789741307dec43c698d35fad11c1ae28bda352
11/06/2019 10:08:01 - INFO - transformers.configuration_utils -   Model config {
 ...
}

11/06/2019 10:08:02 - INFO - transformers.file_utils -   https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin not found in cache or force_download set to True, downloading to /tmp/tmp9w8wg9z3
100%|██████████| 435779157/435779157 [04:07<00:00, 1757340.05B/s]
11/06/2019 10:12:11 - INFO - transformers.file_utils -   copying /tmp/tmp9w8wg9z3 to cache at /xxx/xxx/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
11/06/2019 10:12:21 - INFO - transformers.file_utils -   creating metadata file for /xxx/xxx/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2
11/06/2019 10:12:21 - INFO - transformers.file_utils -   removing temp file /tmp/tmp9w8wg9z3
11/06/2019 10:12:21 - INFO - transformers.modeling_utils -   loading weights file https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-pytorch_model.bin from cache at /xxx/xxx/.cache/torch/transformers/35d8b9d36faaf46728a0192d82bf7d00137490cd6074e8500778afed552a67e5.3fadbea36527ae472139fe84cddaa65454d7429f12d543d80bfc3ad70de55ac2

Idem for dataset download

processor = BertStyleLMProcessor(data_dir = config['project_directory']+"lm_finetune_nips", 
                                 tokenizer=tokenizer,
                                 max_seq_len=128, 
                                 max_docs=25,
                                 next_sent_pred=True,
                                 proxies = proxies,
                                )
data_silo = DataSilo(
    processor=processor,
    batch_size=32, 
    distributed=False
)

11/06/2019 11:00:18 - INFO - farm.data_handler.data_silo -   
Loading data into the data silo ... 
              ______
               |o  |   !
   __          |:`_|---'-.
  |__|______.-/ _ \-----.|       
 (o)(o)------'\ _ /     ( )      
 
11/06/2019 11:00:18 - INFO - farm.data_handler.data_silo -   Loading train set from: /xxx/xxx/lm_finetune_nips/train.txt 
11/06/2019 11:00:18 - INFO - farm.data_handler.utils -   downloading and extracting file lm_finetune_nips to dir /xxx/xxx/lm_finetune_nips/train.txt 
100%|██████████| 63322187/63322187 [00:13<00:00, 4788623.15B/s]

tholor · 2019-11-06T11:59:52Z

Awesome! Then let's merge it into master.

cregouby and others added 5 commits November 4, 2019 12:27

Fix ##115 with _download_extract_downstream_data()

2b38099

Fix ##115 with _download_extract_downstream_data()

0e9e976

Fix ##115 with data_handler/processor.py()

d7c4130

add exlicit proxies argument instead of kwargs. add proxies arg as pr…

7b13769

…ocessor attribute

fix proxies arg in read_tsv. add kwargs back to processors.

04b81d4

tholor self-requested a review November 6, 2019 12:00

tholor self-assigned this Nov 6, 2019

tholor added enhancement New feature or request part: processor Processor labels Nov 6, 2019

tholor merged commit 6682b3a into deepset-ai:master Nov 6, 2019

tholor mentioned this pull request Nov 6, 2019

Please add proxies=proxies capability to any functions call to requests #115

Closed

tholor changed the title ~~Fix ##115 for _download_extract_downstream_data() related functions~~ Add Proxy for _download_extract_downstream_data() related functions Nov 27, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Proxy for _download_extract_downstream_data() related functions #136

Add Proxy for _download_extract_downstream_data() related functions #136

cregouby commented Nov 4, 2019

tholor commented Nov 5, 2019

cregouby commented Nov 6, 2019 •

edited

cregouby commented Nov 6, 2019

tholor commented Nov 6, 2019

Add Proxy for _download_extract_downstream_data() related functions #136

Add Proxy for _download_extract_downstream_data() related functions #136

Conversation

cregouby commented Nov 4, 2019

tholor commented Nov 5, 2019

cregouby commented Nov 6, 2019 • edited

cregouby commented Nov 6, 2019

tholor commented Nov 6, 2019

cregouby commented Nov 6, 2019 •

edited