How about this group-train subcommand #2579

Whu-wxy · 2019-03-08T04:05:38Z

We often need to experiment with different configurations. If AllenNLP can automatically perform several training tasks in turn, it will save us some time. Yesterday I added a subcommand 'group-train', which is a wrapper of the train command ,and did a simple test, I found it’s very useful..

Command execution process
I just use allennlp-as-a-library-example for test.
experiments dir: venue_classifier.json, venue_classifier_boe.json, venue_classifier_boe_adam
Input: allennlp group-train /home/wxy/PycharmProjects/allennlp-as-a-library/experiments -s ./group_save --include-package my_library

process:
1.Before the training begins, create a training_progress.json file that will be used to record configuration files and whether its training has been completed.
{"venue_classifier_boe_adam": false, "venue_classifier.json": false, "venue_classifier_boe.json": false}

2.Use a ’for‘ loop to train each json file in turn.

3.Modify the training_progress.json file after a training task is completed.
{"venue_classifier_boe_adam": true, "venue_classifier.json": false, "venue_classifier_boe.json": false}

4.If the training is interrupted somewhere, it recovers training at the first 'false' according to the training_progress.json file and the training in the back is normal.

5.After all the training is completed, there are three dirs in the serialization_dir :ven_classifier,venue_classifier_boe,venue_classifier_boe_adam. Their names correspond to the configuration files. And training_progress.json file:
{"venue_classifier_boe_adam": true, "venue_classifier.json": true, "venue_classifier_boe.json": true}

Additional context
1.Because I am not familiar with the test module, I tested my code with allennlp-as-a-library-example.

2.If the training is interrupted due to an abnormality, the subsequent training will not continue, so skipping the abnormal training is a better choice. But I don't know how to do this.

3.This code I uploaded to the allennlp repository I forked down.

Do you have any suggestions for my code?
@matt-gardner

The text was updated successfully, but these errors were encountered:

matt-gardner · 2019-03-08T17:01:11Z

I'm glad you found a good way to do this that works well for you! I think there are a lot of different ways to solve this basic problem, and we so far have taken the position that these kinds of scripts or commands should be outside the main library. Maybe some day we'll put something in here, but because there are so many different setups a person could have (a single GPU, multiple CPUs, multiple GPUs, some cloud infrastructure, a cluster with a job queue...), it's hard to make a general solution. I think the right thing to do is leave a link to your solution, like you did, so that others with a similar situation can use it if they find it useful.

Thanks!

matt-gardner closed this as completed Mar 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How about this group-train subcommand #2579

How about this group-train subcommand #2579

Whu-wxy commented Mar 8, 2019

matt-gardner commented Mar 8, 2019

How about this group-train subcommand #2579

How about this group-train subcommand #2579

Comments

Whu-wxy commented Mar 8, 2019

matt-gardner commented Mar 8, 2019