This repo is a bootstrap for experiments and includes helper functions scripts for pytorch training and slurm job scheduler.
Basic idea: Create a python script(experiment) which accepts command line arguments. Provide arg_lists and generate slurm_jobs using the cross product of the given arg lists.
First thing setup your ssh workflow. Then lets kick-start our experiment.
ssh prince git clone firstname.lastname@example.org:evcu/exp.bootstrp.git mv exp.bootstrp my_exp cd my_exp
First debug the experiment on a interactive session prince_slurm_bootstrap.sh loads the modules needed, update as needed. Personally I am using python3 with pip --user packages. You can call it with install for the first time
srun -t2:30:00 --mem=5000 --gres=gpu:1 --pty /bin/bash . ./prince_slurm_bootstrap.sh install cd experiments/cifar10/ python main.py --epoch 1
After we are sure that our main script works, we can start create automated experiments with
create_experiment_jobs.py scripts. First thing to do is updating some of the SLURM fields under experiments/default_conf.yaml.
NET_ID with you net_id for example if you are a fellow NYU student and using prince. You may need to completely change this file according to your needs if you are working in another system or have different requirements.
Note that each element of the
experiment key in the yaml file is a dictionary itself involves argument lists for
<exp_name>/main.py. Each of the values in these argument lists are cross-product with others in the dictionary to generate all possible combinations.
Now we can generate experiment scripts.
cd ../ python create_experiment_jobs.py --debug
if they all look nice then you can create the experiment folder. and submit the jobs
python create_experiment_jobs.py bash /scratch/ue225/my_project/exps/cifar10/cifarLR_03.26/submit_all.sh
which would output something like this
Let say you wanna define a new experiment. You would do by creating a new folder
experiments/new_folder/ and a
experiments/new_folder/main.pyscript that is intended to be run. The main.py script should accept
--conf_file flags at minimum. Then you can change
new_folder and create new experiments.
- read_yaml_args reads conf.yaml and creates a type-checked argParser out of the definition. Write the conf, read and overwrite with cli args.
- Customizable eval-prefixes inside yaml file, which enables defining programatic eval-able arguments. i.e. the string '+range(5)' would be evaluated and read as the list.
- configuration copy to the experiment folder such that you can always change experiments default_args after submission
- ClassificationTrainer/ClassificationTester which wraps the main training/testing functionalities and provides hooks for loggers.
- tensorboardX logging utils and examples.
- convNetgeneric implementation
- Multiple experiment definitions through yaml lists.
there are several options
- You can scp like
scp prince:/scratch/ue225/my_project/exps/cifar10/cifarLR .26/tb_logs ./
- You can open a tunnel to the prince and run tensorboard on prince and connect to it through port forwarding. You can look my (remote Jupyter and port forwarding](https://evcu.github.io/notes/port-forwarding/) notes.
- You can use sshfs and get the logs sync into your local file system. Details here
I am excited to collaborate and learn from you if you figured out better ways experimenting or wanna add text/code to this repo. Please create an issue or reach_out to me.
- change create_experiments such that maybe the defaults included in the experiment.yaml and dumped.
- Source code needs to be copied!