Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gradmatch Data subset selection method making training slow #78

Open
animesh-007 opened this issue Jun 27, 2022 · 9 comments
Open

Gradmatch Data subset selection method making training slow #78

animesh-007 opened this issue Jun 27, 2022 · 9 comments

Comments

@animesh-007
Copy link

animesh-007 commented Jun 27, 2022

I tried to run some experiments as follows:

  • Ran full cifar10 without any subset selection method to train resnet50 which took around 32m 31s.
  • Ran Gradmatch cifar10 subset selection with 0.1 fractions taking longer time than full cifar10 i.e 22h 48m 40s.
  • Ran Gradmatch cifar10 subset selection with 0.3 fractions taking longer time than 0.1 Gradmatch selection method.

I am using scaled resolution images of cifar10 i.e 224x224 resolution and accordingly defined resnet50 architecture.
Can you let me know how to speed up experiments 2 and 3? In general subset selection method should faster the whole training process right?

@animesh-007 animesh-007 changed the title Data subset selection method making training slow Gradmatch Data subset selection method making training slow Jul 2, 2022
@krishnatejakk
Copy link
Collaborator

krishnatejakk commented Jul 14, 2022

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

@animesh-007
Copy link
Author

@animesh-007

Can you point out what version of GradMatch you are using?

Ideally, subset selection should be faster unless something is wrong with the experimental setup. Please attach the log files so that I can figure out the issue after analyzing them.

@krishnatejakk
These are the initial logs. Should I paste whole log? I cloned the repo on June 27. So I guess I am using the latest version.

[06/27 16:40:56] train_sl INFO: DotMap(setting='SL', is_reg=True, dataset=DotMap(name='cifar10', datadir='../storage', feature='dss', type='image'), dataloader=DotMap(shuffle=True, batch_size=256, pin_memory=True, num_workers=8), model=DotMap(architecture='ResNet50_224', type='pre-defined', numclasses=10), ckpt=DotMap(is_load=False, is_save=True, dir='results/', save_every=20), loss=DotMap(type='CrossEntropyLoss', use_sigmoid=False), optimizer=DotMap(type='sgd', momentum=0.9, lr=0.01, weight_decay=0.0005, nesterov=False), scheduler=DotMap(type='cosine_annealing', T_max=300), dss_args=DotMap(type='GradMatch', fraction=0.3, select_every=5, lam=0.5, selection_type='PerClassPerGradient', v1=True, valid=False, kappa=0, eps=1e-100, linear_layer=True), train_args=DotMap(num_epochs=300, device='cuda', print_every=1, results_dir='results/', print_args=['val_loss', 'val_acc', 'tst_loss', 'tst_acc', 'time'], return_args=[]))
Files already downloaded and verified
Files already downloaded and verified
18it [00:01, 10.12it/s]
[06/27 16:41:12] train_sl INFO: Epoch: 1 , Validation Loss: 3.1551918701171875 , Validation Accuracy: 0.1914 , Test Loss: 3.5032728210449218 , Test Accuracy: 0.2142 , Timing: 7.0498366355896
15it [00:01, 10.10it/s]
[06/27 16:41:21] train_sl INFO: Epoch: 2 , Validation Loss: 2.387578009033203 , Validation Accuracy: 0.3002 , Test Loss: 2.735808560180664 , Test Accuracy: 0.3253 , Timing: 6.075047492980957
1it [00:00, 6.72it/s]
[06/27 16:41:31] train_sl INFO: Epoch: 3 , Validation Loss: 2.139058850097656 , Validation Accuracy: 0.3246 , Test Loss: 2.036042041015625 , Test Accuracy: 0.3344 , Timing: 6.058322191238403
8it [00:00, 10.55it/s]
[06/27 16:41:41] train_sl INFO: Epoch: 4 , Validation Loss: 3.5549482177734375 , Validation Accuracy: 0.3576 , Test Loss: 2.480993505859375 , Test Accuracy: 0.3838 , Timing: 5.7214953899383545
9it [00:01, 8.89it/s]
[06/27 16:41:50] train_sl INFO: Epoch: 5 , Validation Loss: 3.782627294921875 , Validation Accuracy: 0.3624 , Test Loss: 3.3407586791992188 , Test Accuracy: 0.39 , Timing: 5.925083160400391
12it [00:01, 10.56it/s]
4it [00:00, 8.48it/s]
5it [00:00, 10.51it/s]
15it [00:01, 11.37it/s]
16it [00:01, 11.82it/s]
18it [00:01, 13.24it/s]
11it [00:01, 9.24it/s]
7it [00:00, 7.66it/s]
2it [00:00, 12.04it/s]
15it [00:01, 11.98it/s]
[06/27 16:58:59] train_sl INFO: Epoch: 6, GradMatch subset selection finished, takes 1028.8181.

@shiyf129
Copy link

@krishnatejakk
I got the similar test results using cifar10 dataset and ResNet18 model.
For one epoch training, "full dataset" took about 50 seconds, GradMatch and CRAIG took more than 100 seconds.
Besides, GradMatch and CRAIG took about 100 seconds to select sub dataset in an epoch.

  1. Can we preprocess the whole dataset first to get the weighted training sub dataset, and then train directly with the weighted sub dataset, which should shorten the training time. Is there an example about that?

  2. Is there a faster sub dataset selection method?
    Thank you.

[Full dataset]:
INFO: The length of dataloader: 2250
INFO: Training Timing: 50.17572069168091

[GradMatch]:
INFO: The length of dataloader: 225
INFO: GradMatch subset selection finished, takes 99.8966.
INFO: Training Timing: 104.97514295578003

[CRAIG]:
INFO: The length of dataloader: 225
INFO: subset selection finished, takes 108.4812.
INFO: Training Timing: 114.62646007537842

@animesh-007
Copy link
Author

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

@krishnatejakk
Copy link
Collaborator

krishnatejakk commented Jul 21, 2022

@animesh-007 @shiyf129 I am working on the issue. We have recently updated the OMP version in GradMatch code which improves its performance further. However the new OMP version is making it slower in this case. I will debug why it is very slow in this case.

For faster training, One option is to use GradMatchPB (i.e., perBatch version) or revert back to previous OMP version in GradMatch strategy code below:

from ..helpers import OrthogonalMP_REG_Parallel, OrthogonalMP_REG, OrthogonalMP_REG_Parallel_V1

In import statement, remove _V1 to revert back to previous version of OMP code

@shiyf129
Copy link

@shiyf129 What is the resolution of the images you are using while training? I am using 224x224.

I use the original cifar10 dataset, 32*32 image size

@shiyf129
Copy link

@krishnatejakk I test GradMatchPB algorithm and set v1=False to use previous OMP version. I compared the beginning 10 epoch training between GradMatchPB alogithm and full dataset training, the result shows GradMatchPB takes longer time, and the average accuracy is relatively low. Do you know the reason about it?

GradMatchPB

  • the mean epoch training time is 26.70+32.06 = 58.76 seconds
  • the mean of accuracy is 0.463

Full dataset training

  • the mean epoch training time is 50.867 seconds
  • the mean of accuracy is 0.7548
dss_args=dict(type="GradMatchPB",
            fraction=0.1,
            select_every=20,
            lam=0,
            selection_type='PerBatch',
            v1=False,
            valid=False,
            eps=1e-100,
            linear_layer=True,
            kappa=0),

GradMatchPB beginning 10 epoch training:

<style> </style>
Index Subset selection time (second) A training epoch time (second) Test Accuracy
1 25.85 30.91 0.3588
2 25.61 30.72 0.3707
3 25.39 31.07 0.4201
4 28.71 34.43 0.4314
5 28.69 33.85 0.4748
6 25.81 31.17 0.485
7 29.03 34.72 0.4881
8 26.78 31.85 0.511
9 25.82 31.45 0.537
10 25.4 30.47 0.5535
Mean 26.7 32.06 0.463

Full dataset beginning 10 epoch training:

<style> </style>
Index A training epoch time (second) Test Accuracy
1 51.59 0.5279
2 52.13 0.6543
3 50.17 0.7183
4 51.26 0.7495
5 51.62 0.7779
6 50.14 0.8205
7 47.99 0.8026
8 51.54 0.8324
9 49.91 0.8229
10 52.32 0.8423
Mean 50.867 0.7548

@krishnatejakk
Copy link
Collaborator

@shiyf129 why is subset selection happening every epoch? We usually set it to 20. Subset selection takes some time and you dont need to select a subset every time.

Furthermore, training with 10% subset should be 10x faster than full dataset training. From your logs, it doesn't seem that way. Can you check if you create a 10% subset of dataset and train on it for one epoch, is it 10x faster than full training?

@shiyf129
Copy link

@krishnatejakk I modified the code to select a subset every 20 epoches.
I run the cifar10 dataset on ResNet18 model to compare GradMatchPB and Full dataset.
Both of them run for 10 minutes and record the test accuracy every minute.
The average test accuracy of GradMatchPB is slightly lower than that of full dataset.
What is the reason for this?

<style> </style>
  Full dataset GradMatchPB (fraction=0.3) GradMatchPB (fraction=0.1)
Average test accuracy 0.7633 0.7515 0.6714

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants