Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancellation in Image Classification (fixes #4632) #4650

Merged
merged 11 commits into from Jan 17, 2020

Conversation

@antoniovs1029
Copy link
Member

antoniovs1029 commented Jan 13, 2020

Adds support for cancellation to the Image Classification trainer in a similar manner as done in #3062 (and other PRs) by adding cancellation checkpoints to the train method.

I've tested it by running the sample related to this trainer. Since the other PR's that included checkpoints for cancellation don't include unit tests, I also didn't include any in here.

Fixes #4632 .

@antoniovs1029 antoniovs1029 requested a review from dotnet/mlnet-core as a code owner Jan 13, 2020
@antoniovs1029

This comment has been minimized.

Copy link
Member Author

antoniovs1029 commented Jan 13, 2020

I don't know if I should also add a .CheckAlive() chekpoint inside the CacheFeaturizedImagesToDisk method of Image Classification Trainer, as that method can take a couple of minutes, but once the method is over, the trainer will anyway end up hitting the checkpoint I've already added in TrainAndEvaluateClassificationLayer.

Also, if anyone has other opinions as to where to put more checkpoints, please, let me know!

@antoniovs1029 antoniovs1029 requested a review from codemzs Jan 13, 2020
@@ -992,6 +995,7 @@ public Tensor ProcessImage(in VBuffer<byte> imageBuffer)

for (int epoch = 0; epoch < epochs; epoch += 1)
{
Host.CheckAlive();

This comment has been minimized.

Copy link
@codemzs

codemzs Jan 14, 2020

Member

Host.CheckAlive(); [](start = 20, length = 18)

I would just put the check in this loop and in the CreateFeaturizedCacheFile. Please also report numbers in perf differences before and after. Please remove CheckAlive from everywhere else as its not very significant and only pollutes the code. You also need to call TryCleanupTemporaryWorkspace for a graceful termination. #Closed

This comment has been minimized.

Copy link
@antoniovs1029

antoniovs1029 Jan 14, 2020

Author Member

I have added a new method "CheckAlive" to the ImageClassification trainer, with a try...catch to call TryCleanupTemporaryWorkspace when it's needed.

Also changed the places where I added the checkpoints.

I will see how to get the perf difference now. #Closed

This comment has been minimized.

Copy link
@antoniovs1029

antoniovs1029 Jan 14, 2020

Author Member

So I ran the ImageClassificationBench.TrainResnetV250 benchmark, with and without the changes of this PR, and they both behaved in pretty much the same way.

Without the changes this was the summary output of the benchmark:

          Method |    Mean |   Error |   StdDev | Extra Metric |
---------------- |--------:|--------:|---------:|-------------:|
 TrainResnetV250 | 41.55 s | 5.580 s | 0.3058 s |            - |

And with the changes, the summary was:

          Method |    Mean |   Error |   StdDev | Extra Metric |
---------------- |--------:|--------:|---------:|-------------:|
 TrainResnetV250 | 40.10 s | 2.723 s | 0.1493 s |            - |

So on average the version with the changes was reported to ran faster.

In any case, the CheckAlive() method is simply doing if-statements evaluations, so I don't think it can introduce meaningful performance difference (given that image classification training is a task expected to take a considerable amount of time anyway). #Closed

This comment has been minimized.

Copy link
@antoniovs1029

antoniovs1029 Jan 17, 2020

Author Member

So, as suggested online by @codemzs I have reran the benchmarks, but using the CIFAR-10 dataset.

Without the changes introduced in the PR the summary is as follows:

          Method |    Mean |   Error |   StdDev | Extra Metric |
---------------- |--------:|--------:|---------:|-------------:|
 TrainResnetV250 | 79.29 m | 4.850 m | 0.2658 m |            - |

With the changes:

          Method |    Mean |   Error |  StdDev | Extra Metric |
---------------- |--------:|--------:|--------:|-------------:|
 TrainResnetV250 | 78.82 m | 21.71 m | 1.190 m |            - |

So, again, my understanding is that there's some variability in the time it takes to train this model (and that's why the benchmark with the changes ran a little bit faster), and the introduction of the CheckAlive() method doesn't really have an impact on the performance of this. #Closed

@codemzs

This comment has been minimized.

Copy link
Member

codemzs commented Jan 14, 2020

CacheFeaturizedImagesToDisk can take significant time, we must add there.


In reply to: 573918512 [](ancestors = 573918512)

Copy link
Member

codemzs left a comment

:shipit:

@antoniovs1029 antoniovs1029 merged commit 6210c38 into dotnet:master Jan 17, 2020
17 of 19 checks passed
17 of 19 checks passed
MachineLearning-CodeCoverage Build #20200114.2 had test failures
Details
MachineLearning-CodeCoverage (Windows_x64 Build_Debug) Windows_x64 Build_Debug failed
Details
MachineLearning-CI Build #20200114.2 had test failures
Details
MachineLearning-CI (Centos_x64_NetCoreApp30 Debug_Build) Centos_x64_NetCoreApp30 Debug_Build succeeded
Details
MachineLearning-CI (Centos_x64_NetCoreApp30 Release_Build) Centos_x64_NetCoreApp30 Release_Build succeeded
Details
MachineLearning-CI (MacOS_x64_NetCoreApp21 Debug_Build) MacOS_x64_NetCoreApp21 Debug_Build succeeded
Details
MachineLearning-CI (MacOS_x64_NetCoreApp21 Release_Build) MacOS_x64_NetCoreApp21 Release_Build succeeded
Details
MachineLearning-CI (Ubuntu_x64_NetCoreApp21 Debug_Build) Ubuntu_x64_NetCoreApp21 Debug_Build succeeded
Details
MachineLearning-CI (Ubuntu_x64_NetCoreApp21 Release_Build) Ubuntu_x64_NetCoreApp21 Release_Build succeeded
Details
MachineLearning-CI (Windows_x64_NetCoreApp21 Debug_Build) Windows_x64_NetCoreApp21 Debug_Build succeeded
Details
MachineLearning-CI (Windows_x64_NetCoreApp21 Release_Build) Windows_x64_NetCoreApp21 Release_Build succeeded
Details
MachineLearning-CI (Windows_x64_NetCoreApp30 Debug_Build) Windows_x64_NetCoreApp30 Debug_Build succeeded
Details
MachineLearning-CI (Windows_x64_NetCoreApp30 Release_Build) Windows_x64_NetCoreApp30 Release_Build succeeded
Details
MachineLearning-CI (Windows_x64_NetFx461 Debug_Build) Windows_x64_NetFx461 Debug_Build succeeded
Details
MachineLearning-CI (Windows_x64_NetFx461 Release_Build) Windows_x64_NetFx461 Release_Build succeeded
Details
MachineLearning-CI (Windows_x86_NetCoreApp21 Debug_Build) Windows_x86_NetCoreApp21 Debug_Build succeeded
Details
MachineLearning-CI (Windows_x86_NetCoreApp21 Release_Build) Windows_x86_NetCoreApp21 Release_Build succeeded
Details
WIP Ready for review
Details
license/cla All CLA requirements met.
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.