Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model Builder error on Azure Training #991

Closed
elbruno opened this issue Sep 9, 2020 · 10 comments
Closed

Model Builder error on Azure Training #991

elbruno opened this issue Sep 9, 2020 · 10 comments
Assignees
Labels
Bug Something isn't working Priority:0 Work that we can't release without Reported by: Customer

Comments

@elbruno
Copy link

elbruno commented Sep 9, 2020

System Information (please complete the following information):

  • Model Builder Version: 16.1.1.2041102
  • Visual Studio Version: Visual Studio 2019 Preview, 16.8.0 Preview 2.1

Describe the bug
Complete step by step to train an Image Recognition scenario using Azure Environment.

To Reproduce
Steps to reproduce the behavior:

  1. Add Machine Learning to C# Project
  2. Select Image Classification scenario
  3. Select Azure as Training Environment
  4. Select training image folder
  5. Start Training

Expected behavior
After uploading all the training images, Model Builder raises this exception

at System.Threading.Tasks.Task1.GetResultCore(Boolean waitCompletionNotification) at Azure.MachineLearning.Services.Compute.ComputeTargetPageFetcher.FetchNextPage() in /_/src/Microsoft.ML.AzureMLClient/Compute/ComputeTargetPageFetcher.cs:line 35 at Azure.MachineLearning.Services.LazyEnumerator1.d__9.MoveNext() in //src/Microsoft.ML.AzureMLClient/LazyEnumerator.cs:line 26
at System.Linq.Enumerable.WhereEnumerableIterator1.MoveNext() at System.Linq.Enumerable.FirstOrDefault[TSource](IEnumerable1 source)
at AzureML.AutoMLRunnerImages.d__27.MoveNext() in /
/src/Microsoft.ML.ModelBuilder.AutoMLService/RemoteAutoML/AutoMLRunnerImages.cs:line 233
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLService.Experiments.AzureImageClassificationExperiment.d__13.MoveNext() in //src/Microsoft.ML.ModelBuilder.AutoMLService/Experiments/AzureImageClassificationExperiment.cs:line 69
--- End of stack trace from previous location where exception was thrown ---
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
at Microsoft.ML.ModelBuilder.AutoMLEngine.d__26.MoveNext() in /
/src/Microsoft.ML.ModelBuilder.AutoMLService/AutoMLEngineService/AutoMLEngine.cs:line 134

image

Additional context
Attached the full runtime log.
b6e7c212-c030-45b8-8c31-a8c125ae1115.txt

@LittleLittleCloud
Copy link
Contributor

Seems that AzClient can't get computer info, can you go to azure portal and check if selected compute is created succesfully?

@arafattehsin
Copy link

This is exactly my issue. And to answer you, @LittleLittleCloud - my compute is successfully created.

image

@elbruno
Copy link
Author

elbruno commented Sep 14, 2020

I also tested with the latest version: 16.2.0.2046002
the one with includes object detection and I still have the same error

image

@LittleLittleCloud LittleLittleCloud added Bug Something isn't working Priority:0 Work that we can't release without labels Sep 14, 2020
@LittleLittleCloud
Copy link
Contributor

Thanks for the feedback @elbruno @arafattehsin We are taking a look at it

@beccamc
Copy link
Contributor

beccamc commented Sep 14, 2020

Thanks for reporting @elbruno and @arafattehsin. This aggregate exception message isn't helpful (need to fix that!) I'd like to know if the run was started and ran into a problem. Can you check the ML portal (https://ml.azure.com) and see if the run has any information?

If the run can't access the compute for some reason the experiment error should tell us the problem. Thanks for your patience while we figure this out!

If you're not familiar with the ML portal...

  • Click Experiments on the left nav.
  • Click on your experiment name (probably "ImageClassification" or "ObjectDetection"). You should see a run from when you were testing. It's probably failed or cancelled.

image

  • Go into that run. It might show some information on the error at the top, or the problem might be in a child run.

image

  • Check the child run for any helpful info. Sometimes it's inside a warning message at the top. You'll need to expand that to see the inner exception info. Alternatively, it's also often available in the Outputs + Logs tab on this page. You need to look through the logs there for the error information.
    image

@arafattehsin
Copy link

Hi @beccamc - Thank you for this detailed explanation. Unfortunately, I can't even see any experiments being executed. My experiment is created but I don't see any run.

image

@LittleLittleCloud
Copy link
Contributor

@arafattehsin This is so strange... It looks like the training fail at fetching compute step so experiemnt won't even have a chance to create.

T'm trying to reproduce the error, could you tell me how do you create compute, did you use the UI in model builder? or through Azure portal. And could you share your compute's property, and which configuration did you use to create compute.

@arafattehsin
Copy link

@LittleLittleCloud I created it using the UI in Model Builder

image

If you can tell you a way which worked for you, please do so as I can try the exact same way to make it work.

@LittleLittleCloud
Copy link
Contributor

@arafattehsin I didn't do anything specifically.. Could you try creating compute in azure portal, this is the configuration I use
image

@LittleLittleCloud
Copy link
Contributor

Use dedicated machine solves the problem, better error message when launch training on low-priority machine fails is needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working Priority:0 Work that we can't release without Reported by: Customer
Projects
None yet
Development

No branches or pull requests

4 participants