Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Conditional Numerical Reproducibility for tests #4569

Merged
merged 3 commits into from
Feb 4, 2020

Conversation

sharwell
Copy link
Member

No description provided.

Copy link
Contributor

@frank-dong-ms-zz frank-dong-ms-zz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@codecov
Copy link

codecov bot commented Dec 12, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@9c9a5c4). Click here to learn what that means.
The diff coverage is 91.66%.

@@            Coverage Diff            @@
##             master    #4569   +/-   ##
=========================================
  Coverage          ?   75.84%           
=========================================
  Files             ?      947           
  Lines             ?   172190           
  Branches          ?    18579           
=========================================
  Hits              ?   130593           
  Misses            ?    36422           
  Partials          ?     5175
Flag Coverage Δ
#Debug 75.84% <91.66%> (?)
#production 71.42% <ø> (?)
#test 90.71% <91.66%> (?)
Impacted Files Coverage Δ
test/Microsoft.ML.TimeSeries.Tests/TimeSeries.cs 87.85% <ø> (ø)
...crosoft.ML.TimeSeries.Tests/TimeSeriesDirectApi.cs 99.28% <100%> (ø)
test/Microsoft.ML.TestFramework/GlobalBase.cs 29.03% <100%> (ø)
...est/Microsoft.ML.Predictor.Tests/TestPredictors.cs 67.9% <50%> (ø)

@sharwell sharwell force-pushed the enable-cnr branch 5 times, most recently from 320e765 to 8b5d1e2 Compare January 16, 2020 14:45
@sharwell sharwell marked this pull request as ready for review January 17, 2020 21:15
@sharwell sharwell requested a review from a team as a code owner January 17, 2020 21:15
@sharwell
Copy link
Member Author

I'm going to submit some parts of this as separate pull requests, and then rebase to focus it and fix conflicts.

@@ -283,6 +283,11 @@ public void BinaryClassifierLogisticRegressionTest()
[Trait("Category", "SkipInCI")]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ I guess I am missing something... but shouldn't this line be removed in order to check that the updated baselines are actually correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These baselines haven't changed in the master branch since I first implemented this pull request. I'm assuming the new values (which were correct then) are still correct now. They would be validated as part of #4722.

@sharwell
Copy link
Member Author

Based on the results of #4722 as a validation, the current plan is:

  1. Merge this PR, since the baselines here are a clear statistical improvement
  2. As a separate PR, add baselines for Windows x86, Centos (and Ubuntu separately if necessary), and MacOS
  3. Update [Test Stability] PR to Test BinaryClassifierSymSgdTest #4722 to determine the new status

@antoniovs1029
Copy link
Member

antoniovs1029 commented Feb 1, 2020

For the record, the validation @sharwell has just mentioned, consisted in running 100 times the BinaryClassifierSymSgdTest test on every build configuration. This was to validate the changes he made to the baseline in here, which is used [exclusively?] on that test which is (at the moment of writing) disabled, and that's why it was needed to be validated on PR #4722 where I reenabled the test.

It passed on every Windows x64 build, except for one, where it failed only 1 out of the 100 iterations.

It failed on every other build configuration, but it is thought to be because CNR works differently on each OS and configuration (MacOS, CentOS, Ubuntu and Windows x86), and then it would be necessary to create a new baseline for each one of these. Notice that that test was disabled recently because it was very flaky on most of our CI builds, and that it was previously disabled on MacOS and x86 since #624 and #1008 respectively. So when doing the baselines for the other build configurations, it would be ok to keep this in mind, and see if that now that CNR is enabled, these pending test scenarios are now resolved.

For reference, the same experiment (running it 100 times on all the build configurations) had 3 Windows x64 builds that failed, each one with 100 failures (one per iteration), when ran without the enabled CNR.

@antoniovs1029
Copy link
Member

antoniovs1029 commented Feb 1, 2020

After briefly discussing it with @harishsk , I will run all the tests (including all the disabled ones) with the changes in this PR, and running the test 100 times along with them, just to double check. 😄 For convenience, I am doing it on #4722

It again passed in all Windows x64 builds, except for one, where it failed only 1 out of the 100 iterations. As expected, it didn't pass in the other builds.

@antoniovs1029
Copy link
Member

antoniovs1029 commented Feb 3, 2020

@harishsk has expressed that he doesn't know if it's ok to have different baselines for each build configuration, and would like @eerhardt , @justinormont , and @tannergooding opinion on this regard, before accepting enabling the CNR.

@antoniovs1029
Copy link
Member

antoniovs1029 commented Feb 3, 2020

So I have just ran this again on #4722 , and actually now even the Windows x64 are failing, it fails in all 100 iterations of the BinaryClassifierSymSgdTest test. It doesn't seem to me that any update on master during the weekend could have changed the baseline here again, so I am not sure why this is happening.

@sharwell can you please remove the [Trait("Category", "SkipInCI")] attribute from the test in here, just temporarily, to rerun the CI in here, to be sure that your changes to the baseline are actually correct? thanks!

@sharwell
Copy link
Member Author

sharwell commented Feb 3, 2020

@antoniovs1029 the baselines were altered in #4710

@antoniovs1029
Copy link
Member

antoniovs1029 commented Feb 4, 2020

@antoniovs1029 the baselines were altered in #4710

You're right. On that PR it's added that the BinaryClassifierSymSGD should also print its bias and weights into the output. So it will only be necessary to add a couple of lines to the baseline.

After running it on the CI, I believe the updated baseline (for Windows x64) should be

On SymSGD-CV-breast-cancer-out.txt

Bias: -467.9297, Weights: [5.415065,76.39395,22.351555,-11.988394,-28.264456,44.58415,22.720116,11.132539,2.851256]
Not training a calibrator because it is not needed.
Not adding a normalizer.
Data fully loaded into memory.
Initial learning rate is tuned to 100.000000
Bias: -484.28625, Weights: [-12.787148,140.42905,121.938194,37.527378,-129.81407,70.90587,-89.37068,81.64309,-32.32781]

On SymSGD-TrainTest-breast-cancer-out.txt

Bias: -473.82773, Weights: [1.8576517,52.924484,9.490602,14.593927,-17.657146,10.126332,-8.042431,1.9634984,26.753149]

By the way, I've gotten baselines for the other platforms and made a couple of experiments with them. So far it seems that enabling CNR does yield more consistent results from the test, although different baselines would indeed be necessary for each platform. I still need to do further experiments on this, though.

@sharwell
Copy link
Member Author

sharwell commented Feb 4, 2020

@antoniovs1029 baselines have been updated

@sharwell sharwell mentioned this pull request Feb 4, 2020
@sharwell sharwell merged commit cdd309e into dotnet:master Feb 4, 2020
@sharwell sharwell deleted the enable-cnr branch February 4, 2020 23:13
@ghost ghost locked as resolved and limited conversation to collaborators Mar 19, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants