New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regression test results #170
Comments
To clarify, the reason fishnet-301116 is blank is because that is the current baseline, correct? |
Yes, exactly. I might also do regression tests for past versions, but currently fishnet-301116 is the base line. |
Thanks. I was thinking that in the future the baseline may advance to the latest stable (non-regressing) release. |
I intended to add the test results, which are always relative to the previous version, so the only empty line by design would be the first line. However, we could of course also add a second table where the results are added to compare all versions with the latest or first one. |
Results after 23 hours of testing on 6 cores@4GHz , stockfish-091216 vs stockfish-301116 : 30s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 249 - 118 - 13 [0.672] 35s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 216 - 129 - 15 [0.621] 40s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 223 - 106 - 11 [0.672] 45s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 198 - 122 - 12 [0.614] 50s+2s : 170 - 123 - 7 [57.8%] 55s+2s : Score of stockfish-modern-301116-1thread vs stockfish-modern-091216-1thread: 107 - 180 - 13 [0.378] Sum and perf : Games are here : https://www.dropbox.com/s/54bn20y88ito86w/SF091216-vs-SF301116.zip?dl=0 |
It's important to note that the condition 10" + 0.1" will vary according to the hardware. This is why fishtest does a per case rescaling so that this translates to a different time control for each machine. |
@arbolis Yes, good point. My tests have been performed on AWS c1.medium machines using 1 thread. I can perform benchmark tests as soon as the testing queue is empty (which is not going to happen soon...). |
There is a surprising result for fishnet-301116 vs. fishnet-151116 in standard chess. +50 Elo in standard chess can not be explained by upstream changes, so it must have to do with this merge overwriting the custom repetition detection. |
Upstream have tested the custom repetition (on an older master) and it cost about 3 Elo. However, if the loss of the custom repetition detection causes the loss, then reapplying fb16fb4 should make up the difference... That merge changed many lines of code, but it's unclear to me what causes the +50 Elo... |
I have not done any tests to show that the mentioned merge caused the +50 Elo, but as far as I know this was the only functional change for standard chess besides upstream changes, so I do not see any other explanation. I would be interested in the link to the test result, because I could not find it. Anyway, I will test it to see whether it causes the +50 Elo. |
Currently running on http://tests.stockfishchess.org/tests : |
Argh, they're going to break my threefold repetition check again? Sigh... |
May be it will be correct in the SF original now ! :-) |
The mod passed the test ! |
It will not be committed. I have seen Marco reject many simplification patches just because they don't fix one of his pet gripes with SF. |
But that fix a buggy eval where SF claims 0.00 after 2 times the same position. |
Off-topic and not a bug since it complies with the UCI protocol. If you really want to discuss whether it's a bug, open an issue. |
The communication interface has nothing to do with the correct internal position evaluation of an engine. |
The engine evaluation 0.0 is correct. Also correct is any evaluation the engine may return. However I customize this fork to return nonzero values. |
No, "draw" is not the correct evaluation of a position after 2 times the same position. |
If you really want to discuss whether it's a bug, open an issue. Just be aware that opening the issue will unleash an avalanche of questions. Upstream PR#925 is introducing a new feature for threefold repetition detection, which Stockfish never supported. |
I think I already rise the subject, Joona or Marco said that change will not rise the level of the engine, that's why this eval won't be changed. They think both the only purpose to mod the source is to rise the level (or simplify the code). So, this buggy eval still ... |
@Vinvin20 You still fail to open an issue, so I'm forced to respond here. It is acceptable (but not necessary) that the evaluation of Suppose I have an incomplete move list Suppose the players are competing under Sofia rules, or have other reasons they don't want a draw. What should the evaluation be? Suppose 5-fold repetition occurs and the position is fed to the engine. Under FIDE rules the arbiter should have claimed a draw. What should the evaluation be? |
At 4...Ng8 black can claim draw (just before playing the move) but is not forced to do so.
I don't understand your question with "incomplete move list" ...
I don't know the details about Sofia rules when the 2 players show bad sportsmanship by forcing an artificial draw (both lose ?) .
0.00 as this is a forced draw. |
Are they tricks in your questions ? |
Not intentionally; I'm just trying to understand this complex issue. My perspective is twofold:
In the second example, an objective assessment is impossible (as the move history is unknown to the engine). UCI allows this use case and does not specify what "the engine's point of view" should be; for all the engine knows, the move list could have been Under Sofia rules (or match conditions where the opponent cannot claim a draw) it is inappropriate to force a Under 5-fold repetition, if prompted the engine may still provide output according to everything noted above. It's not the engine's responsibility to enforce rules when the arbiter fails to do so. |
@ddugovic Why don't you copy this into the Official Stockfish discussion? |
They aren't interested in lengthy debates about this sort of issue, and I have no disagreement with them. |
The tables now go back to the first release with the |
Atomic Chess (Phenom II x4, 15'+3", single core) SF 250916 - SF 141216 : +18 -13 =9 I know 40 games are few games, but still. 250916 got 71% against Atomkraft and 141216 70%. Let me know if I am wrong about the 1st table: does it mean that SF atomic 201016 is 517 ELO better than 131016 and version 021116 is 43 ELO worse than 201016 and 151116 is 2 ELO worse than 021116 and so on? |
@CPagador Yes, your interpretation of the table is correct. A merge of upstream changes ( Testing has changed regarding opening books, time controls and number of games, and it might well be that some of my early test results were inaccurate and caused regressive patches to pass, so your test results are not implausible. |
@ianfab atomic chess 200 games match at 2'+1": SF 250916 - SF 141216 : +78 -76 =46 |
An improved patch about 3-fold draw passed : |
Added fishnet-231216. |
Some explanation about this patch from http://abrok.eu/stockfish/
|
Tests for fishnet-050117 are completed. |
Is it possible to find such a table with all values for recent versions, please ? |
@Vinvin20 New and also updated old regression test results can be found on http://35.161.250.236:6543/regression. It perhaps makes sense to add the link to the new results in the first comment. |
I didn't know about this graph, it's very good ! |
Update: More recent regression test results can be found on http://35.161.250.236:6543/regression.
Since upstream changes are not tested for variants, regression tests should be done on a regular basis. Furthermore, this helps to keep track of our progress. I will try to update this table regularly, but I will probably skip some releases if there are only a few days in between.
The table shows the Elo difference to the previous version in the table measured in 1000 games with a time control of 10''+0.1'' using opening books. If you want to contribute results, just post them below (and maybe announce beforehand to avoid unnecessarily testing twice) and I will include them in the tables.
Suicide and loop chess are not tested, because they are too similar to giveaway and crazyhouse, respectively.
Elo difference to previous version
*Reintroducing atomic SEE regained a loss of about 500 Elo in the previous version.
Elo difference to most recent version
Calculated by summing up the relative Elo differences.
The text was updated successfully, but these errors were encountered: