Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression test results #170

Closed
ianfab opened this issue Dec 11, 2016 · 39 comments
Closed

Regression test results #170

ianfab opened this issue Dec 11, 2016 · 39 comments
Assignees

Comments

@ianfab
Copy link
Collaborator

ianfab commented Dec 11, 2016

Update: More recent regression test results can be found on http://35.161.250.236:6543/regression.

Since upstream changes are not tested for variants, regression tests should be done on a regular basis. Furthermore, this helps to keep track of our progress. I will try to update this table regularly, but I will probably skip some releases if there are only a few days in between.

The table shows the Elo difference to the previous version in the table measured in 1000 games with a time control of 10''+0.1'' using opening books. If you want to contribute results, just post them below (and maybe announce beforehand to avoid unnecessarily testing twice) and I will include them in the tables.

Suicide and loop chess are not tested, because they are too similar to giveaway and crazyhouse, respectively.

Elo difference to previous version

version chess giveaway atomic crazyhouse horde king of the hill losers racing kings three-check
fishnet-131016 not supported not supported
fishnet-201016 2.78+-13.7 -9.38+-15.5 517.07+-42.3* not supported 131.74+-23.0 12.51+-19.3 not supported -0.35+-17.6 6.25+-20.5
fishnet-021116 -12.51+-14.0 6.95+-14.9 -43.30+-19.4 first version 294.61+-29.5 41.89+-19.9 not supported 217.35+-23.8 7.64+-20.6
fishnet-151116 -4.86+-13.8 152.60+-17.5 -2.43+-17.9 126.57+-22.6 11.47+-21.4 -17.04+-19.3 not supported -14.60+-16.5 5.91+-20.4
fishnet-301116 53.58+-12.3 47.90+-15.5 28.90+-16.8 135.76+-22.7 7.30+-21.3 35.91+-18.5 not supported 22.27+-15.7 -9.73+-19.9
fishnet-091216 5.91+-13.6 -1.39+-15.5 17.39+-17.3 132.14+-22.7 43.30+-21.5 6.60+-18.6 not supported 11.82+-16.5 -3.47+-20.4
fishnet-231216 -23.31+-12.5 -26.46+-15.0 -13.56+-16.7 32.05+-20.9 -1.04+-21.3 39.43+-19.0 not supported -13.21+-15.5 14.60+-20.0
fishnet-050117 9.38+-13.8 -22.62+-15.8 53.22+-17.8 30.65+-21.1 30.30+-21.5 11.82+-19.0 first version 12.51+-17.0 2.08+-20.7

*Reintroducing atomic SEE regained a loss of about 500 Elo in the previous version.

Elo difference to most recent version

Calculated by summing up the relative Elo differences.

version chess giveaway atomic crazyhouse horde king of the hill losers racing kings three-check
fishnet-131016 -30.97 -147.60 -557.29 not supported -517.68 -131.12 not supported -235.79 -23.28
fishnet-201016 -28.19 -156.98 -40.22 not supported -385.94 -118.61 not supported -236.14 -17.03
fishnet-021116 -40.70 -150.03 -83.52 -457.17 -91.33 -76.72 not supported -18.79 -9.39
fishnet-151116 -45.56 2.57 -85.95 -330.60 -79.86 -93.76 not supported -33.39 -3.48
fishnet-301116 8.02 50.47 -57.05 -194.84 -72.56 -57.85 not supported -11.12 -13.21
fishnet-091216 13.93 49.08 -39.66 -62.70 -29.26 -51.25 not supported 0.70 -16.68
fishnet-231216 -9.38 22.62 -53.22 -30.65 -30.30 -11.82 not supported -12.51 -2.08
fishnet-050117
@ddugovic
Copy link
Owner

To clarify, the reason fishnet-301116 is blank is because that is the current baseline, correct?

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 11, 2016

Yes, exactly. I might also do regression tests for past versions, but currently fishnet-301116 is the base line.

@ddugovic
Copy link
Owner

Thanks. I was thinking that in the future the baseline may advance to the latest stable (non-regressing) release.

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 11, 2016

I intended to add the test results, which are always relative to the previous version, so the only empty line by design would be the first line. However, we could of course also add a second table where the results are added to compare all versions with the latest or first one.

@Vinvin20
Copy link

Vinvin20 commented Dec 12, 2016

Results after 23 hours of testing on 6 cores@4GHz , stockfish-091216 vs stockfish-301116 :

30s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 249 - 118 - 13 [0.672]
ELO difference: 124.89 +/- 36.62

35s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 216 - 129 - 15 [0.621]
ELO difference: 85.66 +/- 36.26

40s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 223 - 106 - 11 [0.672]
ELO difference: 124.64 +/- 38.77

45s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 198 - 122 - 12 [0.614]
ELO difference: 80.97 +/- 37.77

50s+2s : 170 - 123 - 7 [57.8%]

55s+2s : Score of stockfish-modern-301116-1thread vs stockfish-modern-091216-1thread: 107 - 180 - 13 [0.378]
ELO difference: -86.27 +/- 39.74

Sum and perf :
2012 games : +1236 -705 =71
Perf : (1236-705)/2012*400 = +105.6 Elo

Games are here : https://www.dropbox.com/s/54bn20y88ito86w/SF091216-vs-SF301116.zip?dl=0

@arbolis
Copy link

arbolis commented Dec 12, 2016

It's important to note that the condition 10" + 0.1" will vary according to the hardware. This is why fishtest does a per case rescaling so that this translates to a different time control for each machine.

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 12, 2016

@arbolis Yes, good point. My tests have been performed on AWS c1.medium machines using 1 thread. I can perform benchmark tests as soon as the testing queue is empty (which is not going to happen soon...).

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 12, 2016

There is a surprising result for fishnet-301116 vs. fishnet-151116 in standard chess. +50 Elo in standard chess can not be explained by upstream changes, so it must have to do with this merge overwriting the custom repetition detection.

@ddugovic
Copy link
Owner

Upstream have tested the custom repetition (on an older master) and it cost about 3 Elo. However, if the loss of the custom repetition detection causes the loss, then reapplying fb16fb4 should make up the difference...

That merge changed many lines of code, but it's unclear to me what causes the +50 Elo...

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 13, 2016

I have not done any tests to show that the mentioned merge caused the +50 Elo, but as far as I know this was the only functional change for standard chess besides upstream changes, so I do not see any other explanation. I would be interested in the link to the test result, because I could not find it. Anyway, I will test it to see whether it causes the +50 Elo.

@Vinvin20
Copy link

Currently running on http://tests.stockfishchess.org/tests :
threefold_opt : "Optimized version of my previous solution. "

@ddugovic
Copy link
Owner

Argh, they're going to break my threefold repetition check again? Sigh...

@Vinvin20
Copy link

May be it will be correct in the SF original now ! :-)

@Vinvin20
Copy link

The mod passed the test !

@stockfishdeveloper
Copy link

stockfishdeveloper commented Dec 13, 2016

It will not be committed. I have seen Marco reject many simplification patches just because they don't fix one of his pet gripes with SF.
He is also extremely strict on actually simplifying the code: this patch adds lines.

@Vinvin20
Copy link

But that fix a buggy eval where SF claims 0.00 after 2 times the same position.

@ddugovic
Copy link
Owner

Off-topic and not a bug since it complies with the UCI protocol. If you really want to discuss whether it's a bug, open an issue.

@Vinvin20
Copy link

The communication interface has nothing to do with the correct internal position evaluation of an engine.

@ddugovic
Copy link
Owner

The engine evaluation 0.0 is correct.

Also correct is any evaluation the engine may return. However I customize this fork to return nonzero values.

@Vinvin20
Copy link

Vinvin20 commented Dec 13, 2016

No, "draw" is not the correct evaluation of a position after 2 times the same position.

@ddugovic
Copy link
Owner

ddugovic commented Dec 13, 2016

If you really want to discuss whether it's a bug, open an issue. Just be aware that opening the issue will unleash an avalanche of questions.

Upstream PR#925 is introducing a new feature for threefold repetition detection, which Stockfish never supported.

@Vinvin20
Copy link

I think I already rise the subject, Joona or Marco said that change will not rise the level of the engine, that's why this eval won't be changed. They think both the only purpose to mod the source is to rise the level (or simplify the code). So, this buggy eval still ...

@ddugovic
Copy link
Owner

@Vinvin20 You still fail to open an issue, so I'm forced to respond here.

It is acceptable (but not necessary) that the evaluation of 1. Nf3 Nf6 2. Ng1 Ng8 3. Nf3 Nf6 4. Ng1 Ng8 5. Nf3 Nf6 be 0.0.

Suppose I have an incomplete move list 5. Nf3 Nf6. What should the evaluation be?

Suppose the players are competing under Sofia rules, or have other reasons they don't want a draw. What should the evaluation be?

Suppose 5-fold repetition occurs and the position is fed to the engine. Under FIDE rules the arbiter should have claimed a draw. What should the evaluation be?

@Vinvin20
Copy link

Vinvin20 commented Dec 13, 2016

At 4...Ng8 black can claim draw (just before playing the move) but is not forced to do so.
At 5. Nf3 white can claim draw (just before playing the move) but is not forced to do so. 0.00 is the correct eval here as white is slightly better and black can claim draw.

Suppose I have an incomplete move list 5. Nf3 Nf6. What should the evaluation be?

I don't understand your question with "incomplete move list" ...

Suppose the players are competing under Sofia rules, or have other reasons they don't want a draw. What should the evaluation be?

I don't know the details about Sofia rules when the 2 players show bad sportsmanship by forcing an artificial draw (both lose ?) .

Suppose 5-fold repetition occurs and the position is fed to the engine. Under FIDE rules the arbiter should have claimed a draw. What should the evaluation be?

0.00 as this is a forced draw.

@Vinvin20
Copy link

Are they tricks in your questions ?

@ddugovic
Copy link
Owner

ddugovic commented Dec 13, 2016

Not intentionally; I'm just trying to understand this complex issue. My perspective is twofold:

  1. Information provided by Stockfish must comply with the protocol:
* score
	* cp - the score from the engine's point of view
	* lowerbound - the score is just a lower bound.
	* upperbound - the score is just an upper bound.
  1. The purpose of scoring moves is to order them and identify the best move, not to provide an objective assessment of their quality. To that end it is convenient but unnecessary to use the value 0.0 on the third or fourth repetition (as in the first example).

In the second example, an objective assessment is impossible (as the move history is unknown to the engine). UCI allows this use case and does not specify what "the engine's point of view" should be; for all the engine knows, the move list could have been 1. Nf3 Nf6 2. Ng5 Ng4 3. Nh3 Nh6 4. Ng1 Ng8 5. Nf3 Nf6 or could have been 1. Nf3 Nf6 2. Ng1 Ng8 3. Nf3 Nf6 4. Ng1 Ng8 5. Nf3 Nf6. Based on the incomplete information, the engine could assign any value, although using retrograde analysis we can conclude 5. Nf3 Nf6 carries some risk whereas 5. e4 e5 does not risk a repetition claim by the opponent.

Under Sofia rules (or match conditions where the opponent cannot claim a draw) it is inappropriate to force a 0.0 value although perhaps lowerbound 0.0 might be useful if the engine itself may claim a draw and has not yet evaluated the move.

Under 5-fold repetition, if prompted the engine may still provide output according to everything noted above. It's not the engine's responsibility to enforce rules when the arbiter fails to do so.

@stockfishdeveloper
Copy link

@ddugovic Why don't you copy this into the Official Stockfish discussion?

@ddugovic
Copy link
Owner

They aren't interested in lengthy debates about this sort of issue, and I have no disagreement with them.

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 17, 2016

The tables now go back to the first release with the UCI_Variant option (fishnet-131016). A ~40 Elo regression in atomic chess is the only significant regression so far.

@CPagador
Copy link

Atomic Chess (Phenom II x4, 15'+3", single core)

SF 250916 - SF 141216 : +18 -13 =9

I know 40 games are few games, but still. 250916 got 71% against Atomkraft and 141216 70%.

Let me know if I am wrong about the 1st table: does it mean that SF atomic 201016 is 517 ELO better than 131016 and version 021116 is 43 ELO worse than 201016 and 151116 is 2 ELO worse than 021116 and so on?

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 20, 2016

@CPagador Yes, your interpretation of the table is correct.

A merge of upstream changes (see_ge) somewhen in the beginning of October overwrote atomic SEE and caused a regression of about 500 Elo. At that time I was very busy and did only notice the regression after a report in the lichess forum. The subsequent fix logically gave an improvement of about 500 Elo. September 25th probably was before the mentioned merge, so it should be on about the same level as the versions after the fix.

Testing has changed regarding opening books, time controls and number of games, and it might well be that some of my early test results were inaccurate and caused regressive patches to pass, so your test results are not implausible.

@CPagador
Copy link

@ianfab atomic chess 200 games match at 2'+1":

SF 250916 - SF 141216 : +78 -76 =46

@Vinvin20
Copy link

Vinvin20 commented Dec 26, 2016

An improved patch about 3-fold draw passed :
26-12-16 mc 3fold diff
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 24519 W: 4439 L: 4324 D: 15756
sprt @ 10+0.1 th 1 Redo Sergei's test with simplified patch: root included from extended draw detection.

@ianfab
Copy link
Collaborator Author

ianfab commented Dec 31, 2016

Added fishnet-231216.

@Vinvin20
Copy link

Vinvin20 commented Jan 2, 2017

Some explanation about this patch from http://abrok.eu/stockfish/

Author: Sergei Antonov
Date: Sun Jan 1 10:56:46 2017 +0100
Timestamp: 1483264606

Threefold repetition detection

Implement a threefold repetition detection. Below are the examples of
problems fixed by this change.

Loosing move in a drawn position.
position fen 8/k7/3p4/p2P1p2/P2P1P2/8/8/K7 w - - 0 1 moves a1a2 a7a8 a2a1
The old code suggested a loosing move "bestmove a8a7", the new code suggests "bestmove a8b7" leading to a draw.

Incorrect evaluation (happened in a real game in TCEC Season 9).
position fen 4rbkr/1q3pp1/b3pn2/7p/1pN5/1P1BBP1P/P1R2QP1/3R2K1 w - - 5 31 moves e3d4 h8h6 d4e3
The old code evaluated it as "cp 0", the new code evaluation is around "cp -50" which is adequate.

Brings 0.5-1 ELO gain. Passes [-3.00,1.00].

STC: http://tests.stockfishchess.org/tests/view/584ece040ebc5903140c5aea
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 47744 W: 8537 L: 8461 D: 30746

LTC: http://tests.stockfishchess.org/tests/view/584f134d0ebc5903140c5b37
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 36775 W: 4739 L: 4639 D: 27397

@ianfab
Copy link
Collaborator Author

ianfab commented Jan 6, 2017

Tests for fishnet-050117 are completed.

@Vinvin20
Copy link

Is it possible to find such a table with all values for recent versions, please ?

@ianfab
Copy link
Collaborator Author

ianfab commented Mar 13, 2017

@Vinvin20 New and also updated old regression test results can be found on http://35.161.250.236:6543/regression. It perhaps makes sense to add the link to the new results in the first comment.

@Vinvin20
Copy link

I didn't know about this graph, it's very good !
Thanks !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants