Regression test results #170

ianfab · 2016-12-11T14:18:49Z

Update: More recent regression test results can be found on http://35.161.250.236:6543/regression.

Since upstream changes are not tested for variants, regression tests should be done on a regular basis. Furthermore, this helps to keep track of our progress. I will try to update this table regularly, but I will probably skip some releases if there are only a few days in between.

The table shows the Elo difference to the previous version in the table measured in 1000 games with a time control of 10''+0.1'' using opening books. If you want to contribute results, just post them below (and maybe announce beforehand to avoid unnecessarily testing twice) and I will include them in the tables.

Suicide and loop chess are not tested, because they are too similar to giveaway and crazyhouse, respectively.

Elo difference to previous version

version	chess	giveaway	atomic	crazyhouse	horde	king of the hill	losers	racing kings	three-check
fishnet-131016				not supported			not supported
fishnet-201016	2.78+-13.7	-9.38+-15.5	517.07+-42.3*	not supported	131.74+-23.0	12.51+-19.3	not supported	-0.35+-17.6	6.25+-20.5
fishnet-021116	-12.51+-14.0	6.95+-14.9	-43.30+-19.4	first version	294.61+-29.5	41.89+-19.9	not supported	217.35+-23.8	7.64+-20.6
fishnet-151116	-4.86+-13.8	152.60+-17.5	-2.43+-17.9	126.57+-22.6	11.47+-21.4	-17.04+-19.3	not supported	-14.60+-16.5	5.91+-20.4
fishnet-301116	53.58+-12.3	47.90+-15.5	28.90+-16.8	135.76+-22.7	7.30+-21.3	35.91+-18.5	not supported	22.27+-15.7	-9.73+-19.9
fishnet-091216	5.91+-13.6	-1.39+-15.5	17.39+-17.3	132.14+-22.7	43.30+-21.5	6.60+-18.6	not supported	11.82+-16.5	-3.47+-20.4
fishnet-231216	-23.31+-12.5	-26.46+-15.0	-13.56+-16.7	32.05+-20.9	-1.04+-21.3	39.43+-19.0	not supported	-13.21+-15.5	14.60+-20.0
fishnet-050117	9.38+-13.8	-22.62+-15.8	53.22+-17.8	30.65+-21.1	30.30+-21.5	11.82+-19.0	first version	12.51+-17.0	2.08+-20.7

*Reintroducing atomic SEE regained a loss of about 500 Elo in the previous version.

Elo difference to most recent version

Calculated by summing up the relative Elo differences.

version	chess	giveaway	atomic	crazyhouse	horde	king of the hill	losers	racing kings	three-check
fishnet-131016	-30.97	-147.60	-557.29	not supported	-517.68	-131.12	not supported	-235.79	-23.28
fishnet-201016	-28.19	-156.98	-40.22	not supported	-385.94	-118.61	not supported	-236.14	-17.03
fishnet-021116	-40.70	-150.03	-83.52	-457.17	-91.33	-76.72	not supported	-18.79	-9.39
fishnet-151116	-45.56	2.57	-85.95	-330.60	-79.86	-93.76	not supported	-33.39	-3.48
fishnet-301116	8.02	50.47	-57.05	-194.84	-72.56	-57.85	not supported	-11.12	-13.21
fishnet-091216	13.93	49.08	-39.66	-62.70	-29.26	-51.25	not supported	0.70	-16.68
fishnet-231216	-9.38	22.62	-53.22	-30.65	-30.30	-11.82	not supported	-12.51	-2.08
fishnet-050117

ddugovic · 2016-12-11T14:24:06Z

To clarify, the reason fishnet-301116 is blank is because that is the current baseline, correct?

ianfab · 2016-12-11T14:25:50Z

Yes, exactly. I might also do regression tests for past versions, but currently fishnet-301116 is the base line.

ddugovic · 2016-12-11T14:40:43Z

Thanks. I was thinking that in the future the baseline may advance to the latest stable (non-regressing) release.

ianfab · 2016-12-11T14:50:21Z

I intended to add the test results, which are always relative to the previous version, so the only empty line by design would be the first line. However, we could of course also add a second table where the results are added to compare all versions with the latest or first one.

Vinvin20 · 2016-12-12T18:57:50Z

Results after 23 hours of testing on 6 cores@4GHz , stockfish-091216 vs stockfish-301116 :

30s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 249 - 118 - 13 [0.672]
ELO difference: 124.89 +/- 36.62

35s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 216 - 129 - 15 [0.621]
ELO difference: 85.66 +/- 36.26

40s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 223 - 106 - 11 [0.672]
ELO difference: 124.64 +/- 38.77

45s+2s : Score of stockfish-modern-091216-1thread vs stockfish-modern-301116-1thread: 198 - 122 - 12 [0.614]
ELO difference: 80.97 +/- 37.77

50s+2s : 170 - 123 - 7 [57.8%]

55s+2s : Score of stockfish-modern-301116-1thread vs stockfish-modern-091216-1thread: 107 - 180 - 13 [0.378]
ELO difference: -86.27 +/- 39.74

Sum and perf :
2012 games : +1236 -705 =71
Perf : (1236-705)/2012*400 = +105.6 Elo

Games are here : https://www.dropbox.com/s/54bn20y88ito86w/SF091216-vs-SF301116.zip?dl=0

arbolis · 2016-12-12T18:58:34Z

It's important to note that the condition 10" + 0.1" will vary according to the hardware. This is why fishtest does a per case rescaling so that this translates to a different time control for each machine.

ianfab · 2016-12-12T19:15:17Z

@arbolis Yes, good point. My tests have been performed on AWS c1.medium machines using 1 thread. I can perform benchmark tests as soon as the testing queue is empty (which is not going to happen soon...).

ianfab · 2016-12-12T21:44:25Z

There is a surprising result for fishnet-301116 vs. fishnet-151116 in standard chess. +50 Elo in standard chess can not be explained by upstream changes, so it must have to do with this merge overwriting the custom repetition detection.

ddugovic · 2016-12-12T22:56:21Z

Upstream have tested the custom repetition (on an older master) and it cost about 3 Elo. However, if the loss of the custom repetition detection causes the loss, then reapplying fb16fb4 should make up the difference...

That merge changed many lines of code, but it's unclear to me what causes the +50 Elo...

ianfab · 2016-12-13T00:24:44Z

I have not done any tests to show that the mentioned merge caused the +50 Elo, but as far as I know this was the only functional change for standard chess besides upstream changes, so I do not see any other explanation. I would be interested in the link to the test result, because I could not find it. Anyway, I will test it to see whether it causes the +50 Elo.

Vinvin20 · 2016-12-13T07:18:36Z

Currently running on http://tests.stockfishchess.org/tests :
threefold_opt : "Optimized version of my previous solution. "

ddugovic · 2016-12-13T12:28:39Z

Argh, they're going to break my threefold repetition check again? Sigh...

Vinvin20 · 2016-12-13T12:54:25Z

May be it will be correct in the SF original now ! :-)

Vinvin20 · 2016-12-13T15:35:45Z

The mod passed the test !

stockfishdeveloper · 2016-12-13T15:39:53Z

It will not be committed. I have seen Marco reject many simplification patches just because they don't fix one of his pet gripes with SF.
He is also extremely strict on actually simplifying the code: this patch adds lines.

Vinvin20 · 2016-12-13T15:43:50Z

But that fix a buggy eval where SF claims 0.00 after 2 times the same position.

ddugovic · 2016-12-13T16:08:32Z

Off-topic and not a bug since it complies with the UCI protocol. If you really want to discuss whether it's a bug, open an issue.

Vinvin20 · 2016-12-13T17:56:43Z

The communication interface has nothing to do with the correct internal position evaluation of an engine.

ddugovic · 2016-12-13T18:05:29Z

The engine evaluation 0.0 is correct.

Also correct is any evaluation the engine may return. However I customize this fork to return nonzero values.

Vinvin20 · 2016-12-13T18:17:21Z

No, "draw" is not the correct evaluation of a position after 2 times the same position.

ddugovic · 2016-12-13T18:21:12Z

If you really want to discuss whether it's a bug, open an issue. Just be aware that opening the issue will unleash an avalanche of questions.

Upstream PR#925 is introducing a new feature for threefold repetition detection, which Stockfish never supported.

Vinvin20 · 2016-12-13T18:34:05Z

I think I already rise the subject, Joona or Marco said that change will not rise the level of the engine, that's why this eval won't be changed. They think both the only purpose to mod the source is to rise the level (or simplify the code). So, this buggy eval still ...

ddugovic · 2016-12-13T18:56:26Z

@Vinvin20 You still fail to open an issue, so I'm forced to respond here.

It is acceptable (but not necessary) that the evaluation of 1. Nf3 Nf6 2. Ng1 Ng8 3. Nf3 Nf6 4. Ng1 Ng8 5. Nf3 Nf6 be 0.0.

Suppose I have an incomplete move list 5. Nf3 Nf6. What should the evaluation be?

Suppose the players are competing under Sofia rules, or have other reasons they don't want a draw. What should the evaluation be?

Suppose 5-fold repetition occurs and the position is fed to the engine. Under FIDE rules the arbiter should have claimed a draw. What should the evaluation be?

Vinvin20 · 2016-12-13T19:12:16Z

At 4...Ng8 black can claim draw (just before playing the move) but is not forced to do so.
At 5. Nf3 white can claim draw (just before playing the move) but is not forced to do so. 0.00 is the correct eval here as white is slightly better and black can claim draw.

Suppose I have an incomplete move list 5. Nf3 Nf6. What should the evaluation be?

I don't understand your question with "incomplete move list" ...

Suppose the players are competing under Sofia rules, or have other reasons they don't want a draw. What should the evaluation be?

I don't know the details about Sofia rules when the 2 players show bad sportsmanship by forcing an artificial draw (both lose ?) .

Suppose 5-fold repetition occurs and the position is fed to the engine. Under FIDE rules the arbiter should have claimed a draw. What should the evaluation be?

0.00 as this is a forced draw.

Vinvin20 · 2016-12-13T19:18:50Z

Are they tricks in your questions ?

ddugovic · 2016-12-13T20:21:20Z

Not intentionally; I'm just trying to understand this complex issue. My perspective is twofold:

Information provided by Stockfish must comply with the protocol:

* score
	* cp - the score from the engine's point of view
	* lowerbound - the score is just a lower bound.
	* upperbound - the score is just an upper bound.

The purpose of scoring moves is to order them and identify the best move, not to provide an objective assessment of their quality. To that end it is convenient but unnecessary to use the value 0.0 on the third or fourth repetition (as in the first example).

In the second example, an objective assessment is impossible (as the move history is unknown to the engine). UCI allows this use case and does not specify what "the engine's point of view" should be; for all the engine knows, the move list could have been 1. Nf3 Nf6 2. Ng5 Ng4 3. Nh3 Nh6 4. Ng1 Ng8 5. Nf3 Nf6 or could have been 1. Nf3 Nf6 2. Ng1 Ng8 3. Nf3 Nf6 4. Ng1 Ng8 5. Nf3 Nf6. Based on the incomplete information, the engine could assign any value, although using retrograde analysis we can conclude 5. Nf3 Nf6 carries some risk whereas 5. e4 e5 does not risk a repetition claim by the opponent.

Under Sofia rules (or match conditions where the opponent cannot claim a draw) it is inappropriate to force a 0.0 value although perhaps lowerbound 0.0 might be useful if the engine itself may claim a draw and has not yet evaluated the move.

Under 5-fold repetition, if prompted the engine may still provide output according to everything noted above. It's not the engine's responsibility to enforce rules when the arbiter fails to do so.

stockfishdeveloper · 2016-12-13T20:32:18Z

@ddugovic Why don't you copy this into the Official Stockfish discussion?

ddugovic · 2016-12-13T20:45:37Z

They aren't interested in lengthy debates about this sort of issue, and I have no disagreement with them.

ianfab · 2016-12-17T15:46:47Z

The tables now go back to the first release with the UCI_Variant option (fishnet-131016). A ~40 Elo regression in atomic chess is the only significant regression so far.

CPagador · 2016-12-19T23:55:52Z

Atomic Chess (Phenom II x4, 15'+3", single core)

SF 250916 - SF 141216 : +18 -13 =9

I know 40 games are few games, but still. 250916 got 71% against Atomkraft and 141216 70%.

Let me know if I am wrong about the 1st table: does it mean that SF atomic 201016 is 517 ELO better than 131016 and version 021116 is 43 ELO worse than 201016 and 151116 is 2 ELO worse than 021116 and so on?

ianfab · 2016-12-20T00:41:33Z

@CPagador Yes, your interpretation of the table is correct.

A merge of upstream changes (see_ge) somewhen in the beginning of October overwrote atomic SEE and caused a regression of about 500 Elo. At that time I was very busy and did only notice the regression after a report in the lichess forum. The subsequent fix logically gave an improvement of about 500 Elo. September 25th probably was before the mentioned merge, so it should be on about the same level as the versions after the fix.

Testing has changed regarding opening books, time controls and number of games, and it might well be that some of my early test results were inaccurate and caused regressive patches to pass, so your test results are not implausible.

CPagador · 2016-12-20T23:53:57Z

@ianfab atomic chess 200 games match at 2'+1":

SF 250916 - SF 141216 : +78 -76 =46

Vinvin20 · 2016-12-26T20:47:19Z

An improved patch about 3-fold draw passed :
26-12-16 mc 3fold diff
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 24519 W: 4439 L: 4324 D: 15756
sprt @ 10+0.1 th 1 Redo Sergei's test with simplified patch: root included from extended draw detection.

ianfab · 2016-12-31T00:15:55Z

Added fishnet-231216.

Vinvin20 · 2017-01-02T00:20:50Z

Some explanation about this patch from http://abrok.eu/stockfish/

Author: Sergei Antonov
Date: Sun Jan 1 10:56:46 2017 +0100
Timestamp: 1483264606

Threefold repetition detection

Implement a threefold repetition detection. Below are the examples of
problems fixed by this change.

Loosing move in a drawn position.
position fen 8/k7/3p4/p2P1p2/P2P1P2/8/8/K7 w - - 0 1 moves a1a2 a7a8 a2a1
The old code suggested a loosing move "bestmove a8a7", the new code suggests "bestmove a8b7" leading to a draw.

Incorrect evaluation (happened in a real game in TCEC Season 9).
position fen 4rbkr/1q3pp1/b3pn2/7p/1pN5/1P1BBP1P/P1R2QP1/3R2K1 w - - 5 31 moves e3d4 h8h6 d4e3
The old code evaluated it as "cp 0", the new code evaluation is around "cp -50" which is adequate.

Brings 0.5-1 ELO gain. Passes [-3.00,1.00].

STC: http://tests.stockfishchess.org/tests/view/584ece040ebc5903140c5aea
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 47744 W: 8537 L: 8461 D: 30746

LTC: http://tests.stockfishchess.org/tests/view/584f134d0ebc5903140c5b37
LLR: 2.96 (-2.94,2.94) [-3.00,1.00]
Total: 36775 W: 4739 L: 4639 D: 27397

ianfab · 2017-01-06T12:27:54Z

Tests for fishnet-050117 are completed.

Vinvin20 · 2017-03-13T15:15:45Z

Is it possible to find such a table with all values for recent versions, please ?

ianfab · 2017-03-13T15:24:57Z

@Vinvin20 New and also updated old regression test results can be found on http://35.161.250.236:6543/regression. It perhaps makes sense to add the link to the new results in the first comment.

Vinvin20 · 2017-03-16T09:09:52Z

I didn't know about this graph, it's very good !
Thanks !

This was referenced Dec 11, 2016

Crazyhouse Computer Championships 2016 #131

Closed

What conditions should be used for testing changes? #149

Closed

ddugovic assigned ianfab Dec 13, 2016

ddugovic closed this as completed Aug 27, 2017

Regression test results #170

Regression test results #170

Comments

ianfab commented Dec 11, 2016 • edited

Elo difference to previous version

Elo difference to most recent version

ddugovic commented Dec 11, 2016

ianfab commented Dec 11, 2016 • edited

ddugovic commented Dec 11, 2016

ianfab commented Dec 11, 2016

Vinvin20 commented Dec 12, 2016 • edited

arbolis commented Dec 12, 2016 • edited

ianfab commented Dec 12, 2016

ianfab commented Dec 12, 2016

ddugovic commented Dec 12, 2016

ianfab commented Dec 13, 2016

Vinvin20 commented Dec 13, 2016

ddugovic commented Dec 13, 2016

Vinvin20 commented Dec 13, 2016

Vinvin20 commented Dec 13, 2016

stockfishdeveloper commented Dec 13, 2016 • edited

Vinvin20 commented Dec 13, 2016

ddugovic commented Dec 13, 2016

Vinvin20 commented Dec 13, 2016

ddugovic commented Dec 13, 2016

Vinvin20 commented Dec 13, 2016 • edited

ddugovic commented Dec 13, 2016 • edited

Vinvin20 commented Dec 13, 2016

ddugovic commented Dec 13, 2016

Vinvin20 commented Dec 13, 2016 • edited

Vinvin20 commented Dec 13, 2016

ddugovic commented Dec 13, 2016 • edited

stockfishdeveloper commented Dec 13, 2016

ddugovic commented Dec 13, 2016

ianfab commented Dec 17, 2016

CPagador commented Dec 19, 2016

ianfab commented Dec 20, 2016

CPagador commented Dec 20, 2016

Vinvin20 commented Dec 26, 2016 • edited

ianfab commented Dec 31, 2016

Vinvin20 commented Jan 2, 2017

ianfab commented Jan 6, 2017

Vinvin20 commented Mar 13, 2017

ianfab commented Mar 13, 2017 • edited

Vinvin20 commented Mar 16, 2017

ianfab commented Dec 11, 2016 •

edited

ianfab commented Dec 11, 2016 •

edited

Vinvin20 commented Dec 12, 2016 •

edited

arbolis commented Dec 12, 2016 •

edited

stockfishdeveloper commented Dec 13, 2016 •

edited

Vinvin20 commented Dec 13, 2016 •

edited

ddugovic commented Dec 13, 2016 •

edited

Vinvin20 commented Dec 13, 2016 •

edited

ddugovic commented Dec 13, 2016 •

edited

Vinvin20 commented Dec 26, 2016 •

edited

ianfab commented Mar 13, 2017 •

edited