Testing reminder #574

ddugovic · 2020-06-12T13:10:02Z

During merging I have concerns which I need to write down and act upon; or I need to develop a habit of blindly testing.

As the repository's owner I need to remind myself to frequently submit tests; relying on my bot to measure Elo swings after a mis-merge is proving to be insufficient. (I want to offer to transfer ownership, but that would be grossly unfair to the new owner; besides, the current owner/maintainer arrangement is working well despite my incompetence.)

I have enjoyed owning this project each time I added a variant, or competed with other engines (crazyhouse, atomic) although I'm likely done adding variants and NN engines are too strong. My most recent variant additions are:

Helpmate solver
Placement
Relay (and Knight-Relay)

ianfab · 2020-06-13T09:50:52Z

@ddugovic While I welcome that you now consider to do regression testing, testing a trivial merge like bd3e681 with 10000 games for all variants in my opinion is massively overdoing things. There is a wide range between testing nothing and testing everything.

First of all one should be clear about the goal. E.g., I think restricting detection of regressions to >=20 Elo is very reasonable, and 1000 games are already enough for that (a threshold of ~50 Elo and limiting to something like 400 games could also make sense, as investigating 20 Elo regressions might not be worth the time). In any case >>100 Elo regressions can be detected easily.

E.g., my policy for regression testing of upstream merges from official SF for Fairy-Stockfish has two parts:

While merging, watch out for changes (more than just a minor value tweak) that are predestined to strongly affect (positive or negative) a specific variant and run targeted regression tests for those variants, e.g.:

pawn eval -> horde
king safety -> crazyhouse, 3check
SEE -> atomic
etc.

Regularly (e.g., every 2-3 months) run regression tests for all (relevant) variants.

Discovers holes in part 1 and potentially other bugs unrelated to merges.
Gives an overview over the progression.

Such a combination between targeted and regular testing in my opinion is a good compromise between coverage and effort of regression testing.

ddugovic · 2020-06-19T17:08:11Z

That's a fine policy, although the first part is more than I can understand. I would make claims about the tests I run before pushing to GitHub, but there is no value in arguing.

Progress building tests into Travis CI (checking for large mistakes; incidentally this means I will also tackle #158 by creating "puzzle" EPDs for all variants) is gradual:

$ ../tests/puzzle.sh
puzzle testing started
spawn python3 ../tests/chess-artist/chess_artist.py --infile ../tests/chess-artist/EPD/wacnew.epd --outfile wacnew_result.txt --enginefile ./stockfish --eval search --job test
Analyzing engine: Stockfish 2020-06-19 Multi-Variant
EPD 1: 2rr3k/pp3pp1/1nnqbN1p/3pN3/2pP4/2P3Q1/PPB4P/R4RK1 w - - bm Qg6; id "WAC.001";
FEN 1: 2rr3k/pp3pp1/1nnqbN1p/3pN3/2pP4/2P3Q1/PPB4P/R4RK1 w - - 0 1
engine bm: Qg6
correct: Yes
num correct: 1 / 1
EPD 2: 8/7p/5k2/5p2/p1p2P2/Pr1pPK2/1P1R3P/8 b - - bm Rxb2; id "WAC.002";
FEN 2: 8/7p/5k2/5p2/p1p2P2/Pr1pPK2/1P1R3P/8 b - - 0 1
engine bm: Rb7
correct: No
num correct: 1 / 2
EPD 3: 5rk1/1ppb3p/p1pb4/6q1/3P1p1r/2P1R2P/PP1BQ1P1/5RKN w - - bm Rg3; id "WAC.003";
FEN 3: 5rk1/1ppb3p/p1pb4/6q1/3P1p1r/2P1R2P/PP1BQ1P1/5RKN w - - 0 1
engine bm: Rg3
correct: Yes
num correct: 2 / 3
EPD 4: r1bq2rk/pp3pbp/2p1p1pQ/7P/3P4/2PB1N2/PP3PPR/2KR4 w - - bm Qxh7+; id "WAC.004";
FEN 4: r1bq2rk/pp3pbp/2p1p1pQ/7P/3P4/2PB1N2/PP3PPR/2KR4 w - - 0 1
engine bm: Qxh7+
correct: Yes
num correct: 3 / 4
EPD 5: 5k2/6pp/p1qN4/1p1p4/3P4/2PKP2Q/PP3r2/3R4 b - - bm Qc4+; id "WAC.005";
FEN 5: 5k2/6pp/p1qN4/1p1p4/3P4/2PKP2Q/PP3r2/3R4 b - - 0 1
engine bm: Qc4+
correct: Yes
num correct: 4 / 5
EPD 6: 7k/p7/1R5K/6r1/6p1/6P1/8/8 w - - bm Rb7; id "WAC.006";
FEN 6: 7k/p7/1R5K/6r1/6p1/6P1/8/8 w - - 0 1
engine bm: Rb7
correct: Yes
num correct: 5 / 6
EPD 7: rnbqkb1r/pppp1ppp/8/4P3/6n1/7P/PPPNPPP1/R1BQKBNR b KQkq - bm Ne3; id "WAC.007";
FEN 7: rnbqkb1r/pppp1ppp/8/4P3/6n1/7P/PPPNPPP1/R1BQKBNR b KQkq - 0 1
engine bm: Ne3
correct: Yes
num correct: 6 / 7
EPD 8: r4q1k/p2bR1rp/2p2Q1N/5p2/5p2/2P5/PP3PPP/R5K1 w - - bm Rf7; id "WAC.008";
FEN 8: r4q1k/p2bR1rp/2p2Q1N/5p2/5p2/2P5/PP3PPP/R5K1 w - - 0 1
engine bm: Rf7
correct: Yes
num correct: 7 / 8
EPD 9: 3q1rk1/p4pp1/2pb3p/3p4/6Pr/1PNQ4/P1PB1PP1/4RRK1 b - - bm Bh2+; id "WAC.009";
FEN 9: 3q1rk1/p4pp1/2pb3p/3p4/6Pr/1PNQ4/P1PB1PP1/4RRK1 b - - 0 1
engine bm: Bh2+
correct: Yes
num correct: 8 / 9
EPD 10: 2br2k1/2q3rn/p2NppQ1/2p1P3/Pp5R/4P3/1P3PPP/3R2K1 w - - bm Rxh7; id "WAC.010";
FEN 10: 2br2k1/2q3rn/p2NppQ1/2p1P3/Pp5R/4P3/1P3PPP/3R2K1 w - - 0 1
engine bm: Rxh7
correct: Yes
num correct: 9 / 10
EPD 11: r1b1kb1r/3q1ppp/pBp1pn2/8/Np3P2/5B2/PPP3PP/R2Q1RK1 w kq - bm Bxc6; id "WAC.011";
FEN 11: r1b1kb1r/3q1ppp/pBp1pn2/8/Np3P2/5B2/PPP3PP/R2Q1RK1 w kq - 0 1
engine bm: Bxc6
correct: Yes
num correct: 10 / 11
EPD 12: 4k1r1/2p3r1/1pR1p3/3pP2p/3P2qP/P4N2/1PQ4P/5R1K b - - bm Qxf3+; id "WAC.012";
FEN 12: 4k1r1/2p3r1/1pR1p3/3pP2p/3P2qP/P4N2/1PQ4P/5R1K b - - 0 1
engine bm: Qxf3+
correct: Yes
num correct: 11 / 12
EPD 13: 5rk1/pp4p1/2n1p2p/2Npq3/2p5/6P1/P3P1BP/R4Q1K w - - bm Qxf8+; id "WAC.013";
FEN 13: 5rk1/pp4p1/2n1p2p/2Npq3/2p5/6P1/P3P1BP/R4Q1K w - - 0 1
engine bm: Qxf8+
correct: Yes
num correct: 12 / 13
EPD 14: r2rb1k1/pp1q1p1p/2n1p1p1/2bp4/5P2/PP1BPR1Q/1BPN2PP/R5K1 w - - bm Qxh7+; id "WAC.014";
FEN 14: r2rb1k1/pp1q1p1p/2n1p1p1/2bp4/5P2/PP1BPR1Q/1BPN2PP/R5K1 w - - 0 1
puzzle testing OK
The command "../tests/puzzle.sh" exited with 0.

ianfab · 2020-06-19T17:39:08Z

Detecting regressions that would otherwise be missed is not the point of the targeted tests for me. Running regular regression tests for all variants already ensures that all significant regressions are found, so the targeted tests are only to detect obvious potential regressions early and to avoid time consuming bisection later on, but they of course require making good guesses about the potential regressions, otherwise they are not more efficient than bisection. This is just my suggestion since it works well for me, and if you do not feel comfortable with it there of course also are many other good ways to test.

ghost · 2020-10-29T12:24:30Z

is everything okay

ddugovic · 2020-10-29T12:32:38Z

Lately #580 and #585 , and timeouts on Travis CI, and finishing the AppVeyor build, and the SF 12 release, diverted most of my attention and I haven't been able to focus on this.

ddugovic self-assigned this Jun 12, 2020

ddugovic added a commit that referenced this issue Jun 19, 2020

Add regression test using chess-artist #574

bd9c4b9

ddugovic added a commit that referenced this issue Jul 5, 2020

Add regression test using chess-artist #574

e2ea67c

ddugovic closed this as completed Feb 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing reminder #574

Testing reminder #574

ddugovic commented Jun 12, 2020

ianfab commented Jun 13, 2020

ddugovic commented Jun 19, 2020

ianfab commented Jun 19, 2020

ghost commented Oct 29, 2020

ddugovic commented Oct 29, 2020 •

edited

Testing reminder #574

Testing reminder #574

Comments

ddugovic commented Jun 12, 2020

ianfab commented Jun 13, 2020

ddugovic commented Jun 19, 2020

ianfab commented Jun 19, 2020

ghost commented Oct 29, 2020

ddugovic commented Oct 29, 2020 • edited

ddugovic commented Oct 29, 2020 •

edited