Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing reminder #574

Closed
ddugovic opened this issue Jun 12, 2020 · 5 comments
Closed

Testing reminder #574

ddugovic opened this issue Jun 12, 2020 · 5 comments
Assignees

Comments

@ddugovic
Copy link
Owner

During merging I have concerns which I need to write down and act upon; or I need to develop a habit of blindly testing.

As the repository's owner I need to remind myself to frequently submit tests; relying on my bot to measure Elo swings after a mis-merge is proving to be insufficient. (I want to offer to transfer ownership, but that would be grossly unfair to the new owner; besides, the current owner/maintainer arrangement is working well despite my incompetence.)

I have enjoyed owning this project each time I added a variant, or competed with other engines (crazyhouse, atomic) although I'm likely done adding variants and NN engines are too strong. My most recent variant additions are:

  • Helpmate solver
  • Placement
  • Relay (and Knight-Relay)
@ddugovic ddugovic self-assigned this Jun 12, 2020
@ianfab
Copy link
Collaborator

ianfab commented Jun 13, 2020

@ddugovic While I welcome that you now consider to do regression testing, testing a trivial merge like bd3e681 with 10000 games for all variants in my opinion is massively overdoing things. There is a wide range between testing nothing and testing everything.

First of all one should be clear about the goal. E.g., I think restricting detection of regressions to >=20 Elo is very reasonable, and 1000 games are already enough for that (a threshold of ~50 Elo and limiting to something like 400 games could also make sense, as investigating 20 Elo regressions might not be worth the time). In any case >>100 Elo regressions can be detected easily.

E.g., my policy for regression testing of upstream merges from official SF for Fairy-Stockfish has two parts:

  1. While merging, watch out for changes (more than just a minor value tweak) that are predestined to strongly affect (positive or negative) a specific variant and run targeted regression tests for those variants, e.g.:
  • pawn eval -> horde
  • king safety -> crazyhouse, 3check
  • SEE -> atomic
  • etc.
  1. Regularly (e.g., every 2-3 months) run regression tests for all (relevant) variants.
  • Discovers holes in part 1 and potentially other bugs unrelated to merges.
  • Gives an overview over the progression.

Such a combination between targeted and regular testing in my opinion is a good compromise between coverage and effort of regression testing.

@ddugovic
Copy link
Owner Author

That's a fine policy, although the first part is more than I can understand. I would make claims about the tests I run before pushing to GitHub, but there is no value in arguing.

Progress building tests into Travis CI (checking for large mistakes; incidentally this means I will also tackle #158 by creating "puzzle" EPDs for all variants) is gradual:

$ ../tests/puzzle.sh
puzzle testing started
spawn python3 ../tests/chess-artist/chess_artist.py --infile ../tests/chess-artist/EPD/wacnew.epd --outfile wacnew_result.txt --enginefile ./stockfish --eval search --job test
Analyzing engine: Stockfish 2020-06-19 Multi-Variant
EPD 1: 2rr3k/pp3pp1/1nnqbN1p/3pN3/2pP4/2P3Q1/PPB4P/R4RK1 w - - bm Qg6; id "WAC.001";
FEN 1: 2rr3k/pp3pp1/1nnqbN1p/3pN3/2pP4/2P3Q1/PPB4P/R4RK1 w - - 0 1
engine bm: Qg6
correct: Yes
num correct: 1 / 1
EPD 2: 8/7p/5k2/5p2/p1p2P2/Pr1pPK2/1P1R3P/8 b - - bm Rxb2; id "WAC.002";
FEN 2: 8/7p/5k2/5p2/p1p2P2/Pr1pPK2/1P1R3P/8 b - - 0 1
engine bm: Rb7
correct: No
num correct: 1 / 2
EPD 3: 5rk1/1ppb3p/p1pb4/6q1/3P1p1r/2P1R2P/PP1BQ1P1/5RKN w - - bm Rg3; id "WAC.003";
FEN 3: 5rk1/1ppb3p/p1pb4/6q1/3P1p1r/2P1R2P/PP1BQ1P1/5RKN w - - 0 1
engine bm: Rg3
correct: Yes
num correct: 2 / 3
EPD 4: r1bq2rk/pp3pbp/2p1p1pQ/7P/3P4/2PB1N2/PP3PPR/2KR4 w - - bm Qxh7+; id "WAC.004";
FEN 4: r1bq2rk/pp3pbp/2p1p1pQ/7P/3P4/2PB1N2/PP3PPR/2KR4 w - - 0 1
engine bm: Qxh7+
correct: Yes
num correct: 3 / 4
EPD 5: 5k2/6pp/p1qN4/1p1p4/3P4/2PKP2Q/PP3r2/3R4 b - - bm Qc4+; id "WAC.005";
FEN 5: 5k2/6pp/p1qN4/1p1p4/3P4/2PKP2Q/PP3r2/3R4 b - - 0 1
engine bm: Qc4+
correct: Yes
num correct: 4 / 5
EPD 6: 7k/p7/1R5K/6r1/6p1/6P1/8/8 w - - bm Rb7; id "WAC.006";
FEN 6: 7k/p7/1R5K/6r1/6p1/6P1/8/8 w - - 0 1
engine bm: Rb7
correct: Yes
num correct: 5 / 6
EPD 7: rnbqkb1r/pppp1ppp/8/4P3/6n1/7P/PPPNPPP1/R1BQKBNR b KQkq - bm Ne3; id "WAC.007";
FEN 7: rnbqkb1r/pppp1ppp/8/4P3/6n1/7P/PPPNPPP1/R1BQKBNR b KQkq - 0 1
engine bm: Ne3
correct: Yes
num correct: 6 / 7
EPD 8: r4q1k/p2bR1rp/2p2Q1N/5p2/5p2/2P5/PP3PPP/R5K1 w - - bm Rf7; id "WAC.008";
FEN 8: r4q1k/p2bR1rp/2p2Q1N/5p2/5p2/2P5/PP3PPP/R5K1 w - - 0 1
engine bm: Rf7
correct: Yes
num correct: 7 / 8
EPD 9: 3q1rk1/p4pp1/2pb3p/3p4/6Pr/1PNQ4/P1PB1PP1/4RRK1 b - - bm Bh2+; id "WAC.009";
FEN 9: 3q1rk1/p4pp1/2pb3p/3p4/6Pr/1PNQ4/P1PB1PP1/4RRK1 b - - 0 1
engine bm: Bh2+
correct: Yes
num correct: 8 / 9
EPD 10: 2br2k1/2q3rn/p2NppQ1/2p1P3/Pp5R/4P3/1P3PPP/3R2K1 w - - bm Rxh7; id "WAC.010";
FEN 10: 2br2k1/2q3rn/p2NppQ1/2p1P3/Pp5R/4P3/1P3PPP/3R2K1 w - - 0 1
engine bm: Rxh7
correct: Yes
num correct: 9 / 10
EPD 11: r1b1kb1r/3q1ppp/pBp1pn2/8/Np3P2/5B2/PPP3PP/R2Q1RK1 w kq - bm Bxc6; id "WAC.011";
FEN 11: r1b1kb1r/3q1ppp/pBp1pn2/8/Np3P2/5B2/PPP3PP/R2Q1RK1 w kq - 0 1
engine bm: Bxc6
correct: Yes
num correct: 10 / 11
EPD 12: 4k1r1/2p3r1/1pR1p3/3pP2p/3P2qP/P4N2/1PQ4P/5R1K b - - bm Qxf3+; id "WAC.012";
FEN 12: 4k1r1/2p3r1/1pR1p3/3pP2p/3P2qP/P4N2/1PQ4P/5R1K b - - 0 1
engine bm: Qxf3+
correct: Yes
num correct: 11 / 12
EPD 13: 5rk1/pp4p1/2n1p2p/2Npq3/2p5/6P1/P3P1BP/R4Q1K w - - bm Qxf8+; id "WAC.013";
FEN 13: 5rk1/pp4p1/2n1p2p/2Npq3/2p5/6P1/P3P1BP/R4Q1K w - - 0 1
engine bm: Qxf8+
correct: Yes
num correct: 12 / 13
EPD 14: r2rb1k1/pp1q1p1p/2n1p1p1/2bp4/5P2/PP1BPR1Q/1BPN2PP/R5K1 w - - bm Qxh7+; id "WAC.014";
FEN 14: r2rb1k1/pp1q1p1p/2n1p1p1/2bp4/5P2/PP1BPR1Q/1BPN2PP/R5K1 w - - 0 1
puzzle testing OK
The command "../tests/puzzle.sh" exited with 0.

@ianfab
Copy link
Collaborator

ianfab commented Jun 19, 2020

Detecting regressions that would otherwise be missed is not the point of the targeted tests for me. Running regular regression tests for all variants already ensures that all significant regressions are found, so the targeted tests are only to detect obvious potential regressions early and to avoid time consuming bisection later on, but they of course require making good guesses about the potential regressions, otherwise they are not more efficient than bisection. This is just my suggestion since it works well for me, and if you do not feel comfortable with it there of course also are many other good ways to test.

@ghost
Copy link

ghost commented Oct 29, 2020

is everything okay

@ddugovic
Copy link
Owner Author

ddugovic commented Oct 29, 2020

Lately #580 and #585 , and timeouts on Travis CI, and finishing the AppVeyor build, and the SF 12 release, diverted most of my attention and I haven't been able to focus on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants