Controlled-bias_Sudoku_generator_and_collection

A large collection (~6M) of minimal Sudoku puzzles generated with a controlled-bias generator, intended to allow the easy computation of unbiased statistics + associated results about ratings, correlation and distributions.

1. INTRODUCTION

A Sudoku puzzle is called minimal if it has a unique solution and any puzzle obtained from it by deleting one clue (or "given") has several solutions.

In general, in statistical studies, only minimal puzzles are worth considering, because adding clues to a puzzle eventually leads to huge amounts of trivial puzzles that would completely dominate the statistics.

[Note that this may have to be reconsidered in some contexts (e.g. in the search for the "hardest" puzzles), after the introduction of new steps in the generation methods suggests that, in order to avoid redundancies, only expansions of minimals by Singles should be taken into account, because two puzzles with the same such expansion are "the same" for all practical purposes (Singles being trivial resolution steps). However, in the case of the present work, it is very unlikely that two of the minimal puzzles generated by the controlled-bias generator (or by any of the generators considered here) have the same expansion.]

There are two standard ways of creating random minimal Sudoku puzzles: bottom-up or top-down. In the bottom-up approach, one starts from an empty grid and progressively adds givens; in the top-down aproach, one starts from a complete Sudoku grid and progresively deletes givens. In both cases, one has to test uniqueness and minimality at each step, so that the process is complicated by possibly having to delete added givens or to re-add deleted givens. Both algorithms are nevertheless extremely fast and they can produce thousands of minimal puzzles per minute.

Unfortunately, it appeared in the late 2000s that both bottom-up and top-down Sudoku generators are strongly biased with respect to the number of clues of the puzzles thus obtained and that there is no a priori way of knowing the bias.

In 2009, I defined a new type of generator, which I called the controlled-bias generator. Compared to the previous two kinds, the controlled-bias generator is extremely slow, it is still strongly biased (though much less than the other two kinds), but the bias is precisely known.

As a result, using collections produced by this generator, one can easily compute unbiased statistics of any feature of Sudoku puzzles (unbiased distribution of the number of clues, unbiased distribution of any rating...).
As an accessory result, one can also compute (with 0.065% relative error):
. an estimate of the total number of minimal Sudoku puzzles: 3.1055e+37
. and an estimate of the total number of non-isomorphic minimal Sudoku puzzles: 2.5477e+25.

See [Berthier 2009] or [PBCS] for details of how the three kinds of generators work.

2. THE CONTENTS OF THIS REPOSITORY

This repository consists of four main folders (PROGRAMS, SMALL-CB-COLLECTIONS, GLOBAL-CB-RESULTS and DOCS), this README file and a "CB-stats.clp" file.

You don't need to know how to use the programs or to re-run them in order to use the provided controlled-bias collections.
These collections can be used independently of the programs, which is fortunate because generating controlled-bias collections is much slower than bottom-up or top-down ones.
The controlled-bias collections provided here are so large that they should be enough for the computation of almost any unbiased statitistcs with very high precision.

3. THE "PROGRAMS" FOLDER CONTAINS THE VARIOUS GENERATORS MENTIONNED IN SECTION 1

The four programs given here can be used as standalone. suexg-cb and suexg-cb-Paul-optim48 can also be used with an input stream of complete Sudoku grids.

suexg-bu is a basic bottom-up generation program.
suexg-td is a basic top-down generation program.
suexg-cb is one of the first versions of the controlled-bias generator.
suexg-cb-Paul-optim48 is the speed-optimised final version of the controlled-bias generator.

The two versions suexg-cb and suexg-cb-Paul-optim48 have been used to generate the small-cb-collections and the resulting 5,926,343 puzzles. They differ only in the speed of producing puzzles.

This folder also containes the FGPX version of SER, used to compute the FGPXnoU ratings.

3.1 How to compile the programs

Unless you are using MasOS, for which precompiled versions are provided, you will need to compile the three programs before using them, as follows:

gcc -O3 -Wall suexg-bu.c -o suexg-bu[.exe]
gcc -O3 -Wall suexg-td.c -o suexg-td[.exe]
gcc -O3 -Wall suexg-cb.c -o suexg-cb[.exe]

In folder suexg-cb-Paul-optim48:
g++ -O3 -I. suexg-cb.cpp bb_sudoku_solver.cpp -o suexg-cb-Paul[.exe]

3.2 How to run the programs

./suexg-bu[.exe] s n
./suexg-td[.exe] s n [file]
./suexg-cb[.exe] s n [file]

In folder suexg-cb-Paul-optim48:
./suexg-cb-Paul[.exe] s n [file]

the .exe suffix is for Windows users.

s is the seed (an integer used as the seed of the random numbers generator);
if the optional argument [file] is not present, n is the number of puzzles generated;
if the optional argument file is present, it must be a file with a full grid in each line:
*- in suexg-td, n puzzles are generated for each input line
*- in suexg-cb and suexg-cb-Paul-optim48, the grid in each input line is used n times to try to obtain a puzzle (but no result is guaranteed for this line if n is not large);
*- in suexg-cb and suexg-cb-Paul-optim48, the optional file argument can be defined as "-" and the algorithm will take a stream of complete grids as input.

3.3 About the creation of the controlled-bias collections:

Different versions of the controlled-bias generator have been used (with increasing speeds, but all with the same controlled-bias properties). The last but one version was the present version of suexg-cb (as largely modified by Paul Isaacsson); it doesn't have much in common with the original suexg program (appearing here as "suexg-td"). The last version was based on Paul Isaacsson's solver.
The input was a stream of complete Sudoku grids, obtained by gsf's compressed full list of Sudoku Grids (modulo isomorphisms) and his decompressor. Unfortunately, gsf's compressed full list is too large for GitHub.

4. THE "SMALL-CB-COLLECTIONS" FOLDER CONTAINS INDEPENDENT CONTROLLED-BIAS COLLECTIONS OF PUZZLES GENERATED BY THE CONTROLLED-BIAS GENERATOR

In practice, it is easier to do separate computations on several small collections than on a very large one, and possibly assemble the results only at the end.

Also, although one could easily concatenate all the "small" puzzle files in the SMALL-CB-COLLECTIONS folder into a large one with 5,926,343 minimal puzzles, this would not allow to choose arbitrary subparts of this global file: the results would not be controlled-bias.

Each of the 129 puzzles files in this SMALL-CB-COLLECTIONS folder consists of a minimal collection necessary to run controlled-bias computations.

Contrary to the first version of this repository, each puzzle file now has its own sub-folder, which contains both the puzzles themselves and associated results. Of course, this new organisation has no impact on the global statistical results first published in 2009 ([Berthier 2009]) and recalled in each edition of [PBCS], such as the unbiased distributions of clues or of the W-rating, or the correlation between SER and W, or the estimates for the numbers of minimal Sudoku puzzles...

The name of each folder is the seed used to generate random numbers in the controlled-bias generator (but this is irrelevant for all practical purposes).

Each folder with name <xxx> contains various files:

an <xxx>-puzzles.txt file, which contains the puzzles (one per line) obtained using seed <xxx> for the random numbers generator used by the controlled-bias generator;
an <xxx>-nb-clues.txt file, which contains the corresponding numbers of clues;
an <xxx>-nb-cands-after-BRT.txt file, which contains the corresponding numbers of candidates remaining after rules in BRT (i.e. Singles plus eliminations by direct contradictions arising from them) have been applied until no more of these rules can be applied;
an <xxx>-nb-cands-after-W1.txt file, which contains the corresponding numbers of candidates remaining after rules in W1 (i.e. BRT plus whips[1]) have been applied;
an <xxx>-W-ratings.txt file, which contains the corresponding W-ratings;
an <xxx>-gW-ratings.txt file, which contains the corresponding gW-ratings;
an <xxx>-B-ratings.txt file, which contains the corresponding B-ratings;
an <xxx>-gB-ratings.txt file, which contains the corresponding gB-ratings;
an <xxx>-SER.txt file, which contains the corresponding SER ratings;
an <xxx>-FPGXnoU-ratings.txt file, which contains the corresponding FGPX ratings with rules for uniqueness disabled. (FPGX is a faster version of SER, with some problems of isomorphism-dependency corrected and with rules for uniqueness de-activated. It is now included in the PROGRAMS folder, where more details about it can be found in the local README.md file.).

Puzzle files have different sizes (that you can find in "DOCS/Total-puzzles.pdf"), due to the generating method. The smallest ones have about 21,000 puzzles.

Some files (the last ones in the collection, starting from "397") are about four times larger than the first ones (and thus have about 85,000 puzzles), because each complete grid was submitted to the controlled-bias generator four times instead of once at a time in the other cases. This has no impact on controlled-bias computations based on them (other than the time needed to process them and the precisions of the results based on them). Remember that relative precision of the results (expressed as a percentage of the values obtained) varies as 1/sqr(number of puzzles used) and this is true of both the controlled-bias results and the unbiased ones derived from them.

All the files in the same "small" folder have the same length.

As for the tools used for the calculations:

All the calculations for nb-clues, nb-cands-after-BRT, nb-cands-after-W1, W-ratings and gW-ratings were done with the current version of SudoRules-V20.1 (the Sudoku part of [CSP-Rules]) as of this writing (2024, Feb. 02).
All the calculations for the B-ratings were done with the current version of [SHC] as of this writing.
All the calculations for the gB-ratings were done with a non-published version of [SHC] (slower than the current one that doesn't have g-candidates).
All the calculations for the FPGX-noU ratings were done with [FPGX], a slightly faster variant of SER, with some of the isomorphism-dependency problems of SER solved, including modifications by "1to9only", "iksudoku" and "creint").

5. THE "GLOBAL-CB-RESULTS" FOLDER CONTAINS VARIOUS UNIX SRCIPTS FOR ASSEMBLING GLOBAL RESULTS FROM THE SMALL COLLECTIONS

Whereas the previous release of this project had global files for the W ratings, nb-clues and SER, this is no longer the case. Instead, there are Unix scripts for assembling such files from the available small collections.
More precisely, for each file type <ttt> in any of the SMALL-CB-COLLECTIONS sub-folders, the GLOBAL-CB-RESULTS folder contains an assemble-<ttt>.clp file (e.g. "assemble-puzzles.txt", "assemble-nb-clues.txt", "assemble-W-ratings.txt"...), which includes:

a Unix script for generating the corresponding global file,
a CLIPS function for generating this Unix script.
The Unix scripts must be run from the "GLOBAL-CB-RESULTS" folder.

Global files resulting from the above scripts are named all-<ttt>.text (e.g. "all-puzzles.txt", "all-nb-clues.txt", "all-W-ratings.txt", "all-gW-ratings.txt"...).

Once assembled, all the global files should have exactly 5,926,343 lines.

However, you can use the scripts in almost unrestricted ways by selecting only the parts of the data you want to assemble. Of course, the selections should be consistent from one script to another.

6. THE "DOCS" FOLDER CONTAINS:

a "Total.pdf" file, recalling the number of puzzles per file in the "SMALL-CB-COLLECTIONS" folder. The order in this file is also the order in which they must be assembled in case you wanted a unique large file of puzzles. It also corresponds to the order used by the scripts for assembling small result files into big ones in the "GLOBAL-CB-RESULTS" folder. [Note that this order follows the increasing values of seeds, contrary to the previous version.];
my article [Berthier 2009];
an old "Comparisons.pdf" file stating the results of comparing the various generators in the "PROGRAMS" folder (plus other results);
an old "W-classification-results.pdf" file stating the detailed results of the W classification, obtained thanks to the programs and data in this repository and to the CSP-Rules software; it repeats much of [Berthier 2009] but it provides more details and results than a single article can do.

7. THE "CB-stats.clp" FILE

It contains:

SudoRules commands for computing the statistical results exploiting the full collection of puzzles and ratings;
results of their execution;
comments on the results.

It is the most up-to-date version of these results, with all the calculations done on the whole collection of 5,926,343 controlled-bias puzzles.
Notice however that, in addition to the results here, [CSP-Rules-Examples] has a cbg-000 folder in its Sudoku examples folder where more ratings are computed and compared (e.g. S+W, S+gW, W+uniqueness...). Considering the computation times all these ratings required, this was done for only a small part (the first 21,375) of the full controlled-bias collection (and this was enough for the additional conclusions drawn for them).

8. LICENSE

The Controlled-bias_Sudoku_generator_and_collection is distributed under the GNU GPL v3.0 license.

9. ACKNOWLEDGMENTS

Thanks are due to “Eleven” for implementing the first modification (suexg-cb) of a well-known top-down generator (suexg, written in C) to make it compliant with my specification of controlled-bias, and then several faster versions of it; this allowed to turn the whole idea into reality. Thanks to Paul Isaacson for adapting Brian Turner’s fast solver so that it could be used to replace the solver part of suexg. Thanks to Glenn Fowler (alias gsf) for providing an a priori unbiased source of complete grids: the full compressed collection of their equivalence classes (with respect to Sudoku isomorphisms) together with a fast decompressor allowing to use it as the direct input to the generator. Thanks also, for discussions and/or various contributions, to Allan Barker, Coloin, David P. Bird, Mike Metcalf, Red Ed (who was first to suggest the existence of a bias in the current generators). The informal collaboration that the controlled-bias idea sprouted on the late Sudoku Player's Forum was very productive: due to several independent optimisations, the last version of suexg-cb (which does not retain much of the original suexg code) is 200 times faster than the first.
Thanks are also due to François Cordoliani for developing the Sudoku Hierarchical Classifier ([SHC]). It allowed to compute the B and gB ratings for the whole collection, which would have taken much too long with the SudoRules implementation of braids or g-braids.

10. REFERENCES

All the references mentioned below are available:

in free pdf form on ResearchGate (https://www.researchgate.net/profile/Denis-Berthier/research),
in free pdf form in the Publications folder of CSP-Rules on GitHub (https://github.com/denis-berthier/CSP-Rules-V2.1),
in print form on Lulu.com (https://www.lulu.com/spotlight/berthier) and on any of the main book sellers websites.

Articles

[Berthier 2009]: BERTHIER D., Unbiased Statistics of a CSP - A Controlled-Bias Generator, International Joint Conferences on Computer, Information, Systems Sciences and Engineering (CISSE 09), December 4-12, 2009, Springer. Published as a chapter of Innovations in Computing Sciences and Software Engineering, Khaled Elleithy Editor, pp. 11-17, Springer, 2010. Also available here, in the Docs folder.

Books

[HCCS] : BERTHIER D., Hierachical Classifications in Constraint Satisfaction, Lulu.com Publishers, October 2023.
[PBCS1]: BERTHIER D., Pattern-Based Constraint Satisfaction and Logic Puzzles, Lulu.com Publishers, July 2012.
[PBCS2]: BERTHIER D., Pattern-Based Constraint Satisfaction and Logic Puzzles (Second Edition), Lulu.com Publishers, November 2015.
[PBCS3]: BERTHIER D., Pattern-Based Constraint Satisfaction and Logic Puzzles (Third Edition), Lulu.com Publishers, November 2021.
[PBCS]: any of [PBCS1], [PBCS2] or [PBCS3].

Software

[CSP-Rules]: BERTHIER D., CSP-Rules-V2.1, August 2020, https://github.com/denis-berthier/CSP-Rules-V2.1
[CSP-Rules-Examples]: BERTHIER D., August 2020, https://github.com/denis-berthier/CSP-Rules-Examples
[FPGX]: https://drive.google.com/file/d/1vxl21CkrY_r6Vv5Cbc8DIHmgo_F2S02_/view?usp=share_link
[SHC]: CORDOLIANI F., The Sudoku Hierarchical Classifier, October 2023, https://github.com/denis-berthier/Sudoku_Hierarchical_Classifier

Name		Name	Last commit message	Last commit date
Latest commit History 92 Commits
Docs		Docs
GLOBAL-CB-RESULTS		GLOBAL-CB-RESULTS
PROGRAMS		PROGRAMS
SMALL-CB-COLLECTIONS		SMALL-CB-COLLECTIONS
.gitattributes		.gitattributes
.gitignore		.gitignore
CB-stats.clp		CB-stats.clp
LICENSE		LICENSE
README.md		README.md

License

denis-berthier/Controlled-bias_Sudoku_generator_and_collection

Folders and files

Latest commit

History

Repository files navigation