Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding shrinking flag for cover and fastcover #1656

Merged
merged 43 commits into from
Jun 27, 2019
Merged

Adding shrinking flag for cover and fastcover #1656

merged 43 commits into from
Jun 27, 2019

Conversation

Tyler-Tran
Copy link
Contributor

Created a flag that allows the dictionary builder to select the smallest dictionary that is within a 1.01 regression of the largest dictionary.

lib/dictBuilder/fastcover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/zdict.h Outdated Show resolved Hide resolved
programs/zstdcli.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
@terrelln
Copy link
Contributor

I haven't reviewed yet, but can you please add a test case in tests/playTests.sh, where the other dictionary builder tests are?

@terrelln
Copy link
Contributor

Also, lets add a test case in tests/fuzzer.c where the other dict builder tests are

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looking good, just some nits.

lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly nits, but 2 changes:

  1. I think there is a bug when no smaller dictionary is found.
  2. Please switch to --train-cover=shrink=1 instead of --shrink-dict. See how split is implemented.

lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
lib/dictBuilder/cover.c Outdated Show resolved Hide resolved
programs/zstdcli.c Outdated Show resolved Hide resolved
programs/zstdcli.c Outdated Show resolved Hide resolved
tests/fuzzer.c Show resolved Hide resolved
tests/playTests.sh Outdated Show resolved Hide resolved
tests/playTests.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update programs/zstd.1.md, which is the man page for zstd, to say what shrink does. Search for --train-cover and --train-fastcover to see where to update.


BYTE *const largestDictbuffer = (BYTE * const)malloc(dictContentSize);
BYTE *const candidateDictBuffer = (BYTE * const)malloc(dictContentSize);
double regressionTolerance = params.shrinkDict + 1.00;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked taking the parameter as an unsigned, but we should distinguish:

  1. Auto shrinking enabled vs disabled. If you need fast dictionary training, you don't want auto shrinking.
    It can be disabled by default.
  2. The regression we allow. A regression of 0 could make sense if you want to take a smaller dictionary only if it is better.

I think we could take two flags shrinkDict and shrinkDictMaxRegression, both unsigned. It was a mistake to take split as a double, so I don't want to repeat it here.

On the library side, we can default to disabled, and the regression defaults to 0.

On the CLI, we can default to disabled, but we could say if someone sets --train-fastcover=shrink we enable with default to a regression of 1, but --train-fastcover=shrink=0 or --train-fastcover=shrink=2 should both still work.

dict, dictBufferCapacity);
dictBufferCapacity = selection.dictSize;
totalCompressedSize = selection.totalCompressedSize;
memcpy(dict, selection.dictContent, dictBufferCapacity);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need dict. If we move the declaration of selection to the top scope, we can refactor COVER_best_finish() to take a COVER_dictSelection instead of the three parameters (dict, totalCompressedSize, dictBufferCapacity).

COVER_dictSelection_t selection = COVER_dictSelectionError(ERROR(GENERIC));
/* ... */
{
  const size_t tail = COVER_buildDictionary(/* ... */);
  selection = COVER_selectDict(/* ... */);
}
_cleanup:
  COVER_best_finish(data->best, parameters, selection); /* Copies the dict out of selection */
  COVER_dictSelectionFree(selection);

``

@@ -311,10 +311,15 @@ static unsigned parseCoverParameters(const char* stringPtr, ZDICT_cover_params_t
params->splitPoint = (double)splitPercentage / 100.0;
if (stringPtr[0]==',') { stringPtr++; continue; } else break;
}
if (longCommandWArg(&stringPtr, "shrink=")) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets allow shrink to set a default regression amount of 1.

if (longCommandWArg(&stringPtr, "shrink")) {
  unsigned regression = kDefaultRegression;
  if (stringPtr[0] == '=') {
    stringPtr++;
    regression = readU32FromChar(&stringPtr);
  }
  /* ... */

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the CI is broken.

@@ -94,7 +94,8 @@ typedef struct {
unsigned steps; /* Number of steps : Only used for optimization : 0 means default (40) : Higher means more parameters checked */
unsigned nbThreads; /* Number of threads : constraint: 0 < nbThreads : 1 means single-threaded : Only used for optimization : Ignored if ZSTD_MULTITHREAD is not defined */
double splitPoint; /* Percentage of samples used for training: Only used for optimization : the first nbSamples * splitPoint samples will be used to training, the last nbSamples * (1 - splitPoint) samples will be used for testing, 0 means default (1.0), 1.0 when all samples are used for both training and testing */
double shrinkDict; /* Shrink dictionaries to select the smallest dictionary within the % specified by the auto-shrink-dict flag: 0 disables shrinking */
unsigned shrinkDict; /* Train dictionaries to shrink in size starting from the minimum size and selects the smallest dictionary within the max regression. 0 means no shrinking and 1 means shrinking */
unsigned shrinkDictMaxRegression; /* Sets the value that determines how much worse a dictionary can be than the dictionary of max dictionary size */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you describe what this value means? E.g. A smaller dict is allowed to be shrinkDictMaxRegression worse than the full sized dictionary.

@@ -179,8 +179,8 @@ static int usage_advanced(const char* programName)
DISPLAY( "\n");
DISPLAY( "Dictionary builder : \n");
DISPLAY( "--train ## : create a dictionary from a training set of files \n");
DISPLAY( "--train-cover[=k=#,d=#,steps=#,split=#,shrink=#] : use the cover algorithm with optional args\n");
DISPLAY( "--train-fastcover[=k=#,d=#,f=#,steps=#,split=#,accel=#,shrink=#] : use the fast cover algorithm with optional args\n");
DISPLAY( "--train-cover[=k=#,d=#,steps=#,split=#,shrink] : use the cover algorithm with optional args\n");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shrink[=#] brackets means optional arguments

@@ -377,7 +394,8 @@ static ZDICT_cover_params_t defaultCoverParams(void)
params.d = 8;
params.steps = 4;
params.splitPoint = 1.0;
params.shrinkDict = 0.0;
params.shrinkDict = 0;
params.shrinkDictMaxRegression = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kDefaultRegression

@@ -390,7 +408,8 @@ static ZDICT_fastCover_params_t defaultFastCoverParams(void)
params.steps = 4;
params.splitPoint = 0.75; /* different from default splitPoint of cover */
params.accel = DEFAULT_ACCEL;
params.shrinkDict = 0.0;
params.shrinkDict = 0;
params.shrinkDictMaxRegression = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kDefaultRegression

free(data);
COVER_map_destroy(&activeDmers);
if (dict) {
free(dict);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like you're leaking dict.

free(data);
free(segmentFreqs);
free(dict);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're leaking dict.

Copy link
Contributor

@terrelln terrelln left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A one line fix and this looks good to go!

@@ -929,6 +934,111 @@ void COVER_best_finish(COVER_best_t *best, size_t compressedSize,
}
}

COVER_dictSelection_t COVER_dictSelectionError(size_t error) {
COVER_dictSelection_t selection = { NULL, error, error };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to set dictSize == 0 here.

In COVER_best_finish() when totalCompressedSize is an error, but still < -1 we will allocate space for the dictionary. We will try to allocate ~2^64 bytes, which will probably fail, but still we don't want to do that.

The dictionary size should always be the right number here, which is 0.

@terrelln terrelln merged commit c55d2e7 into facebook:dev Jun 27, 2019
Tyler-Tran added a commit to Tyler-Tran/zstd that referenced this pull request Jul 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants