Ability to generate unique values #232

giunto · 2022-07-17T18:24:02Z

Currently there isn't a way to enforce that random values from faker are different. This is mainly an issue when writing tests with ids or keys that cannot be the same. The data produced by faker is usually unique enough, but there is still a small chance that tests will randomly fail if we're not careful.

My solution is to have a unique faker that keeps a store of every value that it has generated. It has a base method that takes in a supplier and ensures that the value from the supplier has not been generated before during the unique faker's lifespan. For example:

// These two names will never be the same
faker.unique().get(() -> faker.name().firstName());
faker.unique().get(() -> faker.name().firstName());

// If the last name has the possibility of being the same as the first name, it will be 
// regenerated and guaranteed to be unique as well
faker.unique().get(() -> faker.name().lastName());

The store is kept at the unique faker level, so the uniqueness is only persisted during the lifespan of the faker object. If there are two different fakers they could potentially generate the same values.

Would there be any issues with having a faker like this? Here is the full implementation that I had in mind:

public class Unique {
    private final Faker faker;
    private final Set<Object> uniqueValueStore;

    private static final long LOOP_TIMEOUT_MILLIS = 10000;

    public Unique(Faker faker) {
        this.faker = faker;
        this.uniqueValueStore = new HashSet<>();
    }

    public <T> T get(Supplier<T> supplier) {
        T value = supplier.get();
        long millisBeforeCheck = currentTimeMillis();
        while (uniqueValueStore.contains(value)) {
            handleInfiniteLoop(millisBeforeCheck);
            value = supplier.get();
        }
        uniqueValueStore.add(value);
        return value;
    }

    public String resolve(String key) {
        return get(() -> faker.resolve(key));
    }

    public String expression(String expression) {
        return get(() -> faker.expression(expression));
    }

    public int nextInt() {
        return get(() -> faker.random().nextInt());
    }

    public int nextInt(int n) {
        return get(() -> faker.random().nextInt(n));
    }

    public int nextInt(int min, int max) {
        return get(() -> faker.random().nextInt(min, max));
    }

    public long nextLong() {
        return get(() -> faker.random().nextLong());
    }

    public long nextLong(long n) {
        return get(() -> faker.random().nextLong(n));
    }

    public long nextLong(long min, long max) {
        return get(() -> faker.random().nextLong(min, max));
    }

    private void handleInfiniteLoop(long initialMillis) {
        if (currentTimeMillis() - initialMillis > LOOP_TIMEOUT_MILLIS) {
            throw new RuntimeException("Unable to get unique value from supplier");
        }
    }
}

snuyanzin · 2022-07-17T20:02:11Z

Unique values are tricky thing...

There are at least two issues which are not solved and currently not clear how to solve.

OutOfMemory. For cases with relatively small number of elements it should not be a problem however sometimes people could generate millions of records like e.g. here create 100 million object cost lot of time DiUS/java-faker#663. In this case uniqueValueStore will consume lots of memory
Some generated values are values of limited set e.g. defined via *.yml file. Imagine a situation that in a en/name.yml file there are defined 473 unique lastnames. What should happened if I will try to generate 1000 unique lastnames with the solution above? I guess it will stuck in an endless loop

giunto · 2022-07-17T20:48:23Z

OutOfMemory. For cases with relatively small number of elements it should not be a problem however sometimes people could generate millions of records like e.g. here DiUS/java-faker#663. In this case uniqueValueStore will consume lots of memory

That's a good point that I didn't really consider. The intended use case I had for this was for generating small data sets of just a few values. For unique integers and longs, tracking the last generated value and incrementing it might work:

private int uniqueInt = 0;

public int nextInt() {
    uniqueInt += faker.random().nextInt(1, x);
    return uniqueInt;
}

This would raise another issue though, where it's possible for the value to overflow, so it might not be the best solution for large amounts of data.

Some generated values are values of limited set e.g. defined via *.yml file. Imagine a situation that in a en/name.yml file there are defined 473 unique lastnames. What should happened if I will try to generate 1000 unique lastnames with the solution above? I guess it will stuck in an endless loop

The solution I made does handle infinite loops by timing out and throwing an exception if the unique value check takes longer than 10 seconds. But like I said above, the solution is mostly intended for small amounts of data where the user is mindful of the random data they are generating.

snuyanzin · 2022-07-17T21:06:11Z

+ one more question: why do all the methods use the same storage of unique methods?
it means that if somewhere in the code I used nextInt() which returned e.g. 42 then in case I use nextLong after that it will not allowed to return 42 because it is already in uniqueValueStore. In this intentional?

giunto · 2022-07-17T21:14:28Z

one more question: why do all the methods use the same storage of unique methods?
it means that if somewhere in the code I used nextInt() which returned e.g. 42 then in case I use nextLong after that it will not allowed to return 42 because it is already in uniqueValueStore. In this intentional?

Since Long 42 is a different type than Integer 42, it would be possible for nextLong to return 42 after nextInt returned it.

It was intentional for everything to share the same uniqueValueStore. My thinking was that everything returned from a method on faker.unique should be something that wasn't returned previously. Theoretically it should be possible to track return values based on the method that was called, but that would add extra complexity which I didn't see beneficial.

bodiam · 2022-07-18T02:21:18Z

I think it's an interesting idea, but it sounds quite brittle. I don't think there's any way to make this a reliable feature if you're generating large amounts of data. It's almost impossible to know how many unique values can come from a yml file, which complicates things.

What's wrong with generating a large amount of data, put them in a set, and take the data you need after that?

I have no objection against a feature like this, but I'm a bit hesitant to add a feature which "sometimes" works.

giunto · 2022-07-20T22:46:22Z

I'll go ahead and close this issue. I don't think there's a good way to guarantee uniqueness with large amounts of data without running into issues with memory or throwing an exception. It seems like it's better to let the user decide how they want uniqueness handled. Thanks for the feedback!

giunto closed this as completed Jul 20, 2022

snuyanzin mentioned this issue Jul 30, 2022

Generator of unique values for file based generators #265

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to generate unique values #232

Ability to generate unique values #232

giunto commented Jul 17, 2022

snuyanzin commented Jul 17, 2022

giunto commented Jul 17, 2022

snuyanzin commented Jul 17, 2022 •

edited

giunto commented Jul 17, 2022 •

edited

bodiam commented Jul 18, 2022

giunto commented Jul 20, 2022

Ability to generate unique values #232

Ability to generate unique values #232

Comments

giunto commented Jul 17, 2022

snuyanzin commented Jul 17, 2022

giunto commented Jul 17, 2022

snuyanzin commented Jul 17, 2022 • edited

giunto commented Jul 17, 2022 • edited

bodiam commented Jul 18, 2022

giunto commented Jul 20, 2022

snuyanzin commented Jul 17, 2022 •

edited

giunto commented Jul 17, 2022 •

edited