Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to generate unique values #232

Closed
giunto opened this issue Jul 17, 2022 · 6 comments
Closed

Ability to generate unique values #232

giunto opened this issue Jul 17, 2022 · 6 comments

Comments

@giunto
Copy link
Contributor

giunto commented Jul 17, 2022

Currently there isn't a way to enforce that random values from faker are different. This is mainly an issue when writing tests with ids or keys that cannot be the same. The data produced by faker is usually unique enough, but there is still a small chance that tests will randomly fail if we're not careful.

My solution is to have a unique faker that keeps a store of every value that it has generated. It has a base method that takes in a supplier and ensures that the value from the supplier has not been generated before during the unique faker's lifespan. For example:

// These two names will never be the same
faker.unique().get(() -> faker.name().firstName());
faker.unique().get(() -> faker.name().firstName());

// If the last name has the possibility of being the same as the first name, it will be 
// regenerated and guaranteed to be unique as well
faker.unique().get(() -> faker.name().lastName());

The store is kept at the unique faker level, so the uniqueness is only persisted during the lifespan of the faker object. If there are two different fakers they could potentially generate the same values.

Would there be any issues with having a faker like this? Here is the full implementation that I had in mind:

public class Unique {
    private final Faker faker;
    private final Set<Object> uniqueValueStore;

    private static final long LOOP_TIMEOUT_MILLIS = 10000;

    public Unique(Faker faker) {
        this.faker = faker;
        this.uniqueValueStore = new HashSet<>();
    }

    public <T> T get(Supplier<T> supplier) {
        T value = supplier.get();
        long millisBeforeCheck = currentTimeMillis();
        while (uniqueValueStore.contains(value)) {
            handleInfiniteLoop(millisBeforeCheck);
            value = supplier.get();
        }
        uniqueValueStore.add(value);
        return value;
    }

    public String resolve(String key) {
        return get(() -> faker.resolve(key));
    }

    public String expression(String expression) {
        return get(() -> faker.expression(expression));
    }

    public int nextInt() {
        return get(() -> faker.random().nextInt());
    }

    public int nextInt(int n) {
        return get(() -> faker.random().nextInt(n));
    }

    public int nextInt(int min, int max) {
        return get(() -> faker.random().nextInt(min, max));
    }

    public long nextLong() {
        return get(() -> faker.random().nextLong());
    }

    public long nextLong(long n) {
        return get(() -> faker.random().nextLong(n));
    }

    public long nextLong(long min, long max) {
        return get(() -> faker.random().nextLong(min, max));
    }

    private void handleInfiniteLoop(long initialMillis) {
        if (currentTimeMillis() - initialMillis > LOOP_TIMEOUT_MILLIS) {
            throw new RuntimeException("Unable to get unique value from supplier");
        }
    }
}
@snuyanzin
Copy link
Collaborator

Unique values are tricky thing...

There are at least two issues which are not solved and currently not clear how to solve.

  1. OutOfMemory. For cases with relatively small number of elements it should not be a problem however sometimes people could generate millions of records like e.g. here create 100 million object cost lot of time DiUS/java-faker#663. In this case uniqueValueStore will consume lots of memory
  2. Some generated values are values of limited set e.g. defined via *.yml file. Imagine a situation that in a en/name.yml file there are defined 473 unique lastnames. What should happened if I will try to generate 1000 unique lastnames with the solution above? I guess it will stuck in an endless loop

@giunto
Copy link
Contributor Author

giunto commented Jul 17, 2022

OutOfMemory. For cases with relatively small number of elements it should not be a problem however sometimes people could generate millions of records like e.g. here DiUS/java-faker#663. In this case uniqueValueStore will consume lots of memory

That's a good point that I didn't really consider. The intended use case I had for this was for generating small data sets of just a few values. For unique integers and longs, tracking the last generated value and incrementing it might work:

private int uniqueInt = 0;

public int nextInt() {
    uniqueInt += faker.random().nextInt(1, x);
    return uniqueInt;
}

This would raise another issue though, where it's possible for the value to overflow, so it might not be the best solution for large amounts of data.

Some generated values are values of limited set e.g. defined via *.yml file. Imagine a situation that in a en/name.yml file there are defined 473 unique lastnames. What should happened if I will try to generate 1000 unique lastnames with the solution above? I guess it will stuck in an endless loop

The solution I made does handle infinite loops by timing out and throwing an exception if the unique value check takes longer than 10 seconds. But like I said above, the solution is mostly intended for small amounts of data where the user is mindful of the random data they are generating.

@snuyanzin
Copy link
Collaborator

snuyanzin commented Jul 17, 2022

+ one more question: why do all the methods use the same storage of unique methods?
it means that if somewhere in the code I used nextInt() which returned e.g. 42 then in case I use nextLong after that it will not allowed to return 42 because it is already in uniqueValueStore. In this intentional?

@giunto
Copy link
Contributor Author

giunto commented Jul 17, 2022

  • one more question: why do all the methods use the same storage of unique methods?
    it means that if somewhere in the code I used nextInt() which returned e.g. 42 then in case I use nextLong after that it will not allowed to return 42 because it is already in uniqueValueStore. In this intentional?

Since Long 42 is a different type than Integer 42, it would be possible for nextLong to return 42 after nextInt returned it.

It was intentional for everything to share the same uniqueValueStore. My thinking was that everything returned from a method on faker.unique should be something that wasn't returned previously. Theoretically it should be possible to track return values based on the method that was called, but that would add extra complexity which I didn't see beneficial.

@bodiam
Copy link
Contributor

bodiam commented Jul 18, 2022

I think it's an interesting idea, but it sounds quite brittle. I don't think there's any way to make this a reliable feature if you're generating large amounts of data. It's almost impossible to know how many unique values can come from a yml file, which complicates things.

What's wrong with generating a large amount of data, put them in a set, and take the data you need after that?

I have no objection against a feature like this, but I'm a bit hesitant to add a feature which "sometimes" works.

@giunto
Copy link
Contributor Author

giunto commented Jul 20, 2022

I'll go ahead and close this issue. I don't think there's a good way to guarantee uniqueness with large amounts of data without running into issues with memory or throwing an exception. It seems like it's better to let the user decide how they want uniqueness handled. Thanks for the feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants