Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting same words or names most of the time #122

Closed
ghmendonca opened this issue Feb 2, 2023 · 6 comments
Closed

Getting same words or names most of the time #122

ghmendonca opened this issue Feb 2, 2023 · 6 comments

Comments

@ghmendonca
Copy link

ghmendonca commented Feb 2, 2023

I noticed that almost every time I get the same word or same name, how can I make this much more random?

I'm running some tests in my API and I'm connecting to a real database, and I have 52 documents and pretty much everytime that I run the tests it rejects since some fields should be unique and the words or names generated is already in the database

@tgross35
Copy link

tgross35 commented Feb 2, 2024

One option is to add a Unique<T> wrapper, such as Unique<Username<..>>, that contains a HashSet of all produced values, and rerolls the inner generator if there are duplicates. The hashset would either need to be in a RefCell/Mutex, or the trait signatures updated to take &mut. Unique versions of fake::vec! and others could also be added that do this faster.

The other option that is much more performant is to use a linear congruential generator to select from the dictionary. These generators are pseudorandom but can guarantee there are no repeated values within the dictionary size. This seems like a better option, if possible to implement. This could use a different interface since a RNG is not needed for getting items, only for selecting the initial seed.

There is some discussion and linked issues for python's FactoryBoy FactoryBoy/factory_boy#305, I think their initial implementation uses a set.

@proegssilb
Copy link

I'm also doing some DB testing that needs certain fields to be unique. Guess I'll have to generate & load the data in Python for now (since it has a library that can do performant & unique data gen), and benchmark querying in rust (for sub-ms precision).

Maybe if there's maintainer interest in a particular design for this, I could look into a PR, but for now, the path of least resistance lies elsewhere.

@tgross35
Copy link

tgross35 commented Feb 5, 2024

Figure it's worth an ask - @cksac do you have any ideas here?

@cksac
Copy link
Owner

cksac commented Feb 6, 2024

I think it would better to have a custom faker for the field required to be unique, like below

use fake::{Dummy, Fake};
use once_cell::sync::Lazy;
use std::{collections::HashSet, sync::Mutex};

static ORDER_ID_CACHE: Lazy<Mutex<HashSet<usize>>> = Lazy::new(|| Mutex::new(HashSet::new()));

pub struct OrderIdFaker<U>(pub U);

impl<U> Dummy<OrderIdFaker<U>> for usize
where
    usize: Dummy<U>,
{
    fn dummy_with_rng<R: rand::prelude::Rng + ?Sized>(
        config: &OrderIdFaker<U>,
        rng: &mut R,
    ) -> Self {
        let faker = &config.0;
        let mut id = faker.fake_with_rng(rng);
        let mut cache = ORDER_ID_CACHE.lock().unwrap();
        while cache.contains(&id) {
            id = faker.fake_with_rng(rng);
        }
        cache.insert(id);
        id
    }
}

#[derive(Debug, Dummy)]
pub struct Order {
    #[dummy(faker = "OrderIdFaker(0..1000)")]
    id: usize,
}

fn main() {
    let orders = fake::vec![Order; 1..10];
    println!("{:?}", orders);
}

Unique<T> wrapper will not work here as

  1. Can't implement Dummy for it due to implementation overlapping within the fake crate.
  2. Can't get a global cache of type T in dummy_with_rng fn
  3. Can't support different cache in same target type
pub struct Unique<T>(T);
impl<U, T> Dummy<Unique<T>> for U where U: Dummy<T> {
 ....
}

@proegssilb
Copy link

Not sure if I'm on the same page as you regarding technical limitations, not sure this note will be helpful, but it is worth mentioning for the record if nothing else.

If I have this schema (python code):

schema_fun = lambda: {
        "username": field("person.username"),
        "pwd": "password",
        "name": field("full_name"),
        "email": field("person.email", unique=True),
        "created": field("timestamp", fmt=TimestampFormat.POSIX),
        "verified": field("timestamp", fmt=TimestampFormat.POSIX),
        "modified": field("timestamp", fmt=TimestampFormat.POSIX),
    }

Suppose internally, the thing that generates person.username would get re-used between the schema fields username and email. I don't actually need the internally-generated person.username to be unique between those two fields. I just need the same email address to not be generated twice.

(In practice, I actually had to drop the username field because I couldn't find the spot in the docs where mimesis provides unique usernames. Just unique emails.)

All that to say locally unique outputs is, in fact, a useful start, and would solve problems.

--

Another thought: Suppose we had both (1) UniqueFromArray, that "shuffled" an array, and pulled each item at most once, and (2) UniqueFromArrays, that picked from multiple arrays and combined them according to a lambda (but only returned each combo once). That'd probably be a good start. Locally-unique-only, doesn't support the normal APIs, takes some serious hacking to do, but at least it enables problems to be solved without devising a custom algorithm from scratch to spit out each unique field individually. Like, email address would require manually combining First Name, Last Name, and Free Email Domain (or Lorem Ipsum Word + Lorem Ipsum Word + TLD for additional options). But that's far more approachable than having to write the same set-membership-check every time, or having to devise a custom linear congruential sequence every time.

@cksac
Copy link
Owner

cksac commented Feb 7, 2024

hi @proegssilb, that is what I propose in previous suggestion. In below example, email is unique among generated user profile instances and not related to the username. And your proposed approach can be implemented in different faker if you like.

use fake::faker::internet::en::*;
use fake::locales::EN;
use fake::{Dummy, Fake};
use once_cell::sync::Lazy;
use std::{collections::HashSet, sync::Mutex};

static EMAIL_CACHE: Lazy<Mutex<HashSet<String>>> = Lazy::new(|| Mutex::new(HashSet::new()));

pub struct UniqueEmailFaker;

impl Dummy<UniqueEmailFaker> for String {
    fn dummy_with_rng<R: rand::prelude::Rng + ?Sized>(
        config: &UniqueEmailFaker,
        rng: &mut R,
    ) -> Self {
        let mut email: String = FreeEmail().fake_with_rng(rng);
        let mut cache = EMAIL_CACHE.lock().unwrap();
        while cache.contains(&email) {
            email = FreeEmail().fake_with_rng(rng);
        }
        cache.insert(email.clone());
        email
    }
}

#[derive(Debug, Dummy)]
pub struct UserProfile {
    #[dummy(faker = "Username()")]
    pub username: String,
    #[dummy(faker = "UniqueEmailFaker")]
    pub email: String,
}

fn main() {
    let user_set_1 = fake::vec![UserProfile; 1..10];
    println!("{:?}", user_set_1);

    let user_set_2 = fake::vec![UserProfile; 1..10];
    println!("{:?}", user_set_2);

    // no duplicate emails among user_set_1 and user_set_2, unless EMAIL_CACHE is cleared
}

@cksac cksac closed this as completed Feb 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants