Replace simplistic dbldatagen.text_generators with community maintained Faker generators #50

nfx · 2021-07-16T16:17:19Z

Why don't we use use Faker as random generation backend? it's more powerful already, than self-written dbldatagen.text_generators. there's already plenty of data providers - https://faker.readthedocs.io/en/stable/providers.html & https://faker.readthedocs.io/en/stable/communityproviders.html - it's very good idea to build on top of existing efforts.

The text was updated successfully, but these errors were encountered:

ronanstokes-db · 2021-07-16T19:36:02Z

it depends on what you mean by more powerful - it does not support repeatability of data generation, uses the basic python random number generator without seeding

It does provide lots of support for pre-canned formats so that aspect is interesting. But we would need to override the default random number generator and it would be hard to incorporate vectorized operation.

Its not just the text generators that can be used to generate text data in the existing dbldatagen implementation - but all of Spark SQL, pyspark, pandas / numpy /scipy - thats whats leveraged by it being fully integrated with PySpark

ronanstokes-db · 2021-07-16T20:06:49Z

A more sensible approach might be to offer it as a possible integration in the future in a similar way that Factory Boy uses it rather than a replacement for the existing mechanism. So it would become an additional text generator rather than replacing the existing ones.

This would have some limitations such as only generating uniform distributed values (due to mechanics of Fakers random number generator), only support it for string columns

I've confirmed that this is at least feasible for non-repeatable data (using Pandas UDF integration in conjunction with dbldatagen) - but perf is up to 100x slower - more realistically 20x slower for smaller data sets.

So i would suggest this a possible documentation example rather than built in feature.

Update - I was able to generate 10 million rows of data in about 1.5 minutes with 12 x 8 core cluster. For comparison, dbldatagen can generate 1 billion rows of data with basic formatting AND write them out to a delta table on Azure in 1.5 minutes on same cluster. Performance for complex formatting in dbldatagen can be slower (10 -15 minutes for 1 billion rows in some cases)

Trying to do the same for 1 billion rows with parallelized Faker failed with faker after 18 minutes and it was only partially completed ( 1/3rd of the way completed)

For 100 million rows, i was able to generate faker data using an extension mechanism in dbldatagen on a 12 x 8 core cluster and write it in 5 minutes. So i think we can show an example of using faker in conjunction with dbldatgen but it does not make sense as the default mechanism.

nfx · 2021-07-19T18:51:11Z

But current approach generates data in Pandas UDF, not in Spark. So probably setting a random seed for faker would achieve the same goal? Faker should work with custom distributions

nfx · 2021-07-19T18:53:54Z

Well, I think it makes the most sense to have it as plugin. Where is the performance bottleneck? Maybe going down faker provider APIs will help?

ronanstokes-db · 2021-07-19T21:39:48Z

pandas udfs are only used for text generation from templates and Ipsum Lorem text.

A pandas UDF is still distributed across spark nodes.

Aside from that, I think having a generic plugin that can support faker but also other libraries is useful. It wont be bound specifically to Faker and we dont want to ship Faker, have a dependency on Faker, test Faker or require it to be preinstalled.

This mechanism would also allow use of arbitrary Python functions.

here is how i see the syntax working:

import dbldatagen as dg
from faker import Faker
from faker.providers import internet

shuffle_partitions_requested = 12 * 4
partitions_requested = 96  * 5
data_rows = 1 * 1000 * 1000

spark.conf.set("spark.sql.shuffle.partitions", shuffle_partitions_requested)

my_word_list = [
'danish','cheesecake','sugar',
'Lollipop','wafer','Gummies',
'sesame','Jelly','beans',
'pie','bar','Ice','oat' ]

# the context is shared information used across generation of many rows
# here, its the faker instance, but it could include customer lookup data, custom number generators etc
# As its a Python object, you can store anything within the bounds of whats reasonable for a Python object.
# It also gets around the issue of using objects from 3rd party libraries that don't support pickling

def initFaker(context):
  context.faker = Faker()
  context.faker.add_provider(internet)

# the data generation functions are lambdas or python functions taking a context and base value of the column
# they return the generated value
ip_address_generator = (lambda context, v : context.faker.ipv4_private())
name_generator = (lambda context, v : context.faker.name())
text_generator = (lambda context, v : context.faker.sentence(ext_word_list=my_word_list))
cc_generator = (lambda context, v : context.faker.credit_card_number())
email_generator = (lambda context, v : context.faker.ascii_company_email())

# example uses use of faker text generation alongside standard text generation
fakerDataspec = (dg.DataGenerator(spark, rows=data_rows, partitions=partitions_requested)
            .withColumn("name", percent_nulls=1.0, text=PyfuncText(name_generator , initFn=initFaker))
            .withColumn("name2", percent_nulls=1.0, template=r'\\w \\w|\\w a. \\w')
            .withColumn("payment_instrument", text=PyfuncText(cc_generator, initFn=initFaker))
            .withColumn("email", text=PyfuncText(email_generator, initFn=initFaker))
            .withColumn("ip_address", text=PyfuncText(ip_address_generator , initFn=initFaker))
            .withColumn("faker_text", text=PyfuncText(text_generator, initFn=initFaker))
            .withColumn("il_text", text=dg.ILText(words=(1,8), extendedWordList=my_word_list))
            )
dfFakerOnly = fakerDataspec.build()

display(dfFakerOnly)

nfx assigned ronanstokes-db Jul 16, 2021

ronanstokes-db added the documentation Improvements or additions to documentation label Jul 18, 2021

ronanstokes-db linked a pull request Jul 20, 2021 that will close this issue

Feature text generation plugins #51

Merged

ronanstokes-db added the enhancement New feature or request label Jul 29, 2021

ronanstokes-db added this to the initial-release milestone Jul 29, 2021

ronanstokes-db closed this as completed in #51 Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace simplistic dbldatagen.text_generators with community maintained Faker generators #50

Replace simplistic dbldatagen.text_generators with community maintained Faker generators #50

nfx commented Jul 16, 2021

ronanstokes-db commented Jul 16, 2021 •

edited

ronanstokes-db commented Jul 16, 2021 •

edited

nfx commented Jul 19, 2021

nfx commented Jul 19, 2021

ronanstokes-db commented Jul 19, 2021 •

edited

Replace simplistic dbldatagen.text_generators with community maintained Faker generators #50

Replace simplistic dbldatagen.text_generators with community maintained Faker generators #50

Comments

nfx commented Jul 16, 2021

ronanstokes-db commented Jul 16, 2021 • edited

ronanstokes-db commented Jul 16, 2021 • edited

nfx commented Jul 19, 2021

nfx commented Jul 19, 2021

ronanstokes-db commented Jul 19, 2021 • edited

ronanstokes-db commented Jul 16, 2021 •

edited

ronanstokes-db commented Jul 16, 2021 •

edited

ronanstokes-db commented Jul 19, 2021 •

edited