Skip to content

Add Spark SQL dialect#1261

Merged
snuyanzin merged 3 commits intodatafaker-net:mainfrom
gatear:spark-sql
Jun 17, 2024
Merged

Add Spark SQL dialect#1261
snuyanzin merged 3 commits intodatafaker-net:mainfrom
gatear:spark-sql

Conversation

@gatear
Copy link
Contributor

@gatear gatear commented Jun 16, 2024

Add Spark SQL support.
See "INSERT INTO" spec https://spark.apache.org/docs/3.2.1/sql-ref-syntax-dml-insert-into.html

There are some issues with existing design in order to support all Spark types.
Spark SQL has 3 complex data types:

  • Array
  • Map
  • Struct

Insertions look like this

-- Inserting an array
INSERT INTO MyTable (array_column) VALUES (ARRAY(1, 2, 3));

-- Inserting a map
INSERT INTO MyTable (map_column) VALUES (MAP('key1', 'value1', 'key2', 'value2'));

-- Inserting a struct
INSERT INTO MyTable (struct_column) VALUES (NAMED_STRUCT('field1', 'value1', 'field2', 'value2'));

So notable design changes are

  • array start and end markers are now configurable with default "[", "]". Spark requires "(" and ")".
  • row prefix is configurable as keyword "ROW" doesn't exist but we have "NAMED_TUPLE" for complex fields in Spark.
  • a named tuples requires a prefix for each value, the field name (but root columns don't require field name prefix)
  • there's also Map support required. I think it makes sense to correlate java.util.Map with Spark SQL Map
  • batch insert is not support so I think it's best to throw exception on attempt to generate "batch" statements

I tested on latest Databricks runtime this generated SQL and works well 👍

INSERT INTO `MyTable` (`bytes`) VALUES (ARRAY(1, 0));
INSERT INTO `MyTable` (`booleans`) VALUES (ARRAY(true, false));
INSERT INTO `MyTable` (`ints`) VALUES (ARRAY(1, 2, 3));

INSERT INTO `MyTable` (`struct_array`) VALUES (NAMED_STRUCT('name1', '1', 'struct', NAMED_STRUCT('name', ARRAY(1, 2, 3))));
INSERT INTO `MyTable` (`struct_struct`) VALUES (NAMED_STRUCT('name1', '1', 'struct', NAMED_STRUCT('name', '2')));
INSERT INTO `MyTable` (`longs`) VALUES (ARRAY(23, 45));
INSERT INTO `MyTable` (`empty_map`) VALUES (MAP());
INSERT INTO `MyTable` (`maps`) VALUES (MAP('k1', MAP('k1', 'v1'), 'k2', MAP('k1', 'v1')));


private static final String DEFAULT_BEFORE_EACH_BATCH_PREFIX = " ";


Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please delete this line

return result.toString();
}

private String handeObjectInMap(Object value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
private String handeObjectInMap(Object value) {
private String handleObjectInMap(Object value) {

seems like a misprint

TERADATA("\""),
VERTICA("\"", Casing.UNCHANGED);
VERTICA("\"", Casing.UNCHANGED),
SPARKSQL("`",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to keep records here sorted alphabetically

Copy link
Collaborator

@snuyanzin snuyanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution

in general lgtm, I left some comments

@gatear
Copy link
Contributor Author

gatear commented Jun 17, 2024

Thanks @snuyanzin for the review 🙏

I also have some questions ..

  • in order to port this change to faker version 1.x I need to also open a PR to branch 1.x ?
  • I would also invest in some documentation, where would you suggest it's best to make it ?

@gatear gatear requested a review from snuyanzin June 17, 2024 08:08
@codecov-commenter
Copy link

codecov-commenter commented Jun 17, 2024

Codecov Report

Attention: Patch coverage is 95.23810% with 2 lines in your changes missing coverage. Please review.

Project coverage is 91.91%. Comparing base (b37c566) to head (57e83c6).
Report is 166 commits behind head on main.

Files Patch % Lines
.../net/datafaker/transformations/sql/SqlDialect.java 84.61% 2 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1261      +/-   ##
============================================
- Coverage     92.35%   91.91%   -0.44%     
- Complexity     2821     3079     +258     
============================================
  Files           292      310      +18     
  Lines          5609     6026     +417     
  Branches        599      631      +32     
============================================
+ Hits           5180     5539     +359     
- Misses          275      325      +50     
- Partials        154      162       +8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@snuyanzin
Copy link
Collaborator

Thanks for addressing feedback and for the valuable contribution

@snuyanzin snuyanzin merged commit f3fa54a into datafaker-net:main Jun 17, 2024
@snuyanzin
Copy link
Collaborator

snuyanzin commented Jun 17, 2024

in order to port this change to faker version 1.x I need to also open a PR to branch 1.x ?

yep, need a backport PR for that

I would also invest in some documentation, where would you suggest it's best to make it ?

depending on how large documentation update you want to make, will it be ok to add as a subsection of SQL transformation ?
https://github.com/datafaker-net/datafaker/blame/f3fa54a4495fd8923e790701fab0570df8bd75b7/docs/documentation/schemas.md#L163

or ### Advanced SQL types at https://github.com/datafaker-net/datafaker/blame/981eaa714266b9246c127f3b84855270a3a00591/docs/documentation/schemas.md#L239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants