[SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator #30372

davidrabinowitz · 2020-11-13T21:30:38Z

This PR is based on the master branch, replacing PR #30071

What changes were proposed in this pull request?

Having CodeGenerator.getValueFromVector() to correctly treat UserDefniedTypes as CodeGenerator.javaType() does.

Why are the changes needed?

Without it the generated java code would not compile, the error was

rg.codehaus.commons.compiler.CompileException: File 'generated.java', Line 153, Column 126: No applicable constructor/method found for actual parameters "int, int"; candidates are: "public org.apache.spark.sql.vectorized.ColumnarRow org.apache.spark.sql.vectorized.ColumnVector.getStruct(int)"

The fix makes sure the method call has just one parameter.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I've added a unit test to verify the proper code is generated: getStruct(ordinal)

davidrabinowitz · 2020-11-13T21:33:30Z

@HyukjinKwon @gengliangwang FYI

AmplabJenkins · 2020-11-13T21:37:52Z

Can one of the admins verify this patch?

davidrabinowitz · 2020-11-16T04:42:24Z

In order to verify it first you need to create a table in BigQuery in the following manner:

bq load --source_format NEWLINE_DELIMITED_JSON <TABLE> vector_test.data.json vector_test.schema.json

The files are:

vector_test.data.json:

{"name":"row1","num":"1","vector":{"type":"1","indices":[],"values":[1,2,3]}}
{"name":"row2","num":"2","vector":{"type":"1","indices":[],"values":[4,5,6]}}
{"name":"row3","num":"3","vector":{"type":"1","indices":[],"values":[7,8,9]}}

vector_test.schema.json:

[
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "STRING"
  },
  {
    "mode": "NULLABLE",
    "name": "num",
    "type": "INTEGER"
  },
  {
    "description": "{spark.type=vector}",
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "type",
        "type": "INTEGER"
      },
      {
        "mode": "NULLABLE",
        "name": "size",
        "type": "INTEGER"
      },
      {
        "mode": "REPEATED",
        "name": "indices",
        "type": "INTEGER"
      },
      {
        "mode": "REPEATED",
        "name": "values",
        "type": "FLOAT"
      }
    ],
    "mode": "NULLABLE",
    "name": "vector",
    "type": "RECORD"
  }
]

A GCP account is needed for that, but the amount of data and operation are well in the free tier.

Run spark-shell --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.3 and enter the following commands:

val df = spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("<TABLE>")
df.schema()
df.show()

Notice that when the format is changed to bigquery another path is used which does not rely on the code generator and hence does not suffer from this issue.

maropu · 2020-11-17T07:46:58Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

-    } else {
-      getValue(vector, dataType, rowId)
+    dataType match {
+      case udt: UserDefinedType[_] => getValueFromVector(vector, udt.sqlType, rowId)


Does this issue only happens when using spark-bigquery-with-dependencies? In the current spark codebase, it seems dataType cannot be an user-defined type in this method.

maropu · 2020-11-17T07:47:20Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/CodeGenerator.scala

+      }
    }
  }
-


nit: revert this (unnecessary change)

github-actions · 2021-02-26T00:45:57Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Adding support for UserDefinedType for Spark SQL Code generator

3adbaa5

github-actions bot added the SQL label Nov 13, 2020

davidrabinowitz mentioned this pull request Nov 13, 2020

[SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator #30071

Closed

HyukjinKwon requested a review from gengliangwang November 17, 2020 05:12

maropu reviewed Nov 17, 2020

View reviewed changes

github-actions bot added the Stale label Feb 26, 2021

github-actions bot closed this Feb 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator #30372

[SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator #30372

Uh oh!

davidrabinowitz commented Nov 13, 2020 •

edited

Loading

Uh oh!

davidrabinowitz commented Nov 13, 2020

Uh oh!

AmplabJenkins commented Nov 13, 2020

Uh oh!

davidrabinowitz commented Nov 16, 2020

Uh oh!

maropu Nov 17, 2020

Uh oh!

maropu Nov 17, 2020

Uh oh!

github-actions bot commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator #30372

[SPARK-33172][SQL] Adding support for UserDefinedType for Spark SQL Code generator #30372

Uh oh!

Conversation

davidrabinowitz commented Nov 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

davidrabinowitz commented Nov 13, 2020

Uh oh!

AmplabJenkins commented Nov 13, 2020

Uh oh!

davidrabinowitz commented Nov 16, 2020

Uh oh!

maropu Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

maropu Nov 17, 2020

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 26, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

davidrabinowitz commented Nov 13, 2020 •

edited

Loading