Skip to content

Conversation

@davidrabinowitz
Copy link

@davidrabinowitz davidrabinowitz commented Nov 13, 2020

This PR is based on the master branch, replacing PR #30071

What changes were proposed in this pull request?

Having CodeGenerator.getValueFromVector() to correctly treat UserDefniedTypes as CodeGenerator.javaType() does.

Why are the changes needed?

Without it the generated java code would not compile, the error was

rg.codehaus.commons.compiler.CompileException: File 'generated.java', Line 153, Column 126: No applicable constructor/method found for actual parameters "int, int"; candidates are: "public org.apache.spark.sql.vectorized.ColumnarRow org.apache.spark.sql.vectorized.ColumnVector.getStruct(int)"

The fix makes sure the method call has just one parameter.

Does this PR introduce any user-facing change?

No

How was this patch tested?

I've added a unit test to verify the proper code is generated: getStruct(ordinal)

@davidrabinowitz
Copy link
Author

@HyukjinKwon @gengliangwang FYI

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@davidrabinowitz
Copy link
Author

In order to verify it first you need to create a table in BigQuery in the following manner:

bq load --source_format NEWLINE_DELIMITED_JSON <TABLE> vector_test.data.json vector_test.schema.json

The files are:

  • vector_test.data.json:
{"name":"row1","num":"1","vector":{"type":"1","indices":[],"values":[1,2,3]}}
{"name":"row2","num":"2","vector":{"type":"1","indices":[],"values":[4,5,6]}}
{"name":"row3","num":"3","vector":{"type":"1","indices":[],"values":[7,8,9]}}
  • vector_test.schema.json:
[
  {
    "mode": "NULLABLE",
    "name": "name",
    "type": "STRING"
  },
  {
    "mode": "NULLABLE",
    "name": "num",
    "type": "INTEGER"
  },
  {
    "description": "{spark.type=vector}",
    "fields": [
      {
        "mode": "NULLABLE",
        "name": "type",
        "type": "INTEGER"
      },
      {
        "mode": "NULLABLE",
        "name": "size",
        "type": "INTEGER"
      },
      {
        "mode": "REPEATED",
        "name": "indices",
        "type": "INTEGER"
      },
      {
        "mode": "REPEATED",
        "name": "values",
        "type": "FLOAT"
      }
    ],
    "mode": "NULLABLE",
    "name": "vector",
    "type": "RECORD"
  }
]

A GCP account is needed for that, but the amount of data and operation are well in the free tier.

Run spark-shell --packages com.google.cloud.spark:spark-bigquery-with-dependencies_2.11:0.17.3 and enter the following commands:

val df = spark.read.format("com.google.cloud.spark.bigquery.v2.BigQueryDataSourceV2").load("<TABLE>")
df.schema()
df.show()

Notice that when the format is changed to bigquery another path is used which does not rely on the code generator and hence does not suffer from this issue.

} else {
getValue(vector, dataType, rowId)
dataType match {
case udt: UserDefinedType[_] => getValueFromVector(vector, udt.sqlType, rowId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this issue only happens when using spark-bigquery-with-dependencies? In the current spark codebase, it seems dataType cannot be an user-defined type in this method.

}
}
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: revert this (unnecessary change)

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Feb 26, 2021
@github-actions github-actions bot closed this Feb 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants