Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nested objects fail parsing in Spark SQL when empty objects present #2157

Closed
jbaiera opened this issue Nov 9, 2023 · 0 comments · Fixed by #2158
Closed

Nested objects fail parsing in Spark SQL when empty objects present #2157

jbaiera opened this issue Nov 9, 2023 · 0 comments · Fixed by #2158

Comments

@jbaiera
Copy link
Member

jbaiera commented Nov 9, 2023

We keep track of which field we are currently parsing in the org.elasticsearch.spark.sql.ScalaRowValueReader#readValue method:

override def readValue(parser: Parser, value: String, esType: FieldType) = {
sparkRowField = if (getCurrentField == null) null else getCurrentField.getFieldName
if (sparkRowField == null) {
sparkRowField = Utils.ROOT_LEVEL_NAME
}
super.readValue(parser, value, esType)

When reading an array of objects though, the current field that we are reading is overwritten between row creations. We get around this in the create array method by stashing the row order for an array on the call stack:

https://github.com/elastic/elasticsearch-hadoop/blob/4a14860391d00716a5225804a4c71c46a5633162/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/ScalaEsRowValueReader.scala#L76C55-L89

When we create a Row object we check the current field. If we're in an array, we try to use the stashed row order if the current field doesn't have one. If the current field does have one, then we use it instead because we're probably making a subobject under the array:

val rowOrd =
if (inArray) {
if (rowColumnsMap.contains(sparkRowField)) {
rowColumns(sparkRowField)
}
else {
currentArrayRowOrder
}
}
else rowColumns(sparkRowField)

Unfortunately, if we are parsing a nested document that has an empty document at the very end of it, the empty document field name will remain in the current field variable on the parser. When the next object in the array is created, it will pick up the column list for the previous empty object, which results in downstream serialization issues:

{
  "nested": [          // Current field: `nested`
    {                  // Current field: `nested` (creates map for `nested`)
      "key": "value",  // Current field: `nested.key`
      "object": {}     // Current field: `nested.object` (creates map for `nested.object`)
    },
    {                  // Current field: `nested.object` (creates map for `nested.object` but should have created map for `nested`)
      "key": "value"
    }
  ]
}

This isn't a problem if the object has fields because the underlying fields wont have object mappings unless they too are empty objects:

{
  "nested": [             // Current field: `nested`
    {                     // Current field: `nested` (creates map for `nested`)
      "key": "value",     // Current field: `nested.key`
      "object": {         // Current field: `nested.object` (creates map for `nested.object`)
        "subkey": "value" // Current field: `nested.object.subkey`
      }
    },
    {                     // Current field: `nested.object.subkey` (creates map for `nested` using stashed row order because `nested.object.subkey` has no column order data)
      "key": "value"
    }
  ]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant