Nested objects fail parsing in Spark SQL when empty objects present #2157

jbaiera · 2023-11-09T20:34:28Z

We keep track of which field we are currently parsing in the org.elasticsearch.spark.sql.ScalaRowValueReader#readValue method:

elasticsearch-hadoop/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/ScalaEsRowValueReader.scala

Lines 39 to 46 in 4a14860

    
           override def readValue(parser: Parser, value: String, esType: FieldType) = { 
        
             sparkRowField = if (getCurrentField == null) null else getCurrentField.getFieldName 
        
             if (sparkRowField == null) { 
        
               sparkRowField = Utils.ROOT_LEVEL_NAME 
        
             } 
        
             super.readValue(parser, value, esType)

When reading an array of objects though, the current field that we are reading is overwritten between row creations. We get around this in the create array method by stashing the row order for an array on the call stack:

https://github.com/elastic/elasticsearch-hadoop/blob/4a14860391d00716a5225804a4c71c46a5633162/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/ScalaEsRowValueReader.scala#L76C55-L89

When we create a Row object we check the current field. If we're in an array, we try to use the stashed row order if the current field doesn't have one. If the current field does have one, then we use it instead because we're probably making a subobject under the array:

elasticsearch-hadoop/spark/sql-30/src/main/scala/org/elasticsearch/spark/sql/ScalaEsRowValueReader.scala

Lines 60 to 69 in 4a14860

    
           val rowOrd =  
        
           if (inArray) { 
        
             if (rowColumnsMap.contains(sparkRowField)) { 
        
                 rowColumns(sparkRowField) 
        
             } 
        
             else { 
        
               currentArrayRowOrder 
        
             } 
        
           } 
        
           else rowColumns(sparkRowField)

Unfortunately, if we are parsing a nested document that has an empty document at the very end of it, the empty document field name will remain in the current field variable on the parser. When the next object in the array is created, it will pick up the column list for the previous empty object, which results in downstream serialization issues:

{
  "nested": [          // Current field: `nested`
    {                  // Current field: `nested` (creates map for `nested`)
      "key": "value",  // Current field: `nested.key`
      "object": {}     // Current field: `nested.object` (creates map for `nested.object`)
    },
    {                  // Current field: `nested.object` (creates map for `nested.object` but should have created map for `nested`)
      "key": "value"
    }
  ]
}

This isn't a problem if the object has fields because the underlying fields wont have object mappings unless they too are empty objects:

{
  "nested": [             // Current field: `nested`
    {                     // Current field: `nested` (creates map for `nested`)
      "key": "value",     // Current field: `nested.key`
      "object": {         // Current field: `nested.object` (creates map for `nested.object`)
        "subkey": "value" // Current field: `nested.object.subkey`
      }
    },
    {                     // Current field: `nested.object.subkey` (creates map for `nested` using stashed row order because `nested.object.subkey` has no column order data)
      "key": "value"
    }
  ]
}

The text was updated successfully, but these errors were encountered:

jbaiera added bug :Spark :Serialization labels Nov 9, 2023

jbaiera mentioned this issue Nov 9, 2023

Always resolve current field name in SparkSQL when creating Row objects inside of arrays #2158

Merged

jbaiera closed this as completed in #2158 Nov 16, 2023

jbaiera added v8.12.0 v8.11.2 labels Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nested objects fail parsing in Spark SQL when empty objects present #2157

Nested objects fail parsing in Spark SQL when empty objects present #2157

jbaiera commented Nov 9, 2023

Nested objects fail parsing in Spark SQL when empty objects present #2157

Nested objects fail parsing in Spark SQL when empty objects present #2157

Comments

jbaiera commented Nov 9, 2023