Skip to content
This repository was archived by the owner on Mar 24, 2025. It is now read-only.
This repository was archived by the owner on Mar 24, 2025. It is now read-only.

How to extract all individual elements from a nested WrappedArray from a DataFrame in Spark #192

Closed
@deepakmundhada

Description

@deepakmundhada

I'm using spark-xml to parse xml file. It creates a DataFrame with schema like below. How can I get all individual elements from MEMEBERDETAIL?

scala> xmlDF.printSchema
    root
     |-- MEMBERDETAIL: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- FILE_ID: double (nullable = true)
     |    |    |-- INP_SOURCE_ID: long (nullable = true)
     |    |    |-- NET_DB_CR_SW: string (nullable = true)
     |    |    |-- NET_PYM_AMT: string (nullable = true)
     |    |    |-- ORGNTD_DB_CR_SW: string (nullable = true)
     |    |    |-- ORGNTD_PYM_AMT: double (nullable = true)
     |    |    |-- RCVD_DB_CR_SW: string (nullable = true)
     |    |    |-- RCVD_PYM_AMT: string (nullable = true)
     |    |    |-- RECON_DATE: string (nullable = true)
     |    |    |-- SLNO: long (nullable = true)
scala> xmlDF.head
res147: org.apache.spark.sql.Row = [WrappedArray([1.1610100000001425E22,1,D,        94,842.38,C,0.0,D,        94,842.38,2016-10-10,1], [1.1610100000001425E22,1,D,        33,169.84,C,0.0,D,        33,169.84,2016-10-10,2], [1.1610110000001425E22,1,D,       155,500.88,C,0.0,D,       155,500.88,2016-10-11,3], [1.1610110000001425E22,1,D,       164,952.29,C,0.0,D,       164,952.29,2016-10-11,4], [1.1610110000001425E22,1,D,       203,061.06,C,0.0,D,       203,061.06,2016-10-11,5], [1.1610110000001425E22,1,D,       104,040.01,C,0.0,D,       104,040.01,2016-10-11,6], [2.1610110000001427E22,1,C,           849.14,C,849.14,C,             0.00,2016-10-11,7], [1.1610100000001465E22,1,D,             3.78,C,0.0,D,             3.78,2016-10-10,1], [1.1610100000001465E22,1,D,           261.54,C,0.0,D,    ...

After trying many ways, I am able to get just "Any" object like below but again not able to read all fields separately.

xmlDF.select($"MEMBERDETAIL".getItem(0)).head().get(0)
res56: Any = [1.1610100000001425E22,1,D,94,842.38,C,0.0,D,94,842.38,2016-10-10,1]

And Schema for above "Any" element is like below -

res61: org.apache.spark.sql.DataFrame = [MEMBERDETAIL[0]: struct<FILE_ID:double,INP_SOURCE_ID:bigint,NET_DB_CR_SW:string,NET_PYM_AMT:string,ORGNTD_DB_CR_SW:string,ORGNTD_PYM_AMT:double,RCVD_DB_CR_SW:string,RCVD_PYM_AMT:string,RECON_DATE:string,SLNO:bigint>]

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions