Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to extract all individual elements from a nested WrappedArray from a DataFrame in Spark #192

Closed
deepakmundhada opened this issue Oct 24, 2016 · 13 comments
Labels

Comments

@deepakmundhada
Copy link

deepakmundhada commented Oct 24, 2016

I'm using spark-xml to parse xml file. It creates a DataFrame with schema like below. How can I get all individual elements from MEMEBERDETAIL?

scala> xmlDF.printSchema
    root
     |-- MEMBERDETAIL: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- FILE_ID: double (nullable = true)
     |    |    |-- INP_SOURCE_ID: long (nullable = true)
     |    |    |-- NET_DB_CR_SW: string (nullable = true)
     |    |    |-- NET_PYM_AMT: string (nullable = true)
     |    |    |-- ORGNTD_DB_CR_SW: string (nullable = true)
     |    |    |-- ORGNTD_PYM_AMT: double (nullable = true)
     |    |    |-- RCVD_DB_CR_SW: string (nullable = true)
     |    |    |-- RCVD_PYM_AMT: string (nullable = true)
     |    |    |-- RECON_DATE: string (nullable = true)
     |    |    |-- SLNO: long (nullable = true)
scala> xmlDF.head
res147: org.apache.spark.sql.Row = [WrappedArray([1.1610100000001425E22,1,D,        94,842.38,C,0.0,D,        94,842.38,2016-10-10,1], [1.1610100000001425E22,1,D,        33,169.84,C,0.0,D,        33,169.84,2016-10-10,2], [1.1610110000001425E22,1,D,       155,500.88,C,0.0,D,       155,500.88,2016-10-11,3], [1.1610110000001425E22,1,D,       164,952.29,C,0.0,D,       164,952.29,2016-10-11,4], [1.1610110000001425E22,1,D,       203,061.06,C,0.0,D,       203,061.06,2016-10-11,5], [1.1610110000001425E22,1,D,       104,040.01,C,0.0,D,       104,040.01,2016-10-11,6], [2.1610110000001427E22,1,C,           849.14,C,849.14,C,             0.00,2016-10-11,7], [1.1610100000001465E22,1,D,             3.78,C,0.0,D,             3.78,2016-10-10,1], [1.1610100000001465E22,1,D,           261.54,C,0.0,D,    ...

After trying many ways, I am able to get just "Any" object like below but again not able to read all fields separately.

xmlDF.select($"MEMBERDETAIL".getItem(0)).head().get(0)
res56: Any = [1.1610100000001425E22,1,D,94,842.38,C,0.0,D,94,842.38,2016-10-10,1]

And Schema for above "Any" element is like below -

res61: org.apache.spark.sql.DataFrame = [MEMBERDETAIL[0]: struct<FILE_ID:double,INP_SOURCE_ID:bigint,NET_DB_CR_SW:string,NET_PYM_AMT:string,ORGNTD_DB_CR_SW:string,ORGNTD_PYM_AMT:double,RCVD_DB_CR_SW:string,RCVD_PYM_AMT:string,RECON_DATE:string,SLNO:bigint>]
@HyukjinKwon
Copy link
Member

I guess #141 (comment) might be helpful.

@deepakmundhada
Copy link
Author

deepakmundhada commented Oct 25, 2016

Above link was helpful. But my data is like below.

  <MEMBERHEADER>
    <MEMBER_NAME>SERVICES LIMITED</MEMBER_NAME>
    <MEMBER_ID>16253</MEMBER_ID>
    <INTERNAL_ID>16253</INTERNAL_ID>
  </MEMBERHEADER>
  <MEMBERDETAIL>
    <SLNO>1</SLNO>
    <INP_SOURCE_ID>0000000001</INP_SOURCE_ID>
  </MEMBERDETAIL>
  <MEMBERDETAIL>
    <SLNO>2</SLNO>
    <INP_SOURCE_ID>0000000001</INP_SOURCE_ID>        
  </MEMBERDETAIL>
 <MEMBERHEADER>
    <MEMBER_NAME>SERVICES LIMITED</MEMBER_NAME>
    <MEMBER_ID>16223</MEMBER_ID>
    <INTERNAL_ID>16223</INTERNAL_ID>
  </MEMBERHEADER>
  <MEMBERDETAIL>
    <SLNO>1</SLNO>
    <INP_SOURCE_ID>0000000002</INP_SOURCE_ID>
  </MEMBERDETAIL>
  <MEMBERDETAIL>
    <SLNO>2</SLNO>
    <INP_SOURCE_ID>0000000002</INP_SOURCE_ID>        
  </MEMBERDETAIL>

Will Data Frame always maintain order of records from file? I mean 1st MEMBERHEADER and followed MEMBERDETAIL will always be 1st ROW in DataFrame and next is 2nd ROW and so on? Or can it change based on number of partitions (tasks) created by spark?

@davidcrossland
Copy link

I have a similar issue, my schema is (well a portion), in fact this schema relates to a single column;

|-- AirportDataList: struct (nullable = true)
| |-- AirportData: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Airport: struct (nullable = true)
| | | | |-- TACode: string (nullable = true)
| | | | |-- AOCode: string (nullable = true)
| | | | |-- _airportFunction: string (nullable = true)
| | | | |-- _airportName: string (nullable = true)

Every attempt to retrieve the values results in null..

@HyukjinKwon
Copy link
Member

@deepakmundhada Ah, yes. As far as I know, the order is as written in the file; however, I guess it is not encouraged idea to rely on the natural order.

@HyukjinKwon
Copy link
Member

@davidcrossland I guess that is related with #185 it seems critical. I will try to release another one soon although we have a very few fixes.

@davidcrossland
Copy link

Just to try and validate whether im being dumb..

So i imported the xml file using the outermost tag, this resulted in a bunch of columns as you would expect.

Inspecting the schema of a specific column results in this;

StructType(StructField(AirportDataList,StructType(StructField(AirportData,ArrayType(StructType(StructField(Airport,StructType(StructField(AirportIATACode,StringType,true), StructField(AirportICAOCode,StringType,true), StructField(_airportFunction,StringType,true), StructField(_airportName,StringType,true)),true), StructField(PlannedRunway,StringType,true), StructField(SuitablePeriod,StructType(StructField(_VALUE,StringType,true), StructField(_from,StringType,true), StructField(_until,StringType,true)),true), StructField(TerminalProcedure,ArrayType(StructType(StructField(_VALUE,StringType,true), StructField(_procedureType,StringType,true)),true),true)),true),true)),true))

or presented in a nicer way;

|-- AirportDataList: struct (nullable = true)
| |-- AirportData: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Airport: struct (nullable = true)
| | | | |-- AirportIATACode: string (nullable = true)
| | | | |-- AirportICAOCode: string (nullable = true)
| | | | |-- _airportFunction: string (nullable = true)
| | | | |-- _airportName: string (nullable = true)
| | | |-- PlannedRunway: string (nullable = true)
| | | |-- SuitablePeriod: struct (nullable = true)
| | | | |-- _VALUE: string (nullable = true)
| | | | |-- _from: string (nullable = true)
| | | | |-- _until: string (nullable = true)
| | | |-- TerminalProcedure: array (nullable = true)
| | | | |-- element: struct (containsNull = true)
| | | | | |-- _VALUE: string (nullable = true)
| | | | | |-- _procedureType: string (nullable = true)

Ive tried approaching extracting the data a bunch of different ways, but from the docs i would have expected this to work;

df.select("AirportDataList.AirportData.Airport.AirportIATACode").show

Unless i am approaching this incorrectly?

I certainly don't get any error, and when i try this;

df.select("AirportDataList.AirportData.Airport.AirportIATACode").first()

results in a GenericRowWithSchema object where schema has the expected value

StructField(AirportIATACode,ArrayType(StringType,true),true)

however values are [null]

So it appears to be traversing the xml structure correctly but is not pulling back the values. In fact i can prove to myself by selecting incorrect paths through the data that it is indeed able to traverse the structure correctly. If you need any further info from me please let me know, if youre able to give me an estimate as to when you might release a fix that would be useful as i need this for a customer project. I'll take a root though the code see if i can spot anything..

@denysthegitmenace
Copy link

Confirm that i am facing the same issue.

@shermilaguerra
Copy link

I have the same issue. The question was resolved ??

@HyukjinKwon
Copy link
Member

This was resolved in master I believe. Will release one soon. I apologise it has been postponed.

@denysthegitmenace
Copy link

Last week's master code (Wednesday, October 26th) didn't fix the issue for PySpark.

@davidcrossland
Copy link

Any eta when the fix may be released?

@HyukjinKwon
Copy link
Member

I am planning it on this weekend. I will merge small fixes up more and then proceed some tests.

This was referenced Nov 3, 2016
HyukjinKwon added a commit that referenced this issue Nov 6, 2016
There are some critical bugs in 0.4.0. Although it includes few fixes, it'd be great if another one is released.

This will includes the fixes for #193, #196, #192 and #185

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #198 from HyukjinKwon/changes-0.4.1.
@HyukjinKwon
Copy link
Member

I am closing this. Please feel free to reopen if this is not resolved and you still face this issue. Thanks again!

HyukjinKwon added a commit that referenced this issue Nov 6, 2016
There are some critical bugs in 0.3.4. Although it includes few fixes, it'd be great if another one is released.

This will includes the fixes for #193, #196, #192 and #185

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #200 from HyukjinKwon/changes-0.3.5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants