-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to extract all individual elements from a nested WrappedArray from a DataFrame in Spark #192
Comments
I guess #141 (comment) might be helpful. |
Above link was helpful. But my data is like below.
Will Data Frame always maintain order of records from file? I mean 1st MEMBERHEADER and followed MEMBERDETAIL will always be 1st ROW in DataFrame and next is 2nd ROW and so on? Or can it change based on number of partitions (tasks) created by spark? |
I have a similar issue, my schema is (well a portion), in fact this schema relates to a single column; |-- AirportDataList: struct (nullable = true) Every attempt to retrieve the values results in null.. |
@deepakmundhada Ah, yes. As far as I know, the order is as written in the file; however, I guess it is not encouraged idea to rely on the natural order. |
@davidcrossland I guess that is related with #185 it seems critical. I will try to release another one soon although we have a very few fixes. |
Just to try and validate whether im being dumb.. So i imported the xml file using the outermost tag, this resulted in a bunch of columns as you would expect. Inspecting the schema of a specific column results in this; StructType(StructField(AirportDataList,StructType(StructField(AirportData,ArrayType(StructType(StructField(Airport,StructType(StructField(AirportIATACode,StringType,true), StructField(AirportICAOCode,StringType,true), StructField(_airportFunction,StringType,true), StructField(_airportName,StringType,true)),true), StructField(PlannedRunway,StringType,true), StructField(SuitablePeriod,StructType(StructField(_VALUE,StringType,true), StructField(_from,StringType,true), StructField(_until,StringType,true)),true), StructField(TerminalProcedure,ArrayType(StructType(StructField(_VALUE,StringType,true), StructField(_procedureType,StringType,true)),true),true)),true),true)),true)) or presented in a nicer way; |-- AirportDataList: struct (nullable = true) Ive tried approaching extracting the data a bunch of different ways, but from the docs i would have expected this to work; df.select("AirportDataList.AirportData.Airport.AirportIATACode").show Unless i am approaching this incorrectly? I certainly don't get any error, and when i try this; df.select("AirportDataList.AirportData.Airport.AirportIATACode").first() results in a GenericRowWithSchema object where schema has the expected value StructField(AirportIATACode,ArrayType(StringType,true),true) however values are [null] So it appears to be traversing the xml structure correctly but is not pulling back the values. In fact i can prove to myself by selecting incorrect paths through the data that it is indeed able to traverse the structure correctly. If you need any further info from me please let me know, if youre able to give me an estimate as to when you might release a fix that would be useful as i need this for a customer project. I'll take a root though the code see if i can spot anything.. |
Confirm that i am facing the same issue. |
I have the same issue. The question was resolved ?? |
This was resolved in master I believe. Will release one soon. I apologise it has been postponed. |
Last week's master code (Wednesday, October 26th) didn't fix the issue for PySpark. |
Any eta when the fix may be released? |
I am planning it on this weekend. I will merge small fixes up more and then proceed some tests. |
I am closing this. Please feel free to reopen if this is not resolved and you still face this issue. Thanks again! |
I'm using spark-xml to parse xml file. It creates a DataFrame with schema like below. How can I get all individual elements from MEMEBERDETAIL?
After trying many ways, I am able to get just "Any" object like below but again not able to read all fields separately.
And Schema for above "Any" element is like below -
The text was updated successfully, but these errors were encountered: