-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To Pandas doesn't work with parquet file - Type Error #159
Comments
I haven't seen such an error before, I'm afraid. The error suggests that the block has no data-page, which would be very odd - I wonder if it's possible that you have blocks containing no data? |
@martindurant I'm also seeing this error. I'm wondering if you have any further thoughts. Can you elaborate on what you mean by "blocks containing no data"? Appreciate the help.
The schema for this field is:
Just getting started with fastparquet - the script itself is very simple:
|
The traceback suggests that parsing of the thrift header to a data chunk failed, the "None" should be the data chunk header. This most likely means that the file is corrupt; how was it produced, and does it load successfully in any other parquet frameworks? Do any other columns load OK? Is this happening for only one of the files (try looping over the contents of filelist and loading each in turn) ? If it turns out that you have a valid file, but fastparquet is failing to load it, then I may ask to see the file and debug from there. |
Thank you for the suggestion. 3 of the ~60 files I was trying to load caused this error. By removing them, it worked. Need to look into why they are corrupt... Appreciate the help. |
@martindurant - If you are still open to checking out the file I would appreciate it:
To summarize, there seems to be something funky about the "features" column of the "troublesome" file causing fast parquet to generate this error:
I'm able to read other columns from the "troublesome" file with fastparquet, and I'm also able to read in the whole thing (including the "features" field) with spark, so I don't actually think the file is corrupt. As I mentioned in previous comment, I'm able to read in other files of the same format ("working file" attached). I appreciate any thoughts/insights into this. Happy to provide more info if it would be helpful. Thanks. |
OK, so: there appear to be multiple dictionary pages, which is not supposed to happen, but I can deal with. Also, the encoding is "bit-packed (deprecated)", which, as the name suggests, is not supposed to be around. I can maybe code it up, since the spec is well-stated, and I can compare the result against ground-truth as given by spark. I'll get back to you. |
Hey @martindurant - Were you able to find anything here? Appreciate the insights. Thanks a lot for the help. |
Sorry, I did not manage to fix this yet. I did allow for multiple dictionaries, that works fine, but my implementation of binary packed reading apparently does not work, I end up in a bad bytes location and seg-fault. I don't know when I'll have the chance to look into this further. |
I'm seeing the same stack trace
but, the column doesn't seem to be bitpacked encoded.
What can I do to help debug this issue? |
Perhaps fixed by #264 ? |
I have the same problem here. |
Are you using 0.1.5? |
Yes and python-snappy 0.5.1
Martin Durant <notifications@github.com> schrieb am Mo., 9. Apr. 2018 um
14:39 Uhr:
… Are you using 0.1.5?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#159 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AWH0nEMiqvlx8cvl1jgu9OLjyc_SV2esks5tm1Z6gaJpZM4NqGtn>
.
|
I could isolate some columns that cause the error. Those contain arrays of
integers. But other columns are also arrays of integers and they do work
with toPandas().
Andreas Hopfgartner <schotterschorsch@gmail.com> schrieb am Mo., 9. Apr.
2018 um 15:26 Uhr:
… Yes and python-snappy 0.5.1
Martin Durant ***@***.***> schrieb am Mo., 9. Apr. 2018 um
14:39 Uhr:
> Are you using 0.1.5?
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#159 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AWH0nEMiqvlx8cvl1jgu9OLjyc_SV2esks5tm1Z6gaJpZM4NqGtn>
> .
>
|
Oh no, I'm also using Apache Spark and unfortunately, I'm also having that exact same exception but for a list of doubles. It seems like fastparquet just cannot deal with the way Apache Spark writes arrays. I guess I'll just have to switch to CSV. |
fastparquet does deal with some list types, so if you can produce a sample of the data, I might be able to help. Just wondering, how do you write lists into CSV? |
Same issue on my end with some Parquet file written by Spark, with the schema being:
Same files can be read fine with Spark for all fields, and with FastParquet if I read just the |
Previously generated files from spark which I have are like this
what do you get with fastparquet's view of your file?
|
Here is what I get with the schema printed via FastParquet where I can only read the
Versus schema for exact same file via Spark where I can read all fields fine:
And for good measure, what I'm getting through the
Let me know if I can provide anything more to help investigating this. |
I think I found the issue @martindurant - in However, it seems that in some cases, you can have multiple I'm not familiar enough with Parquet internals to say why this happens, but since other Parquet frameworks seem to handle this case fine, it seems like a bug in FastParquet. The fix probably involves just moving the check for |
@cmenguy , I had thought that this exact issue came up before and was fixed. Do you think it's possible to produce a test file (with randomised data) of reasonable size, with the structure you describe? |
Perhaps #367 solves this? I would appreciate you trying with that code. |
I've tried #367 , and it fails with a different error:
I've added a sample file with this issue, which just contains my |
I am looking into it, but not sure yet what to do. It seems that after reading the second dictionary page header, we are no longer at a valid data section. Interestingly, the column metadata says:
i.e., there are exactly one "type 2" pages (these are the dictionaries), not two. However, the data decompresses OK, and that wouldn't happen, I think, if it didn't look like a valid SNAPPY block. |
I tried comparing with the output of the But the 2nd dictionary page seems to happen right at the end of the 58 pages for |
Actually I'm pretty sure now that this 2nd dictionary is for the second column. |
The second dictionary page appears to surface after 38575 values of |
@martindurant I messed a bit around with the code, and think I found the root cause - could you take a look at #368 and let me know your thoughts? I was able to read the file fine with this fix, and the number of rows matches. |
Actually, after 1289 rows being read correctly, everything else is null, so most likely some pages are not being read, but at least it's not throwing an exception. It's probably about tuning how |
So we can probably close this? |
still have exact problem while read a parquet file(4.3G) saved from spark. |
@tao-cao , are you using the master version of fastparquet? A release should happen in the next few days. If yes, then (as usual) the specifics of your schema will be important. |
thanks! Im using pip installed version 0.1.6, is the master on git higher than 0.1.6? |
Yes, the fixes mentioned above are more recent. You would need to install directly from git to test (below), or wait for the release
|
@martindurant to give you more info the whole file with more than 5m rows, about 10 col, if saved 100 rows as a test dataset, works fine. But not got error on the whole file. |
We are getting the same error with version 0.2.0 installed using Conda.
|
@spektom , did you try with the master version as directed above, or try to change things on the spark side (see the discussion) ? |
@martindurant The version installed from git master works! Thanks. |
Excellent; I will scheduler a release for when I return from vacation. |
Hi all,
I'm loading some parquet files generated by a Spark ETL job.
I get this error when calling
parquet_file.to_pandas()
.The text was updated successfully, but these errors were encountered: