New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
string data type does not work in Amazon Athena but identical Drill file does work #150
Comments
This is pretty puzzling - if the schema for both files is bytes/UTF8/optional; what exactly does athena think the type is if not varchar? I don't have athena myself, so I can't so any experimentation - but it does appear that it is drill's loading-saving roundtrip which has changed the representation, in your case for the more convenient. |
Does the octal dump that I attached of each complete tiny file shed any light on the difference between the Drill version and the fastparquet version? Thank you. |
I am attaching the actual files here. |
The basic uncompressed binary column data in both cases is |
Opening both files with fastparquet.ParquetFile and then printing the fmd variable shows: fastparquet has drill has: Also drill has |
Here is a variant with those changes - I very much doubt it makes any difference. |
Sadly you are right, it made no difference. I very much appreciate your time. I will continue to dig and see what the problem might be. Should I close this issue? |
The issue can remain, unless you run out of ideas too. |
I also noticed that the drill file has key_value_metadata set to None, while the fastparquet file has it set to an empty list... |
+1 here. Files created with fastparquet do not work with PrestoDB/AWS Athena. |
@diegojancic , as you can see from the conversation above, I have tried to research the problem. I have drill, but not athena; I see that prestoDB is probably easy to install locally. |
@martindurant Athena is fairly easy to use as you just have to register to AWS. Anyway, I'm working on it right now, I'll post the solution if I'm able to fix it. I just didn't want the issue to be closed. Thanks! |
Thanks for having a go! |
@diegojancic: if it helps you debug the issue, the pyarrow parquet writer works with athena. I made the switch, but would prefer a fastparquet solution. |
I just noticed that you are passing |
Thanks @zdk123. I've tried with pyarrow but I could make the As for what @martindurant says, yes, I believe it's an encoding issue. Here are a couple tests: RETURNS NO RESULTS: RETURNS VALID RESULTS:
I'm playing with changing the encoding of the Pandas' DataFrame and using |
FYI,
I've trying with several combinations of |
That's exactly what I would have tried. I suppose it should be "utf8", but since I think your strings contain only ascii characters anyway, it won't make any difference. |
I can mention a work-around that I'm using: write out multiple parquet files and then use "parquet-cat" from the parquet-tools package to merge them into a parquet file which Athena has no problem with. This actually works out better for me, because on a multi processor machine I can write out multiple parquet files simultaneously and then use parquet-cat to combine them while I write out the next set of files. |
What do you mean with multiple files? Do you partition them? |
@Non-PlayerCharacter , this doesn't speak to the problem, but you may be interested in dask dataframe's |
@diegojancic since Athena transparently scans across multiple parquet files, I can just break my data up into arbitrary chunks and write them to separate files. The Athena docs said somewhere that Parquet files of 100GB were the right size for speedy queries. My workflow is to read the data into Pandas using read_csv, read_fwf or read_sql_query using chunks; once a chunk is read, I fork my process and the child does data cleanup and transform in RAM on the chunk sitting in the Pandas dataframe, and then writes the dataframe out using fastparquet. The parent process checks to see if the sum total of the parquet files that the child processes have written out adds up to 100GB, and if so it calls parquet-cat to create an Athena-readable-file in S3 from the individual files that had been written out by the child processes. If the sum bytesize is not 100GB, the parent reads the next chunk and forks off another child. I can get all the cores working on my data transforms this way. |
@Non-PlayerCharacter can you point this out this the docs? Not relevant to this issue but for a different problem I'm having. Fyi, I was still having the string matching issue even with a single 'simple-schema' parquet file. |
100GB per single file sounds very big to me, 100MB is more typical on HDFS systems (although I know reading from S3 is not the same situation because of the connection overhead). You would do the same workflow in dask like
and this will parallelize by default, or you can run with the distributed scheduler over a cluster too. Of course the issue above would not be solves... In fact, you might not need athena, since df, above, is queriable (using pandas syntax) and can read from parquet on S3 too :) |
If this helps anyone debug the issue (I wouldn't know how). I'm attaching the same dataset saved as Parquet using Fastparquet and PyArrow, uncompressed. Fastparquet's does not work while PyArrow's does. Full code used to generate that:
Thanks @zdk123 for the suggestion of using PyArrow. I would prefer to use fastparquet too. |
Interestingly, the arrow file is encoded as a dictionary, not simple strings, unlike the drill variant at the top to the thread.
(I don't know if this a useful possible solution) Also, fastparquet adds extra null bytes to the data page, because of an earlier problem with spark, but I don't see how this can cause a problem (athena seems to have no problem loading the data, only on the interpretation of the encoding). By the way, is it possible to ask athena a question like: what type do you think this column is? |
@martindurant using As for your other question: no, Athena will only say 'it's a |
Would you mind trying the following? I have removed the statistics metadata, which perhaps has the wrong format and causes the missed values. |
@martindurant it worked!!! Can you explain what did you do? Just delete the |
I understand now, I can apply a fix. It is not the lack of metadata files (the metadata is contained in the one file in this case, just to make it easier - this is the "simple" format). |
Please see if #179 solves your problem. Also, it would be good to try with categoricals. |
Sorry, not sure how to build from source on Windows. I'm installing using
|
I'm afraid that some of the code needs compilation - I presume you have no appropriate compiler installer. I've never done this on windows, sorry I can't do it for you. |
@martindurant I was able to build and install your branch for python 3.6.1, and can confirm exact string matching now works on Athena. I don't have categorical data in my test sets, but I will try and report back. |
Yes, that's OK. I'll continue trying. The build for your pull-request failed, but once that's solved I guess it will be published in Anaconda (hopefully soon). |
OK, I merged to master. If categoricals break, please make a new issue. |
OK. Thanks for everything @martindurant. I'll wait for the conda-forge release whenever you can. If not I'll continue trying to compile it for Windows. Very much appreciated! |
@diegojancic you said
Can you report the issue on the Arrow JIRA? Since Windows support only appeared recently I would be keen to hear of any issues so we can fix them. Thanks! |
@martindurant has the fix made it to conda-forge ? I just tried to upgrade from there, and also from Github, both to no avail: still need to cast. |
The time types are a different issue. There is a compatibility int96 type available on write
for the "non-standard" times representation that, nevertheless, many parquet frameworkd use. |
I tried that but it shows up in Drill as byte-array (e.g. As for the casting, I resolved it coincidentally when playing with encoding as object_encoding arg gave me an error. Turns out once I added the encoding arg - had given none - on the various pd.read_ I use, Drill showed the strings correctly. |
Unless you say otherwise, fastparquet looks at the types of the first few values in an object column to determine how to encode. |
Actually the encoding arg helped with read_csv but not read_fwf. On the latter df's, I still get bytes array when querying through Drill after fastparquet. I will create a separate issue. |
Creating a little file like this:
Causes Athena string comparisons to fail:
(NO ROWS ARE RETURNED)
To make it work, you have to cast the column to varchar:
But creating the identical file using Drill allows you to do string literal comparisons without any casting:
I've attached a file with all the steps, including an octal dump of each of the two tiny parquet files.
fastparquet_vs_drill.txt
The text was updated successfully, but these errors were encountered: