Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string data type does not work in Amazon Athena but identical Drill file does work #150

Closed
Non-PlayerCharacter opened this issue May 15, 2017 · 43 comments · Fixed by #179
Closed

Comments

@Non-PlayerCharacter
Copy link

Creating a little file like this:

import fastparquet
import pandas as pd
df = pd.DataFrame({'z': [u'#',]})
fastparquet.write('fastparquet.pq', df, object_encoding='utf8', compression='GZIP')

Causes Athena string comparisons to fail:

athena "select * from my_database.fastparquet where z='#';"
z
-

(NO ROWS ARE RETURNED)

To make it work, you have to cast the column to varchar:

athena "select * from my_database.fastparquet where CAST(z as varchar) = '#' ;"
z
-
#

But creating the identical file using Drill allows you to do string literal comparisons without any casting:

athena "select * from my_database.other_source where z='#';"
z
-
#

I've attached a file with all the steps, including an octal dump of each of the two tiny parquet files.
fastparquet_vs_drill.txt

@martindurant
Copy link
Member

This is pretty puzzling - if the schema for both files is bytes/UTF8/optional; what exactly does athena think the type is if not varchar? I don't have athena myself, so I can't so any experimentation - but it does appear that it is drill's loading-saving roundtrip which has changed the representation, in your case for the more convenient.

@Non-PlayerCharacter
Copy link
Author

Does the octal dump that I attached of each complete tiny file shed any light on the difference between the Drill version and the fastparquet version?

Thank you.

@Non-PlayerCharacter
Copy link
Author

I am attaching the actual files here.

other_source.zip
fastparquet.zip

@martindurant
Copy link
Member

The basic uncompressed binary column data in both cases is b'\x02\x00\x00\x00\x03\x01\x01\x00\x00\x00#'
The schemas are the same.
The fastparquet version lacks statistics at the data page level (both have at the column chunk level). The fastparquet file is slightly bigger.
I have no idea, sorry...

@Non-PlayerCharacter
Copy link
Author

Opening both files with fastparquet.ParquetFile and then printing the fmd variable shows:

fastparquet has
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]

drill has:
encoding_stats: None

Also
fastparquet has
name: schema

drill has
name: root

@martindurant
Copy link
Member

Here is a variant with those changes - I very much doubt it makes any difference.
fastparquet2.pq.zip

@Non-PlayerCharacter
Copy link
Author

Sadly you are right, it made no difference. I very much appreciate your time. I will continue to dig and see what the problem might be. Should I close this issue?

@martindurant
Copy link
Member

The issue can remain, unless you run out of ideas too.

@Non-PlayerCharacter
Copy link
Author

I also noticed that the drill file has key_value_metadata set to None, while the fastparquet file has it set to an empty list...

@diegojancic
Copy link

+1 here. Files created with fastparquet do not work with PrestoDB/AWS Athena.

@martindurant
Copy link
Member

@diegojancic , as you can see from the conversation above, I have tried to research the problem. I have drill, but not athena; I see that prestoDB is probably easy to install locally.
Do you perhaps have any way to shed light on what's going on?

@diegojancic
Copy link

@martindurant Athena is fairly easy to use as you just have to register to AWS. Anyway, I'm working on it right now, I'll post the solution if I'm able to fix it. I just didn't want the issue to be closed. Thanks!

@martindurant
Copy link
Member

Thanks for having a go!

@zdk123
Copy link

zdk123 commented Jul 6, 2017

@diegojancic: if it helps you debug the issue, the pyarrow parquet writer works with athena. I made the switch, but would prefer a fastparquet solution.

@martindurant
Copy link
Member

I just noticed that you are passing u'#' and using UTF8 encoding, but VARCHAR is supposed to be explicitly ascii. Yes, I know that drill makes UTF8 columns too... but I wonder what happens if you have the data as b'#' instead, and object_encoding='bytes' ?

@diegojancic
Copy link

Thanks @zdk123. I've tried with pyarrow but I could make the .write_table work (maybe it doesn't work well on Windodws).

As for what @martindurant says, yes, I believe it's an encoding issue. Here are a couple tests:

RETURNS NO RESULTS:
... where column = 'value'

RETURNS VALID RESULTS:
... where column like 'value%' and length(column) = 5

... where to_utf8(column) = to_utf8('value')

I'm playing with changing the encoding of the Pandas' DataFrame and using object_encoding, but it hasn't work yet.

@diegojancic
Copy link

FYI,
This is what I'm doing for testing (without success):

dataFrame.firstname = dataFrame.firstname.str.encode("ascii")
fastparquet.write("dir", dataFrame, compression='GZIP', file_scheme='hive', object_encoding='bytes')

I've trying with several combinations of .encode("..."), including utf8 and latin_1, and also object_encoding set to bytes and utf8. Obviously I've no idea what I'm doing LOL, but I can try whatever you suggest. Thanks!

@martindurant
Copy link
Member

That's exactly what I would have tried. I suppose it should be "utf8", but since I think your strings contain only ascii characters anyway, it won't make any difference.

@Non-PlayerCharacter
Copy link
Author

I can mention a work-around that I'm using: write out multiple parquet files and then use "parquet-cat" from the parquet-tools package to merge them into a parquet file which Athena has no problem with. This actually works out better for me, because on a multi processor machine I can write out multiple parquet files simultaneously and then use parquet-cat to combine them while I write out the next set of files.

@diegojancic
Copy link

What do you mean with multiple files? Do you partition them?

@martindurant
Copy link
Member

@Non-PlayerCharacter , this doesn't speak to the problem, but you may be interested in dask dataframe's to_parquet method, which writes partitioned parquet in parallel.

@Non-PlayerCharacter
Copy link
Author

@diegojancic since Athena transparently scans across multiple parquet files, I can just break my data up into arbitrary chunks and write them to separate files. The Athena docs said somewhere that Parquet files of 100GB were the right size for speedy queries.

My workflow is to read the data into Pandas using read_csv, read_fwf or read_sql_query using chunks; once a chunk is read, I fork my process and the child does data cleanup and transform in RAM on the chunk sitting in the Pandas dataframe, and then writes the dataframe out using fastparquet.

The parent process checks to see if the sum total of the parquet files that the child processes have written out adds up to 100GB, and if so it calls parquet-cat to create an Athena-readable-file in S3 from the individual files that had been written out by the child processes. If the sum bytesize is not 100GB, the parent reads the next chunk and forks off another child.

I can get all the cores working on my data transforms this way.

@zdk123
Copy link

zdk123 commented Jul 7, 2017

The Athena docs said somewhere that Parquet files of 100GB were the right size for speedy queries.

@Non-PlayerCharacter can you point this out this the docs? Not relevant to this issue but for a different problem I'm having.

Fyi, I was still having the string matching issue even with a single 'simple-schema' parquet file.

@martindurant
Copy link
Member

100GB per single file sounds very big to me, 100MB is more typical on HDFS systems (although I know reading from S3 is not the same situation because of the connection overhead).

You would do the same workflow in dask like

import dask.dataframe as dd
df = dd.read_csv(..., chunksize=500000000)
df = df.map/apply/dropna...
df.to_parquet('s3://mybucket/data.parqet', compression=, object_encoding=...)

and this will parallelize by default, or you can run with the distributed scheduler over a cluster too. Of course the issue above would not be solves... In fact, you might not need athena, since df, above, is queriable (using pandas syntax) and can read from parquet on S3 too :)

@diegojancic
Copy link

If this helps anyone debug the issue (I wouldn't know how). I'm attaching the same dataset saved as Parquet using Fastparquet and PyArrow, uncompressed. Fastparquet's does not work while PyArrow's does.

Full code used to generate that:

import pandas as pd
import fastparquet
import pyarrow
import pyarrow.parquet as pqt

dataFrame = pd.DataFrame({'id': [1,2], 'name': ['John', 'Mary']})

# FASTPARQUET
fastparquet.write("users_fastparquet", dataFrame, compression='UNCOMPRESSED', file_scheme='hive', object_encoding='utf8')

# PYARROW
table = pyarrow.Table.from_pandas(dataFrame)
pqt.write_table(table, '/users_pyarrow/part.0.parquet', compression='none')
pqt.write_metadata(table.schema, '/users_pyarrow/_metadata')

Thanks @zdk123 for the suggestion of using PyArrow. I would prefer to use fastparquet too.

users_fastparquet.zip
users_pyarrow.zip

@martindurant
Copy link
Member

Interestingly, the arrow file is encoded as a dictionary, not simple strings, unlike the drill variant at the top to the thread.
Does anyone want to try with the name column as a categorical?

df['name'] = df['name'].astype('category')

(I don't know if this a useful possible solution)

Also, fastparquet adds extra null bytes to the data page, because of an earlier problem with spark, but I don't see how this can cause a problem (athena seems to have no problem loading the data, only on the interpretation of the encoding).

By the way, is it possible to ask athena a question like: what type do you think this column is?

@diegojancic
Copy link

@martindurant using .astype('category') didn't work in Athena. It's like it's not able to process the data now.

As for your other question: no, Athena will only say 'it's a string' and just because we said it was a string when we created the table (ie, CREATE TABLE...)

@martindurant
Copy link
Member

Would you mind trying the following? I have removed the statistics metadata, which perhaps has the wrong format and causes the missed values.

temp.zip

@diegojancic
Copy link

@martindurant it worked!!! Can you explain what did you do? Just delete the _common_metadata and _metadata files?

@martindurant
Copy link
Member

I understand now, I can apply a fix.
The statistics (max/min of each column for a given chunk) are used to decide whether to load that chunk based on conditions. The string values were being encoded as they are in the data itself, as they are for non-string values, but it seems that they are being stored directly as thrift string types instead.

It is not the lack of metadata files (the metadata is contained in the one file in this case, just to make it easier - this is the "simple" format).

@martindurant
Copy link
Member

Please see if #179 solves your problem. Also, it would be good to try with categoricals.

@diegojancic
Copy link

Sorry, not sure how to build from source on Windows. I'm installing using conda.
I'm getting an error while executing python setup.py build:

...
running build_ext
error: [WinError 2] The system cannot find the file specified

@martindurant
Copy link
Member

I'm afraid that some of the code needs compilation - I presume you have no appropriate compiler installer. I've never done this on windows, sorry I can't do it for you.

@zdk123
Copy link

zdk123 commented Jul 7, 2017

@martindurant I was able to build and install your branch for python 3.6.1, and can confirm exact string matching now works on Athena.

I don't have categorical data in my test sets, but I will try and report back.

@diegojancic
Copy link

Yes, that's OK. I'll continue trying. The build for your pull-request failed, but once that's solved I guess it will be published in Anaconda (hopefully soon).

@martindurant
Copy link
Member

OK, I merged to master. If categoricals break, please make a new issue.
Next week I am at scipy, so I'm not sure if I'll have time to release this on conda-forge.

@diegojancic
Copy link

OK. Thanks for everything @martindurant. I'll wait for the conda-forge release whenever you can. If not I'll continue trying to compile it for Windows. Very much appreciated!

@wesm
Copy link

wesm commented Jul 9, 2017

@diegojancic you said

Thanks @zdk123. I've tried with pyarrow but I could make the .write_table work (maybe it doesn't work well on Windodws).

Can you report the issue on the Arrow JIRA? Since Windows support only appeared recently I would be keen to hear of any issues so we can fix them. Thanks!

@yan-hic
Copy link

yan-hic commented Aug 15, 2017

@martindurant has the fix made it to conda-forge ? I just tried to upgrade from there, and also from Github, both to no avail: still need to cast.
It seems that datetime64 is not converted from frame correctly either e.g. 1034640000000000 instead of Oct15 2012
FYI, I don't use Athena but Drill to read parquet.

@martindurant
Copy link
Member

The time types are a different issue. There is a compatibility int96 type available on write

fastparquet.write(data, filename, times='int96')

for the "non-standard" times representation that, nevertheless, many parquet frameworkd use.

@yan-hic
Copy link

yan-hic commented Aug 15, 2017

I tried that but it shows up in Drill as byte-array (e.g. B@5bbf1c15). I will revert for now converting in pd and do this down the line, in SQL/Drill.

As for the casting, I resolved it coincidentally when playing with encoding as object_encoding arg gave me an error. Turns out once I added the encoding arg - had given none - on the various pd.read_ I use, Drill showed the strings correctly.
Weird - does fastparquet determine & use the encoding of the frame to pass along in the parq files?

@martindurant
Copy link
Member

Unless you say otherwise, fastparquet looks at the types of the first few values in an object column to determine how to encode.
Perhaps adding the encoding argument to pandas ensures that the strings are encodable in UTF8 (the only encoding parquet supports).

@yan-hic
Copy link

yan-hic commented Aug 15, 2017

Actually the encoding arg helped with read_csv but not read_fwf. On the latter df's, I still get bytes array when querying through Drill after fastparquet. I will create a separate issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants