string data type does not work in Amazon Athena but identical Drill file does work #150

Non-PlayerCharacter · 2017-05-15T19:33:58Z

Creating a little file like this:

import fastparquet
import pandas as pd
df = pd.DataFrame({'z': [u'#',]})
fastparquet.write('fastparquet.pq', df, object_encoding='utf8', compression='GZIP')

Causes Athena string comparisons to fail:

athena "select * from my_database.fastparquet where z='#';"
z
-

(NO ROWS ARE RETURNED)

To make it work, you have to cast the column to varchar:

athena "select * from my_database.fastparquet where CAST(z as varchar) = '#' ;"
z
-
#

But creating the identical file using Drill allows you to do string literal comparisons without any casting:

athena "select * from my_database.other_source where z='#';"
z
-
#

I've attached a file with all the steps, including an octal dump of each of the two tiny parquet files.
fastparquet_vs_drill.txt

The text was updated successfully, but these errors were encountered:

martindurant · 2017-05-15T19:46:38Z

This is pretty puzzling - if the schema for both files is bytes/UTF8/optional; what exactly does athena think the type is if not varchar? I don't have athena myself, so I can't so any experimentation - but it does appear that it is drill's loading-saving roundtrip which has changed the representation, in your case for the more convenient.

Non-PlayerCharacter · 2017-05-15T19:51:41Z

Does the octal dump that I attached of each complete tiny file shed any light on the difference between the Drill version and the fastparquet version?

Thank you.

Non-PlayerCharacter · 2017-05-15T19:54:39Z

I am attaching the actual files here.

other_source.zip
fastparquet.zip

martindurant · 2017-05-15T20:13:13Z

The basic uncompressed binary column data in both cases is b'\x02\x00\x00\x00\x03\x01\x01\x00\x00\x00#'
The schemas are the same.
The fastparquet version lacks statistics at the data page level (both have at the column chunk level). The fastparquet file is slightly bigger.
I have no idea, sorry...

Non-PlayerCharacter · 2017-05-15T20:26:03Z

Opening both files with fastparquet.ParquetFile and then printing the fmd variable shows:

fastparquet has
encoding_stats: [<class 'parquet_thrift.PageEncodingStats'>
count: 1
encoding: 0
page_type: 0
]

drill has:
encoding_stats: None

Also
fastparquet has
name: schema

drill has
name: root

martindurant · 2017-05-15T20:32:18Z

Here is a variant with those changes - I very much doubt it makes any difference.
fastparquet2.pq.zip

Non-PlayerCharacter · 2017-05-15T20:42:28Z

Sadly you are right, it made no difference. I very much appreciate your time. I will continue to dig and see what the problem might be. Should I close this issue?

martindurant · 2017-05-15T20:43:14Z

The issue can remain, unless you run out of ideas too.

Non-PlayerCharacter · 2017-05-15T20:46:07Z

I also noticed that the drill file has key_value_metadata set to None, while the fastparquet file has it set to an empty list...

diegojancic · 2017-07-06T17:35:53Z

+1 here. Files created with fastparquet do not work with PrestoDB/AWS Athena.

martindurant · 2017-07-06T17:40:28Z

@diegojancic , as you can see from the conversation above, I have tried to research the problem. I have drill, but not athena; I see that prestoDB is probably easy to install locally.
Do you perhaps have any way to shed light on what's going on?

diegojancic · 2017-07-06T17:43:49Z

@martindurant Athena is fairly easy to use as you just have to register to AWS. Anyway, I'm working on it right now, I'll post the solution if I'm able to fix it. I just didn't want the issue to be closed. Thanks!

martindurant · 2017-07-06T17:45:16Z

Thanks for having a go!

zdk123 · 2017-07-06T18:58:28Z

@diegojancic: if it helps you debug the issue, the pyarrow parquet writer works with athena. I made the switch, but would prefer a fastparquet solution.

martindurant · 2017-07-06T19:24:25Z

I just noticed that you are passing u'#' and using UTF8 encoding, but VARCHAR is supposed to be explicitly ascii. Yes, I know that drill makes UTF8 columns too... but I wonder what happens if you have the data as b'#' instead, and object_encoding='bytes' ?

diegojancic · 2017-07-06T20:38:29Z

Thanks @zdk123. I've tried with pyarrow but I could make the .write_table work (maybe it doesn't work well on Windodws).

As for what @martindurant says, yes, I believe it's an encoding issue. Here are a couple tests:

RETURNS NO RESULTS:
... where column = 'value'

RETURNS VALID RESULTS:
... where column like 'value%' and length(column) = 5

... where to_utf8(column) = to_utf8('value')

I'm playing with changing the encoding of the Pandas' DataFrame and using object_encoding, but it hasn't work yet.

diegojancic · 2017-07-06T20:44:35Z

FYI,
This is what I'm doing for testing (without success):

dataFrame.firstname = dataFrame.firstname.str.encode("ascii")
fastparquet.write("dir", dataFrame, compression='GZIP', file_scheme='hive', object_encoding='bytes')

I've trying with several combinations of .encode("..."), including utf8 and latin_1, and also object_encoding set to bytes and utf8. Obviously I've no idea what I'm doing LOL, but I can try whatever you suggest. Thanks!

martindurant · 2017-07-06T20:48:24Z

That's exactly what I would have tried. I suppose it should be "utf8", but since I think your strings contain only ascii characters anyway, it won't make any difference.

Non-PlayerCharacter · 2017-07-06T20:52:57Z

I can mention a work-around that I'm using: write out multiple parquet files and then use "parquet-cat" from the parquet-tools package to merge them into a parquet file which Athena has no problem with. This actually works out better for me, because on a multi processor machine I can write out multiple parquet files simultaneously and then use parquet-cat to combine them while I write out the next set of files.

diegojancic · 2017-07-06T21:07:12Z

What do you mean with multiple files? Do you partition them?

martindurant · 2017-07-07T01:40:12Z

@Non-PlayerCharacter , this doesn't speak to the problem, but you may be interested in dask dataframe's to_parquet method, which writes partitioned parquet in parallel.

Non-PlayerCharacter · 2017-07-07T13:16:32Z

@diegojancic since Athena transparently scans across multiple parquet files, I can just break my data up into arbitrary chunks and write them to separate files. The Athena docs said somewhere that Parquet files of 100GB were the right size for speedy queries.

My workflow is to read the data into Pandas using read_csv, read_fwf or read_sql_query using chunks; once a chunk is read, I fork my process and the child does data cleanup and transform in RAM on the chunk sitting in the Pandas dataframe, and then writes the dataframe out using fastparquet.

The parent process checks to see if the sum total of the parquet files that the child processes have written out adds up to 100GB, and if so it calls parquet-cat to create an Athena-readable-file in S3 from the individual files that had been written out by the child processes. If the sum bytesize is not 100GB, the parent reads the next chunk and forks off another child.

I can get all the cores working on my data transforms this way.

zdk123 · 2017-07-07T13:30:51Z

The Athena docs said somewhere that Parquet files of 100GB were the right size for speedy queries.

@Non-PlayerCharacter can you point this out this the docs? Not relevant to this issue but for a different problem I'm having.

Fyi, I was still having the string matching issue even with a single 'simple-schema' parquet file.

martindurant · 2017-07-07T13:41:43Z

100GB per single file sounds very big to me, 100MB is more typical on HDFS systems (although I know reading from S3 is not the same situation because of the connection overhead).

You would do the same workflow in dask like

import dask.dataframe as dd
df = dd.read_csv(..., chunksize=500000000)
df = df.map/apply/dropna...
df.to_parquet('s3://mybucket/data.parqet', compression=, object_encoding=...)

and this will parallelize by default, or you can run with the distributed scheduler over a cluster too. Of course the issue above would not be solves... In fact, you might not need athena, since df, above, is queriable (using pandas syntax) and can read from parquet on S3 too :)

diegojancic · 2017-07-07T14:22:30Z

If this helps anyone debug the issue (I wouldn't know how). I'm attaching the same dataset saved as Parquet using Fastparquet and PyArrow, uncompressed. Fastparquet's does not work while PyArrow's does.

Full code used to generate that:

import pandas as pd
import fastparquet
import pyarrow
import pyarrow.parquet as pqt

dataFrame = pd.DataFrame({'id': [1,2], 'name': ['John', 'Mary']})

# FASTPARQUET
fastparquet.write("users_fastparquet", dataFrame, compression='UNCOMPRESSED', file_scheme='hive', object_encoding='utf8')

# PYARROW
table = pyarrow.Table.from_pandas(dataFrame)
pqt.write_table(table, '/users_pyarrow/part.0.parquet', compression='none')
pqt.write_metadata(table.schema, '/users_pyarrow/_metadata')

Thanks @zdk123 for the suggestion of using PyArrow. I would prefer to use fastparquet too.

users_fastparquet.zip
users_pyarrow.zip

martindurant · 2017-07-07T15:02:34Z

Interestingly, the arrow file is encoded as a dictionary, not simple strings, unlike the drill variant at the top to the thread.
Does anyone want to try with the name column as a categorical?

df['name'] = df['name'].astype('category')

(I don't know if this a useful possible solution)

Also, fastparquet adds extra null bytes to the data page, because of an earlier problem with spark, but I don't see how this can cause a problem (athena seems to have no problem loading the data, only on the interpretation of the encoding).

By the way, is it possible to ask athena a question like: what type do you think this column is?

diegojancic · 2017-07-07T15:24:32Z

@martindurant using .astype('category') didn't work in Athena. It's like it's not able to process the data now.

As for your other question: no, Athena will only say 'it's a string' and just because we said it was a string when we created the table (ie, CREATE TABLE...)

martindurant · 2017-07-07T15:45:09Z

Would you mind trying the following? I have removed the statistics metadata, which perhaps has the wrong format and causes the missed values.

temp.zip

diegojancic · 2017-07-07T15:50:50Z

@martindurant it worked!!! Can you explain what did you do? Just delete the _common_metadata and _metadata files?

martindurant · 2017-07-07T15:57:08Z

I understand now, I can apply a fix.
The statistics (max/min of each column for a given chunk) are used to decide whether to load that chunk based on conditions. The string values were being encoded as they are in the data itself, as they are for non-string values, but it seems that they are being stored directly as thrift string types instead.

It is not the lack of metadata files (the metadata is contained in the one file in this case, just to make it easier - this is the "simple" format).

martindurant · 2017-07-07T16:29:21Z

Please see if #179 solves your problem. Also, it would be good to try with categoricals.

diegojancic · 2017-07-07T17:05:13Z

Sorry, not sure how to build from source on Windows. I'm installing using conda.
I'm getting an error while executing python setup.py build:

...
running build_ext
error: [WinError 2] The system cannot find the file specified

martindurant · 2017-07-07T17:52:31Z

I'm afraid that some of the code needs compilation - I presume you have no appropriate compiler installer. I've never done this on windows, sorry I can't do it for you.

zdk123 · 2017-07-07T17:55:00Z

@martindurant I was able to build and install your branch for python 3.6.1, and can confirm exact string matching now works on Athena.

I don't have categorical data in my test sets, but I will try and report back.

diegojancic · 2017-07-07T17:55:05Z

Yes, that's OK. I'll continue trying. The build for your pull-request failed, but once that's solved I guess it will be published in Anaconda (hopefully soon).

martindurant · 2017-07-07T20:41:45Z

OK, I merged to master. If categoricals break, please make a new issue.
Next week I am at scipy, so I'm not sure if I'll have time to release this on conda-forge.

diegojancic · 2017-07-07T20:43:25Z

OK. Thanks for everything @martindurant. I'll wait for the conda-forge release whenever you can. If not I'll continue trying to compile it for Windows. Very much appreciated!

wesm · 2017-07-09T03:28:18Z

@diegojancic you said

Thanks @zdk123. I've tried with pyarrow but I could make the .write_table work (maybe it doesn't work well on Windodws).

Can you report the issue on the Arrow JIRA? Since Windows support only appeared recently I would be keen to hear of any issues so we can fix them. Thanks!

yan-hic · 2017-08-15T02:09:01Z

@martindurant has the fix made it to conda-forge ? I just tried to upgrade from there, and also from Github, both to no avail: still need to cast.
It seems that datetime64 is not converted from frame correctly either e.g. 1034640000000000 instead of Oct15 2012
FYI, I don't use Athena but Drill to read parquet.

martindurant · 2017-08-15T02:27:15Z

The time types are a different issue. There is a compatibility int96 type available on write

fastparquet.write(data, filename, times='int96')

for the "non-standard" times representation that, nevertheless, many parquet frameworkd use.

yan-hic · 2017-08-15T05:34:51Z

I tried that but it shows up in Drill as byte-array (e.g. B@5bbf1c15). I will revert for now converting in pd and do this down the line, in SQL/Drill.

As for the casting, I resolved it coincidentally when playing with encoding as object_encoding arg gave me an error. Turns out once I added the encoding arg - had given none - on the various pd.read_ I use, Drill showed the strings correctly.
Weird - does fastparquet determine & use the encoding of the frame to pass along in the parq files?

martindurant · 2017-08-15T13:07:22Z

Unless you say otherwise, fastparquet looks at the types of the first few values in an object column to determine how to encode.
Perhaps adding the encoding argument to pandas ensures that the strings are encodable in UTF8 (the only encoding parquet supports).

yan-hic · 2017-08-15T20:57:57Z

Actually the encoding arg helped with read_csv but not read_fwf. On the latter df's, I still get bytes array when querying through Drill after fastparquet. I will create a separate issue.

martindurant mentioned this issue Jul 7, 2017

String stat encoding #179

Merged

martindurant closed this as completed in #179 Jul 7, 2017

andyfase mentioned this issue Nov 3, 2017

Issues with String/UTF8 type with AWS Athena (Presto) xitongsys/parquet-go#4

Closed

string data type does not work in Amazon Athena but identical Drill file does work #150

string data type does not work in Amazon Athena but identical Drill file does work #150

Comments

Non-PlayerCharacter commented May 15, 2017

martindurant commented May 15, 2017

Non-PlayerCharacter commented May 15, 2017

Non-PlayerCharacter commented May 15, 2017

martindurant commented May 15, 2017

Non-PlayerCharacter commented May 15, 2017

martindurant commented May 15, 2017

Non-PlayerCharacter commented May 15, 2017

martindurant commented May 15, 2017

Non-PlayerCharacter commented May 15, 2017

diegojancic commented Jul 6, 2017

martindurant commented Jul 6, 2017

diegojancic commented Jul 6, 2017

martindurant commented Jul 6, 2017

zdk123 commented Jul 6, 2017 • edited

martindurant commented Jul 6, 2017

diegojancic commented Jul 6, 2017

diegojancic commented Jul 6, 2017

martindurant commented Jul 6, 2017

Non-PlayerCharacter commented Jul 6, 2017

diegojancic commented Jul 6, 2017

martindurant commented Jul 7, 2017

Non-PlayerCharacter commented Jul 7, 2017

zdk123 commented Jul 7, 2017 • edited

martindurant commented Jul 7, 2017

diegojancic commented Jul 7, 2017

martindurant commented Jul 7, 2017

diegojancic commented Jul 7, 2017

martindurant commented Jul 7, 2017

diegojancic commented Jul 7, 2017

martindurant commented Jul 7, 2017

martindurant commented Jul 7, 2017

diegojancic commented Jul 7, 2017

martindurant commented Jul 7, 2017

zdk123 commented Jul 7, 2017 • edited

diegojancic commented Jul 7, 2017

martindurant commented Jul 7, 2017

diegojancic commented Jul 7, 2017

wesm commented Jul 9, 2017

yan-hic commented Aug 15, 2017

martindurant commented Aug 15, 2017

yan-hic commented Aug 15, 2017

martindurant commented Aug 15, 2017

yan-hic commented Aug 15, 2017

zdk123 commented Jul 6, 2017 •

edited

zdk123 commented Jul 7, 2017 •

edited

zdk123 commented Jul 7, 2017 •

edited