Speed of Reading into ADAM RDDs from S3 #2003
We're using ADAM via the Python API, and we're running into some bottlenecks loading data from S3 using s3a. We're experiencing a max throughput of about 100mbps when reading into ADAM RDDs from S3, for bams and vcfs. Loading the same files into Spark as textfiles is ~1gbps. I realize many factors could affect this performance, but are these numbers ballpark of what's expected for this use case of ADAM? If not, are there recommended troubleshooting steps?
Could provide more info if needed.
The text was updated successfully, but these errors were encountered:
Hello @nick-phillips, thanks for the question.
Things have been strange for me and others reading BAM and VCF from S3 via s3a recently. Parquet works fine though. See #1951
Perhaps it might be useful discussing this further on gitter? Feel free to start a one-on-one if there is anything sensitive about your environment.
@heuermh - would you care to elaborate on "Things have been strange for me and others reading BAM and VCF from S3 via s3a recently."
Also I am attempting to load vcfs via parquet via the python API, and as you suggested for @nick-phillips, and have saved a vcf I loaded tp parquet via
when I try to load this however,
I get the following errors.
Py4JJavaError: An error occurred while calling o72.loadVariants.
My parquet vcf exists as a directory on s3 with many partitioned files, and I did verify there was the SUCCESS file, but although the documentation says that loadVariants in the python API supports parquet, it can't seem to load it with the s3a protocol. Is there something I am missing here?
@pjongeneel It's in the linked issue, there are thread leaks upstream in Hadoop libraries that cause trouble.
df=adamContext.loadVariants(path).toDF() df.write.format("parquet") .save("s3a://my_bucket/df.parquet") .saveMetadata("s3a://my_bucket/df.parquet")
I haven't tried writing the
You can write to Parquet via the
@heuermh , thanks for the info, I tried that and it worked fine!
Side note: I actually got the original code to save my dataframe from the ADAM scala api
override def saveAsParquet(filePath: String,
however when I saved it manually like that, I got
Not sure yet if there is an easy way to save the dataframe directly as the .gz.parquet files but I have a solution that works for now, so thank you!