New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does Spark.jl support writing data to HDFS? #98
Comments
Indeed, function save_as_text_file(rdd::RDD, path::AbstractString)
jcall(rdd.jrdd, "saveAsTextFile", Nothing, (JString,), path)
end You can add other methods in a similar way using Spark Java docs and JavaCall.jl. Note that RDD API is quite old, so you might also be interested in SQL API, e.g. methods |
@dfdx Thank you for the reply and suggestions. I will definitely read more about SQL and Dataframes. |
Can you post a reproducible snippet? If you don't see any errors during execution, it might be some generic error like saving an empty RDD or exiting before Spark has time to finish writing to HDFS. Also, does it work if you read and write to a local file, for example? |
@dfdx, thank you for your help. It seems that Spark does not overwrite folder by default. It is working now after removing the old empty folder. Could we set some parameter to make it overwrittenable? Here is the code snippet.
|
Yes, it's possible to override the output directory. In the RDD API, it should be enough to set conf = SparkConf(Dict("spark.hadoop.validateOutputSpecs" => "false"))
sc = SparkContext(master="local", conf=conf) In the SQL interface there's a special method for it, but we don't have a convenient API for it in Spark.jl, so you'll have to use a chain of |
Thanks. Line 4 in Spark.jl file seems to be a typo? It should be "SparkConf"? After correcting this, HDFS files can be overwritten. `module Spark export |
Ah, it's interesting that this simple mistake never appeared before! Thanks for noticing, I'll fix the typo after your PR is merged to avoid rebasing on your side. Should we close this issue now? |
Yep, thank you for the help! |
Hi there, thank you for the great work. I am trying to use Spark.jl to read data from HDFS files and also write results to HDFS. The user guide (http://dfdx.github.io/Spark.jl/index.html) says we could use "text_file" to load data, but does not mention how to write to HDFS. Does Spark.jl currently support writing data to HDFS, something like "saveAsTextFile"? Or any suggestions on how to output RDD objects to HDFS? Thanks.
The text was updated successfully, but these errors were encountered: