Does Spark.jl support writing data to HDFS? #98

pzhanggit · 2021-09-17T21:38:33Z

Hi there, thank you for the great work. I am trying to use Spark.jl to read data from HDFS files and also write results to HDFS. The user guide (http://dfdx.github.io/Spark.jl/index.html) says we could use "text_file" to load data, but does not mention how to write to HDFS. Does Spark.jl currently support writing data to HDFS, something like "saveAsTextFile"? Or any suggestions on how to output RDD objects to HDFS? Thanks.

dfdx · 2021-09-18T13:26:29Z

Indeed, saveAsTextFile is missing from the API. But it should be relatively easy to add. I'm not at my main laptop right now, but something like this should do the trick:

function save_as_text_file(rdd::RDD, path::AbstractString)
    jcall(rdd.jrdd, "saveAsTextFile", Nothing, (JString,), path)
end

You can add other methods in a similar way using Spark Java docs and JavaCall.jl.

Note that RDD API is quite old, so you might also be interested in SQL API, e.g. methods read_json(), write_json(), etc.

pzhanggit · 2021-09-19T16:28:20Z

@dfdx Thank you for the reply and suggestions. I will definitely read more about SQL and Dataframes.
I am beginner in Spark and was trying to play with the word count example with RDD API in Julia. I tried to add save_as_text_file to Spark.jl but got an empty HDFS output folder. I added the function you have above to rdd.jl and exported it save_as_text_file in Spark.jl. Did I miss something here?

dfdx · 2021-09-19T21:59:06Z

Can you post a reproducible snippet? If you don't see any errors during execution, it might be some generic error like saving an empty RDD or exiting before Spark has time to finish writing to HDFS.

Also, does it work if you read and write to a local file, for example?

pzhanggit · 2021-09-20T14:25:04Z

@dfdx, thank you for your help. It seems that Spark does not overwrite folder by default. It is working now after removing the old empty folder. Could we set some parameter to make it overwrittenable? Here is the code snippet.

using Spark
filepath_input = "hdfs://..."
filepath_output = "hdfs://..."
Spark.init()
sc = SparkContext(master="local")
text = text_file(sc, filepath_input)
words = flat_map(text, s -> [string(word) for word in split(s)]) 
words_tuple = cartesian(words, parallelize(sc, [1]))
counts = reduce_by_key(words_tuple, +)
save_as_text_file(counts,filepath_output)
close(sc)

dfdx · 2021-09-20T22:58:43Z

Yes, it's possible to override the output directory. In the RDD API, it should be enough to set "spark.hadoop.validateOutputSpecs" property to "false", e.g.:

conf = SparkConf(Dict("spark.hadoop.validateOutputSpecs" => "false"))
sc = SparkContext(master="local", conf=conf)

In the SQL interface there's a special method for it, but we don't have a convenient API for it in Spark.jl, so you'll have to use a chain of jcalls directly on Java API to obtain it. See more details on the solution for the SQL interface here.

pzhanggit · 2021-09-27T17:36:43Z

Thanks. Line 4 in Spark.jl file seems to be a typo? It should be "SparkConf"? After correcting this, HDFS files can be overwritten.

`module Spark

export
SparkConfig,
SparkContext,`

dfdx · 2021-09-27T22:29:41Z

Ah, it's interesting that this simple mistake never appeared before! Thanks for noticing, I'll fix the typo after your PR is merged to avoid rebasing on your side.

Should we close this issue now?

pzhanggit · 2021-09-28T02:20:14Z

Yep, thank you for the help!

pzhanggit closed this as completed Sep 28, 2021

pzhanggit mentioned this issue Sep 28, 2021

read sequence file from HDFS #101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does Spark.jl support writing data to HDFS? #98

Does Spark.jl support writing data to HDFS? #98

pzhanggit commented Sep 17, 2021 •

edited

dfdx commented Sep 18, 2021

pzhanggit commented Sep 19, 2021

dfdx commented Sep 19, 2021

pzhanggit commented Sep 20, 2021

dfdx commented Sep 20, 2021

pzhanggit commented Sep 27, 2021

dfdx commented Sep 27, 2021

pzhanggit commented Sep 28, 2021

Does Spark.jl support writing data to HDFS? #98

Does Spark.jl support writing data to HDFS? #98

Comments

pzhanggit commented Sep 17, 2021 • edited

dfdx commented Sep 18, 2021

pzhanggit commented Sep 19, 2021

dfdx commented Sep 19, 2021

pzhanggit commented Sep 20, 2021

dfdx commented Sep 20, 2021

pzhanggit commented Sep 27, 2021

dfdx commented Sep 27, 2021

pzhanggit commented Sep 28, 2021

pzhanggit commented Sep 17, 2021 •

edited