Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataFrames.jl to Spark dataframe #114

Open
bienpierre opened this issue Aug 1, 2022 · 2 comments
Open

DataFrames.jl to Spark dataframe #114

bienpierre opened this issue Aug 1, 2022 · 2 comments

Comments

@bienpierre
Copy link

Hello,

Good job on Spark.jl.

I have a issue, I tried to learn Spark and I followed the documentation:

This is a quick introduction into the Spark.jl core functions. It closely follows the official PySpark tutorial and copies many examples verbatim. In most cases, PySpark docs should work for Spark.jl as is or with little adaptation.

I found that it is possible to Create a PySpark DataFrame from a pandas DataFrame. I tried to create a spark DataFrame from julia DataFrames.jl.

`df = spark.createDataFrame([
Row(a=1, b=2.0, c="string1", d=Date(2000, 1, 1), e=DateTime(2000, 1, 1, 12, 0)),
Row(a=2, b=3.0, c="string2", d=Date(2000, 2, 1), e=DateTime(2000, 1, 2, 12, 0)),
Row(a=4, b=5.0, c="string3", d=Date(2000, 3, 1), e=DateTime(2000, 1, 3, 12, 0))
])
+---+---+-------+----------+-------------------+
| a| b| c| d| e|
+---+---+-------+----------+-------------------+
| 1|2.0|string1|2000-01-01|2000-01-01 13:00:00|
| 2|3.0|string2|2000-02-01|2000-01-02 13:00:00|
| 4|5.0|string3|2000-03-01|2000-01-03 13:00:00|
+---+---+-------+----------+-------------------+

julia> println(df)
+---+---+-------+----------+-------------------+
| a| b| c| d| e|
+---+---+-------+----------+-------------------+
| 1|2.0|string1|2000-01-01|2000-01-01 13:00:00|
| 2|3.0|string2|2000-02-01|2000-01-02 13:00:00|
| 4|5.0|string3|2000-03-01|2000-01-03 13:00:00|
+---+---+-------+----------+-------------------+

julia> df_dataframes = DataFrames.DataFrame(A=1:4, B=["M", "F", "F", "M"])
4×2 DataFrame
Row │ A B
│ Int64 String
─────┼───────────────
1 │ 1 M
2 │ 2 F
3 │ 3 F
4 │ 4 M

julia> df = spark.createDataFrame(df_dataframes)
ERROR: MethodError: no method matching createDataFrame(::SparkSession, ::DataFrames.DataFrame)
Closest candidates are:
createDataFrame(::SparkSession, ::Vector{Vector{Any}}, ::Union{String, Vector{String}}) at ~/.julia/packages/Spark/89BUd/src/session.jl:92
createDataFrame(::SparkSession, ::Vector{Row}) at ~/.julia/packages/Spark/89BUd/src/session.jl:98
createDataFrame(::SparkSession, ::Vector{Row}, ::Union{String, Vector{String}}) at ~/.julia/packages/Spark/89BUd/src/session.jl:87
...
Stacktrace:
[1] (::Spark.DotChainer{SparkSession, typeof(Spark.createDataFrame)})(args::DataFrames.DataFrame)
@ Spark ~/.julia/packages/Spark/89BUd/src/chainable.jl:13
[2] top-level scope
@ REPL[21]:1
`

Is there a way to create a spark dataframe from DataFrames.jl? Or Do i have to use Pandas.jl.
regards

@bienpierre bienpierre changed the title DataFrames to Spark dataframe DataFrames.jl to Spark dataframe Aug 1, 2022
@dfdx
Copy link
Owner

dfdx commented Aug 2, 2022

DataFrmes.jl is definitely the way to go, but the integration isn't done yet. In the simplest case, you can convert rows of DataFrames.DataFrame to Spark.Rows and use Spark.createDataFrame(...) to convert it. A more efficient solution would be to use Arrow.jl to pass data between Julia and JVM underlying the Spark.jl, but it requires a bit more research than I currently have time for :( I will take a look at it at the next iteration of working on Spark.jl.

@exyi
Copy link
Contributor

exyi commented Aug 2, 2022

No promises, but I'll most likely have some time this month to bring the Arrow.jl interop back :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants