Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Unsupported spark function list [please leave a comment if you plan to pick some] #4039

Open
53 of 98 tasks
PHILO-HE opened this issue Dec 14, 2023 · 74 comments
Open
53 of 98 tasks
Labels
enhancement New feature or request

Comments

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Dec 14, 2023

Description

Here listed spark functions still not supported by Gluten Velox backend. Please leave a comment if you'd like to pick some. In the below list, [√] means someone is working in progress for the corresponding function.
You can find all functions' support status from this gluten doc.

To avoid duplicate work, before starting, please check whether a PR has been submitted in Velox community or whether it has already been implemented in Velox who holds most sql functions in its sparksql folder & prestosql folder.

Reference:


  • percentile_approx/approx_percentile (WIP, guangxin)
  • concat_ws (PR ready, Add concat_ws Spark function facebookincubator/velox#8854)
  • unix_timestamp: "Only supports string type, with session timezone considered, todo: support date type"
  • locate
  • parse_url (PR drafted, not merged)
  • urldecoder: "UDF, supported by spark as a built-in function since 3.4.0."
  • normalizenanandzero
  • arrayintersects
  • default.json_split (udf, no need to impl.): "external UDF"
  • parsejsonarray: "external UDF"
  • struct
  • percentile (@Yohahaha)
  • first/first_value (@JkSelf)
  • last/last_value (@JkSelf)
  • posexplode (WIP, @marin-ma)
  • trunc (WIP, HannanKan)
  • months_between (PR ready)
  • date_trunc (WIP, HannanKan)
  • stack
  • grouping_id
  • printf (@Surbhi-Vijay)
  • space (WIP, rhh777)
  • inline (WIP, @marin-ma)
  • to_unix_timestamp: "Only supports string type, with session timezone considered. todo: support date type"
  • from_csv
  • from_json
  • json_object_keys
  • json_tuple
  • schema_of_csv
  • schema_of_json
  • to_csv
  • to_json (Suppose workable with folly function used)
  • make_ym_interval (WIP, @marin-ma)
  • make_timestamp (WIP, @marin-ma)
  • make_interval
  • make_dt_interval
  • monotonically_increasing_id
  • from_utc_timestamp (@acvictor)
  • extract
  • exists (@lyy-pineapple)
  • date_part
  • zip_with
  • transform (@Yohahaha)
  • transform_keys
  • transform_values
  • map_from_entries (WIP, MaYan)
  • map_filter (WIP, MaYan)
  • map_entries (Done, by MaYan)
  • map_concat
  • forall (@lyy-pineapple)
  • flatten (@ivoson)
  • filter
  • filter (array) (@ivoson)
  • width_bucket
  • array_sort (@boneanxs)
  • xpath
  • xpath_boolean
  • xpath_double
  • xpath_float
  • xpath_int
  • xpath_long
  • xpath_number
  • xpath_short
  • xpath_string
  • unbase64 (WIP, @fyp711)
  • decode (partially supported if translated to caseWhen. WIP Cody)
  • initcap (WIP, velox PR: 8676)
  • unix_date (velox PR 8725, completed)
  • count_min_sketch
  • bool_and/every (@mskapilks)
  • bool_or/any/some (@mskapilks)
  • shuffle (completed)
  • bround (@xumingming)
  • format_string (@gaoyangxiaozhu)
  • format_number (@gaoyangxiaozhu)
  • soundex (@zhli1142015)
  • levenshtein (@zhli1142015)
  • cot (@honeyhexin)
  • expm1 (@Donvi)
  • stack (generator function, @xumingming)
  • randn (@Donvi)
  • empty2null (internal function, @jinchengchenghh)
  • toprettystring (internal function, @jinchengchenghh)
  • AtLeastNNonNulls (internal funciton, @zhli1142015)
  • Since Spark-3.3 (related to ML, low priority)
  • regr_count
  • regr_avgx
  • regr_avgy
  • regr_r2
  • regr_sxx
  • regr_sxy
  • regr_syy
  • regr_slope
  • regr_intercept
  • Since Spark-3.3

  • Since Spark-3.4

@PHILO-HE PHILO-HE added the enhancement New feature or request label Dec 14, 2023
@PHILO-HE PHILO-HE pinned this issue Dec 14, 2023
@PHILO-HE PHILO-HE changed the title [VL] Spark function support list [please leave comment/mark if you plan to implement] [VL] Unsupported spark function list [please leave comment/mark if you plan to implement] Dec 15, 2023
@PHILO-HE PHILO-HE changed the title [VL] Unsupported spark function list [please leave comment/mark if you plan to implement] [VL] Unsupported spark function list [please leave a comment if you plan to pick some] Dec 15, 2023
@Yohahaha
Copy link
Contributor

Yohahaha commented Dec 29, 2023

I'd like support hex and unhex.

update: hex and unhex has already supported in Gluten.

@zwangsheng
Copy link
Contributor

Hi i'd like to give a try with hour function.

@konjac
Copy link
Contributor

konjac commented Jan 4, 2024

Hi, I'd like to have a look into map_keys

@fyp711
Copy link
Contributor

fyp711 commented Jan 11, 2024

Hi I'd like to support find_in_set in velox

@HannanKan
Copy link
Contributor

Hi, I'd like to support date_trunc/trunc.

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

Hi, I'd like to support dense_rank.

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

dense_rank already supported in velox facebookincubator/velox#6289.

@zhztheplayer
Copy link
Member

  • percentile_approx
  • approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

@PHILO-HE
Copy link
Contributor Author

  • percentile_approx
  • approx_percentile: "Third argument accuracy is different with velox, velox is double but spark is long"

The two stand for the same function I assume? I'll take these two if nobody is working on it.

Yes, they are one thing. Just unify them into one checkbox. Thanks!

@JkSelf
Copy link
Contributor

JkSelf commented Jan 22, 2024

I will take a look ntile window function.

@zhouyuan
Copy link
Contributor

ubase64:
#4482

@zjuwangg
Copy link
Contributor

Is there any plan to suppport from_json function?

@yma11
Copy link
Contributor

yma11 commented Jan 29, 2024

I'd like take map_entries and map_from_entries, there are already presto implementation in velox, will need check consistency .

@acvictor
Copy link
Contributor

I'd like to give date_from_unix_date a shot

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Feb 21, 2024

Just removed the below functions from the list, since they have been supported. Thanks! @acvictor, @Yohahaha, @fyp711, @zwangsheng, @JkSelf, etc.

to_date hour mod pow ifnull add_months next_day dense_rank find_in_set hex ntile
date_from_unix_date array_repeat array_position array_except array_distinct weekday
year month day

@acvictor
Copy link
Contributor

acvictor commented Feb 21, 2024

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

@Surbhi-Vijay
Copy link
Contributor

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

@PHILO-HE
Copy link
Contributor Author

nullif is out of the box supported. Spark send the converted expression as If expression and it is supported in Gluten.

Thanks so much for your feedback! Just removed it from the list.

@acvictor
Copy link
Contributor

@PHILO-HE I see support for year, month, day, last_day in Velox too. I can also give from_utc_timestamp a go.

Will do minute as well.

@rui-mo
Copy link
Contributor

rui-mo commented Feb 26, 2024

I'd like to work on locate and arrayintersect.

@mskapilks
Copy link
Contributor

I would like to work on bool_and, bool_or

@zhztheplayer
Copy link
Member

zhztheplayer commented Feb 29, 2024

  • collect_list (velox supported, needs Gluten to enable array for project plan node)
  • collect_set

@PHILO-HE Should we uncheck these two? I ran a test and the two functions are both fallen back (in 3.3).

@Surbhi-Vijay
Copy link
Contributor

I would like to give printf a try.

@NEUpanning
Copy link
Contributor

I'd like to take unix_date, thanks.

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented May 17, 2024

I'd like to take unix_date, thanks.

@NEUpanning, we have supported it in both Gluten & Velox. Just changed its state in the list. Thanks!
#5287
facebookincubator/velox#8725

@NEUpanning
Copy link
Contributor

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@PHILO-HE
Copy link
Contributor Author

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

"SELECT date_part('yearofweek', dt), extract(yearofweek from dt)" +

@NEUpanning
Copy link
Contributor

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.

date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

"SELECT date_part('yearofweek', dt), extract(yearofweek from dt)" +

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

@xumingming
Copy link
Contributor

shuffle, array_sort are already supported, can be marked as complete.

@xumingming
Copy link
Contributor

xumingming commented May 22, 2024

I will take a look at bround.

@PHILO-HE
Copy link
Contributor Author

@PHILO-HE Thanks for your feedback. So i'd like to take date_part. Is to_date supported in gluten now? It doesn't shows in the list. I also would like to pick it.

@NEUpanning, this list only maintains working-in-progress functions. I think to_date has been supported. See https://github.com/apache/incubator-gluten/blob/main/docs/velox-backend-support-progress.md.
date_part may be supported also. I note the below test in Gluten. You can confirm whether all date patterns have been supported.

"SELECT date_part('yearofweek', dt), extract(yearofweek from dt)" +

I can't find any implementation of date_part and to_date function in Velox. Would you like to help me find it? Thanks.

@NEUpanning, not a direct replacement. date_part is covered here. to_date is converted to Cast + GetTimestamp by Spark.

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented May 24, 2024

shuffle, array_sort are already supported, can be marked as complete.

@xumingming, seems sort_array is supported, but array_sort is not. Please spare some time to confirm. Thanks!

@Donvi
Copy link
Contributor

Donvi commented May 27, 2024

As I see only rand exists and no randn, I'm taking randn

@xumingming
Copy link
Contributor

shuffle, array_sort are already supported, can be marked as complete.

@xumingming, seems sort_array is supported, but array_sort is not. Please spare some time to confirm. Thanks!

@PHILO-HE array_sort is marked as supported in the doc:

| array_sort | array_sort | array_sort | S | | | | | | | | | | | | | | | | | | | |

And there is a test for collect_set which used array_sort

runQueryAndCompare("SELECT array_sort(collect_set(l_partkey)) FROM lineitem") {

@PHILO-HE
Copy link
Contributor Author

And there is a test for collect_set which used array_sort

runQueryAndCompare("SELECT array_sort(collect_set(l_partkey)) FROM lineitem") {

@xumingming, this test only confirms aggregate is offloaded. In my local test, array_sort is not offloaded actually.

@boneanxs
Copy link
Contributor

@PHILO-HE I can try to support array_sort if no one picked, we internally need this function :)

@Donvi
Copy link
Contributor

Donvi commented Jun 6, 2024

ubase64: #4482

I see you've map the from_base64 to unbase64, and respectively I find the base64 is almost the same as to_base64, so it's just a missing or is there any other consideration?

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Jun 6, 2024

ubase64: #4482

I see you've map the from_base64 to unbase64, and respectively I find the base64 is almost the same as to_base64, so it's just a missing or is there any other consideration?

@Donvi, seems there are a few semantic differences between Spark's unbase64 & Velox's from_base64. So the simple mapping has not been accepted by the community. See discussion: #5242 (comment). I guess similarly to_base64 cannot be mapped due to some unknown differences.

@gaoyangxiaozhu
Copy link
Contributor

gaoyangxiaozhu commented Jun 20, 2024

FYI, i am working for mask function support. @PHILO-HE

@zhli1142015
Copy link
Contributor

I'd like to pick up mode, thanks

@jinchengchenghh
Copy link
Contributor

Can you add empty2null to the list? @PHILO-HE

@PHILO-HE
Copy link
Contributor Author

Can you add empty2null to the list? @PHILO-HE

Just added.

@jinchengchenghh
Copy link
Contributor

Thanks!

@jinchengchenghh
Copy link
Contributor

jinchengchenghh commented Jun 25, 2024

Can you add the function toprettystring to the list? Thanks! @PHILO-HE
This query will use it
I will take it.

select        sum(hash(floor(l_extendedprice)) *l_discount + hash(l_orderkey) + hash(l_partkey) + hash(l_suppkey) + hash(l_linenumber) + hash(l_comment) + hash(l_shipinstruct)) as revenue from      lineitem;

@zhli1142015
Copy link
Contributor

I would lie to take AtLeastNNonNulls, thanks.

@jinchengchenghh
Copy link
Contributor

Here list some other functions that not support:
https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L62
Here list some function some data type or some behavior does not aligns with Spark.
https://github.com/apache/incubator-gluten/blob/main/cpp/velox/substrait/SubstraitToVeloxPlanValidator.cc#L188

@zml1206
Copy link
Contributor

zml1206 commented Oct 18, 2024

Hi, I'd like to support date_trunc/trunc.

@HannanKan Are you still doing this? If you don't have time, I can take over, thank you.

@zjuwangg
Copy link
Contributor

zjuwangg commented Nov 7, 2024

@PHILO-HE I can try to support array_sort if no one picked, we internally need this function :)

@boneanxs How about this issue goes? If you don't have time, I'd like to investigate in it.

@boneanxs
Copy link
Contributor

boneanxs commented Nov 7, 2024

@PHILO-HE I can try to support array_sort if no one picked, we internally need this function :)

@boneanxs How about this issue goes? If you don't have time, I'd like to investigate in it.

@zjuwangg can see this pr: facebookincubator/velox#10138, still under reviewing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests