Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions #30867

Closed
wants to merge 3 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Dec 21, 2020

What changes were proposed in this pull request?

This PR proposes to fill missing group tags and re-categorize all the group tags for built-in functions.
New groups below are added in this PR:

  • binary_funcs
  • bitwise_funcs
  • collection_funcs
  • predicate_funcs
  • conditional_funcs
  • conversion_funcs
  • csv_funcs
  • generator_funcs
  • hash_funcs
  • lambda_funcs
  • math_funcs
  • misc_funcs
  • string_funcs
  • struct_funcs
  • xml_funcs

A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. For example, all the functions in hash.scala are categorized into hash_funcs. But, there are some exceptional/ambiguous cases when categorizing them. Here are some special notes:

  • All the aggregate functions are categorized into agg_funcs.
  • array_funcs and map_funcs are sub-groups of collection_funcs. For example, array_contains is used only for arrays, so it is assigned to array_funcs. On the other hand, reverse is used for both arrays and strings, so it is assigned to collection_funcs.
  • Some functions logically belong to multiple groups. In this case, these functions are categorized based on the file that they belong to. For example, schema_of_csv can be grouped into both csv_funcs and struct_funcs in terms of input types, but it is assigned to csv_funcs because it belongs to the csvExpressions.scala file that holds the other CSV-related functions.
  • Functions in nullExpressions.scala, complexTypeCreator.scala, randomExpressions.scala, and regexExpressions.scala are categorized based on their functionalities. For example:
    • isnull in nullExpressions is assigned to predicate_funcs because this is a predicate function.
    • array in complexTypeCreator.scala is assigned to array_funcsbased on its output type (The other functions in array_funcs are categorized based on their input types though).

A category list (after this PR) is as follows (the list below includes the exprs that already have a group tag in the current master):

group name class
agg_funcs any org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr
agg_funcs approx_count_distinct org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus
agg_funcs approx_percentile org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
agg_funcs avg org.apache.spark.sql.catalyst.expressions.aggregate.Average
agg_funcs bit_and org.apache.spark.sql.catalyst.expressions.aggregate.BitAndAgg
agg_funcs bit_or org.apache.spark.sql.catalyst.expressions.aggregate.BitOrAgg
agg_funcs bit_xor org.apache.spark.sql.catalyst.expressions.aggregate.BitXorAgg
agg_funcs bool_and org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd
agg_funcs bool_or org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr
agg_funcs collect_list org.apache.spark.sql.catalyst.expressions.aggregate.CollectList
agg_funcs collect_set org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet
agg_funcs corr org.apache.spark.sql.catalyst.expressions.aggregate.Corr
agg_funcs count_if org.apache.spark.sql.catalyst.expressions.aggregate.CountIf
agg_funcs count_min_sketch org.apache.spark.sql.catalyst.expressions.aggregate.CountMinSketchAgg
agg_funcs count org.apache.spark.sql.catalyst.expressions.aggregate.Count
agg_funcs covar_pop org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation
agg_funcs covar_samp org.apache.spark.sql.catalyst.expressions.aggregate.CovSample
agg_funcs cube org.apache.spark.sql.catalyst.expressions.Cube
agg_funcs every org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd
agg_funcs first_value org.apache.spark.sql.catalyst.expressions.aggregate.First
agg_funcs first org.apache.spark.sql.catalyst.expressions.aggregate.First
agg_funcs grouping_id org.apache.spark.sql.catalyst.expressions.GroupingID
agg_funcs grouping org.apache.spark.sql.catalyst.expressions.Grouping
agg_funcs kurtosis org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis
agg_funcs last_value org.apache.spark.sql.catalyst.expressions.aggregate.Last
agg_funcs last org.apache.spark.sql.catalyst.expressions.aggregate.Last
agg_funcs max_by org.apache.spark.sql.catalyst.expressions.aggregate.MaxBy
agg_funcs max org.apache.spark.sql.catalyst.expressions.aggregate.Max
agg_funcs mean org.apache.spark.sql.catalyst.expressions.aggregate.Average
agg_funcs min_by org.apache.spark.sql.catalyst.expressions.aggregate.MinBy
agg_funcs min org.apache.spark.sql.catalyst.expressions.aggregate.Min
agg_funcs percentile_approx org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
agg_funcs percentile org.apache.spark.sql.catalyst.expressions.aggregate.Percentile
agg_funcs rollup org.apache.spark.sql.catalyst.expressions.Rollup
agg_funcs skewness org.apache.spark.sql.catalyst.expressions.aggregate.Skewness
agg_funcs some org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr
agg_funcs stddev_pop org.apache.spark.sql.catalyst.expressions.aggregate.StddevPop
agg_funcs stddev_samp org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp
agg_funcs stddev org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp
agg_funcs std org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp
agg_funcs sum org.apache.spark.sql.catalyst.expressions.aggregate.Sum
agg_funcs var_pop org.apache.spark.sql.catalyst.expressions.aggregate.VariancePop
agg_funcs var_samp org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp
agg_funcs variance org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp
array_funcs array_contains org.apache.spark.sql.catalyst.expressions.ArrayContains
array_funcs array_distinct org.apache.spark.sql.catalyst.expressions.ArrayDistinct
array_funcs array_except org.apache.spark.sql.catalyst.expressions.ArrayExcept
array_funcs array_intersect org.apache.spark.sql.catalyst.expressions.ArrayIntersect
array_funcs array_join org.apache.spark.sql.catalyst.expressions.ArrayJoin
array_funcs array_max org.apache.spark.sql.catalyst.expressions.ArrayMax
array_funcs array_min org.apache.spark.sql.catalyst.expressions.ArrayMin
array_funcs array_position org.apache.spark.sql.catalyst.expressions.ArrayPosition
array_funcs array_remove org.apache.spark.sql.catalyst.expressions.ArrayRemove
array_funcs array_repeat org.apache.spark.sql.catalyst.expressions.ArrayRepeat
array_funcs array_union org.apache.spark.sql.catalyst.expressions.ArrayUnion
array_funcs arrays_overlap org.apache.spark.sql.catalyst.expressions.ArraysOverlap
array_funcs arrays_zip org.apache.spark.sql.catalyst.expressions.ArraysZip
array_funcs array org.apache.spark.sql.catalyst.expressions.CreateArray
array_funcs flatten org.apache.spark.sql.catalyst.expressions.Flatten
array_funcs sequence org.apache.spark.sql.catalyst.expressions.Sequence
array_funcs shuffle org.apache.spark.sql.catalyst.expressions.Shuffle
array_funcs slice org.apache.spark.sql.catalyst.expressions.Slice
array_funcs sort_array org.apache.spark.sql.catalyst.expressions.SortArray
bitwise_funcs & org.apache.spark.sql.catalyst.expressions.BitwiseAnd
bitwise_funcs ^ org.apache.spark.sql.catalyst.expressions.BitwiseXor
bitwise_funcs bit_count org.apache.spark.sql.catalyst.expressions.BitwiseCount
bitwise_funcs shiftrightunsigned org.apache.spark.sql.catalyst.expressions.ShiftRightUnsigned
bitwise_funcs shiftright org.apache.spark.sql.catalyst.expressions.ShiftRight
bitwise_funcs ~ org.apache.spark.sql.catalyst.expressions.BitwiseNot
collection_funcs cardinality org.apache.spark.sql.catalyst.expressions.Size
collection_funcs concat org.apache.spark.sql.catalyst.expressions.Concat
collection_funcs reverse org.apache.spark.sql.catalyst.expressions.Reverse
collection_funcs size org.apache.spark.sql.catalyst.expressions.Size
conditional_funcs coalesce org.apache.spark.sql.catalyst.expressions.Coalesce
conditional_funcs ifnull org.apache.spark.sql.catalyst.expressions.IfNull
conditional_funcs if org.apache.spark.sql.catalyst.expressions.If
conditional_funcs nanvl org.apache.spark.sql.catalyst.expressions.NaNvl
conditional_funcs nullif org.apache.spark.sql.catalyst.expressions.NullIf
conditional_funcs nvl2 org.apache.spark.sql.catalyst.expressions.Nvl2
conditional_funcs nvl org.apache.spark.sql.catalyst.expressions.Nvl
conditional_funcs when org.apache.spark.sql.catalyst.expressions.CaseWhen
conversion_funcs bigint org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs binary org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs boolean org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs cast org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs date org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs decimal org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs double org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs float org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs int org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs smallint org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs string org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs timestamp org.apache.spark.sql.catalyst.expressions.Cast
conversion_funcs tinyint org.apache.spark.sql.catalyst.expressions.Cast
csv_funcs from_csv org.apache.spark.sql.catalyst.expressions.CsvToStructs
csv_funcs schema_of_csv org.apache.spark.sql.catalyst.expressions.SchemaOfCsv
csv_funcs to_csv org.apache.spark.sql.catalyst.expressions.StructsToCsv
datetime_funcs add_months org.apache.spark.sql.catalyst.expressions.AddMonths
datetime_funcs current_date org.apache.spark.sql.catalyst.expressions.CurrentDate
datetime_funcs current_timestamp org.apache.spark.sql.catalyst.expressions.CurrentTimestamp
datetime_funcs current_timezone org.apache.spark.sql.catalyst.expressions.CurrentTimeZone
datetime_funcs date_add org.apache.spark.sql.catalyst.expressions.DateAdd
datetime_funcs date_format org.apache.spark.sql.catalyst.expressions.DateFormatClass
datetime_funcs date_from_unix_date org.apache.spark.sql.catalyst.expressions.DateFromUnixDate
datetime_funcs date_part org.apache.spark.sql.catalyst.expressions.DatePart
datetime_funcs date_sub org.apache.spark.sql.catalyst.expressions.DateSub
datetime_funcs date_trunc org.apache.spark.sql.catalyst.expressions.TruncTimestamp
datetime_funcs datediff org.apache.spark.sql.catalyst.expressions.DateDiff
datetime_funcs dayofmonth org.apache.spark.sql.catalyst.expressions.DayOfMonth
datetime_funcs dayofweek org.apache.spark.sql.catalyst.expressions.DayOfWeek
datetime_funcs dayofyear org.apache.spark.sql.catalyst.expressions.DayOfYear
datetime_funcs day org.apache.spark.sql.catalyst.expressions.DayOfMonth
datetime_funcs extract org.apache.spark.sql.catalyst.expressions.Extract
datetime_funcs from_unixtime org.apache.spark.sql.catalyst.expressions.FromUnixTime
datetime_funcs from_utc_timestamp org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp
datetime_funcs hour org.apache.spark.sql.catalyst.expressions.Hour
datetime_funcs last_day org.apache.spark.sql.catalyst.expressions.LastDay
datetime_funcs make_date org.apache.spark.sql.catalyst.expressions.MakeDate
datetime_funcs make_interval org.apache.spark.sql.catalyst.expressions.MakeInterval
datetime_funcs make_timestamp org.apache.spark.sql.catalyst.expressions.MakeTimestamp
datetime_funcs minute org.apache.spark.sql.catalyst.expressions.Minute
datetime_funcs months_between org.apache.spark.sql.catalyst.expressions.MonthsBetween
datetime_funcs month org.apache.spark.sql.catalyst.expressions.Month
datetime_funcs next_day org.apache.spark.sql.catalyst.expressions.NextDay
datetime_funcs now org.apache.spark.sql.catalyst.expressions.Now
datetime_funcs quarter org.apache.spark.sql.catalyst.expressions.Quarter
datetime_funcs second org.apache.spark.sql.catalyst.expressions.Second
datetime_funcs timestamp_micros org.apache.spark.sql.catalyst.expressions.MicrosToTimestamp
datetime_funcs timestamp_millis org.apache.spark.sql.catalyst.expressions.MillisToTimestamp
datetime_funcs timestamp_seconds org.apache.spark.sql.catalyst.expressions.SecondsToTimestamp
datetime_funcs to_date org.apache.spark.sql.catalyst.expressions.ParseToDate
datetime_funcs to_timestamp org.apache.spark.sql.catalyst.expressions.ParseToTimestamp
datetime_funcs to_unix_timestamp org.apache.spark.sql.catalyst.expressions.ToUnixTimestamp
datetime_funcs to_utc_timestamp org.apache.spark.sql.catalyst.expressions.ToUTCTimestamp
datetime_funcs trunc org.apache.spark.sql.catalyst.expressions.TruncDate
datetime_funcs unix_date org.apache.spark.sql.catalyst.expressions.UnixDate
datetime_funcs unix_micros org.apache.spark.sql.catalyst.expressions.UnixMicros
datetime_funcs unix_millis org.apache.spark.sql.catalyst.expressions.UnixMillis
datetime_funcs unix_seconds org.apache.spark.sql.catalyst.expressions.UnixSeconds
datetime_funcs unix_timestamp org.apache.spark.sql.catalyst.expressions.UnixTimestamp
datetime_funcs weekday org.apache.spark.sql.catalyst.expressions.WeekDay
datetime_funcs weekofyear org.apache.spark.sql.catalyst.expressions.WeekOfYear
datetime_funcs year org.apache.spark.sql.catalyst.expressions.Year
generator_funcs explode_outer org.apache.spark.sql.catalyst.expressions.Explode
generator_funcs explode org.apache.spark.sql.catalyst.expressions.Explode
generator_funcs inline_outer org.apache.spark.sql.catalyst.expressions.Inline
generator_funcs inline org.apache.spark.sql.catalyst.expressions.Inline
generator_funcs posexplode_outer org.apache.spark.sql.catalyst.expressions.PosExplode
generator_funcs posexplode org.apache.spark.sql.catalyst.expressions.PosExplode
generator_funcs stack org.apache.spark.sql.catalyst.expressions.Stack
hash_funcs crc32 org.apache.spark.sql.catalyst.expressions.Crc32
hash_funcs hash org.apache.spark.sql.catalyst.expressions.Murmur3Hash
hash_funcs md5 org.apache.spark.sql.catalyst.expressions.Md5
hash_funcs sha1 org.apache.spark.sql.catalyst.expressions.Sha1
hash_funcs sha2 org.apache.spark.sql.catalyst.expressions.Sha2
hash_funcs sha org.apache.spark.sql.catalyst.expressions.Sha1
hash_funcs xxhash64 org.apache.spark.sql.catalyst.expressions.XxHash64
json_funcs from_json org.apache.spark.sql.catalyst.expressions.JsonToStructs
json_funcs get_json_object org.apache.spark.sql.catalyst.expressions.GetJsonObject
json_funcs json_array_length org.apache.spark.sql.catalyst.expressions.LengthOfJsonArray
json_funcs json_object_keys org.apache.spark.sql.catalyst.expressions.JsonObjectKeys
json_funcs json_tuple org.apache.spark.sql.catalyst.expressions.JsonTuple
json_funcs schema_of_json org.apache.spark.sql.catalyst.expressions.SchemaOfJson
json_funcs to_json org.apache.spark.sql.catalyst.expressions.StructsToJson
lambda_funcs aggregate org.apache.spark.sql.catalyst.expressions.ArrayAggregate
lambda_funcs array_sort org.apache.spark.sql.catalyst.expressions.ArraySort
lambda_funcs exists org.apache.spark.sql.catalyst.expressions.ArrayExists
lambda_funcs filter org.apache.spark.sql.catalyst.expressions.ArrayFilter
lambda_funcs forall org.apache.spark.sql.catalyst.expressions.ArrayForAll
lambda_funcs map_filter org.apache.spark.sql.catalyst.expressions.MapFilter
lambda_funcs map_zip_with org.apache.spark.sql.catalyst.expressions.MapZipWith
lambda_funcs transform_keys org.apache.spark.sql.catalyst.expressions.TransformKeys
lambda_funcs transform_values org.apache.spark.sql.catalyst.expressions.TransformValues
lambda_funcs transform org.apache.spark.sql.catalyst.expressions.ArrayTransform
lambda_funcs zip_with org.apache.spark.sql.catalyst.expressions.ZipWith
map_funcs element_at org.apache.spark.sql.catalyst.expressions.ElementAt
map_funcs map_concat org.apache.spark.sql.catalyst.expressions.MapConcat
map_funcs map_entries org.apache.spark.sql.catalyst.expressions.MapEntries
map_funcs map_from_arrays org.apache.spark.sql.catalyst.expressions.MapFromArrays
map_funcs map_from_entries org.apache.spark.sql.catalyst.expressions.MapFromEntries
map_funcs map_keys org.apache.spark.sql.catalyst.expressions.MapKeys
map_funcs map_values org.apache.spark.sql.catalyst.expressions.MapValues
map_funcs map org.apache.spark.sql.catalyst.expressions.CreateMap
map_funcs str_to_map org.apache.spark.sql.catalyst.expressions.StringToMap
math_funcs % org.apache.spark.sql.catalyst.expressions.Remainder
math_funcs * org.apache.spark.sql.catalyst.expressions.Multiply
math_funcs + org.apache.spark.sql.catalyst.expressions.Add
math_funcs - org.apache.spark.sql.catalyst.expressions.Subtract
math_funcs / org.apache.spark.sql.catalyst.expressions.Divide
math_funcs abs org.apache.spark.sql.catalyst.expressions.Abs
math_funcs acosh org.apache.spark.sql.catalyst.expressions.Acosh
math_funcs acos org.apache.spark.sql.catalyst.expressions.Acos
math_funcs asinh org.apache.spark.sql.catalyst.expressions.Asinh
math_funcs asin org.apache.spark.sql.catalyst.expressions.Asin
math_funcs atan2 org.apache.spark.sql.catalyst.expressions.Atan2
math_funcs atanh org.apache.spark.sql.catalyst.expressions.Atanh
math_funcs atan org.apache.spark.sql.catalyst.expressions.Atan
math_funcs bin org.apache.spark.sql.catalyst.expressions.Bin
math_funcs bround org.apache.spark.sql.catalyst.expressions.BRound
math_funcs cbrt org.apache.spark.sql.catalyst.expressions.Cbrt
math_funcs ceiling org.apache.spark.sql.catalyst.expressions.Ceil
math_funcs ceil org.apache.spark.sql.catalyst.expressions.Ceil
math_funcs conv org.apache.spark.sql.catalyst.expressions.Conv
math_funcs cosh org.apache.spark.sql.catalyst.expressions.Cosh
math_funcs cos org.apache.spark.sql.catalyst.expressions.Cos
math_funcs cot org.apache.spark.sql.catalyst.expressions.Cot
math_funcs degrees org.apache.spark.sql.catalyst.expressions.ToDegrees
math_funcs div org.apache.spark.sql.catalyst.expressions.IntegralDivide
math_funcs expm1 org.apache.spark.sql.catalyst.expressions.Expm1
math_funcs exp org.apache.spark.sql.catalyst.expressions.Exp
math_funcs e org.apache.spark.sql.catalyst.expressions.EulerNumber
math_funcs factorial org.apache.spark.sql.catalyst.expressions.Factorial
math_funcs floor org.apache.spark.sql.catalyst.expressions.Floor
math_funcs greatest org.apache.spark.sql.catalyst.expressions.Greatest
math_funcs hex org.apache.spark.sql.catalyst.expressions.Hex
math_funcs hypot org.apache.spark.sql.catalyst.expressions.Hypot
math_funcs least org.apache.spark.sql.catalyst.expressions.Least
math_funcs ln org.apache.spark.sql.catalyst.expressions.Log
math_funcs log10 org.apache.spark.sql.catalyst.expressions.Log10
math_funcs log1p org.apache.spark.sql.catalyst.expressions.Log1p
math_funcs log2 org.apache.spark.sql.catalyst.expressions.Log2
math_funcs log org.apache.spark.sql.catalyst.expressions.Logarithm
math_funcs mod org.apache.spark.sql.catalyst.expressions.Remainder
math_funcs negative org.apache.spark.sql.catalyst.expressions.UnaryMinus
math_funcs pi org.apache.spark.sql.catalyst.expressions.Pi
math_funcs pmod org.apache.spark.sql.catalyst.expressions.Pmod
math_funcs positive org.apache.spark.sql.catalyst.expressions.UnaryPositive
math_funcs power org.apache.spark.sql.catalyst.expressions.Pow
math_funcs pow org.apache.spark.sql.catalyst.expressions.Pow
math_funcs radians org.apache.spark.sql.catalyst.expressions.ToRadians
math_funcs randn org.apache.spark.sql.catalyst.expressions.Randn
math_funcs random org.apache.spark.sql.catalyst.expressions.Rand
math_funcs rand org.apache.spark.sql.catalyst.expressions.Rand
math_funcs rint org.apache.spark.sql.catalyst.expressions.Rint
math_funcs round org.apache.spark.sql.catalyst.expressions.Round
math_funcs shiftleft org.apache.spark.sql.catalyst.expressions.ShiftLeft
math_funcs signum org.apache.spark.sql.catalyst.expressions.Signum
math_funcs sign org.apache.spark.sql.catalyst.expressions.Signum
math_funcs sinh org.apache.spark.sql.catalyst.expressions.Sinh
math_funcs sin org.apache.spark.sql.catalyst.expressions.Sin
math_funcs sqrt org.apache.spark.sql.catalyst.expressions.Sqrt
math_funcs tanh org.apache.spark.sql.catalyst.expressions.Tanh
math_funcs tan org.apache.spark.sql.catalyst.expressions.Tan
math_funcs unhex org.apache.spark.sql.catalyst.expressions.Unhex
math_funcs width_bucket org.apache.spark.sql.catalyst.expressions.WidthBucket
misc_funcs assert_true org.apache.spark.sql.catalyst.expressions.AssertTrue
misc_funcs current_catalog org.apache.spark.sql.catalyst.expressions.CurrentCatalog
misc_funcs current_database org.apache.spark.sql.catalyst.expressions.CurrentDatabase
misc_funcs input_file_block_length org.apache.spark.sql.catalyst.expressions.InputFileBlockLength
misc_funcs input_file_block_start org.apache.spark.sql.catalyst.expressions.InputFileBlockStart
misc_funcs input_file_name org.apache.spark.sql.catalyst.expressions.InputFileName
misc_funcs java_method org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
misc_funcs monotonically_increasing_id org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID
misc_funcs raise_error org.apache.spark.sql.catalyst.expressions.RaiseError
misc_funcs reflect org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
misc_funcs spark_partition_id org.apache.spark.sql.catalyst.expressions.SparkPartitionID
misc_funcs typeof org.apache.spark.sql.catalyst.expressions.TypeOf
misc_funcs uuid org.apache.spark.sql.catalyst.expressions.Uuid
misc_funcs version org.apache.spark.sql.catalyst.expressions.SparkVersion
predicate_funcs ! org.apache.spark.sql.catalyst.expressions.Not
predicate_funcs <=> org.apache.spark.sql.catalyst.expressions.EqualNullSafe
predicate_funcs <= org.apache.spark.sql.catalyst.expressions.LessThanOrEqual
predicate_funcs < org.apache.spark.sql.catalyst.expressions.LessThan
predicate_funcs == org.apache.spark.sql.catalyst.expressions.EqualTo
predicate_funcs = org.apache.spark.sql.catalyst.expressions.EqualTo
predicate_funcs >= org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual
predicate_funcs > org.apache.spark.sql.catalyst.expressions.GreaterThan
predicate_funcs and org.apache.spark.sql.catalyst.expressions.And
predicate_funcs in org.apache.spark.sql.catalyst.expressions.In
predicate_funcs isnan org.apache.spark.sql.catalyst.expressions.IsNaN
predicate_funcs isnotnull org.apache.spark.sql.catalyst.expressions.IsNotNull
predicate_funcs isnull org.apache.spark.sql.catalyst.expressions.IsNull
predicate_funcs like org.apache.spark.sql.catalyst.expressions.Like
predicate_funcs not org.apache.spark.sql.catalyst.expressions.Not
predicate_funcs or org.apache.spark.sql.catalyst.expressions.Or
predicate_funcs regexp_like org.apache.spark.sql.catalyst.expressions.RLike
predicate_funcs rlike org.apache.spark.sql.catalyst.expressions.RLike
string_funcs ascii org.apache.spark.sql.catalyst.expressions.Ascii
string_funcs base64 org.apache.spark.sql.catalyst.expressions.Base64
string_funcs bit_length org.apache.spark.sql.catalyst.expressions.BitLength
string_funcs char_length org.apache.spark.sql.catalyst.expressions.Length
string_funcs character_length org.apache.spark.sql.catalyst.expressions.Length
string_funcs char org.apache.spark.sql.catalyst.expressions.Chr
string_funcs chr org.apache.spark.sql.catalyst.expressions.Chr
string_funcs concat_ws org.apache.spark.sql.catalyst.expressions.ConcatWs
string_funcs decode org.apache.spark.sql.catalyst.expressions.Decode
string_funcs elt org.apache.spark.sql.catalyst.expressions.Elt
string_funcs encode org.apache.spark.sql.catalyst.expressions.Encode
string_funcs find_in_set org.apache.spark.sql.catalyst.expressions.FindInSet
string_funcs format_number org.apache.spark.sql.catalyst.expressions.FormatNumber
string_funcs format_string org.apache.spark.sql.catalyst.expressions.FormatString
string_funcs initcap org.apache.spark.sql.catalyst.expressions.InitCap
string_funcs instr org.apache.spark.sql.catalyst.expressions.StringInstr
string_funcs lcase org.apache.spark.sql.catalyst.expressions.Lower
string_funcs left org.apache.spark.sql.catalyst.expressions.Left
string_funcs length org.apache.spark.sql.catalyst.expressions.Length
string_funcs levenshtein org.apache.spark.sql.catalyst.expressions.Levenshtein
string_funcs locate org.apache.spark.sql.catalyst.expressions.StringLocate
string_funcs lower org.apache.spark.sql.catalyst.expressions.Lower
string_funcs lpad org.apache.spark.sql.catalyst.expressions.StringLPad
string_funcs ltrim org.apache.spark.sql.catalyst.expressions.StringTrimLeft
string_funcs octet_length org.apache.spark.sql.catalyst.expressions.OctetLength
string_funcs overlay org.apache.spark.sql.catalyst.expressions.Overlay
string_funcs parse_url org.apache.spark.sql.catalyst.expressions.ParseUrl
string_funcs position org.apache.spark.sql.catalyst.expressions.StringLocate
string_funcs printf org.apache.spark.sql.catalyst.expressions.FormatString
string_funcs regexp_extract_all org.apache.spark.sql.catalyst.expressions.RegExpExtractAll
string_funcs regexp_extract org.apache.spark.sql.catalyst.expressions.RegExpExtract
string_funcs regexp_replace org.apache.spark.sql.catalyst.expressions.RegExpReplace
string_funcs repeat org.apache.spark.sql.catalyst.expressions.StringRepeat
string_funcs replace org.apache.spark.sql.catalyst.expressions.StringReplace
string_funcs right org.apache.spark.sql.catalyst.expressions.Right
string_funcs rpad org.apache.spark.sql.catalyst.expressions.StringRPad
string_funcs rtrim org.apache.spark.sql.catalyst.expressions.StringTrimRight
string_funcs sentences org.apache.spark.sql.catalyst.expressions.Sentences
string_funcs soundex org.apache.spark.sql.catalyst.expressions.SoundEx
string_funcs space org.apache.spark.sql.catalyst.expressions.StringSpace
string_funcs split org.apache.spark.sql.catalyst.expressions.StringSplit
string_funcs substring_index org.apache.spark.sql.catalyst.expressions.SubstringIndex
string_funcs substring org.apache.spark.sql.catalyst.expressions.Substring
string_funcs substr org.apache.spark.sql.catalyst.expressions.Substring
string_funcs translate org.apache.spark.sql.catalyst.expressions.StringTranslate
string_funcs trim org.apache.spark.sql.catalyst.expressions.StringTrim
string_funcs ucase org.apache.spark.sql.catalyst.expressions.Upper
string_funcs unbase64 org.apache.spark.sql.catalyst.expressions.UnBase64
string_funcs upper org.apache.spark.sql.catalyst.expressions.Upper
struct_funcs named_struct org.apache.spark.sql.catalyst.expressions.CreateNamedStruct
struct_funcs struct org.apache.spark.sql.catalyst.expressions.CreateNamedStruct
window_funcs cume_dist org.apache.spark.sql.catalyst.expressions.CumeDist
window_funcs dense_rank org.apache.spark.sql.catalyst.expressions.DenseRank
window_funcs lag org.apache.spark.sql.catalyst.expressions.Lag
window_funcs lead org.apache.spark.sql.catalyst.expressions.Lead
window_funcs nth_value org.apache.spark.sql.catalyst.expressions.NthValue
window_funcs ntile org.apache.spark.sql.catalyst.expressions.NTile
window_funcs percent_rank org.apache.spark.sql.catalyst.expressions.PercentRank
window_funcs rank org.apache.spark.sql.catalyst.expressions.Rank
window_funcs row_number org.apache.spark.sql.catalyst.expressions.RowNumber
xml_funcs xpath_boolean org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean
xml_funcs xpath_double org.apache.spark.sql.catalyst.expressions.xml.XPathDouble
xml_funcs xpath_float org.apache.spark.sql.catalyst.expressions.xml.XPathFloat
xml_funcs xpath_int org.apache.spark.sql.catalyst.expressions.xml.XPathInt
xml_funcs xpath_long org.apache.spark.sql.catalyst.expressions.xml.XPathLong
xml_funcs xpath_number org.apache.spark.sql.catalyst.expressions.xml.XPathDouble
xml_funcs xpath_short org.apache.spark.sql.catalyst.expressions.xml.XPathShort
xml_funcs xpath_string org.apache.spark.sql.catalyst.expressions.xml.XPathString
xml_funcs xpath org.apache.spark.sql.catalyst.expressions.xml.XPathList

Closes #30040

NOTE: An original author of this PR is @tanelk, so the credit should be given to @tanelk.

Why are the changes needed?

For better documents.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a test to check if exprs have a group tag in ExpressionInfoSuite.

@github-actions github-actions bot added the SQL label Dec 21, 2020
@maropu
Copy link
Member Author

maropu commented Dec 21, 2020

cc: @HyukjinKwon @tanelk

@maropu
Copy link
Member Author

maropu commented Dec 21, 2020

This PR focuses on categorizing the existing built-in funcs. In a follow-p PR, I'll fix the doc-related issue.

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37716/

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for Apache Spark 3.2.0, @maropu ?

@maropu
Copy link
Member Author

maropu commented Dec 21, 2020

Yea, I think so, @dongjoon-hyun cc: @HyukjinKwon

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37716/

@@ -205,7 +207,8 @@ case class SchemaOfCsv(
> SELECT _FUNC_(named_struct('time', to_timestamp('2015-08-26', 'yyyy-MM-dd')), map('timestampFormat', 'dd/MM/yyyy'));
26/08/2015
""",
since = "3.0.0")
since = "3.0.0",
group = "csv_funcs")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, Is this able to be struct_funcs because the input type is StructType, too?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @maropu . Mostly looks good. It seems that we need a few clarification.

  1. The category is based on the outputType or inputType? Some functions are at the intersection of both types. e.g. StructToCsv or CreateArray.
  2. The definition of array_func and collection_func?
  3. Some function category is changed from Apache Spark 3.1.0. So, this is not Adds a group tag. Instead, this is re-categorization.

It would be enough if you clarify the design at the beginning of the PR description.

@maropu maropu changed the title [SPARK-33124][SQL] Adds a group tag in all the expressions for built-in functions [WIP][SPARK-33124][SQL] Adds a group tag in all the expressions for built-in functions Dec 21, 2020
@maropu maropu changed the title [WIP][SPARK-33124][SQL] Adds a group tag in all the expressions for built-in functions [WIP][SPARK-33124][SQL] Assign missing group tags and re-categorizes all the group tags for built-in functions Dec 21, 2020
@maropu maropu changed the title [WIP][SPARK-33124][SQL] Assign missing group tags and re-categorizes all the group tags for built-in functions [WIP][SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions Dec 21, 2020
@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133117 has finished for PR 30867 at commit 6fdea42.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Dec 21, 2020

Thanks for the comment, @dongjoon-hyun !

The category is based on the outputType or inputType? Some functions are at the intersection of both types. e.g. StructToCsv or CreateArray.

A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. But, yea, the two cases you pointed out above are ambiguous cases, I think. In the current approach, StructToCsv is categorized into csv_funcs because it exists in the csvExpressions.scala file (I mean that it is categorized based on its functionality). CreateArray is categorized into array_funcsbased on the output type (The other functions in array_funcs are categorized based on their input types though...).

The definition of array_func and collection_func?

array_funcs and map_funcs are sub-groups of collection_funcs in the current approach. For example, array_contains is used only for arrays, so it is assigned to array_funcs. On the other hand, reverse is used for both arrays and strings, so it is assigned to collection_funcs.

Anyway, this is a first-shot to re-categorize them, so I'm open to other ideas.

@maropu
Copy link
Member Author

maropu commented Dec 21, 2020

I've updated the title and the description based on the comment above.

@maropu maropu changed the title [WIP][SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions [SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions Dec 21, 2020
@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Test build #133133 has finished for PR 30867 at commit 4e82580.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37732/

@SparkQA
Copy link

SparkQA commented Dec 21, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/37732/

@HyukjinKwon
Copy link
Member

Nice!

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM for Apache Spark 3.2.0.
Thank you, @maropu and @HyukjinKwon !

@dongjoon-hyun
Copy link
Member

Merged to master!

@maropu
Copy link
Member Author

maropu commented Dec 21, 2020

Thanks, @dongjoon-hyun and @HyukjinKwon !

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Dec 21, 2020

The enforcing is good to have at the early stage of Apache Spark 3.2.0 by preventing the increase of new expressions.
Since we have an enough time for 3.2.0, I believe we can revise this if needed in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
5 participants