[SPARK-46677][CONNECT][FOLLOWUP] Convert `count(df["*"])` to `count(1)` on client side #44752

zhengruifeng · 2024-01-16T08:24:44Z

What changes were proposed in this pull request?

before #44689, df["*"] and sf.col("*") are both convert to UnresolvedStar, and then Count(UnresolvedStar) is converted to Count(1) in Analyzer:

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala

Lines 1893 to 1897 in 381f369

    
           case f0: UnresolvedFunction if !f0.isDistinct && 
        
             f0.nameParts.map(_.toLowerCase(Locale.ROOT)) == Seq("count") && 
        
             f0.arguments == Seq(UnresolvedStar(None)) => 
        
             // Transform COUNT(*) into COUNT(1). 
        
             f0.copy(nameParts = Seq("count"), arguments = Seq(Literal(1)))

in that fix, we introduced a new node UnresolvedDataFrameStar for df["*"] which will be replaced to ResolvedStar later. Unfortunately, it doesn't match Count(UnresolvedStar) any more.
So it causes:

In [1]: from pyspark.sql import functions as sf

In [2]: df1 = spark.createDataFrame([{"id": 1, "val": "v"}])

In [3]: df1.select(sf.count(df1["*"]))
Out[3]: DataFrame[count(id, val): bigint]

which should be

In [3]: df1.select(sf.count(df1["*"]))
Out[3]: DataFrame[count(1): bigint]

In vanilla Spark, it is up to the count function to make such conversion sf.count(df1["*"]) -> sf.count(sf.lit(1)), see

spark/sql/core/src/main/scala/org/apache/spark/sql/functions.scala

Lines 422 to 436 in e8dfcd3

    
             /** 
        
              * Aggregate function: returns the number of items in a group. 
        
              * 
        
              * @group agg_funcs 
        
              * @since 1.3.0 
        
              */ 
        
             def count(e: Column): Column = { 
        
               val withoutStar = e.expr match { 
        
                 // Turn count(*) into count(1) 
        
                 case _: Star => Column(Literal(1)) 
        
                 case _ => e 
        
               } 
        
               Column.fn("count", withoutStar) 
        
             }

So it is a natural way to fix this behavior on the client side.

Why are the changes needed?

to keep the behavior

Does this PR introduce any user-facing change?

it fix a behavior change introduced in #44689

How was this patch tested?

added ut

Was this patch authored or co-authored using generative AI tooling?

no

zhengruifeng · 2024-01-16T10:02:55Z

thanks, merged to master

zhengruifeng added 2 commits January 16, 2024 16:04

fix

186b80e

fix

44e5dcd

github-actions bot added SQL PYTHON CONNECT labels Jan 16, 2024

zhengruifeng requested review from cloud-fan and HyukjinKwon January 16, 2024 08:26

HyukjinKwon approved these changes Jan 16, 2024

View reviewed changes

cloud-fan approved these changes Jan 16, 2024

View reviewed changes

zhengruifeng closed this in bfb0f01 Jan 16, 2024

zhengruifeng deleted the connect_fix_count_df_star branch January 16, 2024 10:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-46677][CONNECT][FOLLOWUP] Convert `count(df["*"])` to `count(1)` on client side #44752

[SPARK-46677][CONNECT][FOLLOWUP] Convert `count(df["*"])` to `count(1)` on client side #44752

zhengruifeng commented Jan 16, 2024 •

edited

zhengruifeng commented Jan 16, 2024

	case f0: UnresolvedFunction if !f0.isDistinct &&
	f0.nameParts.map(_.toLowerCase(Locale.ROOT)) == Seq("count") &&
	f0.arguments == Seq(UnresolvedStar(None)) =>
	// Transform COUNT(*) into COUNT(1).
	f0.copy(nameParts = Seq("count"), arguments = Seq(Literal(1)))


	/**
	* Aggregate function: returns the number of items in a group.
	*
	* @group agg_funcs
	* @since 1.3.0
	*/
	def count(e: Column): Column = {
	val withoutStar = e.expr match {
	// Turn count(*) into count(1)
	case _: Star => Column(Literal(1))
	case _ => e
	}
	Column.fn("count", withoutStar)
	}

[SPARK-46677][CONNECT][FOLLOWUP] Convert count(df["*"]) to count(1) on client side #44752

[SPARK-46677][CONNECT][FOLLOWUP] Convert count(df["*"]) to count(1) on client side #44752

Conversation

zhengruifeng commented Jan 16, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

zhengruifeng commented Jan 16, 2024

[SPARK-46677][CONNECT][FOLLOWUP] Convert `count(df["*"])` to `count(1)` on client side #44752

[SPARK-46677][CONNECT][FOLLOWUP] Convert `count(df["*"])` to `count(1)` on client side #44752

zhengruifeng commented Jan 16, 2024 •

edited