Skip to content

Conversation

@ChengXiangLi
Copy link

The Literal(1) is used as the IntermediateField of Count aggregation, so multi Count looks the same in ExpandAggregations.

@aljoscha
Copy link
Contributor

Hi,
the reason it was implemented as it was is to eliminate common subexpressions. For example, if you wrote:

select(a.count, b.count)

the resulting expansion would be

intermediate = select(1.count as intermediate.1)
agg = intermediate.aggregate(intermediate.1, sum)
result = agg.select(intermediate.1, intermediate.1)

You see, the 1 would only be aggregated once here.

The problem was that selecting intermediate.1 twice is not allowed since the field names have to be unique.

Changing ExpandAggregations to this should do the trick:

var intermediateCount = 0
var resultCount = 0
selection foreach {  f =>
  f.transformPre {
    case agg: Aggregation =>
      val intermediateReferences = agg.getIntermediateFields.zip(agg.getAggregations) map {
        case (expr, basicAgg) =>
          aggregations.get((expr, basicAgg)) match {
            case Some(intermediateName) =>
              resultCount = resultCount + 1
              val resultName = s"result.$resultCount"
              Naming(ResolvedFieldReference(intermediateName, expr.typeInfo), resultName)
            case None =>
              intermediateCount = intermediateCount + 1
              val intermediateName = s"intermediate.$intermediateCount"
              intermediateFields += Naming(expr, intermediateName)
              aggregations((expr, basicAgg)) = intermediateName
              resultCount = resultCount + 1
              val resultName = s"result.$resultCount"
              Naming(ResolvedFieldReference(intermediateName, expr.typeInfo), resultName)
          }
      }

      aggregationIntermediates(agg) = intermediateReferences
      // Return a NOP so that we don't add the children of the aggregation
      // to intermediate fields. We already added the necessary fields to the list
      // of intermediate fields.
      NopExpression()

    case fa: ResolvedFieldReference =>
      if (!fa.name.startsWith("intermediate")) {
        intermediateFields += Naming(fa, fa.name)
      }
      fa
  }
}

Now the aggregations will be expanded to:

intermediate = select(1.count as intermediate.1)
agg = intermediate.aggregate(intermediate.1, sum)
result = agg.select(intermediate.1 as result.1, intermediate.1 as result.2)

This also required changing ExpressionCodeGenerator.scala, line 111 to:

def cleanExpr(e: Expression): Expression = {
  e match {
    case expressions.Naming(namedExpr, _) => cleanExpr(namedExpr)
    case _ => e
  }
}

because there is another bug that doesn't allow having nested Namings à la Naming(Naming(expr, intermediate.1, result.1)).

@aljoscha
Copy link
Contributor

It's good that you found the problem and added a test. 😄

@ChengXiangLi
Copy link
Author

Thanks, @aljoscha ,
a.count and b.count may not the same all the times(for example, if a contains NULL values), that's why i try to aggregate it twice. As NULL value is not allowed until now, so i can update the PR in your way and revisit this issue after NULL value handling in Table API is enabled.

@aljoscha
Copy link
Contributor

aljoscha commented Dec 1, 2015

Oh, you are way ahead of me. 😅 You are right, keep it that way since we will probably have support for Null values in the future.

I'll merge it later if there are no objections. @twalthr ?

@twalthr
Copy link
Contributor

twalthr commented Dec 1, 2015

Looks good. No objections.

@aljoscha
Copy link
Contributor

aljoscha commented Dec 1, 2015

Ah, I see you changed it to my (wrong) suggestion. @ChengXiangLi Do you want to change it back to your initial version (plus maybe the fix for the naming and cleanExpr). Or should it be changed when adding support for null values?

@ChengXiangLi
Copy link
Author

Let's reuse the count aggregation now, as it's better for performance. I would revisit this after NULL value handling is enabled.

@aljoscha
Copy link
Contributor

aljoscha commented Dec 2, 2015

Alright, I merged it. Could you please close it if github doesn't do so automatically.

Thanks for the work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants