Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19436][SQL] Add missing tests for approxQuantile #16776

Closed
wants to merge 14 commits into from

Conversation

zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

1, check the behavior with illegal quantiles and relativeError
2, add tests for relativeError > 1
3, update tests for null data
4, update some docs for javadoc8

How was this patch tested?

local test in spark-shell

@zhengruifeng zhengruifeng changed the title [SPARK-14352][FOLLOWUP][SQL] add tests for approxQuantile & [SPARK-14352][FOLLOWUP][SQL] update tests for approxQuantile Feb 2, 2017
@zhengruifeng zhengruifeng changed the title [SPARK-14352][FOLLOWUP][SQL] update tests for approxQuantile [SPARK-19436][SQL] Add missing tests for approxQuantile Feb 2, 2017
@SparkQA
Copy link

SparkQA commented Feb 2, 2017

Test build #72278 has finished for PR 16776 at commit db9ffc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@SparkQA
Copy link

SparkQA commented Feb 2, 2017

Test build #72279 has finished for PR 16776 at commit 917fd6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

@zhengruifeng Could you split the test cases to multiple independent ones with meaningful titles?Thanks!

}

// quantile should be in the range [0.0, 1.0]
val e: IllegalArgumentException = intercept[IllegalArgumentException] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val e: IllegalArgumentException -> val e

@SparkQA
Copy link

SparkQA commented Feb 3, 2017

Test build #72304 has finished for PR 16776 at commit 3ea1301.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@gatorsmile Update! Thanks for reviewing!

@gatorsmile
Copy link
Member

Thank you! What is the expected output if the input dataset is empty? Could you also add a test case?

@@ -80,18 +80,16 @@ final class DataFrameStatFunctions private[sql](df: DataFrame) {
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to fix this line, right?

4, update some docs for javadoc8

@SparkQA
Copy link

SparkQA commented Feb 4, 2017

Test build #72364 has finished for PR 16776 at commit 92bcf05.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

assert(e2.getMessage.contains("Relative Error must be non-negative"))

// dataset should be non-empty
intercept[NoSuchElementException] {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should return null, instead of throwing an exception. Could you fix it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want to return null? Seems very un Scala like to me

@SparkQA
Copy link

SparkQA commented Feb 5, 2017

Test build #72407 has finished for PR 16776 at commit 292cd02.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor Author

@gatorsmile Updated! Thanks for reviewing!

StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,
probabilities, relativeError).map(_.toArray).toArray
try {
StatFunctions.multipleApproxQuantiles(df.select(cols.map(col): _*).na.drop(), cols,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will drop the whole row if any column has null or NaN. For example,

    Seq[(java.lang.Long, java.lang.Double)]((null, 1.23), (3L, null), (4L, 3.45))
      .toDF("a", "b").na.drop().show()

That means, users could get different results. It depends on which API they used.

df.stat.approxQuantile("col1", Array(q1), epsilon)
df.stat.approxQuantile("col2", Array(q1), epsilon)
df.stat.approxQuantile(Array("col1", "col2"), Array(q1), epsilon)

I am wondering if this is the expected behavior?

Copy link
Member

@gatorsmile gatorsmile Feb 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not sound right to me. cc @holdenk @MLnick @jkbradley

I think that might be the reason why we did not provide such an API at the beginning.

If we want to make them consistent, it does not get any performance benefit from this API. Should we revert the PR #12135 ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile Good catch! Agree that this will cause confusing results.
I think there are two way to make them consistent:
1, The behavior of na-droping was included in SPARK-17219 to enhanced NaN value handling, and the single-column version of approxQuantile is only used in QuantileDiscretizer. So we can make the na-droping happen in QuantileDiscretizer, and remove the na-drop in approxQuantile.
2, modify the impl StatFunctions.multipleApproxQuantiles to deal with null and NaN, and remove the na-drop in approxQuantile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhengruifeng Sure. If we want to make them consistent, I am fine. How about reverting #12135 at first? At the same time, we can work on the new solution.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally there was never any na dropping in approxQuantile as far as I can recall. That was added in #14858. cc @srowen

You could also simply change the na dropping to only drop from the cols passed as args for each version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! I vote for modifying multipleApproxQuantiles to handle null and NaN values. As far as reverting, I'm OK either way as long as we get the fix into 2.2. I'd actually recommend going ahead and merging this PR and creating a follow-up Critical Bug targeted at 2.2.

@MLnick I think dropping NAs from the cols passed as args still will not work. Say the user passes cols "a" and "b" as args, but some rows have (a = NaN, b = 1.0). Then those rows will be ignored.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let us add a TODO comment above this function and create a JIRA for tracking this issue. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct, I missed that!

Ok, so #14858 added the na dropping to approxQuantile. It wasn't there in the original impl. It can work on data that has NaN in the sense that it will return NaNs as part of the quantiles, in some cases. I'm not sure if that was the intended behavior in the original impl or not cc @thunterdb.

In that sense it's the same as aggs like min or max - those don't exclude NaN automatically, it's up to the user to handle them. So it could be best to just remove the na dropping from approxQuantile (which would revert to original impl behavior) and put it in QuantileDiscretizer which was the original need.

* @param cols the names of the numerical columns
* @param probabilities a list of quantile probabilities
* Each number must belong to [0, 1].
* For example 0 is the minimum, 0.5 is the median, 1 is the maximum.
* @param relativeError The relative target precision to achieve (>= 0).
* @param relativeError The relative target precision to achieve (greater or equal to 0).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"greater" -> "greater than"

if (res != null) {
res.head
} else {
null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Scaladoc should describe this case

@SparkQA
Copy link

SparkQA commented Feb 9, 2017

Test build #72634 has started for PR 16776 at commit 4db82b4.

@zhengruifeng
Copy link
Contributor Author

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Feb 10, 2017

Test build #72676 has finished for PR 16776 at commit 4db82b4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

res.head
} else {
null
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above five lines can be shorten to Option(res).map(_.head).orNull

}

/**
* Calculates the approximate quantiles of numerical columns of a DataFrame.
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
* detailed description.
* @see `DataFrameStatsFunctions.approxQuantile` for detailed description.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did this link cause problems in doc generation? It looks like there is a missing ")" in the original link. If you add the ")" then will it generate correctly (with a hyperlink) for the Scala and Java docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkbradley Do you mean @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]])? I am not sure whether it work for java docs. @HyukjinKwon Could you help review this?

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72794 has finished for PR 16776 at commit a3171e4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72795 has finished for PR 16776 at commit c77755d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Calculates the approximate quantiles of numerical columns of a DataFrame.
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
* detailed description.
* @see `[[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]]` for detailed
Copy link
Member

@gatorsmile gatorsmile Feb 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile
->
DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile)
??? How to generate the JAVA8 doc?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@HyukjinKwon HyukjinKwon Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: DataFrameStatsFunctions -> DataFrameStatFunctions or remove it.

For example, just

`approxQuantile(String, Array[Double], Double)`

or

`approxQuantile()` with some description

We could just wrap them by backticks without [[ ... ]] in general. It seems Scaladoc specific annotation also does not work to disambiguate the argument types.

[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error]    * @see {@link DataFrameStatFunctions.approxQuantile(col:Str* approxQuantile)} for
[error]      ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:45: error: unexpected text
[error]    * @see #approxQuantile(String, Array[Double], Double) for detailed description.
[error]      ^

I guess It does not necessarily make a link if it breaks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the breaks are queued up a bit. Let me sweep it soon.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72802 has started for PR 16776 at commit 4b7ad19.

@zhengruifeng
Copy link
Contributor Author

https://issues.apache.org/jira/browse/SPARK-19573 is created to track the issue on non-consistent na-droping.

@SparkQA
Copy link

SparkQA commented Feb 13, 2017

Test build #72809 has finished for PR 16776 at commit 1f07901.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

/**
* Calculates the approximate quantiles of numerical columns of a DataFrame.
* @see [[DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile]] for
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sorry. Actually, I initially meant remove DataFrameStatFunctions leaving the method because it is in the same class. Nevertheless, FWIW, I am fine with removing this @see as is given other functions here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The * was me getting fancy to get the ScalaDoc to link to the correct single-arg method (I did test it at the time and it does work for Scala though there may be a mistake here somewhere).

It would still be good to provide a @see reference even if it does not link nicely (so the simple backtick method name as you suggested?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will add it back. Should it be approxQuantile(col:Str* approxQuantile) or approxQuantile(String, Array[Double], Double) ?

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72867 has finished for PR 16776 at commit db23d11.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72868 has finished for PR 16776 at commit 7c54234.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2017

Test build #72983 has finished for PR 16776 at commit 2268d36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Feb 16, 2017
## What changes were proposed in this pull request?

These error below seems caused by unidoc that does not understand double commented block.

```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
[error]  * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction =
[error]                                  ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:69: error: class, interface, or enum expected
[error]  * MapGroupsWithStateFunction<String, Integer, Integer, String> mappingFunction =
[error]                                                                       ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
[error]  *    new MapGroupsWithStateFunction<String, Integer, Integer, String>() {
[error]                                         ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:70: error: class, interface, or enum expected
[error]  *    new MapGroupsWithStateFunction<String, Integer, Integer, String>() {
[error]                                                                             ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: illegal character: '#'
[error]  *      &apache#64;Override
[error]          ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:72: error: class, interface, or enum expected
[error]  *      &apache#64;Override
[error]              ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error]  *      public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error]                ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error]  *      public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error]                                                    ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error]  *      public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error]                                                                ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error]  *      public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error]                                                                                     ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:73: error: class, interface, or enum expected
[error]  *      public String call(String key, Iterator<Integer> value, KeyedState<Integer> state) {
[error]                                                                                                 ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:76: error: class, interface, or enum expected
[error]  *          boolean shouldRemove = ...; // Decide whether to remove the state
[error]  ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:77: error: class, interface, or enum expected
[error]  *          if (shouldRemove) {
[error]  ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:79: error: class, interface, or enum expected
[error]  *          } else {
[error]  ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:81: error: class, interface, or enum expected
[error]  *            state.update(newState); // Set the new state
[error]  ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:82: error: class, interface, or enum expected
[error]  *          }
[error]  ^
[error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:85: error: class, interface, or enum expected
[error]  *          state.update(initialState);
[error]  ^
[error] .../forked/spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:86: error: class, interface, or enum expected
[error]  *        }
[error]  ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:90: error: class, interface, or enum expected
[error]  * </code></pre>
[error]  ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:92: error: class, interface, or enum expected
[error]  * tparam S User-defined type of the state to be stored for each key. Must be encodable into
[error]            ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:93: error: class, interface, or enum expected
[error]  *           Spark SQL types (see {link Encoder} for more details).
[error]                                          ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:94: error: class, interface, or enum expected
[error]  * since 2.1.1
[error]           ^
```

And another link seems unrecognisable.

```
.../spark/sql/core/target/java/org/apache/spark/sql/KeyedState.java:16: error: reference not found
[error]  * That is, in every batch of the {link streaming.StreamingQuery StreamingQuery},
[error]
```

Note that this PR does not fix the two breaks as below:

```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:43: error: unexpected content
[error]    * see {link DataFrameStatsFunctions.approxQuantile(col:Str* approxQuantile} for
[error]      ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/DataFrameStatFunctions.java:52: error: bad use of '>'
[error]    * param relativeError The relative target precision to achieve (>= 0).
[error]                                                                     ^
[error]
```

because these seem probably fixed soon in apache#16776 and I intended to avoid potential conflicts.

## How was this patch tested?

Manually via `jekyll build`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes apache#16926 from HyukjinKwon/javadoc-break.
@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

Please continue to finish the work https://issues.apache.org/jira/browse/SPARK-19573

@asfgit asfgit closed this in 54a30c8 Feb 16, 2017
@zhengruifeng zhengruifeng deleted the fix_approxQuantile branch February 17, 2017 02:19
@thunterdb
Copy link
Contributor

Sorry I missed the conversation here. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants