Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-39616][BUILD][ML] Upgrade Breeze to 2.0 #37002

Closed
wants to merge 9 commits into from

Conversation

zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Jun 27, 2022

What changes were proposed in this pull request?

1, Upgrade Breeze to 2.0
2, import a implict breeze.stats.distributions.Rand.FixedSeed.randBasis in ML for this upgrade: there are two options: FixedSeed and VariableSeed, I took a quick look and choose FixedSeed for reproductivity

Why are the changes needed?

since 1.3, breeze has replaced com.github.fommil.netlib with dev.ludovic.netlib

after upgrade to 2.0:

1, breeze should be faster because of this replacement;
2, completely resolve the license issue related to com.github.fommil.netlib:all

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing UT

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this, @zhengruifeng .

@srowen
Copy link
Member

srowen commented Jun 27, 2022

Looks good if it works; I kinda thought it would break more

@zhengruifeng
Copy link
Contributor Author

there are some failures in pyspark, I think I need some more time.

@srowen
Copy link
Member

srowen commented Jul 1, 2022

Looks OK pending tests

@github-actions github-actions bot added the R label Jul 4, 2022
Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ready to merge?

@@ -96,7 +96,8 @@ def run_individual_python_test(target_dir, test_name, pyspark_python):
os.mkdir(metastore_dir)

# Also override the JVM's temp directory by setting driver and executor options.
java_options = "-Djava.io.tmpdir={0} -Dio.netty.tryReflectionSetAccessible=true".format(tmp_dir)
java_options = "-Djava.io.tmpdir={0}".format(tmp_dir)
java_options = java_options + " -Dio.netty.tryReflectionSetAccessible=true -Xss16M"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The -Xss16m seems OK, just wondering if we think that breeze 2.0 is going to cause stack overflows more readily without this setting? could be an issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, without this setting, pyspark and sparkr always failed.

I think it should be a new restriction in breeze, found such a commit in breeze 2.0
scalanlp/breeze@07c2d06

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but I guess configurations in practice should normally safisty this stack size requirement.
for example, the scala UT run with breeze 2.0 successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also cc @HyukjinKwon @WeichenXu123 since this PR change the java option of tests in pySpark and sparkR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we using smaller memory in PySpark and SparkR?

Scala UT (Maven and SBT) seems to use small stack size, 4m, at large memory like mx4g.

-Xmx4g -Xss4m -XX:MaxMetaspaceSize=2g

Although this is already passed, do you think we can use more consistent settings across Scala/Python/R?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, let's have the same settings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me have a try with -Xss4m

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a comment. However, since I believe we can tune JVM memory and stack size options more later, I'm +1 for AS-IS code too, @zhengruifeng .

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Jul 5, 2022

@dongjoon-hyun @HyukjinKwon -Xss4M also works, the javaoptions in PySpark and SparkR are different from the scala side, but here I only set stack size.

@dongjoon-hyun
Copy link
Member

Thank you, @zhengruifeng , @srowen , @HyukjinKwon . Merged to master for Apache Spark 3.4.

@zhengruifeng zhengruifeng deleted the build_update_breeze branch July 6, 2022 00:10
@HyukjinKwon
Copy link
Member

HyukjinKwon commented Jul 6, 2022

@zhengruifeng seems like there's one test flaky presumably after this (https://github.com/apache/spark/runs/7203043049?check_suite_focus=true):


File "/__w/spark/spark/python/pyspark/mllib/linalg/distributed.py", line 859, in __main__.IndexedRowMatrix.computeSVD
Failed example:
    svd_model.V # doctest: +ELLIPSIS
Expected:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
Got:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, -0.0], 0)
**********************************************************************
File "/__w/spark/spark/python/pyspark/mllib/linalg/distributed.py", line 426, in __main__.RowMatrix.computeSVD
Failed example:
    svd_model.V # doctest: +ELLIPSIS
Expected:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
Got:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, -0.0], 0)
**********************************************************************
   1 of   6 in __main__.IndexedRowMatrix.computeSVD
   1 of   6 in __main__.RowMatrix.computeSVD
***Test Failed*** 2 failures.
Had test failures in pyspark.mllib.linalg.distributed with python3.9; see logs.

Mind taking a look please?

@dongjoon-hyun
Copy link
Member

Thank you for reporting that, @HyukjinKwon .

@zhengruifeng
Copy link
Contributor Author

@HyukjinKwon @dongjoon-hyun let me take a look, thanks

dongjoon-hyun pushed a commit that referenced this pull request Jul 6, 2022
### What changes were proposed in this pull request?
Skip flaky doctests

### Why are the changes needed?
```
File "/__w/spark/spark/python/pyspark/mllib/linalg/distributed.py", line 859, in __main__.IndexedRowMatrix.computeSVD
Failed example:
    svd_model.V # doctest: +ELLIPSIS
Expected:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
Got:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, -0.0], 0)
**********************************************************************
File "/__w/spark/spark/python/pyspark/mllib/linalg/distributed.py", line 426, in __main__.RowMatrix.computeSVD
Failed example:
    svd_model.V # doctest: +ELLIPSIS
Expected:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, 0.0], 0)
Got:
    DenseMatrix(3, 2, [-0.4082, -0.8165, -0.4082, 0.8944, -0.4472, -0.0], 0)
**********************************************************************
   1 of   6 in __main__.IndexedRowMatrix.computeSVD
   1 of   6 in __main__.RowMatrix.computeSVD
***Test Failed*** 2 failures.
Had test failures in pyspark.mllib.linalg.distributed with python3.9; see logs.
```

#37002 occasionally cause above tests output `-0.0` instead of `0.0`, I think they are both acceptable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated doctests

Closes #37097 from zhengruifeng/build_breeze_followup.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
@zhengruifeng
Copy link
Contributor Author

shall we also delete the license file LICENSE-netlib?
@srowen @dongjoon-hyun @HyukjinKwon

@srowen
Copy link
Member

srowen commented Aug 30, 2022

I believe you could. Check the license of the ludovic package to make sure we haven't missed something there, and that the fommil package isn't in the build now

HyukjinKwon pushed a commit to HyukjinKwon/spark that referenced this pull request Aug 30, 2022
### What changes were proposed in this pull request?
Delete the license of fommil-netlib

### Why are the changes needed?
after upgrading breeze to 2.0 in apache#37002, fommil-netlib no longer exists in spark build

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuites

Closes apache#37717 from zhengruifeng/del_netlib_license.

Authored-by: Ruifeng Zheng <ruifengz@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
srowen pushed a commit that referenced this pull request Sep 4, 2022
…bra acceleration

### What changes were proposed in this pull request?
The pr aims to update doc for mllib linear algebra acceleration after #37002 & #37700

### Why are the changes needed?
> A.Dependence on netlib-java has been completely eliminated after #37002 & #37700.
> B.Some interfaces of dev.ludovic.netlib.blas have changed.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
N/A

Closes #37791 from panbingkun/update_doc_for_mllib.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
srowen pushed a commit that referenced this pull request Mar 12, 2023
### What changes were proposed in this pull request?
The pr aims to remove dependency `shapeless` on breeze.

### Why are the changes needed?
After pr #37002, `shapeless` has been deleted in [spark-deps-hadoop-2-hive-2.3](https://github.com/apache/spark/pull/37002/files#diff-670b971a2758f55d602f0d1ef63f7af5f8d9ca095b5a55664bc3275e274ca395).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Pass GA.

Closes #40378 from panbingkun/remove_dep_on_breeze.

Authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants