-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix #17940
Conversation
…ng breeze CSCMatrix In an operation of two A, B CSCMatrices the resulting C matrix may have some extra 0s in rowIndices and data which are created for performance improvement by Breeze. This causes problems on converting back to mllib.Matrix because it relies on rowIndices and data being coherent with colPtrs. Therefore it is necessary to truncate rowIndices and data to the active number of elements hold by the C matrix, before creating a Spark's SparseMatrix.
Please fix up the title and description per http://spark.apache.org/contributing.html |
Sorry about that. I added more context in the description and updated the title. |
Test build #3710 has finished for PR 17940 at commit
|
Need to fix line in the test because it's too long. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR. also cc @yanboliang . Looks like the issue is still present.
// despite sm being a valid CSCMatrix. | ||
// We need to truncate both arrays (rowIndices, data) | ||
// to the real size of the vector sm.activeSize to allow valid conversion | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can add if (sm.activeSize != sm.rowIndices.length)
here, since this is only needed when necessary.
Please refer to https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/CSCMatrix.scala#L130
// to the real size of the vector sm.activeSize to allow valid conversion | ||
|
||
val truncRowIndices = sm.rowIndices.slice(0, sm.activeSize) | ||
val truncData = sm.data.slice(0, sm.activeSize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same as calling compact(). To make it less sensitive to the breeze internal implementation, how about:
val matCopy = sm.copy
matCopy.compact()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm implementing both suggestions, however, wouldn't be the sm.copy more expensive than just doing those 2 slices?
@@ -46,6 +46,26 @@ class MatricesSuite extends SparkFunSuite { | |||
} | |||
} | |||
|
|||
test("Test Breeze Conversion Bug - SPARK-20687") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specific name: Test FromBreeze when Breeze.CSCMatrix.rowIndices has trailing zeros.
And move the test after another unit test "fromBreeze with sparse matrix"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update. Looks good to me.
@@ -992,7 +992,24 @@ object Matrices { | |||
new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose) | |||
case sm: BSM[Double] => | |||
// There is no isTranspose flag for sparse matrices in Breeze | |||
new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data) | |||
|
|||
// Some Breeze CSCMatrices may have extra trailing zeros in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: make this comment more compact in fewer lines.
Test build #3714 has started for PR 17940 at commit |
Test build #3727 has started for PR 17940 at commit |
Test build #3733 has finished for PR 17940 at commit
|
smC.compact() | ||
new SparseMatrix(smC.rows, smC.cols, smC.colPtrs, smC.rowIndices, smC.data) | ||
} else { | ||
new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hhbyyh what do you think of the current state? I wasn't clear if you were requesting a specific change to the comment.
Here, if you're going to change it again, you could simplify by not repeating the new SparseMatrix(...)
call. Just pick which sparse matrix you're copying in the if statement (original or compacted) and then return the result of converting that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, I shortened the comment and I removed repeating the new SparseMatrix.
Give it a look now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -992,7 +992,16 @@ object Matrices { | |||
new DenseMatrix(dm.rows, dm.cols, dm.data, dm.isTranspose) | |||
case sm: BSM[Double] => | |||
// There is no isTranspose flag for sparse matrices in Breeze | |||
new SparseMatrix(sm.rows, sm.cols, sm.colPtrs, sm.rowIndices, sm.data) | |||
val nsm = if (sm.rowIndices.length > sm.activeSize) { | |||
// This sparse matrix has trainling zeros. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
trailing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ups.
Test build #3746 has finished for PR 17940 at commit
|
…ing from Breeze sparse matrix ## What changes were proposed in this pull request? When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data. In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations. See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add ## How was this patch tested? Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark. Bugfix for https://issues.apache.org/jira/browse/SPARK-20687 Author: Ignacio Bermudez <ignaciobermudez@gmail.com> Author: Ignacio Bermudez Corrales <icorrales@splunk.com> Closes #17940 from ghoto/bug-fix/SPARK-20687. (cherry picked from commit 06dda1d) Signed-off-by: Sean Owen <sowen@cloudera.com>
…ing from Breeze sparse matrix ## What changes were proposed in this pull request? When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data. In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations. See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add ## How was this patch tested? Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark. Bugfix for https://issues.apache.org/jira/browse/SPARK-20687 Author: Ignacio Bermudez <ignaciobermudez@gmail.com> Author: Ignacio Bermudez Corrales <icorrales@splunk.com> Closes #17940 from ghoto/bug-fix/SPARK-20687. (cherry picked from commit 06dda1d) Signed-off-by: Sean Owen <sowen@cloudera.com>
Merged to master/2.2/2.1 |
…ing from Breeze sparse matrix ## What changes were proposed in this pull request? When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data. In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations. See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add ## How was this patch tested? Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark. Bugfix for https://issues.apache.org/jira/browse/SPARK-20687 Author: Ignacio Bermudez <ignaciobermudez@gmail.com> Author: Ignacio Bermudez Corrales <icorrales@splunk.com> Closes apache#17940 from ghoto/bug-fix/SPARK-20687.
…ing from Breeze sparse matrix When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data. In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations. See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark. Bugfix for https://issues.apache.org/jira/browse/SPARK-20687 Author: Ignacio Bermudez <ignaciobermudez@gmail.com> Author: Ignacio Bermudez Corrales <icorrales@splunk.com> Closes apache#17940 from ghoto/bug-fix/SPARK-20687. (cherry picked from commit 06dda1d) Signed-off-by: Sean Owen <sowen@cloudera.com>
What changes were proposed in this pull request?
When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.
In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data
This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.
See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add
How was this patch tested?
Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.
Bugfix for https://issues.apache.org/jira/browse/SPARK-20687