[SPARK-4562] [MLlib] speedup vector#3420
Conversation
|
Test build #23761 has started for PR 3420 at commit
|
|
Test build #23762 has started for PR 3420 at commit
|
|
Test build #23763 has started for PR 3420 at commit
|
|
Test build #23761 has finished for PR 3420 at commit
|
|
Test PASSed. |
|
Test build #23762 has finished for PR 3420 at commit
|
|
Test FAILed. |
|
Test build #23763 has finished for PR 3420 at commit
|
|
Test FAILed. |
|
Test build #532 has started for PR 3420 at commit
|
|
Test build #23767 has started for PR 3420 at commit
|
|
Test build #532 has finished for PR 3420 at commit
|
|
Test build #23767 has finished for PR 3420 at commit
|
|
Test FAILed. |
There was a problem hiding this comment.
Just curious: Is there not a good way to pick up the character set from the encoding utils?
There was a problem hiding this comment.
sorry, I didn't find a name for it.
|
For the record, I ran some tests with this and confirmed the speedups. This PR puts test time prediction for GLMs at the same speed as the Spark 1.1 release. Combined with [https://github.com//pull/3397], this makes training time much faster than Spark 1.1. Test summary: 16 worker nodes on EC2. GLMs with 1M rows, 10K cols, 20 iterations. Python GLM training ran about 3X faster than in Spark 1.1, making Python & Scala training times almost equal. |
|
Test build #23771 has started for PR 3420 at commit
|
|
By the way, my tests were with dense vectors, not sparse. |
|
Test build #23771 has finished for PR 3420 at commit
|
|
Test FAILed. |
|
Test build #23776 has started for PR 3420 at commit
|
|
Test build #23776 has finished for PR 3420 at commit
|
python/pyspark/mllib/linalg.py
Outdated
There was a problem hiding this comment.
It should work if we say m = DenseMatrix(2, 2, range(4))
|
Test build #23791 has started for PR 3420 at commit
|
|
@mengxr fixed, thanks! |
|
Test build #23791 has finished for PR 3420 at commit
|
|
Test FAILed. |
|
Test build #533 has started for PR 3420 at commit
|
|
Test build #23793 has started for PR 3420 at commit
|
|
Test FAILed. |
|
Test build #533 has finished for PR 3420 at commit
|
|
Test build #536 has started for PR 3420 at commit
|
|
Test build #23793 has finished for PR 3420 at commit
|
|
Test PASSed. |
|
Test build #536 has finished for PR 3420 at commit
|
|
LGTM. Merged into master and branch-1.2. Thanks! |
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array. It also improve the serialization of DenseVector. Before this change: trial | trainingTime | testTime -------|--------|-------- 0 | 5.126 | 1.786 1 |2.698 |1.693 After the change: trial | trainingTime | testTime -------|--------|-------- 0 |4.692 |0.554 1 |2.307 |0.525 This could partially fix the performance regression during test. Author: Davies Liu <davies@databricks.com> Closes apache#3420 from davies/ser2 and squashes the following commits: 0e1e6f3 [Davies Liu] fix tests 426f5db [Davies Liu] impove toArray() 44707ec [Davies Liu] add name for ISO-8859-1 fa7d791 [Davies Liu] address comments 1cfb137 [Davies Liu] handle zero sparse vector 2548ee2 [Davies Liu] fix tests 9e6389d [Davies Liu] bugfix 470f702 [Davies Liu] speed up DenseMatrix f0d3c40 [Davies Liu] speedup SparseVector ef6ce70 [Davies Liu] speed up dense vector (cherry picked from commit b660de7) Signed-off-by: Xiangrui Meng <meng@databricks.com>
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array. It also improve the serialization of DenseVector. Before this change: trial | trainingTime | testTime -------|--------|-------- 0 | 5.126 | 1.786 1 |2.698 |1.693 After the change: trial | trainingTime | testTime -------|--------|-------- 0 |4.692 |0.554 1 |2.307 |0.525 This could partially fix the performance regression during test. Author: Davies Liu <davies@databricks.com> Closes #3420 from davies/ser2 and squashes the following commits: 0e1e6f3 [Davies Liu] fix tests 426f5db [Davies Liu] impove toArray() 44707ec [Davies Liu] add name for ISO-8859-1 fa7d791 [Davies Liu] address comments 1cfb137 [Davies Liu] handle zero sparse vector 2548ee2 [Davies Liu] fix tests 9e6389d [Davies Liu] bugfix 470f702 [Davies Liu] speed up DenseMatrix f0d3c40 [Davies Liu] speedup SparseVector ef6ce70 [Davies Liu] speed up dense vector
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.
It also improve the serialization of DenseVector.
Before this change:
After the change:
This could partially fix the performance regression during test.