[SPARK-4562] [MLlib] speedup vector by davies · Pull Request #3420 · apache/spark

davies · 2014-11-23T09:25:43Z

This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.

It also improve the serialization of DenseVector.

Before this change:

trial	trainingTime	testTime
0	5.126	1.786
1	2.698	1.693

After the change:

trial	trainingTime	testTime
0	4.692	0.554
1	2.307	0.525

This could partially fix the performance regression during test.

SparkQA · 2014-11-23T09:29:55Z

Test build #23761 has started for PR 3420 at commit ef6ce70.

This patch merges cleanly.

SparkQA · 2014-11-23T10:30:03Z

Test build #23762 has started for PR 3420 at commit f0d3c40.

This patch merges cleanly.

SparkQA · 2014-11-23T10:42:44Z

Test build #23763 has started for PR 3420 at commit 470f702.

This patch merges cleanly.

SparkQA · 2014-11-23T11:03:54Z

Test build #23761 has finished for PR 3420 at commit ef6ce70.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-23T11:03:58Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23761/
Test PASSed.

SparkQA · 2014-11-23T11:42:06Z

Test build #23762 has finished for PR 3420 at commit f0d3c40.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-23T11:42:10Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23762/
Test FAILed.

SparkQA · 2014-11-23T11:54:43Z

Test build #23763 has finished for PR 3420 at commit 470f702.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-23T11:54:46Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23763/
Test FAILed.

SparkQA · 2014-11-23T20:07:09Z

Test build #532 has started for PR 3420 at commit 470f702.

This patch merges cleanly.

SparkQA · 2014-11-23T20:10:03Z

Test build #23767 has started for PR 3420 at commit 9e6389d.

This patch merges cleanly.

SparkQA · 2014-11-23T21:20:58Z

Test build #532 has finished for PR 3420 at commit 470f702.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RandomForestModel(JavaModelWrapper):
- class RandomForest(object):
- class DefaultSource extends RelationProvider
- case class ParquetRelation2(path: String)(@transient val sqlContext: SQLContext)
- abstract class CatalystScan extends BaseRelation

SparkQA · 2014-11-23T21:22:51Z

Test build #23767 has finished for PR 3420 at commit 9e6389d.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-23T21:22:55Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23767/
Test FAILed.

jkbradley · 2014-11-24T01:21:00Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

Just curious: Is there not a good way to pick up the character set from the encoding utils?

sorry, I didn't find a name for it.

jkbradley · 2014-11-24T01:37:00Z

For the record, I ran some tests with this and confirmed the speedups. This PR puts test time prediction for GLMs at the same speed as the Spark 1.1 release. Combined with [https://github.com//pull/3397], this makes training time much faster than Spark 1.1.

Test summary: 16 worker nodes on EC2. GLMs with 1M rows, 10K cols, 20 iterations. Python GLM training ran about 3X faster than in Spark 1.1, making Python & Scala training times almost equal.

SparkQA · 2014-11-24T05:40:06Z

Test build #23771 has started for PR 3420 at commit 2548ee2.

This patch merges cleanly.

jkbradley · 2014-11-24T06:34:10Z

By the way, my tests were with dense vectors, not sparse.

SparkQA · 2014-11-24T06:51:30Z

Test build #23771 has finished for PR 3420 at commit 2548ee2.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-24T06:51:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23771/
Test FAILed.

SparkQA · 2014-11-24T07:55:12Z

Test build #23776 has started for PR 3420 at commit 1cfb137.

This patch merges cleanly.

SparkQA · 2014-11-24T09:28:08Z

Test build #23776 has finished for PR 3420 at commit 1cfb137.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-11-24T19:02:07Z

python/pyspark/mllib/linalg.py

It should work if we say m = DenseMatrix(2, 2, range(4))

SparkQA · 2014-11-24T20:10:21Z

Test build #23791 has started for PR 3420 at commit 44707ec.

This patch merges cleanly.

davies · 2014-11-24T21:19:20Z

@mengxr fixed, thanks!

SparkQA · 2014-11-24T21:20:11Z

Test build #23791 has finished for PR 3420 at commit 44707ec.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-24T21:20:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23791/
Test FAILed.

SparkQA · 2014-11-24T21:27:14Z

Test build #533 has started for PR 3420 at commit 426f5db.

This patch merges cleanly.

SparkQA · 2014-11-24T21:30:01Z

Test build #23793 has started for PR 3420 at commit 0e1e6f3.

This patch merges cleanly.

AmplabJenkins · 2014-11-24T21:37:28Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23792/
Test FAILed.

SparkQA · 2014-11-24T22:29:18Z

Test build #533 has finished for PR 3420 at commit 426f5db.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-11-24T22:30:07Z

Test build #536 has started for PR 3420 at commit 0e1e6f3.

This patch merges cleanly.

SparkQA · 2014-11-24T22:55:48Z

Test build #23793 has finished for PR 3420 at commit 0e1e6f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-24T22:55:51Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23793/
Test PASSed.

SparkQA · 2014-11-25T00:04:54Z

Test build #536 has finished for PR 3420 at commit 0e1e6f3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-11-25T00:37:51Z

LGTM. Merged into master and branch-1.2. Thanks!

This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array. It also improve the serialization of DenseVector. Before this change: trial | trainingTime | testTime -------|--------|-------- 0 | 5.126 | 1.786 1 |2.698 |1.693 After the change: trial | trainingTime | testTime -------|--------|-------- 0 |4.692 |0.554 1 |2.307 |0.525 This could partially fix the performance regression during test. Author: Davies Liu <davies@databricks.com> Closes apache#3420 from davies/ser2 and squashes the following commits: 0e1e6f3 [Davies Liu] fix tests 426f5db [Davies Liu] impove toArray() 44707ec [Davies Liu] add name for ISO-8859-1 fa7d791 [Davies Liu] address comments 1cfb137 [Davies Liu] handle zero sparse vector 2548ee2 [Davies Liu] fix tests 9e6389d [Davies Liu] bugfix 470f702 [Davies Liu] speed up DenseMatrix f0d3c40 [Davies Liu] speedup SparseVector ef6ce70 [Davies Liu] speed up dense vector (cherry picked from commit b660de7) Signed-off-by: Xiangrui Meng <meng@databricks.com>

This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array. It also improve the serialization of DenseVector. Before this change: trial | trainingTime | testTime -------|--------|-------- 0 | 5.126 | 1.786 1 |2.698 |1.693 After the change: trial | trainingTime | testTime -------|--------|-------- 0 |4.692 |0.554 1 |2.307 |0.525 This could partially fix the performance regression during test. Author: Davies Liu <davies@databricks.com> Closes #3420 from davies/ser2 and squashes the following commits: 0e1e6f3 [Davies Liu] fix tests 426f5db [Davies Liu] impove toArray() 44707ec [Davies Liu] add name for ISO-8859-1 fa7d791 [Davies Liu] address comments 1cfb137 [Davies Liu] handle zero sparse vector 2548ee2 [Davies Liu] fix tests 9e6389d [Davies Liu] bugfix 470f702 [Davies Liu] speed up DenseMatrix f0d3c40 [Davies Liu] speedup SparseVector ef6ce70 [Davies Liu] speed up dense vector

speed up dense vector

ef6ce70

speedup SparseVector

f0d3c40

speed up DenseMatrix

470f702

davies changed the title ~~[SPARK-4562] [MLlib] speed up dense vector~~ [SPARK-4562] [MLlib] speedup vector Nov 23, 2014

bugfix

9e6389d

jkbradley reviewed Nov 24, 2014
View reviewed changes

fix tests

2548ee2

handle zero sparse vector

1cfb137

mengxr reviewed Nov 24, 2014
View reviewed changes

python/pyspark/mllib/linalg.py Outdated

Copy link
Copy Markdown

Contributor

mengxr Nov 24, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work if we say m = DenseMatrix(2, 2, range(4))

Davies Liu added 2 commits November 24, 2014 12:03

address comments

fa7d791

add name for ISO-8859-1

44707ec

impove toArray()

426f5db

davies force-pushed the ser2 branch from 64d1a25 to 426f5db Compare November 24, 2014 21:20

fix tests

0e1e6f3

davies closed this Nov 25, 2014

Conversation

davies commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

AmplabJenkins commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

AmplabJenkins commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

AmplabJenkins commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

SparkQA commented Nov 23, 2014

Uh oh!

AmplabJenkins commented Nov 23, 2014

Uh oh!

jkbradley Nov 24, 2014

Choose a reason for hiding this comment

Uh oh!

davies Nov 24, 2014

Choose a reason for hiding this comment

Uh oh!

jkbradley commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

jkbradley commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

AmplabJenkins commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

mengxr Nov 24, 2014

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

davies commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

AmplabJenkins commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

AmplabJenkins commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 24, 2014

Uh oh!

AmplabJenkins commented Nov 24, 2014

Uh oh!

SparkQA commented Nov 25, 2014

Uh oh!

mengxr commented Nov 25, 2014

Uh oh!