Skip to content

[SPARK-4562] [MLlib] speedup vector#3420

Closed
davies wants to merge 10 commits intoapache:masterfrom
davies:ser2
Closed

[SPARK-4562] [MLlib] speedup vector#3420
davies wants to merge 10 commits intoapache:masterfrom
davies:ser2

Conversation

@davies
Copy link
Copy Markdown
Contributor

@davies davies commented Nov 23, 2014

This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.

It also improve the serialization of DenseVector.

Before this change:

trial trainingTime testTime
0 5.126 1.786
1 2.698 1.693

After the change:

trial trainingTime testTime
0 4.692 0.554
1 2.307 0.525

This could partially fix the performance regression during test.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23761 has started for PR 3420 at commit ef6ce70.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23762 has started for PR 3420 at commit f0d3c40.

  • This patch merges cleanly.

@davies davies changed the title [SPARK-4562] [MLlib] speed up dense vector [SPARK-4562] [MLlib] speedup vector Nov 23, 2014
@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23763 has started for PR 3420 at commit 470f702.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23761 has finished for PR 3420 at commit ef6ce70.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23761/
Test PASSed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23762 has finished for PR 3420 at commit f0d3c40.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23762/
Test FAILed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23763 has finished for PR 3420 at commit 470f702.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23763/
Test FAILed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #532 has started for PR 3420 at commit 470f702.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23767 has started for PR 3420 at commit 9e6389d.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #532 has finished for PR 3420 at commit 470f702.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RandomForestModel(JavaModelWrapper):
    • class RandomForest(object):
    • class DefaultSource extends RelationProvider
    • case class ParquetRelation2(path: String)(@transient val sqlContext: SQLContext)
    • abstract class CatalystScan extends BaseRelation

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 23, 2014

Test build #23767 has finished for PR 3420 at commit 9e6389d.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23767/
Test FAILed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: Is there not a good way to pick up the character set from the encoding utils?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, I didn't find a name for it.

@jkbradley
Copy link
Copy Markdown
Member

For the record, I ran some tests with this and confirmed the speedups. This PR puts test time prediction for GLMs at the same speed as the Spark 1.1 release. Combined with [https://github.com//pull/3397], this makes training time much faster than Spark 1.1.

Test summary: 16 worker nodes on EC2. GLMs with 1M rows, 10K cols, 20 iterations. Python GLM training ran about 3X faster than in Spark 1.1, making Python & Scala training times almost equal.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23771 has started for PR 3420 at commit 2548ee2.

  • This patch merges cleanly.

@jkbradley
Copy link
Copy Markdown
Member

By the way, my tests were with dense vectors, not sparse.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23771 has finished for PR 3420 at commit 2548ee2.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23771/
Test FAILed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23776 has started for PR 3420 at commit 1cfb137.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23776 has finished for PR 3420 at commit 1cfb137.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work if we say m = DenseMatrix(2, 2, range(4))

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23791 has started for PR 3420 at commit 44707ec.

  • This patch merges cleanly.

@davies
Copy link
Copy Markdown
Contributor Author

davies commented Nov 24, 2014

@mengxr fixed, thanks!

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23791 has finished for PR 3420 at commit 44707ec.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23791/
Test FAILed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #533 has started for PR 3420 at commit 426f5db.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23793 has started for PR 3420 at commit 0e1e6f3.

  • This patch merges cleanly.

@AmplabJenkins
Copy link
Copy Markdown

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23792/
Test FAILed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #533 has finished for PR 3420 at commit 426f5db.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #536 has started for PR 3420 at commit 0e1e6f3.

  • This patch merges cleanly.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 24, 2014

Test build #23793 has finished for PR 3420 at commit 0e1e6f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link
Copy Markdown

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23793/
Test PASSed.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Nov 25, 2014

Test build #536 has finished for PR 3420 at commit 0e1e6f3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Copy Markdown
Contributor

mengxr commented Nov 25, 2014

LGTM. Merged into master and branch-1.2. Thanks!

@davies davies closed this Nov 25, 2014
andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Nov 25, 2014
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.

It also improve the serialization of DenseVector.

Before this change:

trial	| trainingTime | 	testTime
-------|--------|--------
0	| 5.126 | 	1.786
1	|2.698	|1.693

After the change:

trial	| trainingTime |	testTime
-------|--------|--------
0	|4.692	|0.554
1	|2.307	|0.525

This could partially fix the performance regression during test.

Author: Davies Liu <davies@databricks.com>

Closes apache#3420 from davies/ser2 and squashes the following commits:

0e1e6f3 [Davies Liu] fix tests
426f5db [Davies Liu] impove toArray()
44707ec [Davies Liu] add name for ISO-8859-1
fa7d791 [Davies Liu] address comments
1cfb137 [Davies Liu] handle zero sparse vector
2548ee2 [Davies Liu] fix tests
9e6389d [Davies Liu] bugfix
470f702 [Davies Liu] speed up DenseMatrix
f0d3c40 [Davies Liu] speedup SparseVector
ef6ce70 [Davies Liu] speed up dense vector

(cherry picked from commit b660de7)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
asfgit pushed a commit that referenced this pull request Nov 25, 2014
This PR change the underline array of DenseVector to numpy.ndarray to avoid the conversion, because most of the users will using numpy.array.

It also improve the serialization of DenseVector.

Before this change:

trial	| trainingTime | 	testTime
-------|--------|--------
0	| 5.126 | 	1.786
1	|2.698	|1.693

After the change:

trial	| trainingTime |	testTime
-------|--------|--------
0	|4.692	|0.554
1	|2.307	|0.525

This could partially fix the performance regression during test.

Author: Davies Liu <davies@databricks.com>

Closes #3420 from davies/ser2 and squashes the following commits:

0e1e6f3 [Davies Liu] fix tests
426f5db [Davies Liu] impove toArray()
44707ec [Davies Liu] add name for ISO-8859-1
fa7d791 [Davies Liu] address comments
1cfb137 [Davies Liu] handle zero sparse vector
2548ee2 [Davies Liu] fix tests
9e6389d [Davies Liu] bugfix
470f702 [Davies Liu] speed up DenseMatrix
f0d3c40 [Davies Liu] speedup SparseVector
ef6ce70 [Davies Liu] speed up dense vector
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants