Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-8711] [ML] Add additional methods to PySpark ML tree models #7095

Closed
wants to merge 5 commits into from

Conversation

MechCoder
Copy link
Contributor

Add numNodes and depth to treeModels, add treeWeights to ensemble Models.
Add repr to all models.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@MechCoder
Copy link
Contributor Author

@mengxr Should we add wrapper around the trees methos in ensembles as well? I'm not sure how we would manipulate java objects in python.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #36023 has started for PR 7095 at commit 849aabe.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@SparkQA
Copy link

SparkQA commented Jun 29, 2015

Test build #36023 has finished for PR 7095 at commit 849aabe.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(DecisionTreeModel):
    • class RandomForestClassificationModel(treeEnsembleModels):
    • class GBTClassificationModel(treeEnsembleModels):
    • class DecisionTreeModel(JavaModel):
    • class treeEnsembleModels(JavaModel):
    • class DecisionTreeRegressionModel(DecisionTreeModel):
    • class RandomForestRegressionModel(treeEnsembleModels):
    • class GBTRegressionModel(treeEnsembleModels):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@@ -409,6 +444,8 @@ class GBTRegressor(JavaEstimator, HasFeaturesCol, HasLabelCol, HasPredictionCol,
... (0.0, Vectors.sparse(1, [], []))], ["label", "features"])
>>> gbt = GBTRegressor(maxIter=5, maxDepth=2)
>>> model = gbt.fit(df)
>>> model.treeWeights
[1.0, 0.1, 0.1, 0.1, 0.1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Floating point equality is causing this test to fail. Try numpy.allclose

@MechCoder
Copy link
Contributor Author

@feynmanliang Thanks. I have addressed your comments.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36090 has started for PR 7095 at commit 2fde31a.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36090 has finished for PR 7095 at commit 2fde31a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(DecisionTreeModel):
    • class RandomForestClassificationModel(TreeEnsembleModels):
    • class GBTClassificationModel(TreeEnsembleModels):
    • class DecisionTreeModel(JavaModel):
    • class TreeEnsembleModels(JavaModel):
    • class DecisionTreeRegressionModel(DecisionTreeModel):
    • class RandomForestRegressionModel(TreeEnsembleModels):
    • class GBTRegressionModel(TreeEnsembleModels):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@JoshRosen
Copy link
Contributor

Uh oh: looks like this is actually failing some tests:

Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/anaconda/envs/py3k/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/home/anaconda/envs/py3k/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "./python/run-tests.py", line 166, in process_queue
    run_individual_python_test(test_goal, python_exec)
  File "./python/run-tests.py", line 74, in run_individual_python_test
    if not re.match('[0-9]+', line):
  File "/home/anaconda/envs/py3k/lib/python3.4/re.py", line 160, in match
    return _compile(pattern, flags).match(string)
TypeError: can't use a string pattern on a bytes-like object

Finished test(python3.4): pyspark.ml.recommendation (15s)
Finished test(python3.4): pyspark.ml.feature (18s)
Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/anaconda/envs/py3k/lib/python3.4/threading.py", line 920, in _bootstrap_inner
    self.run()
  File "/home/anaconda/envs/py3k/lib/python3.4/threading.py", line 868, in run
    self._target(*self._args, **self._kwargs)
  File "./python/run-tests.py", line 166, in process_queue
    run_individual_python_test(test_goal, python_exec)
  File "./python/run-tests.py", line 74, in run_individual_python_test
    if not re.match('[0-9]+', line):
  File "/home/anaconda/envs/py3k/lib/python3.4/re.py", line 160, in match
    return _compile(pattern, flags).match(string)
TypeError: can't use a string pattern on a bytes-like object

This is the consequence of not having good test coverage on the test script... I'll fix this shortly.

@JoshRosen
Copy link
Contributor

I've opened #7112 to hotfix the build issue which masked the test failure here.

@JoshRosen
Copy link
Contributor

Jenkins, retest this please.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36096 has started for PR 7095 at commit 2fde31a.

@SparkQA
Copy link

SparkQA commented Jun 30, 2015

Test build #36096 has finished for PR 7095 at commit 2fde31a.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(DecisionTreeModel):
    • class RandomForestClassificationModel(TreeEnsembleModels):
    • class GBTClassificationModel(TreeEnsembleModels):
    • class DecisionTreeModel(JavaModel):
    • class TreeEnsembleModels(JavaModel):
    • class DecisionTreeRegressionModel(DecisionTreeModel):
    • class RandomForestRegressionModel(TreeEnsembleModels):
    • class GBTRegressionModel(TreeEnsembleModels):

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@JoshRosen
Copy link
Contributor

Now that the test reporting is fixed, we can see the failure:

**********************************************************************
File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py", line 175, in __main__.DecisionTreeRegressor
Failed example:
    model.depth
Expected:
    2
Got:
    1
**********************************************************************
File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/ml/regression.py", line 177, in __main__.DecisionTreeRegressor
Failed example:
    model.numNodes
Expected:
    1
Got:
    3
**********************************************************************
   2 of  10 in __main__.DecisionTreeRegressor
***Test Failed*** 2 failures.

@JoshRosen
Copy link
Contributor

By the way, check out --help on python/run-tests if you haven't already; there's some neat options for controlling which suites, python versions, and degree of parallelism to use, which might be a big productivity boost for you given all of the Python work that you've been doing lately.

@AmplabJenkins
Copy link

Merged build triggered.

@MechCoder
Copy link
Contributor Author

I renamed pyTreeWeights to javaTreeWeights.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #36598 has started for PR 7095 at commit 38a0860.

@SparkQA
Copy link

SparkQA commented Jul 6, 2015

Test build #36598 has finished for PR 7095 at commit 38a0860.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(DecisionTreeModel):
    • class RandomForestClassificationModel(TreeEnsembleModels):
    • class GBTClassificationModel(TreeEnsembleModels):
    • class DecisionTreeModel(JavaModel):
    • class TreeEnsembleModels(JavaModel):
    • class DecisionTreeRegressionModel(DecisionTreeModel):
    • class RandomForestRegressionModel(TreeEnsembleModels):
    • class GBTRegressionModel(TreeEnsembleModels):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@@ -70,6 +71,10 @@ private[ml] trait TreeEnsembleModel {
/** Weights for each tree, zippable with [[trees]] */
def treeWeights: Array[Double]

/** Weights used by the python wrappers. */
// Note: An array cannot be returned directly due to serialization problems.
def javaTreeWeights: Vector = Vectors.dense(treeWeights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it package private. Java users do not really need it.

@mengxr
Copy link
Contributor

mengxr commented Jul 6, 2015

LGTM except that the utility method should be package private.

@MechCoder
Copy link
Contributor Author

done. I could not make it private[python] . It was giving some error.

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@MechCoder
Copy link
Contributor Author

jenkins retest this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36652 has started for PR 7095 at commit 23b08be.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36652 has finished for PR 7095 at commit 23b08be.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(DecisionTreeModel):
    • class RandomForestClassificationModel(TreeEnsembleModels):
    • class GBTClassificationModel(TreeEnsembleModels):
    • class DecisionTreeModel(JavaModel):
    • class TreeEnsembleModels(JavaModel):
    • class DecisionTreeRegressionModel(DecisionTreeModel):
    • class RandomForestRegressionModel(TreeEnsembleModels):
    • class GBTRegressionModel(TreeEnsembleModels):

@AmplabJenkins
Copy link

Merged build finished. Test FAILed.

@MechCoder
Copy link
Contributor Author

jenkins retest this please

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

test this please

@AmplabJenkins
Copy link

Merged build triggered.

@AmplabJenkins
Copy link

Merged build started.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36690 has started for PR 7095 at commit 23b08be.

@SparkQA
Copy link

SparkQA commented Jul 7, 2015

Test build #36690 has finished for PR 7095 at commit 23b08be.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DecisionTreeClassificationModel(DecisionTreeModel):
    • class RandomForestClassificationModel(TreeEnsembleModels):
    • class GBTClassificationModel(TreeEnsembleModels):
    • class DecisionTreeModel(JavaModel):
    • class TreeEnsembleModels(JavaModel):
    • class DecisionTreeRegressionModel(DecisionTreeModel):
    • class RandomForestRegressionModel(TreeEnsembleModels):
    • class GBTRegressionModel(TreeEnsembleModels):

@AmplabJenkins
Copy link

Merged build finished. Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Jul 7, 2015

Merged into master. Thanks!

@asfgit asfgit closed this in 1dbc4a1 Jul 7, 2015
@MechCoder MechCoder deleted the missing_methods_tree branch July 7, 2015 16:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants