Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3181][MLLIB]: Add Robust Regression Algorithm with Huber Estimator #7722

Closed
wants to merge 16 commits into from

Conversation

fjiang6
Copy link

@fjiang6 fjiang6 commented Jul 28, 2015

Huber Robust Regression under spark/ml/regression

@SparkQA
Copy link

SparkQA commented Jul 28, 2015

Test build #38683 has finished for PR 7722 at commit dcd757b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)
    • public static class StructWriter
    • abstract class InternalRow extends Serializable with SpecializedGetters
    • implicit class DslLogicalPlan(val logicalPlan: LogicalPlan)
    • case class CreateStructUnsafe(children: Seq[Expression]) extends Expression
    • case class CreateNamedStructUnsafe(children: Seq[Expression]) extends Expression
    • case class LastDay(startDate: Expression) extends UnaryExpression with ImplicitCastInputTypes
    • case class NextDay(startDate: Expression, dayOfWeek: Expression)
    • case class TungstenProject(projectList: Seq[NamedExpression], child: SparkPlan) extends UnaryNode

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38787 has finished for PR 7722 at commit fbd0b64.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)
    • implicit class DslLogicalPlan(val logicalPlan: LogicalPlan)

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38829 has finished for PR 7722 at commit c980a1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38827 has finished for PR 7722 at commit e693c54.

  • This patch fails PySpark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38826 has finished for PR 7722 at commit dd70763.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@SparkQA
Copy link

SparkQA commented Jul 29, 2015

Test build #38828 has finished for PR 7722 at commit 952dcab.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@mengxr
Copy link
Contributor

mengxr commented Jul 29, 2015

@dbtsai @srowen Need your input to decide whether we want to add costFunc: Param[String] to LinearRegression or create a new class RobustRegression (or RobustLinearRegression).

@fjiang6
Copy link
Author

fjiang6 commented Jul 30, 2015

@dbtsai @srowen Need your input to decide whether we want to add costFunc: Param[String] to LinearRegression or create a new class RobustRegression (or RobustLinearRegression).

@srowen
Copy link
Member

srowen commented Jul 30, 2015

Hm... I suppose I would expect to optionally change the cost function to something like absolute error, rather than introduce a different class. this is still essentially linear regression right?

If the difference is more than the cost function, I could see making a parallel implementation, but that seems like a lot of duplication to avoid if possible.

@mengxr
Copy link
Contributor

mengxr commented Jul 30, 2015

Discussed with @dbtsai offline. He suggested using LinearRegression since the output model remains the same no matter what loss function we use.

@dbtsai
Copy link
Member

dbtsai commented Jul 30, 2015

I will have them in the same LinearRegression codebase as @mengxr said. Almost 90% of the code is the same, and it will be hard to maintain. BTW, I can take over this PR for code-review.

* It's used in Breeze's convex optimization routines.
*/
private class HuberCostFun(
data: RDD[(Double, Vector)],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation

@dbtsai
Copy link
Member

dbtsai commented Jul 30, 2015

We also need the unit-tests.

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #39626 has finished for PR 7722 at commit cff7ecb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Aug 4, 2015

Please add the unit tests. Thanks.

@SparkQA
Copy link

SparkQA commented Aug 4, 2015

Test build #39782 has finished for PR 7722 at commit 5a94f99.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39786 has finished for PR 7722 at commit 412c34d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@fjiang6
Copy link
Author

fjiang6 commented Aug 5, 2015

@AmplabJenkins I can build. Can you re-test please?

@SparkQA
Copy link

SparkQA commented Aug 5, 2015

Test build #39793 has finished for PR 7722 at commit 4f0865f.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@fjiang6
Copy link
Author

fjiang6 commented Aug 5, 2015

@AmplabJenkins Need your help. I can build with this command:
sbt publish-local -Pyarn -Phadoop-2.4 -Dhadoop.version=2.5.2

and I can run all the tests.

Please help understand the errors:
not enough arguments for constructor LinearRegressionTrainingSummary: (predictions: org.apache.spark.sql.DataFrame, predictionCol: String, labelCol: String, featuresCol: String, objectiveHistory: Array[Double])org.apache.spark.ml.regression.LinearRegressionTrainingSummary.
[error] Unspecified value parameter objectiveHistory.
[error] val trainingSummary = new LinearRegressionTrainingSummary(

@SparkQA
Copy link

SparkQA commented Aug 6, 2015

Test build #39949 has finished for PR 7722 at commit a79855a.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression(override val uid: String)

@asfgit asfgit closed this in 0d9ab01 Sep 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants