[SPARK-19270][ML] Add summary table to GLM summary #16630

actuaryzhang · 2017-01-18T07:14:34Z

What changes were proposed in this pull request?

Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results.

@srowen @yanboliang @felixcheung

How was this patch tested?

New tests. One for testing feature Name, and one for testing the summary Table.

actuaryzhang · 2017-01-18T07:20:46Z

The following code illustrates the idea of this PR. Let me know if this makes sense. Thanks.

val datasetWithWeight = Seq(
    (1.0, 1.0, 0.0, 5.0),
    (0.5, 2.0, 1.0, 2.0),
    (1.0, 3.0, 2.0, 1.0),
    (0.0, 4.0, 3.0, 3.0)
  ).toDF("y", "w", "x1", "x2")

val formula = (new RFormula()
  .setFormula("y ~ x1 + x2")
  .setFeaturesCol("features")
  .setLabelCol("label"))
val output = formula.fit(datasetWithWeight).transform(datasetWithWeight)

val glr = new GeneralizedLinearRegression()
val model = glr.fit(output)
model.summary.summaryTable.show

This prints out:

+---------+--------------------+-------------------+-------------------+-------------------+
|  Feature|            Estimate|           StdError|             TValue|             PValue|
+---------+--------------------+-------------------+-------------------+-------------------+
|Intercept|  1.4523809523809539| 0.9245946589975053| 1.5708299180050451| 0.3609009059280113|
|       x1|-0.33333333333333387|0.28171808490950573|-1.1832159566199243|0.44669962096188565|
|       x2|-0.11904761904761924|   0.21295885499998|-0.5590169943749482| 0.6754896416955616|
+---------+--------------------+-------------------+-------------------+-------------------+

yanboliang · 2017-01-22T16:43:20Z

Jenkins, test this please

actuaryzhang · 2017-01-25T07:53:39Z

This test just resists to start. Could someone help? Many thanks!
@srowen @jkbradley @MLnick @yanboliang

srowen · 2017-01-25T10:36:28Z

Jenkins add to whitelist

srowen · 2017-01-25T10:36:41Z

Jenkins test this please

SparkQA · 2017-01-25T10:39:14Z

Test build #71985 has finished for PR 16630 at commit 6173ba9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-25T19:29:11Z

Test build #71998 has finished for PR 16630 at commit 78bb77f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

actuaryzhang · 2017-02-09T07:08:50Z

Could somebody help review this PR? I think this will make gathering the estimation results in Scala much easier. This will also be helpful in constructing the tests. For example, the GLM tests with weights can be simplified a lot if we have all results in arrays and SEs etc are aligned with coefficients (current GLM tests with weight force no intercept to avoid this nuisance).

@sethah @imatiach-msft @felixcheung

imatiach-msft · 2017-02-09T18:52:12Z

@actuaryzhang sorry I'm at Spark Summit East, will take a look soon. For the feature name or "lazy val featureName: Array[String]", I recall there is a sparse (eg output by HashingTF) and dense version of the metadata for the StructField, I need to look into that code a bit more to understand if it works...

felixcheung · 2017-02-12T19:19:12Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+      val model = trainer.fit(dataset)
+      val summaryTable = model.summary.summaryTable
+
+      summaryTable.select("Feature").rdd.collect.map(_.getString(0))


you shouldn't need to .rdd.collect instead .collect()?

felixcheung · 2017-02-12T19:21:03Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+    val featureAttrs = AttributeGroup.fromStructField(
+      dataset.schema(model.getFeaturesCol)).attributes
+    if (featureAttrs == None) {
+      Array.tabulate[String](origModel.numFeatures)((x: Int) => ("V" + (x + 1)))


is there another way feature names could be used elsewhere? the concern is names from here would not match feature names in other pleases if this is not stored in the model

might need to add more tests for this?

@felixcheung The feature names were not available prior to this PR, right? One other place I see that does similar summary is the GeneralizedLinearRegressionWrapper for R. Do you think we should consolidate the two, e.g., update the Wrapper to use the summary table directly?

quite possibly - could you check what would be new or removed with that approach?

felixcheung · 2017-02-12T19:21:54Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+        .zip(expectedFeature(idx)).foreach{ x => assert(x._1 === x._2,
+        "Feature name mismatch in summaryTable") }
+      assert(Vectors.dense(summaryTable.select("Estimate").rdd.collect.map(_.getDouble(0)))
+        ~= expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable")


felixcheung · 2017-02-12T19:22:22Z

Let's ping @yanboliang and wait for @imatiach-msft to take a look.

imatiach-msft · 2017-02-13T23:39:19Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+        (featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i))
+
+      val spark = SparkSession.builder().getOrCreate()
+      import spark.implicits._


minor comment: might it be better to simplify this as "import dataset.sparkSession.implicits._", or is there a reason to prefer the SparkSession.builder().getOrCreate()?

I was using the spark session and implicits to be able to use toDF to create data frame with names from Seq. Could you explain how this import dataset.sparkSession.implicits._ works? Could not import it in spark shell.

<console>:56: error: not found: value dataset import dataset.sparkSession.implicits._

imatiach-msft · 2017-02-14T00:07:28Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+  lazy val featureName: Array[String] = {
+    val featureAttrs = AttributeGroup.fromStructField(
+      dataset.schema(model.getFeaturesCol)).attributes
+    if (featureAttrs == None) {


if I do the example below in spark-shell:

import org.apache.spark.ml.feature.HashingTF
val tf = new HashingTF().setInputCol("x").setOutputCol("hash")
val df = spark.createDataFrame(Seq(Tuple3(0.0,Array("a", "b"), 4), Tuple3(1.0, Array("b", "c"), 6), Tuple3(1.0, Array("a", "c"), 7), Tuple3(0.0, Array("b","c"), 7))).toDF("y", "x", "z")
val dfres = tf.transform(df)

when doing show():
scala> dfres.show
+---+------+---+--------------------+
| y| x| z| hash|
+---+------+---+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|
|1.0|[b, c]| 6|(262144,[28698,30...|
|1.0|[a, c]| 7|(262144,[28698,22...|
|0.0|[b, c]| 7|(262144,[28698,30...|
+---+------+---+--------------------+

but, when I look at schema:
import org.apache.spark.ml.attribute.AttributeGroup
scala> AttributeGroup.fromStructField(dfres.schema("hash")).attributes
res5: Option[Array[org.apache.spark.ml.attribute.Attribute]] = None

scala> AttributeGroup.fromStructField(dfres.schema("hash"))
res6: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"num_attrs":262144}}

but in this case the name should be of the form: hash_{#}
instead of V{#}
for example, when using VectorAssembler on the above:
import org.apache.spark.ml.feature.VectorAssembler
val va = new VectorAssembler().setInputCols(Array("y","z","hash")).setOutputCol("outputs")
scala> va.transform(dfres).show()
+---+------+---+--------------------+--------------------+
| y| x| z| hash| outputs|
+---+------+---+--------------------+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|(262146,[1,30915,...|
|1.0|[b, c]| 6|(262144,[28698,30...|(262146,[0,1,2870...|
|1.0|[a, c]| 7|(262144,[28698,22...|(262146,[0,1,2870...|
|0.0|[b, c]| 7|(262144,[28698,30...|(262146,[1,28700,...|
+---+------+---+--------------------+--------------------+

scala> print(AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get)
[Lorg.apache.spark.ml.attribute.Attribute;@4416197b
scala> AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get
res22: Array[org.apache.spark.ml.attribute.Attribute] = Array({"type":"numeric","idx":0,"name":"y"}, {"type":"numeric","idx":1,"name":"z"}, {"type":"numeric","idx":2,"name":"hash_0"}, {"type":"numeric","idx":3,"name":"hash_1"}, {"type":"numeric","idx":4,"name":"hash_2"}, {"type":"numeric","idx":5,"name":"hash_3"}, {"type":"numeric","idx":6,"name":"hash_4"}, {"type":"numeric","idx":7,"name":"hash_5"}, {"type":"numeric","idx":8,"name":"hash_6"}, {"type":"numeric","idx":9,"name":"hash_7"}, {"type":"numeric","idx":10,"name":"hash_8"}, {"type":"numeric","idx":11,"name":"hash_9"}, {"type":"numeric","idx":12,"name":"hash_10"}, {"type":"numeric","idx":13,"name":"hash_11"}, {"type":"numeric","idx":14,"name":"hash_12"}, {"type":"numeric","idx":15,"name":"hash_13"}, {"type":"numeric","idx":16,"nam...

you can see that the attributes are given the column name followed by the index.
This seems like a bug in the VectorAssembler, because it is making the schema dense when it should be sparse, but regardless this seems to be the more official way to represent the name of the attributes instead of using a "V" followed by index - unless you have seen the "V" + index used elsewhere?

@imatiach-msft This makes sense. I now changed the code to mirror the same logic. When attritubes are missing, the default name is set to be the feature name with suffix "_0", "_1" etc.

imatiach-msft · 2017-02-14T00:08:22Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+      var coefficients = model.coefficients.toArray
+      var idx = Array.range(0, coefficients.length)
+      if (model.getFitIntercept) {
+        featureNames = featureNames :+ "Intercept"


would it be possible to move this to a constant ("Intercept")

imatiach-msft · 2017-02-14T00:12:46Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+    var idx = 0
+    for (fitIntercept <- Seq(false, true)) {
+      val trainer = new GeneralizedLinearRegression()
+        .setFamily("gaussian")


not related to this code review, but it's unfortunate that these aren't constants that can be referenced from the model, it's messy to have to type strings like this everywhere as opposed to referencing variables

Indeed, there is object Gaussian and one can use Gaussian.name for the string name.

Guassian.name.toLowerCase (or Guassian.name since it is converted to lowercase later) would be generally the approach.

but this is test suite, I think it's ok

I would usually prefer to use variables wherever possible as it is much easier to update through various editors and in general it is much easier to catch compile time vs runtime errors. But it is a minor point, and it looks like this is consistent with most of the spark codebase.

imatiach-msft · 2017-02-14T00:15:11Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+      result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1)
+    } else {
+      throw new UnsupportedOperationException(
+        "No summary table available for this GeneralizedLinearRegressionModel")


minor suggestion: it would be nice to add a test to verify this exception is thrown (and with the right error message using the withClue() check)

imatiach-msft · 2017-02-14T00:19:26Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+
+      val spark = SparkSession.builder().getOrCreate()
+      import spark.implicits._
+      result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1)


question: is "Estimate" the better term to use here as opposed to "Coefficient"? Are there other libraries which use this specific term in this case?

R was using 'Estimate'. I changed it to 'Coefficient' now.

imatiach-msft · 2017-02-14T00:23:43Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+   * set default names to "V1", "V2", and so on.
+   */
+  @Since("2.2.0")
+  lazy val featureName: Array[String] = {


minor comment: it looks like this is an array so should be plural, as in "featureNames" instead of "featureName" without the s

OK, changed.

imatiach-msft · 2017-02-14T00:25:08Z

the code looks very good, I added a few minor comments, will take another look tomorrow, thanks!

actuaryzhang · 2017-02-14T22:30:15Z

@felixcheung @imatiach-msft Thanks much for the review. Made most changes suggested. Please see my inline replies.

SparkQA · 2017-02-14T23:34:40Z

Test build #72901 has finished for PR 16630 at commit b67d3fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

imatiach-msft · 2017-02-15T19:04:42Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+
+      val spark = SparkSession.builder().getOrCreate()
+      import spark.implicits._
+      result.toDF("Feature", "Coefficient", "StdError", "TValue", "PValue").repartition(1)


Sorry, I didn't realize that R uses Estimate instead of coefficient - if you feel strongly about using Estimate here instead you can change this back. Up to you.

imatiach-msft · 2017-02-15T19:06:54Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+      dataset.schema(model.getFeaturesCol)).attributes
+    if (featureAttrs == None) {
+      Array.tabulate[String](origModel.numFeatures)(
+        (x: Int) => (model.getFeaturesCol + "_" + x))


in general I would have preferred to create a platform-level function (or use one if it exists) to format the strings in the same way, so there is no duplicate code in VectorAssembler vs here that can diverge (and which other functions in spark can generally use). However, this seems a bit out of scope of this code review, so I don't think you need to do this.

imatiach-msft · 2017-02-15T19:19:35Z

Thanks for the updates, the changes look good to me. One question, out of scope of the specific changes in this review: are there any other summary statistics that we could add in the future? Maybe R^2 and adjusted R^2? Also, do you know of any good reference papers that have an overview of the most popular summary statistics used in GLM (not including the ones in this pull request)?

imatiach-msft · 2017-02-15T19:29:00Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+
+    val expectedFeature = Seq(Array("features_0", "features_1"),
+      Array("(Intercept)", "features_0", "features_1"))
+    val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),


is this comparing the summary to the results of R? If so, in general you should add the R code in a comment that was used to generate the expected results so that the expected values are reproducible.

Thanks. Added in R code.

imatiach-msft · 2017-02-15T19:29:36Z

mllib/src/test/scala/org/apache/spark/ml/regression/GeneralizedLinearRegressionSuite.scala

+        ~== expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable")
+      assert(Vectors.dense(summaryTable.select("StdError").rdd.collect.map(_.getDouble(0)))
+        ~== expectedStdError(idx) absTol 1E-3, "Standard error mismatch in summaryTable")
+      assert(Vectors.dense(summaryTable.select("TValue").rdd.collect.map(_.getDouble(0)))


it looks like for all of these below you can just call collect instead of rdd.collect?

actuaryzhang · 2017-02-17T18:57:35Z

@imatiach-msft @felixcheung
I cleaned up the tests as suggested, and also updated the R GLM wrapper to use the result from this PR. Please let me know if there is any other suggestions. Thanks much for the review and comments.

imatiach-msft · 2017-02-17T18:58:02Z

@actuaryzhang thanks, LGTM!

actuaryzhang · 2017-07-17T22:49:15Z

@yanboliang Thanks for the suggestions. I have made a new commit that addresses your comments.
In the new version, I used an array of tuple to represent the coefficient matrix. I used tuple because I have mixed type of string and double (it's necessary to store the feature names since they also depend on whether there is intercept). I then wrote a showString function similar to that in the DataSet class that compiles all summary info into a string, and defined show methods to print out the estimated model. The output is very similar to that in R except that I did not show the residuals and significance levels. Please let me know your thoughts on this update.

Below is an example of the call and the output:

model.summary.show()
+-----------+--------+--------+------+------+
|    Feature|Estimate|StdError|TValue|PValue|
+-----------+--------+--------+------+------+
|(Intercept)|   0.790|   4.013| 0.197| 0.862|
| features_0|   0.226|   2.115| 0.107| 0.925|
| features_1|   0.468|   0.582| 0.804| 0.506|
+-----------+--------+--------+------+------+

(Dispersion parameter for gaussian family taken to be 14.516)
    Null deviance: 46.800 on 2 degrees of freedom
Residual deviance: 29.032 on 2 degrees of freedom
AIC: 30.984

SparkQA · 2017-07-17T22:58:53Z

Test build #79685 has finished for PR 16630 at commit a16cbee.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-18T00:39:54Z

Test build #79686 has finished for PR 16630 at commit 640d564.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-18T00:47:13Z

Test build #79688 has finished for PR 16630 at commit 57f1e5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-07-19T09:41:45Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+   * @since 2.3.0
+   */
+  // scalastyle:off println
+  def show(numRows: Int, truncate: Boolean, numDigits: Int): Unit = if (truncate) {


I think not all functions are useful for GLM summary, I'd recommend to keep only one show function with default setting, such as numRows = coefficientMatrix.size, truncate = 20 and numDigits = 3. There has little different compared with Dataset.show, it's not necessary to provide lots of opinions for users to set, users just want to see the output like R. Then the code will be more clean.

yanboliang · 2017-07-19T09:43:57Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+  }
+
+  private[regression] def showString(_numRows: Int, truncate: Int = 20,
+                                     numDigits: Int = 3): String = {


yanboliang · 2017-07-19T09:51:52Z

mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala

+   * tValue and pValue.
+   */
+  @Since("2.3.0")
+  lazy val coefficientMatrix: Array[(String, Double, Double, Double, Double)] = {


This is not a matrix, so it's not appropriate to name it as coefficientMatrix. Since it's only used for generating summary string output, what about keep it private or inline?

actuaryzhang · 2017-07-19T17:31:28Z

Made a new commit to address the comments.

SparkQA · 2017-07-19T18:32:25Z

Test build #79766 has finished for PR 16630 at commit 174fc49.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-26T18:31:11Z

Test build #79970 has finished for PR 16630 at commit 7281b77.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-07-27T14:01:20Z

LGTM, merged into master. Thanks for all.

yanboliang · 2017-07-27T14:07:39Z

Note the output format of GLR summary.toString is:

Coefficients:
   Feature Estimate Std Error    T Value P Value
features_0  2.21304   0.00279  792.03163 0.00000
features_1  0.83096   0.00080 1042.07543 0.00000

(Dispersion parameter for gaussian family taken to be 0.06483)
Null deviance: 2344915.50893 on 9998 degrees of freedom
Residual deviance: 648.20325 on 9998 degrees of freedom
AIC: 1023.40993

felixcheung reviewed Feb 12, 2017

View reviewed changes

imatiach-msft reviewed Feb 13, 2017

View reviewed changes

imatiach-msft reviewed Feb 14, 2017

View reviewed changes

imatiach-msft reviewed Feb 15, 2017

View reviewed changes

imatiach-msft approved these changes Feb 15, 2017

View reviewed changes

imatiach-msft reviewed Feb 15, 2017

View reviewed changes

actuaryzhang added 8 commits July 13, 2017 10:26

fix style issues

602c3bd

change default name to use feature colname

6882be4

glmTable

8405501

update R glm wrapper to use summaryTable

10f0f9b

clean up test

3d72cf5

fix issue in R wrapper

07a6784

sort import

1c1d3e6

use 2D array for summary table

a16cbee

actuaryzhang force-pushed the glmTable branch from ce0851a to a16cbee Compare July 17, 2017 22:53

actuaryzhang added 2 commits July 17, 2017 16:37

fix import

640d564

remove intercept

57f1e5c

yanboliang reviewed Jul 19, 2017

View reviewed changes

actuaryzhang added 2 commits July 19, 2017 10:27

simplify show method

167af01

fix align issue

174fc49

yanboliang and others added 3 commits July 26, 2017 13:42

Refactor GLR summary toString.

be11106

Merge branch 'yanboliang-glmTable' into glmTable

adb3a74

fix style

7281b77

asfgit closed this in ddcd2e8 Jul 27, 2017

actuaryzhang deleted the glmTable branch July 27, 2017 16:28

[SPARK-19270][ML] Add summary table to GLM summary #16630

[SPARK-19270][ML] Add summary table to GLM summary #16630

Conversation

actuaryzhang commented Jan 18, 2017 • edited

What changes were proposed in this pull request?

How was this patch tested?

actuaryzhang commented Jan 18, 2017 • edited

yanboliang commented Jan 22, 2017

actuaryzhang commented Jan 25, 2017

srowen commented Jan 25, 2017

srowen commented Jan 25, 2017

SparkQA commented Jan 25, 2017

SparkQA commented Jan 25, 2017

actuaryzhang commented Feb 9, 2017

imatiach-msft commented Feb 9, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

felixcheung commented Feb 12, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft commented Feb 14, 2017

actuaryzhang commented Feb 14, 2017

SparkQA commented Feb 14, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

imatiach-msft commented Feb 15, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

actuaryzhang commented Feb 17, 2017

imatiach-msft commented Feb 17, 2017

actuaryzhang commented Jul 17, 2017

SparkQA commented Jul 17, 2017

SparkQA commented Jul 18, 2017

SparkQA commented Jul 18, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

actuaryzhang commented Jul 19, 2017

SparkQA commented Jul 19, 2017

SparkQA commented Jul 26, 2017

yanboliang commented Jul 27, 2017

yanboliang commented Jul 27, 2017

actuaryzhang commented Jan 18, 2017 •

edited

actuaryzhang commented Jan 18, 2017 •

edited