Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19270][ML] Add summary table to GLM summary #16630

Closed
wants to merge 21 commits into from

Conversation

actuaryzhang
Copy link
Contributor

@actuaryzhang actuaryzhang commented Jan 18, 2017

What changes were proposed in this pull request?

Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results.

@srowen @yanboliang @felixcheung

How was this patch tested?

New tests. One for testing feature Name, and one for testing the summary Table.

@actuaryzhang
Copy link
Contributor Author

actuaryzhang commented Jan 18, 2017

The following code illustrates the idea of this PR. Let me know if this makes sense. Thanks.

val datasetWithWeight = Seq(
    (1.0, 1.0, 0.0, 5.0),
    (0.5, 2.0, 1.0, 2.0),
    (1.0, 3.0, 2.0, 1.0),
    (0.0, 4.0, 3.0, 3.0)
  ).toDF("y", "w", "x1", "x2")

val formula = (new RFormula()
  .setFormula("y ~ x1 + x2")
  .setFeaturesCol("features")
  .setLabelCol("label"))
val output = formula.fit(datasetWithWeight).transform(datasetWithWeight)

val glr = new GeneralizedLinearRegression()
val model = glr.fit(output)
model.summary.summaryTable.show

This prints out:

+---------+--------------------+-------------------+-------------------+-------------------+
|  Feature|            Estimate|           StdError|             TValue|             PValue|
+---------+--------------------+-------------------+-------------------+-------------------+
|Intercept|  1.4523809523809539| 0.9245946589975053| 1.5708299180050451| 0.3609009059280113|
|       x1|-0.33333333333333387|0.28171808490950573|-1.1832159566199243|0.44669962096188565|
|       x2|-0.11904761904761924|   0.21295885499998|-0.5590169943749482| 0.6754896416955616|
+---------+--------------------+-------------------+-------------------+-------------------+

@yanboliang
Copy link
Contributor

Jenkins, test this please

@actuaryzhang
Copy link
Contributor Author

This test just resists to start. Could someone help? Many thanks!
@srowen @jkbradley @MLnick @yanboliang

@srowen
Copy link
Member

srowen commented Jan 25, 2017

Jenkins add to whitelist

@srowen
Copy link
Member

srowen commented Jan 25, 2017

Jenkins test this please

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71985 has finished for PR 16630 at commit 6173ba9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 25, 2017

Test build #71998 has finished for PR 16630 at commit 78bb77f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@actuaryzhang
Copy link
Contributor Author

Could somebody help review this PR? I think this will make gathering the estimation results in Scala much easier. This will also be helpful in constructing the tests. For example, the GLM tests with weights can be simplified a lot if we have all results in arrays and SEs etc are aligned with coefficients (current GLM tests with weight force no intercept to avoid this nuisance).

@sethah @imatiach-msft @felixcheung

@imatiach-msft
Copy link
Contributor

@actuaryzhang sorry I'm at Spark Summit East, will take a look soon. For the feature name or "lazy val featureName: Array[String]", I recall there is a sparse (eg output by HashingTF) and dense version of the metadata for the StructField, I need to look into that code a bit more to understand if it works...

val model = trainer.fit(dataset)
val summaryTable = model.summary.summaryTable

summaryTable.select("Feature").rdd.collect.map(_.getString(0))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't need to .rdd.collect instead .collect()?

val featureAttrs = AttributeGroup.fromStructField(
dataset.schema(model.getFeaturesCol)).attributes
if (featureAttrs == None) {
Array.tabulate[String](origModel.numFeatures)((x: Int) => ("V" + (x + 1)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there another way feature names could be used elsewhere? the concern is names from here would not match feature names in other pleases if this is not stored in the model

might need to add more tests for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung The feature names were not available prior to this PR, right? One other place I see that does similar summary is the GeneralizedLinearRegressionWrapper for R. Do you think we should consolidate the two, e.g., update the Wrapper to use the summary table directly?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quite possibly - could you check what would be new or removed with that approach?

.zip(expectedFeature(idx)).foreach{ x => assert(x._1 === x._2,
"Feature name mismatch in summaryTable") }
assert(Vectors.dense(summaryTable.select("Estimate").rdd.collect.map(_.getDouble(0)))
~= expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use ~==

@felixcheung
Copy link
Member

Let's ping @yanboliang and wait for @imatiach-msft to take a look.

(featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i))

val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment: might it be better to simplify this as "import dataset.sparkSession.implicits._", or is there a reason to prefer the SparkSession.builder().getOrCreate()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was using the spark session and implicits to be able to use toDF to create data frame with names from Seq. Could you explain how this import dataset.sparkSession.implicits._ works? Could not import it in spark shell.

<console>:56: error: not found: value dataset
       import dataset.sparkSession.implicits._

lazy val featureName: Array[String] = {
val featureAttrs = AttributeGroup.fromStructField(
dataset.schema(model.getFeaturesCol)).attributes
if (featureAttrs == None) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if I do the example below in spark-shell:

import org.apache.spark.ml.feature.HashingTF
val tf = new HashingTF().setInputCol("x").setOutputCol("hash")
val df = spark.createDataFrame(Seq(Tuple3(0.0,Array("a", "b"), 4), Tuple3(1.0, Array("b", "c"), 6), Tuple3(1.0, Array("a", "c"), 7), Tuple3(0.0, Array("b","c"), 7))).toDF("y", "x", "z")
val dfres = tf.transform(df)

when doing show():
scala> dfres.show
+---+------+---+--------------------+
| y| x| z| hash|
+---+------+---+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|
|1.0|[b, c]| 6|(262144,[28698,30...|
|1.0|[a, c]| 7|(262144,[28698,22...|
|0.0|[b, c]| 7|(262144,[28698,30...|
+---+------+---+--------------------+

but, when I look at schema:
import org.apache.spark.ml.attribute.AttributeGroup
scala> AttributeGroup.fromStructField(dfres.schema("hash")).attributes
res5: Option[Array[org.apache.spark.ml.attribute.Attribute]] = None

scala> AttributeGroup.fromStructField(dfres.schema("hash"))
res6: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"num_attrs":262144}}

but in this case the name should be of the form: hash_{#}
instead of V{#}
for example, when using VectorAssembler on the above:
import org.apache.spark.ml.feature.VectorAssembler
val va = new VectorAssembler().setInputCols(Array("y","z","hash")).setOutputCol("outputs")
scala> va.transform(dfres).show()
+---+------+---+--------------------+--------------------+
| y| x| z| hash| outputs|
+---+------+---+--------------------+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|(262146,[1,30915,...|
|1.0|[b, c]| 6|(262144,[28698,30...|(262146,[0,1,2870...|
|1.0|[a, c]| 7|(262144,[28698,22...|(262146,[0,1,2870...|
|0.0|[b, c]| 7|(262144,[28698,30...|(262146,[1,28700,...|
+---+------+---+--------------------+--------------------+

scala> print(AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get)
[Lorg.apache.spark.ml.attribute.Attribute;@4416197b
scala> AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get
res22: Array[org.apache.spark.ml.attribute.Attribute] = Array({"type":"numeric","idx":0,"name":"y"}, {"type":"numeric","idx":1,"name":"z"}, {"type":"numeric","idx":2,"name":"hash_0"}, {"type":"numeric","idx":3,"name":"hash_1"}, {"type":"numeric","idx":4,"name":"hash_2"}, {"type":"numeric","idx":5,"name":"hash_3"}, {"type":"numeric","idx":6,"name":"hash_4"}, {"type":"numeric","idx":7,"name":"hash_5"}, {"type":"numeric","idx":8,"name":"hash_6"}, {"type":"numeric","idx":9,"name":"hash_7"}, {"type":"numeric","idx":10,"name":"hash_8"}, {"type":"numeric","idx":11,"name":"hash_9"}, {"type":"numeric","idx":12,"name":"hash_10"}, {"type":"numeric","idx":13,"name":"hash_11"}, {"type":"numeric","idx":14,"name":"hash_12"}, {"type":"numeric","idx":15,"name":"hash_13"}, {"type":"numeric","idx":16,"nam...

you can see that the attributes are given the column name followed by the index.
This seems like a bug in the VectorAssembler, because it is making the schema dense when it should be sparse, but regardless this seems to be the more official way to represent the name of the attributes instead of using a "V" followed by index - unless you have seen the "V" + index used elsewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imatiach-msft This makes sense. I now changed the code to mirror the same logic. When attritubes are missing, the default name is set to be the feature name with suffix "_0", "_1" etc.

var coefficients = model.coefficients.toArray
var idx = Array.range(0, coefficients.length)
if (model.getFitIntercept) {
featureNames = featureNames :+ "Intercept"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be possible to move this to a constant ("Intercept")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

var idx = 0
for (fitIntercept <- Seq(false, true)) {
val trainer = new GeneralizedLinearRegression()
.setFamily("gaussian")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not related to this code review, but it's unfortunate that these aren't constants that can be referenced from the model, it's messy to have to type strings like this everywhere as opposed to referencing variables

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, there is object Gaussian and one can use Gaussian.name for the string name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Guassian.name.toLowerCase (or Guassian.name since it is converted to lowercase later) would be generally the approach.

but this is test suite, I think it's ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would usually prefer to use variables wherever possible as it is much easier to update through various editors and in general it is much easier to catch compile time vs runtime errors. But it is a minor point, and it looks like this is consistent with most of the spark codebase.

result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1)
} else {
throw new UnsupportedOperationException(
"No summary table available for this GeneralizedLinearRegressionModel")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion: it would be nice to add a test to verify this exception is thrown (and with the right error message using the withClue() check)


val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is "Estimate" the better term to use here as opposed to "Coefficient"? Are there other libraries which use this specific term in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

R was using 'Estimate'. I changed it to 'Coefficient' now.

* set default names to "V1", "V2", and so on.
*/
@Since("2.2.0")
lazy val featureName: Array[String] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment: it looks like this is an array so should be plural, as in "featureNames" instead of "featureName" without the s

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, changed.

@imatiach-msft
Copy link
Contributor

the code looks very good, I added a few minor comments, will take another look tomorrow, thanks!

@actuaryzhang
Copy link
Contributor Author

@felixcheung @imatiach-msft Thanks much for the review. Made most changes suggested. Please see my inline replies.

@SparkQA
Copy link

SparkQA commented Feb 14, 2017

Test build #72901 has finished for PR 16630 at commit b67d3fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val spark = SparkSession.builder().getOrCreate()
import spark.implicits._
result.toDF("Feature", "Coefficient", "StdError", "TValue", "PValue").repartition(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I didn't realize that R uses Estimate instead of coefficient - if you feel strongly about using Estimate here instead you can change this back. Up to you.

dataset.schema(model.getFeaturesCol)).attributes
if (featureAttrs == None) {
Array.tabulate[String](origModel.numFeatures)(
(x: Int) => (model.getFeaturesCol + "_" + x))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general I would have preferred to create a platform-level function (or use one if it exists) to format the strings in the same way, so there is no duplicate code in VectorAssembler vs here that can diverge (and which other functions in spark can generally use). However, this seems a bit out of scope of this code review, so I don't think you need to do this.

@imatiach-msft
Copy link
Contributor

Thanks for the updates, the changes look good to me. One question, out of scope of the specific changes in this review: are there any other summary statistics that we could add in the future? Maybe R^2 and adjusted R^2? Also, do you know of any good reference papers that have an overview of the most popular summary statistics used in GLM (not including the ones in this pull request)?


val expectedFeature = Seq(Array("features_0", "features_1"),
Array("(Intercept)", "features_0", "features_1"))
val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this comparing the summary to the results of R? If so, in general you should add the R code in a comment that was used to generate the expected results so that the expected values are reproducible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Added in R code.

~== expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable")
assert(Vectors.dense(summaryTable.select("StdError").rdd.collect.map(_.getDouble(0)))
~== expectedStdError(idx) absTol 1E-3, "Standard error mismatch in summaryTable")
assert(Vectors.dense(summaryTable.select("TValue").rdd.collect.map(_.getDouble(0)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like for all of these below you can just call collect instead of rdd.collect?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@actuaryzhang
Copy link
Contributor Author

@imatiach-msft @felixcheung
I cleaned up the tests as suggested, and also updated the R GLM wrapper to use the result from this PR. Please let me know if there is any other suggestions. Thanks much for the review and comments.

@imatiach-msft
Copy link
Contributor

@actuaryzhang thanks, LGTM!

@actuaryzhang
Copy link
Contributor Author

@yanboliang Thanks for the suggestions. I have made a new commit that addresses your comments.
In the new version, I used an array of tuple to represent the coefficient matrix. I used tuple because I have mixed type of string and double (it's necessary to store the feature names since they also depend on whether there is intercept). I then wrote a showString function similar to that in the DataSet class that compiles all summary info into a string, and defined show methods to print out the estimated model. The output is very similar to that in R except that I did not show the residuals and significance levels. Please let me know your thoughts on this update.

Below is an example of the call and the output:

model.summary.show()
+-----------+--------+--------+------+------+
|    Feature|Estimate|StdError|TValue|PValue|
+-----------+--------+--------+------+------+
|(Intercept)|   0.790|   4.013| 0.197| 0.862|
| features_0|   0.226|   2.115| 0.107| 0.925|
| features_1|   0.468|   0.582| 0.804| 0.506|
+-----------+--------+--------+------+------+

(Dispersion parameter for gaussian family taken to be 14.516)
    Null deviance: 46.800 on 2 degrees of freedom
Residual deviance: 29.032 on 2 degrees of freedom
AIC: 30.984

@SparkQA
Copy link

SparkQA commented Jul 17, 2017

Test build #79685 has finished for PR 16630 at commit a16cbee.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79686 has finished for PR 16630 at commit 640d564.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 18, 2017

Test build #79688 has finished for PR 16630 at commit 57f1e5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @since 2.3.0
*/
// scalastyle:off println
def show(numRows: Int, truncate: Boolean, numDigits: Int): Unit = if (truncate) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think not all functions are useful for GLM summary, I'd recommend to keep only one show function with default setting, such as numRows = coefficientMatrix.size, truncate = 20 and numDigits = 3. There has little different compared with Dataset.show, it's not necessary to provide lots of opinions for users to set, users just want to see the output like R. Then the code will be more clean.

}

private[regression] def showString(_numRows: Int, truncate: Int = 20,
numDigits: Int = 3): String = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align.

* tValue and pValue.
*/
@Since("2.3.0")
lazy val coefficientMatrix: Array[(String, Double, Double, Double, Double)] = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a matrix, so it's not appropriate to name it as coefficientMatrix. Since it's only used for generating summary string output, what about keep it private or inline?

@actuaryzhang
Copy link
Contributor Author

Made a new commit to address the comments.

@SparkQA
Copy link

SparkQA commented Jul 19, 2017

Test build #79766 has finished for PR 16630 at commit 174fc49.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2017

Test build #79970 has finished for PR 16630 at commit 7281b77.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM, merged into master. Thanks for all.

@asfgit asfgit closed this in ddcd2e8 Jul 27, 2017
@yanboliang
Copy link
Contributor

Note the output format of GLR summary.toString is:

Coefficients:
   Feature Estimate Std Error    T Value P Value
features_0  2.21304   0.00279  792.03163 0.00000
features_1  0.83096   0.00080 1042.07543 0.00000

(Dispersion parameter for gaussian family taken to be 0.06483)
Null deviance: 2344915.50893 on 9998 degrees of freedom
Residual deviance: 648.20325 on 9998 degrees of freedom
AIC: 1023.40993

@actuaryzhang actuaryzhang deleted the glmTable branch July 27, 2017 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants