New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19270][ML] Add summary table to GLM summary #16630
Conversation
The following code illustrates the idea of this PR. Let me know if this makes sense. Thanks.
This prints out:
|
Jenkins, test this please |
This test just resists to start. Could someone help? Many thanks! |
Jenkins add to whitelist |
Jenkins test this please |
Test build #71985 has finished for PR 16630 at commit
|
Test build #71998 has finished for PR 16630 at commit
|
Could somebody help review this PR? I think this will make gathering the estimation results in Scala much easier. This will also be helpful in constructing the tests. For example, the GLM tests with weights can be simplified a lot if we have all results in arrays and SEs etc are aligned with coefficients (current GLM tests with weight force no intercept to avoid this nuisance). |
@actuaryzhang sorry I'm at Spark Summit East, will take a look soon. For the feature name or "lazy val featureName: Array[String]", I recall there is a sparse (eg output by HashingTF) and dense version of the metadata for the StructField, I need to look into that code a bit more to understand if it works... |
val model = trainer.fit(dataset) | ||
val summaryTable = model.summary.summaryTable | ||
|
||
summaryTable.select("Feature").rdd.collect.map(_.getString(0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't need to .rdd.collect
instead .collect()
?
val featureAttrs = AttributeGroup.fromStructField( | ||
dataset.schema(model.getFeaturesCol)).attributes | ||
if (featureAttrs == None) { | ||
Array.tabulate[String](origModel.numFeatures)((x: Int) => ("V" + (x + 1))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there another way feature names could be used elsewhere? the concern is names from here would not match feature names in other pleases if this is not stored in the model
might need to add more tests for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung The feature names were not available prior to this PR, right? One other place I see that does similar summary is the GeneralizedLinearRegressionWrapper
for R. Do you think we should consolidate the two, e.g., update the Wrapper
to use the summary table directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quite possibly - could you check what would be new or removed with that approach?
.zip(expectedFeature(idx)).foreach{ x => assert(x._1 === x._2, | ||
"Feature name mismatch in summaryTable") } | ||
assert(Vectors.dense(summaryTable.select("Estimate").rdd.collect.map(_.getDouble(0))) | ||
~= expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use ~==
Let's ping @yanboliang and wait for @imatiach-msft to take a look. |
(featureNames(i), coefficients(i), coefficientStandardErrors(i), tValues(i), pValues(i)) | ||
|
||
val spark = SparkSession.builder().getOrCreate() | ||
import spark.implicits._ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment: might it be better to simplify this as "import dataset.sparkSession.implicits._", or is there a reason to prefer the SparkSession.builder().getOrCreate()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was using the spark session and implicits to be able to use toDF
to create data frame with names from Seq
. Could you explain how this import dataset.sparkSession.implicits._
works? Could not import it in spark shell.
<console>:56: error: not found: value dataset
import dataset.sparkSession.implicits._
lazy val featureName: Array[String] = { | ||
val featureAttrs = AttributeGroup.fromStructField( | ||
dataset.schema(model.getFeaturesCol)).attributes | ||
if (featureAttrs == None) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I do the example below in spark-shell:
import org.apache.spark.ml.feature.HashingTF
val tf = new HashingTF().setInputCol("x").setOutputCol("hash")
val df = spark.createDataFrame(Seq(Tuple3(0.0,Array("a", "b"), 4), Tuple3(1.0, Array("b", "c"), 6), Tuple3(1.0, Array("a", "c"), 7), Tuple3(0.0, Array("b","c"), 7))).toDF("y", "x", "z")
val dfres = tf.transform(df)
when doing show():
scala> dfres.show
+---+------+---+--------------------+
| y| x| z| hash|
+---+------+---+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|
|1.0|[b, c]| 6|(262144,[28698,30...|
|1.0|[a, c]| 7|(262144,[28698,22...|
|0.0|[b, c]| 7|(262144,[28698,30...|
+---+------+---+--------------------+
but, when I look at schema:
import org.apache.spark.ml.attribute.AttributeGroup
scala> AttributeGroup.fromStructField(dfres.schema("hash")).attributes
res5: Option[Array[org.apache.spark.ml.attribute.Attribute]] = None
scala> AttributeGroup.fromStructField(dfres.schema("hash"))
res6: org.apache.spark.ml.attribute.AttributeGroup = {"ml_attr":{"num_attrs":262144}}
but in this case the name should be of the form: hash_{#}
instead of V{#}
for example, when using VectorAssembler on the above:
import org.apache.spark.ml.feature.VectorAssembler
val va = new VectorAssembler().setInputCols(Array("y","z","hash")).setOutputCol("outputs")
scala> va.transform(dfres).show()
+---+------+---+--------------------+--------------------+
| y| x| z| hash| outputs|
+---+------+---+--------------------+--------------------+
|0.0|[a, b]| 4|(262144,[30913,22...|(262146,[1,30915,...|
|1.0|[b, c]| 6|(262144,[28698,30...|(262146,[0,1,2870...|
|1.0|[a, c]| 7|(262144,[28698,22...|(262146,[0,1,2870...|
|0.0|[b, c]| 7|(262144,[28698,30...|(262146,[1,28700,...|
+---+------+---+--------------------+--------------------+
scala> print(AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get)
[Lorg.apache.spark.ml.attribute.Attribute;@4416197b
scala> AttributeGroup.fromStructField(va.transform(dfres).schema("outputs")).attributes.get
res22: Array[org.apache.spark.ml.attribute.Attribute] = Array({"type":"numeric","idx":0,"name":"y"}, {"type":"numeric","idx":1,"name":"z"}, {"type":"numeric","idx":2,"name":"hash_0"}, {"type":"numeric","idx":3,"name":"hash_1"}, {"type":"numeric","idx":4,"name":"hash_2"}, {"type":"numeric","idx":5,"name":"hash_3"}, {"type":"numeric","idx":6,"name":"hash_4"}, {"type":"numeric","idx":7,"name":"hash_5"}, {"type":"numeric","idx":8,"name":"hash_6"}, {"type":"numeric","idx":9,"name":"hash_7"}, {"type":"numeric","idx":10,"name":"hash_8"}, {"type":"numeric","idx":11,"name":"hash_9"}, {"type":"numeric","idx":12,"name":"hash_10"}, {"type":"numeric","idx":13,"name":"hash_11"}, {"type":"numeric","idx":14,"name":"hash_12"}, {"type":"numeric","idx":15,"name":"hash_13"}, {"type":"numeric","idx":16,"nam...
you can see that the attributes are given the column name followed by the index.
This seems like a bug in the VectorAssembler, because it is making the schema dense when it should be sparse, but regardless this seems to be the more official way to represent the name of the attributes instead of using a "V" followed by index - unless you have seen the "V" + index used elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@imatiach-msft This makes sense. I now changed the code to mirror the same logic. When attritubes are missing, the default name is set to be the feature name with suffix "_0", "_1" etc.
var coefficients = model.coefficients.toArray | ||
var idx = Array.range(0, coefficients.length) | ||
if (model.getFitIntercept) { | ||
featureNames = featureNames :+ "Intercept" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be possible to move this to a constant ("Intercept")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
var idx = 0 | ||
for (fitIntercept <- Seq(false, true)) { | ||
val trainer = new GeneralizedLinearRegression() | ||
.setFamily("gaussian") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not related to this code review, but it's unfortunate that these aren't constants that can be referenced from the model, it's messy to have to type strings like this everywhere as opposed to referencing variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, there is object Gaussian
and one can use Gaussian.name
for the string name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guassian.name.toLowerCase (or Guassian.name since it is converted to lowercase later) would be generally the approach.
but this is test suite, I think it's ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would usually prefer to use variables wherever possible as it is much easier to update through various editors and in general it is much easier to catch compile time vs runtime errors. But it is a minor point, and it looks like this is consistent with most of the spark codebase.
result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1) | ||
} else { | ||
throw new UnsupportedOperationException( | ||
"No summary table available for this GeneralizedLinearRegressionModel") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor suggestion: it would be nice to add a test to verify this exception is thrown (and with the right error message using the withClue() check)
|
||
val spark = SparkSession.builder().getOrCreate() | ||
import spark.implicits._ | ||
result.toDF("Feature", "Estimate", "StdError", "TValue", "PValue").repartition(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question: is "Estimate" the better term to use here as opposed to "Coefficient"? Are there other libraries which use this specific term in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
R was using 'Estimate'. I changed it to 'Coefficient' now.
* set default names to "V1", "V2", and so on. | ||
*/ | ||
@Since("2.2.0") | ||
lazy val featureName: Array[String] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment: it looks like this is an array so should be plural, as in "featureNames" instead of "featureName" without the s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, changed.
the code looks very good, I added a few minor comments, will take another look tomorrow, thanks! |
@felixcheung @imatiach-msft Thanks much for the review. Made most changes suggested. Please see my inline replies. |
Test build #72901 has finished for PR 16630 at commit
|
|
||
val spark = SparkSession.builder().getOrCreate() | ||
import spark.implicits._ | ||
result.toDF("Feature", "Coefficient", "StdError", "TValue", "PValue").repartition(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't realize that R uses Estimate instead of coefficient - if you feel strongly about using Estimate here instead you can change this back. Up to you.
dataset.schema(model.getFeaturesCol)).attributes | ||
if (featureAttrs == None) { | ||
Array.tabulate[String](origModel.numFeatures)( | ||
(x: Int) => (model.getFeaturesCol + "_" + x)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in general I would have preferred to create a platform-level function (or use one if it exists) to format the strings in the same way, so there is no duplicate code in VectorAssembler vs here that can diverge (and which other functions in spark can generally use). However, this seems a bit out of scope of this code review, so I don't think you need to do this.
Thanks for the updates, the changes look good to me. One question, out of scope of the specific changes in this review: are there any other summary statistics that we could add in the future? Maybe R^2 and adjusted R^2? Also, do you know of any good reference papers that have an overview of the most popular summary statistics used in GLM (not including the ones in this pull request)? |
|
||
val expectedFeature = Seq(Array("features_0", "features_1"), | ||
Array("(Intercept)", "features_0", "features_1")) | ||
val expectedEstimate = Seq(Vectors.dense(0.2884, 0.538), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this comparing the summary to the results of R? If so, in general you should add the R code in a comment that was used to generate the expected results so that the expected values are reproducible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. Added in R code.
~== expectedEstimate(idx) absTol 1E-3, "Coefficient mismatch in summaryTable") | ||
assert(Vectors.dense(summaryTable.select("StdError").rdd.collect.map(_.getDouble(0))) | ||
~== expectedStdError(idx) absTol 1E-3, "Standard error mismatch in summaryTable") | ||
assert(Vectors.dense(summaryTable.select("TValue").rdd.collect.map(_.getDouble(0))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like for all of these below you can just call collect instead of rdd.collect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@imatiach-msft @felixcheung |
@actuaryzhang thanks, LGTM! |
@yanboliang Thanks for the suggestions. I have made a new commit that addresses your comments. Below is an example of the call and the output:
|
Test build #79685 has finished for PR 16630 at commit
|
Test build #79686 has finished for PR 16630 at commit
|
Test build #79688 has finished for PR 16630 at commit
|
* @since 2.3.0 | ||
*/ | ||
// scalastyle:off println | ||
def show(numRows: Int, truncate: Boolean, numDigits: Int): Unit = if (truncate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think not all functions are useful for GLM summary, I'd recommend to keep only one show
function with default setting, such as numRows = coefficientMatrix.size
, truncate = 20
and numDigits = 3
. There has little different compared with Dataset.show
, it's not necessary to provide lots of opinions for users to set, users just want to see the output like R. Then the code will be more clean.
} | ||
|
||
private[regression] def showString(_numRows: Int, truncate: Int = 20, | ||
numDigits: Int = 3): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align.
* tValue and pValue. | ||
*/ | ||
@Since("2.3.0") | ||
lazy val coefficientMatrix: Array[(String, Double, Double, Double, Double)] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a matrix, so it's not appropriate to name it as coefficientMatrix
. Since it's only used for generating summary string output, what about keep it private or inline?
Made a new commit to address the comments. |
Test build #79766 has finished for PR 16630 at commit
|
Test build #79970 has finished for PR 16630 at commit
|
LGTM, merged into master. Thanks for all. |
Note the output format of GLR
|
What changes were proposed in this pull request?
Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results.
@srowen @yanboliang @felixcheung
How was this patch tested?
New tests. One for testing feature Name, and one for testing the summary Table.