Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19733][ML]Removed unnecessary castings and refactored checked casts in ALS. #17059

Closed
wants to merge 5 commits into from

Conversation

@datumbox
Copy link
Contributor

@datumbox datumbox commented Feb 24, 2017

What changes were proposed in this pull request?

The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.

How was this patch tested?

I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.

case v: Int => v // Avoid unnecessary casting
case v: Number =>
val intV = v.intValue()
if (v == intV) { // True for Byte/Short, Long within the Int range and Double/Float with no fractional part.

This comment has been minimized.

@datumbox

datumbox Feb 24, 2017
Author Contributor

One could write this differently and explicitly handle all the permitted types. Unfortunately this would lead to duplicate code. Instead what I do here is convert the number into Integer and compare it with the original Number. If the values are identical one of the following is true:

  • The value is Byte or Short.
  • The value is Long but within the Integer range.
  • The value is Double or Float but without any fractional part.
@datumbox datumbox changed the title Removed unnecessary castings and refactored checked casts in ALS. [SPARK-19733][ML]Removed unnecessary castings and refactored checked casts in ALS. Feb 25, 2017
@srowen
Copy link
Member

@srowen srowen commented Feb 25, 2017

I get it, but is that cast actually slowing things down measurably? because this change also adds some overhead in the checks it does. I think it's better to let the cast to double deal with the source nature of the number consistently.

@SparkQA
Copy link

@SparkQA SparkQA commented Feb 25, 2017

Test build #3582 has finished for PR 17059 at commit 98d6481.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 25, 2017

@srowen I believe that this needs to be fixed for 2 reasons:

  1. Casting the ids to double just to convert it back to integer is not an elegant solution and it is rather confusing.
  2. The double casting puts more strain on the garbage collector and I've personally measured it in an earlier version with and without the hack.

Finally I do not believe the proposed fix slows down things as it does a similar number of comparisons as the original code. If you still believe this is not worth it feel free to close the PR.

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 25, 2017

Concerning the failed scala style test, this is caused by the long in-line comment that I added to explain the if statement. If you decide to approve the PR, I can remove it. Cheers! :)

@srowen
Copy link
Member

@srowen srowen commented Feb 25, 2017

I get it, though the drawback is mostly that you've duplicated in a way the machinery that would construe any number as something comparable to the min/max int value. (Does this not miss the case of a Scala Long?) I like that it catches the case of Double values that aren't integral, but, that's quite a corner case anyway.

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 25, 2017

Could you explain what you mean by "duplicating"? It is safe with Scala Long; I did lots of tests to ensure that it works well. If you require any changes, I'm happy to update the PR.

@srowen
Copy link
Member

@srowen srowen commented Feb 25, 2017

You are effectively handling the cases that casting handles. Doesn't a Scala long get boxed another time now? I bet it works, just wondering why that's not handled like Int. I'm neutral on this and wouldn't merge it myself but others are welcome to.

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 25, 2017

jenkins retest this please

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 25, 2017

@srowen Yes, the behaviour of the method remains the same. This patch helped me get a measurable improvement on GC overhead in Spark 2.0, so I though that it would be beneficial for others. Anyway thanks for the comments. :)

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 26, 2017

I decided to provide a few more bencharks in order to alleviate some of the concerns raised by @srowen.

To reproduce the results add the following snippet in the ALSSuite class:

  test("Speed difference") {
    val (training, test) =
      genExplicitTestData(numUsers = 200, numItems = 400, rank = 2, noiseStd = 0.01)

    val runs = 1000
    var totalTime = 0.0
    println("Performing "+runs+" runs")
    println("Run Id,Time (secs)")
    for(i <- 0 until runs) {
      val t0 = System.currentTimeMillis
      testALS(training, test, maxIter = 1, rank = 2, regParam = 0.01, targetRMSE = 0.1)
      val secs = (System.currentTimeMillis - t0)/1000.0
      println(i+","+secs)
      totalTime += secs
    }
    println("AVG Execution Time: "+(totalTime/runs)+" secs")
  }

To test both solutions, I collected 1000 samples for each (took ~1 hour for each). Here you can see the detailed output for the original and the proposed code.

Code Mean Execution Time Std
Original 4.75521 0.81237
Proposed 4.56276 0.72790

Using an unpaired t-test to compare the two means we find that the proposed code is faster and the result is statistically significant (p-value < .0001).

Below I summarize why I believe the original code needs to change:

  1. Casting user and item ids into double and then to integer is a hacky & indirect way to validate that the ids are numerical and within integer range. The proposed code covers all the corner cases in a clear and direct way. As an added bonus, it handles Doubles and Floats with fractional part.
  2. Given that the ALS implementation requires ids with int values, it is expected that the majority of users encode their Ids as Integer. The proposed solution avoids any casting in that case while reducing the casting in all the other cases. This avoids putting unnecessary strain on the garbage collector, something that you can observe if you profile the execution on a large dataset.
  3. The proposed solution is not slower than the original; if something it is slightly faster.

Could some one from the admins help re-test the code?

@srowen
Copy link
Member

@srowen srowen commented Feb 27, 2017

That's compelling regarding performance. It's not big but not trivial.
My remaining concern is whether you're handling all the cases the original did. Number covers a lot but does it include all the SQL decimal types? I think the common casting mechanism would cover those. Granted, it's odd if someone tries to use those values, but ideally would not work differently after this change.

I am still not sure about the Long case -- don't you want to handle Scala Long like Int here for efficiency?

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 27, 2017

@srowen: Thanks for the comments. We are getting there. :)

I will handle the Long case as you suggest.

If you think people use SQL decimal types, I can include them at the end of the pattern matching. This will lead to some duplicate code though cause I need to write the same if statement twice. Any thoughts?

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 27, 2017

Ignore my comment about duplicate code. It can be written to avoid it. I will investigate handling the SQL decimal types as you recommended and I will update the code tonight.

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 27, 2017

@srowen The following snippet handles explicitly Longs. It can be rewritten to remove duplicate code by introducing bools for overflow detection but I don't think it is worth it. In theory you can catch also explicitly other types such as Byte and Short but I think that's an overkill.

As far as I saw, all SQL numerical types inherit from Number so comparing their doubleValue with their intValue would be enough to check if they are within integer range.

    val u = udf { (n: Any) =>
      n match {
        case v: Int => v
        case v: Long =>
          val intV = v.intValue
          if (v == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        //case v: Byte => v.toInt 
        //case v: Short => v.toInt
        case v: Number =>
          val intV = v.intValue
          if (v.doubleValue == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        case _ => throw new IllegalArgumentException("invalid type")
      }
    }

Personally, I would remove the explicit Long case as it introduces duplicate code and does not help match. The remaining snippet avoids doing any casting if the ID is integer (which should be the majority of cases and yields the biggest memory/speed gains) or non-numeric and handles all corner cases (All scala/java numeric types + SQL Numerics). Agree?

@srowen
Copy link
Member

@srowen srowen commented Feb 27, 2017

This one would be good for @MLnick to look at.

The Scala Long will match java.lang.Number right? if so it works even if there's another conversion there, but maybe that one is trivial, and no more common a case to optimize for than others.

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 27, 2017

Yeah, Scala Long matches. Here is the "stand-alone" script that I used to confirm that everything works ok (tested on Spark 2.1):

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val u = udf { (n: Any) =>
  n match {
    case v: Int => v
    case v: Number =>
      val intV = v.intValue
      if (v.doubleValue == intV) {
        intV
      }
      else {
        throw new IllegalArgumentException("out of range")
      }
    case _ => throw new IllegalArgumentException("invalid type")
  }
}

val df = sqlContext.range(10)
  .withColumn("int_success", lit(123))
  .withColumn("long_success", lit(1231L))
  .withColumn("long_fail", lit(1231000000000L))
  .withColumn("decimal_success", lit(123).cast(DecimalType(5, 2)))
  .withColumn("decimal_fail", lit(123.1).cast(DecimalType(5, 2)))
  .withColumn("double_success", lit(123.0))
  .withColumn("double_fail", lit(123.1))
  .withColumn("double_fail2", lit(1231000000000.0))
  .withColumn("string_fail", lit("123.1"))

// these work fine
df.select(u(df.col("int_success"))).show
df.select(u(df.col("long_success"))).show
df.select(u(df.col("decimal_success"))).show
df.select(u(df.col("double_success"))).show

// these fail with out of int range exception
df.select(u(df.col("long_fail"))).show
df.select(u(df.col("decimal_fail"))).show
df.select(u(df.col("double_fail"))).show
df.select(u(df.col("double_fail2"))).show

// this fails with invalid type exception
df.select(u(df.col("string_fail"))).show

Cool, I'll commit the changes tonight so that @MLnick can check the final code.

@srowen I really appreciate your input; you did raise good points and especially for handling the SQL datatypes. Thank you.

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 27, 2017

@srowen @MLnick I updated the PR based on what was discussed above and I tested it again on Spark 2.1. I also updated the coding styles and the exception message.

The changes requested by Sean, should not affect the speed of the proposed solution but rather help us catch all the corner cases. I tested this by generating a fresh batch of 1000 samples using the latest implementation (details here). The mean of the sample is 4.53837 which is no different than the mean of the code that I committed yesterday (p-value 0.4265 - H0 of equal means can't be rejected). So we are good.

Could any of the admins help re-test this because the "Jenkins, retest this please" did not work for me earlier?

case v: Number =>
val intV = v.intValue()
// Checks if number within Int range and has no fractional part.
if (v.doubleValue == intV) {

This comment has been minimized.

@imatiach-msft

imatiach-msft Feb 28, 2017
Contributor

Sorry, I'm not sure if this is a good idea due to floating point precision... the code above doesn't seem to do this check, it just calls toInt -- however, if this is absolutely necessary, I would hope that we could give the user some way to specify the Int range or precision. Also, if we are going to go ahead with this change, then we should add some tests to verify the case that the exception is thrown, but without some ability to specify the precision I'm not sure if this is the correct thing to do (?).

This comment has been minimized.

@datumbox

datumbox Feb 28, 2017
Author Contributor

@imatiach-msft: In this snippet we deal with Uset id and Item id. Those things should no have fractional bits. What I do here is convert the number into Integer and compare its double value. If the values are identical one of the following is true:

  • The value is Byte or Short.
  • The value is Long but within the Integer range.
  • The value is Double or Float whithin the Integer range but without any fractional part.

I think this snippet is fine.

This comment has been minimized.

@MLnick

MLnick Feb 28, 2017
Contributor

Yeah the reality is we mostly expect Int or Long ids, but want to support any numeric ID as long as it falls within Int range - so passing in fractional float/double just doesn't make sense and we'll throw an error.

This comment has been minimized.

@srowen

srowen Feb 28, 2017
Member

I think the test does work, but may be slightly more direct to check if v.doubleValue % 1.0 == 0.0? this also works.

This comment has been minimized.

@datumbox

datumbox Feb 28, 2017
Author Contributor

The way it is written it checks 2 things at the same time. 1) that it is within integer range (so if it overflows the equality will not hold) and 2) that it has no fractional part. The modulo check only covers you for point #2 but not #1.

This comment has been minimized.

@imatiach-msft

imatiach-msft Feb 28, 2017
Contributor

Also, if we for some reason do decide to go with the check against the fractional part, which I would not recommend, we should add a test to verify the error message is thrown, and possible add this change in accepted input format to the documentation, since it may break some users (they may have to do additional conversion to int for their values prior to calling this API if they run into such floating point precision issues).

This comment has been minimized.

@datumbox

datumbox Feb 28, 2017
Author Contributor

@imatiach-msft I am aware of the implications of floating point precision and I understand your concerns.

Having said that though, even allowing user and item Ids to be double/float is not a good idea. We just keep it for backwards compatibility I guess. Also note that the current implementation of Spark 2.1 will actually take that your 0.9999999999999996 value and silently cast it to Int (so it becomes 0)! For me the only permitted types should have been Integer, Long and BigIntegers.

I don't have strong opinions about refactoring anything in the Number case as it simply performance-wise it does not matter. The point of this PR is to optimize the general case where the id is Int because the casting of the current approach generates twice as much data as the original dataset (of course it is GCed but at a cost).

@MLnick @srowen It's your call.

This comment has been minimized.

@imatiach-msft

imatiach-msft Feb 28, 2017
Contributor

@datumbox I agree that calling toInt in the previous code was not the best decision either; if we really wanted to support double type correctly we would have probably done the check with some precision. Maybe if we add to the documentation that double types for user/item should be avoided I would be ok with it.
Also, could you add a test case to verify the exception is thrown for doubles with a fraction, similar to the test case I sent you (with a catch on the exception type and withClue on the message)? We should clean up the error message too, since it is thrown not only when the value is not in Integer range (Int.Max/Min) but also when there is a fractional part.

This comment has been minimized.

@datumbox

datumbox Feb 28, 2017
Author Contributor

@imatiach-msft I already tested it, check this snippet posted here. I can add it in ALSSuite if necessary. The exception message is technically correct since a number with a fractional part is not in Integer range (plus you get to see the actual value of the number). Returning a different exception would require having two separate if checks instead of one. Do we really want that?

Guys to be honest this PR solves a very simple thing and I did not anticipate to be so controvercial. May I suggest we agree on the final set of changes and merge it? Perhaps we can tackle any other concerns on new pull-requests?

This comment has been minimized.

@imatiach-msft

imatiach-msft Feb 28, 2017
Contributor

@datumbox Sorry the link doesn't seem to work for me. If you could add it to the ALSSuite that would be great. Oh no, I don't mean to have a different check, just to have a clearer exception message - something like "throw new IllegalArgumentException(s"ALS only supports values in Integer range and without fractional part for columns ${$(userCol)} and ${$(itemCol)}. Value $n was either out of Integer range or contained a fractional part that could not be converted.") "

As a general rule all checkins should have tests associated with them to verify functionality, and we should aim for 100% code coverage to validate all production code paths, even if it is not always realistically possible.

@imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Feb 28, 2017

@datumbox I like the changes, I just had a minor concern about the code where we call v.intValue and then compare this to v.doubleValue -- due to precision issues, I'm not sure if this is desirable, since the data could come from any source and be slightly modified outside the Int range, and the previous code does not do this check

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 28, 2017

@imatiach-msft This comparison is intentional and checks 2 things: That the number is within integer range and that the Id does not have any non-zero digits after the decimal point. If the number is outside integer range the intValue will overflow and the number will not match its double part. Moreover if it has fractional part it will also will not match the integer.

Effectively we are making sure that the UserId and ItemId have values within integer range.

@MLnick
Copy link
Contributor

@MLnick MLnick commented Feb 28, 2017

@datumbox you mention there is GC & performance overhead which makes some sense. Have you run into problems with very large scale (like millions users & items & ratings)? I did regression tests here - admittedly vs 1.6.1 at that time

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Feb 28, 2017

@MLnick Yeap! I have run into problems with the original implementation when dealing with several billions records. This is when removing casting starts paying off. :)

@MLnick
Copy link
Contributor

@MLnick MLnick commented Feb 28, 2017

Ok, let me take a look at this. Thanks!

@MLnick
Copy link
Contributor

@MLnick MLnick commented Feb 28, 2017

ok to test

@SparkQA
Copy link

@SparkQA SparkQA commented Feb 28, 2017

Test build #73585 has finished for PR 17059 at commit fc3acfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@imatiach-msft
Copy link
Contributor

@imatiach-msft imatiach-msft commented Feb 28, 2017

@datumbox ah ok, thanks for the link. LGTM!

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Mar 1, 2017

I updated the exception message and I added Unit-tests to cover the checkedCast method as @imatiach-msft suggested. I have not corrected the value of regParam but I had a look and it is safe to do so.

Thanks everyone for their comments and recommendations. Please consider merging this PR with master because many people are working lately on the ALS and soon conflicts will arise.

@SparkQA
Copy link

@SparkQA SparkQA commented Mar 1, 2017

Test build #73620 has finished for PR 17059 at commit eb33189.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
Copy link
Contributor

@MLnick MLnick left a comment

Made a few very minor comments.

val e: SparkException = intercept[SparkException] {
df.select(checkedCast( lit(1231000000000L) )).collect()
}
assert(e.getMessage.contains("either out of Integer range or contained a fractional part"))

This comment has been minimized.

@MLnick

MLnick Mar 1, 2017
Contributor

May as well factor this string out to a val.

val df = spark.sqlContext.range(1)

withClue("Valid Integer Ids") {
df.select(checkedCast( lit(123) )).collect()

This comment has been minimized.

@MLnick

MLnick Mar 1, 2017
Contributor

Could you remove the spaces around lit(123) here and throughout for code style?

@@ -82,12 +82,22 @@ private[recommendation] trait ALSModelParams extends Params with HasPredictionCo
* Attempts to safely cast a user/item id to an Int. Throws an exception if the value is
* out of integer range.

This comment has been minimized.

@MLnick

MLnick Mar 1, 2017
Contributor

Should probably add to comment "... out of integer range or contains a fractional part"

.select(checkedCast(col($(userCol)).cast(DoubleType)),
checkedCast(col($(itemCol)).cast(DoubleType)), r)
.select(checkedCast(col($(userCol))),
checkedCast(col($(itemCol))), r)

This comment has been minimized.

@MLnick

MLnick Mar 1, 2017
Contributor

Nit - but can this now go on one line if short enough?

@@ -205,6 +206,69 @@ class ALSSuite
assert(decompressed.toSet === expected)
}

test("CheckedCast") {
val checkedCast = new ALS().checkedCast
val df = spark.sqlContext.range(1)

This comment has been minimized.

@MLnick

MLnick Mar 1, 2017
Contributor

This can just be spark.range(1)?

@@ -513,31 +577,31 @@ class ALSSuite
withClue("fit should fail when ids exceed integer range. ") {
assert(intercept[SparkException] {
als.fit(df.select(df("user_big").as("user"), df("item"), df("rating")))
}.getCause.getMessage.contains("was out of Integer range"))
}.getCause.getMessage.contains("out of Integer range or contained a fractional part"))

This comment has been minimized.

@MLnick

MLnick Mar 1, 2017
Contributor

May as well clean this message up into a val also while we're at it :)

@MLnick
Copy link
Contributor

@MLnick MLnick commented Mar 1, 2017

The discussion around numeric types was at #12762 (comment). The checked cast was introduced because it was previously implicitly cast to Int anyway, and would mangle Long ids without any warning. I did think about only supporting Int & Long there - but we decided to just support any numeric type within range. I guess it was around the time that we were porting most pipeline components to support all numeric input types.

In hindsight I would have just gone for restricting it to Int & Long.

In reality in the case of ALS users really should not be using Double for ids (e.g. an id of 123.4 is clearly wrong, 123.0 could occur I guess more by accident or poor schema, but in that case we support it here). So I think the likelihood of any real impact from being restrictive in terms of fractionals is really low.

@MLnick
Copy link
Contributor

@MLnick MLnick commented Mar 1, 2017

@datumbox left a few minor comments to clean up. Overall looks good. I think it is fine for 2.2.

For the regParam even though it is a minor thing it would be best to just create a new JIRA & PR for it separately.

@MLnick
Copy link
Contributor

@MLnick MLnick commented Mar 1, 2017

@imatiach-msft thanks for reviewing and requesting the test case, I was going to ask for one :)

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Mar 1, 2017

@MLnick I accepted all proposed changes. Please retest. :)

Ok, I will create a new JIRA & PR for the regParam update but it will go down in history as the laziest commit ever. :p

@SparkQA
Copy link

@SparkQA SparkQA commented Mar 1, 2017

Test build #73693 has finished for PR 17059 at commit a1e32aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Mar 1, 2017

@srowen @MLnick I had a typo on the committed version... Could you please retest it? I don't think I have the rights to do it myself.

@SparkQA
Copy link

@SparkQA SparkQA commented Mar 1, 2017

Test build #73697 has finished for PR 17059 at commit 0b49a50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 1, 2017
## What changes were proposed in this pull request?

In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1.

I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them.

Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](apache#17059 (comment)) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :)

## How was this patch tested?

Unit-tests

Author: Vasilis Vryniotis <vvryniotis@hotels.com>

Closes apache#17121 from datumbox/als_regparam.
@datumbox datumbox force-pushed the datumbox:als_casting_fix branch Mar 1, 2017
@MLnick
Copy link
Contributor

@MLnick MLnick commented Mar 1, 2017

@datumbox once a PR is ok'ed to test from a committer, it should automatically retest on each commit.

@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Mar 1, 2017

@MLnick Cool did not know that. Just updated commit messages. I think I'm good with this PR from my side.

@SparkQA
Copy link

@SparkQA SparkQA commented Mar 1, 2017

Test build #73706 has finished for PR 17059 at commit 361d9cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@datumbox datumbox force-pushed the datumbox:als_casting_fix branch to 3050f6e Mar 1, 2017
@SparkQA
Copy link

@SparkQA SparkQA commented Mar 2, 2017

Test build #73718 has finished for PR 17059 at commit 3050f6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@MLnick
MLnick approved these changes Mar 2, 2017
Copy link
Contributor

@MLnick MLnick left a comment

LGTM

@MLnick
Copy link
Contributor

@MLnick MLnick commented Mar 2, 2017

Merged to master. Thanks @datumbox, and everyone for reviews.

@asfgit asfgit closed this in 625cfe0 Mar 2, 2017
@datumbox
Copy link
Contributor Author

@datumbox datumbox commented Mar 2, 2017

@MLnick @srowen @imatiach-msft Thank you for your comments and for helping making this improvement robust.

devlace added a commit to devlace/spark that referenced this pull request Mar 6, 2017
## What changes were proposed in this pull request?

In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1.

I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them.

Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](apache#17059 (comment)) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :)

## How was this patch tested?

Unit-tests

Author: Vasilis Vryniotis <vvryniotis@hotels.com>

Closes apache#17121 from datumbox/als_regparam.
devlace added a commit to devlace/spark that referenced this pull request Mar 6, 2017
… casts in ALS.

## What changes were proposed in this pull request?

The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.

## How was this patch tested?

I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.

Author: Vasilis Vryniotis <bbriniotis@datumbox.com>

Closes apache#17059 from datumbox/als_casting_fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants