New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-19733][ML]Removed unnecessary castings and refactored checked casts in ALS. #17059

Closed
wants to merge 5 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@datumbox
Contributor

datumbox commented Feb 24, 2017

What changes were proposed in this pull request?

The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.

How was this patch tested?

I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.

@datumbox datumbox changed the title from Removed unnecessary castings and refactored checked casts in ALS. to [SPARK-19733][ML]Removed unnecessary castings and refactored checked casts in ALS. Feb 25, 2017

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Feb 25, 2017

Member

I get it, but is that cast actually slowing things down measurably? because this change also adds some overhead in the checks it does. I think it's better to let the cast to double deal with the source nature of the number consistently.

Member

srowen commented Feb 25, 2017

I get it, but is that cast actually slowing things down measurably? because this change also adds some overhead in the checks it does. I think it's better to let the cast to double deal with the source nature of the number consistently.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 25, 2017

Test build #3582 has finished for PR 17059 at commit 98d6481.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 25, 2017

Test build #3582 has finished for PR 17059 at commit 98d6481.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 25, 2017

Contributor

@srowen I believe that this needs to be fixed for 2 reasons:

  1. Casting the ids to double just to convert it back to integer is not an elegant solution and it is rather confusing.
  2. The double casting puts more strain on the garbage collector and I've personally measured it in an earlier version with and without the hack.

Finally I do not believe the proposed fix slows down things as it does a similar number of comparisons as the original code. If you still believe this is not worth it feel free to close the PR.

Contributor

datumbox commented Feb 25, 2017

@srowen I believe that this needs to be fixed for 2 reasons:

  1. Casting the ids to double just to convert it back to integer is not an elegant solution and it is rather confusing.
  2. The double casting puts more strain on the garbage collector and I've personally measured it in an earlier version with and without the hack.

Finally I do not believe the proposed fix slows down things as it does a similar number of comparisons as the original code. If you still believe this is not worth it feel free to close the PR.

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 25, 2017

Contributor

Concerning the failed scala style test, this is caused by the long in-line comment that I added to explain the if statement. If you decide to approve the PR, I can remove it. Cheers! :)

Contributor

datumbox commented Feb 25, 2017

Concerning the failed scala style test, this is caused by the long in-line comment that I added to explain the if statement. If you decide to approve the PR, I can remove it. Cheers! :)

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Feb 25, 2017

Member

I get it, though the drawback is mostly that you've duplicated in a way the machinery that would construe any number as something comparable to the min/max int value. (Does this not miss the case of a Scala Long?) I like that it catches the case of Double values that aren't integral, but, that's quite a corner case anyway.

Member

srowen commented Feb 25, 2017

I get it, though the drawback is mostly that you've duplicated in a way the machinery that would construe any number as something comparable to the min/max int value. (Does this not miss the case of a Scala Long?) I like that it catches the case of Double values that aren't integral, but, that's quite a corner case anyway.

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 25, 2017

Contributor

Could you explain what you mean by "duplicating"? It is safe with Scala Long; I did lots of tests to ensure that it works well. If you require any changes, I'm happy to update the PR.

Contributor

datumbox commented Feb 25, 2017

Could you explain what you mean by "duplicating"? It is safe with Scala Long; I did lots of tests to ensure that it works well. If you require any changes, I'm happy to update the PR.

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Feb 25, 2017

Member

You are effectively handling the cases that casting handles. Doesn't a Scala long get boxed another time now? I bet it works, just wondering why that's not handled like Int. I'm neutral on this and wouldn't merge it myself but others are welcome to.

Member

srowen commented Feb 25, 2017

You are effectively handling the cases that casting handles. Doesn't a Scala long get boxed another time now? I bet it works, just wondering why that's not handled like Int. I'm neutral on this and wouldn't merge it myself but others are welcome to.

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 25, 2017

Contributor

jenkins retest this please

Contributor

datumbox commented Feb 25, 2017

jenkins retest this please

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 25, 2017

Contributor

@srowen Yes, the behaviour of the method remains the same. This patch helped me get a measurable improvement on GC overhead in Spark 2.0, so I though that it would be beneficial for others. Anyway thanks for the comments. :)

Contributor

datumbox commented Feb 25, 2017

@srowen Yes, the behaviour of the method remains the same. This patch helped me get a measurable improvement on GC overhead in Spark 2.0, so I though that it would be beneficial for others. Anyway thanks for the comments. :)

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 26, 2017

Contributor

I decided to provide a few more bencharks in order to alleviate some of the concerns raised by @srowen.

To reproduce the results add the following snippet in the ALSSuite class:

  test("Speed difference") {
    val (training, test) =
      genExplicitTestData(numUsers = 200, numItems = 400, rank = 2, noiseStd = 0.01)

    val runs = 1000
    var totalTime = 0.0
    println("Performing "+runs+" runs")
    println("Run Id,Time (secs)")
    for(i <- 0 until runs) {
      val t0 = System.currentTimeMillis
      testALS(training, test, maxIter = 1, rank = 2, regParam = 0.01, targetRMSE = 0.1)
      val secs = (System.currentTimeMillis - t0)/1000.0
      println(i+","+secs)
      totalTime += secs
    }
    println("AVG Execution Time: "+(totalTime/runs)+" secs")
  }

To test both solutions, I collected 1000 samples for each (took ~1 hour for each). Here you can see the detailed output for the original and the proposed code.

Code Mean Execution Time Std
Original 4.75521 0.81237
Proposed 4.56276 0.72790

Using an unpaired t-test to compare the two means we find that the proposed code is faster and the result is statistically significant (p-value < .0001).

Below I summarize why I believe the original code needs to change:

  1. Casting user and item ids into double and then to integer is a hacky & indirect way to validate that the ids are numerical and within integer range. The proposed code covers all the corner cases in a clear and direct way. As an added bonus, it handles Doubles and Floats with fractional part.
  2. Given that the ALS implementation requires ids with int values, it is expected that the majority of users encode their Ids as Integer. The proposed solution avoids any casting in that case while reducing the casting in all the other cases. This avoids putting unnecessary strain on the garbage collector, something that you can observe if you profile the execution on a large dataset.
  3. The proposed solution is not slower than the original; if something it is slightly faster.

Could some one from the admins help re-test the code?

Contributor

datumbox commented Feb 26, 2017

I decided to provide a few more bencharks in order to alleviate some of the concerns raised by @srowen.

To reproduce the results add the following snippet in the ALSSuite class:

  test("Speed difference") {
    val (training, test) =
      genExplicitTestData(numUsers = 200, numItems = 400, rank = 2, noiseStd = 0.01)

    val runs = 1000
    var totalTime = 0.0
    println("Performing "+runs+" runs")
    println("Run Id,Time (secs)")
    for(i <- 0 until runs) {
      val t0 = System.currentTimeMillis
      testALS(training, test, maxIter = 1, rank = 2, regParam = 0.01, targetRMSE = 0.1)
      val secs = (System.currentTimeMillis - t0)/1000.0
      println(i+","+secs)
      totalTime += secs
    }
    println("AVG Execution Time: "+(totalTime/runs)+" secs")
  }

To test both solutions, I collected 1000 samples for each (took ~1 hour for each). Here you can see the detailed output for the original and the proposed code.

Code Mean Execution Time Std
Original 4.75521 0.81237
Proposed 4.56276 0.72790

Using an unpaired t-test to compare the two means we find that the proposed code is faster and the result is statistically significant (p-value < .0001).

Below I summarize why I believe the original code needs to change:

  1. Casting user and item ids into double and then to integer is a hacky & indirect way to validate that the ids are numerical and within integer range. The proposed code covers all the corner cases in a clear and direct way. As an added bonus, it handles Doubles and Floats with fractional part.
  2. Given that the ALS implementation requires ids with int values, it is expected that the majority of users encode their Ids as Integer. The proposed solution avoids any casting in that case while reducing the casting in all the other cases. This avoids putting unnecessary strain on the garbage collector, something that you can observe if you profile the execution on a large dataset.
  3. The proposed solution is not slower than the original; if something it is slightly faster.

Could some one from the admins help re-test the code?

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Feb 27, 2017

Member

That's compelling regarding performance. It's not big but not trivial.
My remaining concern is whether you're handling all the cases the original did. Number covers a lot but does it include all the SQL decimal types? I think the common casting mechanism would cover those. Granted, it's odd if someone tries to use those values, but ideally would not work differently after this change.

I am still not sure about the Long case -- don't you want to handle Scala Long like Int here for efficiency?

Member

srowen commented Feb 27, 2017

That's compelling regarding performance. It's not big but not trivial.
My remaining concern is whether you're handling all the cases the original did. Number covers a lot but does it include all the SQL decimal types? I think the common casting mechanism would cover those. Granted, it's odd if someone tries to use those values, but ideally would not work differently after this change.

I am still not sure about the Long case -- don't you want to handle Scala Long like Int here for efficiency?

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 27, 2017

Contributor

@srowen: Thanks for the comments. We are getting there. :)

I will handle the Long case as you suggest.

If you think people use SQL decimal types, I can include them at the end of the pattern matching. This will lead to some duplicate code though cause I need to write the same if statement twice. Any thoughts?

Contributor

datumbox commented Feb 27, 2017

@srowen: Thanks for the comments. We are getting there. :)

I will handle the Long case as you suggest.

If you think people use SQL decimal types, I can include them at the end of the pattern matching. This will lead to some duplicate code though cause I need to write the same if statement twice. Any thoughts?

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 27, 2017

Contributor

Ignore my comment about duplicate code. It can be written to avoid it. I will investigate handling the SQL decimal types as you recommended and I will update the code tonight.

Contributor

datumbox commented Feb 27, 2017

Ignore my comment about duplicate code. It can be written to avoid it. I will investigate handling the SQL decimal types as you recommended and I will update the code tonight.

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 27, 2017

Contributor

@srowen The following snippet handles explicitly Longs. It can be rewritten to remove duplicate code by introducing bools for overflow detection but I don't think it is worth it. In theory you can catch also explicitly other types such as Byte and Short but I think that's an overkill.

As far as I saw, all SQL numerical types inherit from Number so comparing their doubleValue with their intValue would be enough to check if they are within integer range.

    val u = udf { (n: Any) =>
      n match {
        case v: Int => v
        case v: Long =>
          val intV = v.intValue
          if (v == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        //case v: Byte => v.toInt 
        //case v: Short => v.toInt
        case v: Number =>
          val intV = v.intValue
          if (v.doubleValue == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        case _ => throw new IllegalArgumentException("invalid type")
      }
    }

Personally, I would remove the explicit Long case as it introduces duplicate code and does not help match. The remaining snippet avoids doing any casting if the ID is integer (which should be the majority of cases and yields the biggest memory/speed gains) or non-numeric and handles all corner cases (All scala/java numeric types + SQL Numerics). Agree?

Contributor

datumbox commented Feb 27, 2017

@srowen The following snippet handles explicitly Longs. It can be rewritten to remove duplicate code by introducing bools for overflow detection but I don't think it is worth it. In theory you can catch also explicitly other types such as Byte and Short but I think that's an overkill.

As far as I saw, all SQL numerical types inherit from Number so comparing their doubleValue with their intValue would be enough to check if they are within integer range.

    val u = udf { (n: Any) =>
      n match {
        case v: Int => v
        case v: Long =>
          val intV = v.intValue
          if (v == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        //case v: Byte => v.toInt 
        //case v: Short => v.toInt
        case v: Number =>
          val intV = v.intValue
          if (v.doubleValue == intV) {
            intV
          }
          else {
            throw new IllegalArgumentException("out of range")
          }
        case _ => throw new IllegalArgumentException("invalid type")
      }
    }

Personally, I would remove the explicit Long case as it introduces duplicate code and does not help match. The remaining snippet avoids doing any casting if the ID is integer (which should be the majority of cases and yields the biggest memory/speed gains) or non-numeric and handles all corner cases (All scala/java numeric types + SQL Numerics). Agree?

@srowen

This comment has been minimized.

Show comment
Hide comment
@srowen

srowen Feb 27, 2017

Member

This one would be good for @MLnick to look at.

The Scala Long will match java.lang.Number right? if so it works even if there's another conversion there, but maybe that one is trivial, and no more common a case to optimize for than others.

Member

srowen commented Feb 27, 2017

This one would be good for @MLnick to look at.

The Scala Long will match java.lang.Number right? if so it works even if there's another conversion there, but maybe that one is trivial, and no more common a case to optimize for than others.

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 27, 2017

Contributor

Yeah, Scala Long matches. Here is the "stand-alone" script that I used to confirm that everything works ok (tested on Spark 2.1):

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val u = udf { (n: Any) =>
  n match {
    case v: Int => v
    case v: Number =>
      val intV = v.intValue
      if (v.doubleValue == intV) {
        intV
      }
      else {
        throw new IllegalArgumentException("out of range")
      }
    case _ => throw new IllegalArgumentException("invalid type")
  }
}

val df = sqlContext.range(10)
  .withColumn("int_success", lit(123))
  .withColumn("long_success", lit(1231L))
  .withColumn("long_fail", lit(1231000000000L))
  .withColumn("decimal_success", lit(123).cast(DecimalType(5, 2)))
  .withColumn("decimal_fail", lit(123.1).cast(DecimalType(5, 2)))
  .withColumn("double_success", lit(123.0))
  .withColumn("double_fail", lit(123.1))
  .withColumn("double_fail2", lit(1231000000000.0))
  .withColumn("string_fail", lit("123.1"))

// these work fine
df.select(u(df.col("int_success"))).show
df.select(u(df.col("long_success"))).show
df.select(u(df.col("decimal_success"))).show
df.select(u(df.col("double_success"))).show

// these fail with out of int range exception
df.select(u(df.col("long_fail"))).show
df.select(u(df.col("decimal_fail"))).show
df.select(u(df.col("double_fail"))).show
df.select(u(df.col("double_fail2"))).show

// this fails with invalid type exception
df.select(u(df.col("string_fail"))).show

Cool, I'll commit the changes tonight so that @MLnick can check the final code.

@srowen I really appreciate your input; you did raise good points and especially for handling the SQL datatypes. Thank you.

Contributor

datumbox commented Feb 27, 2017

Yeah, Scala Long matches. Here is the "stand-alone" script that I used to confirm that everything works ok (tested on Spark 2.1):

import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

val u = udf { (n: Any) =>
  n match {
    case v: Int => v
    case v: Number =>
      val intV = v.intValue
      if (v.doubleValue == intV) {
        intV
      }
      else {
        throw new IllegalArgumentException("out of range")
      }
    case _ => throw new IllegalArgumentException("invalid type")
  }
}

val df = sqlContext.range(10)
  .withColumn("int_success", lit(123))
  .withColumn("long_success", lit(1231L))
  .withColumn("long_fail", lit(1231000000000L))
  .withColumn("decimal_success", lit(123).cast(DecimalType(5, 2)))
  .withColumn("decimal_fail", lit(123.1).cast(DecimalType(5, 2)))
  .withColumn("double_success", lit(123.0))
  .withColumn("double_fail", lit(123.1))
  .withColumn("double_fail2", lit(1231000000000.0))
  .withColumn("string_fail", lit("123.1"))

// these work fine
df.select(u(df.col("int_success"))).show
df.select(u(df.col("long_success"))).show
df.select(u(df.col("decimal_success"))).show
df.select(u(df.col("double_success"))).show

// these fail with out of int range exception
df.select(u(df.col("long_fail"))).show
df.select(u(df.col("decimal_fail"))).show
df.select(u(df.col("double_fail"))).show
df.select(u(df.col("double_fail2"))).show

// this fails with invalid type exception
df.select(u(df.col("string_fail"))).show

Cool, I'll commit the changes tonight so that @MLnick can check the final code.

@srowen I really appreciate your input; you did raise good points and especially for handling the SQL datatypes. Thank you.

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 27, 2017

Contributor

@srowen @MLnick I updated the PR based on what was discussed above and I tested it again on Spark 2.1. I also updated the coding styles and the exception message.

The changes requested by Sean, should not affect the speed of the proposed solution but rather help us catch all the corner cases. I tested this by generating a fresh batch of 1000 samples using the latest implementation (details here). The mean of the sample is 4.53837 which is no different than the mean of the code that I committed yesterday (p-value 0.4265 - H0 of equal means can't be rejected). So we are good.

Could any of the admins help re-test this because the "Jenkins, retest this please" did not work for me earlier?

Contributor

datumbox commented Feb 27, 2017

@srowen @MLnick I updated the PR based on what was discussed above and I tested it again on Spark 2.1. I also updated the coding styles and the exception message.

The changes requested by Sean, should not affect the speed of the proposed solution but rather help us catch all the corner cases. I tested this by generating a fresh batch of 1000 samples using the latest implementation (details here). The mean of the sample is 4.53837 which is no different than the mean of the code that I committed yesterday (p-value 0.4265 - H0 of equal means can't be rejected). So we are good.

Could any of the admins help re-test this because the "Jenkins, retest this please" did not work for me earlier?

@imatiach-msft

This comment has been minimized.

Show comment
Hide comment
@imatiach-msft

imatiach-msft Feb 28, 2017

Contributor

@datumbox I like the changes, I just had a minor concern about the code where we call v.intValue and then compare this to v.doubleValue -- due to precision issues, I'm not sure if this is desirable, since the data could come from any source and be slightly modified outside the Int range, and the previous code does not do this check

Contributor

imatiach-msft commented Feb 28, 2017

@datumbox I like the changes, I just had a minor concern about the code where we call v.intValue and then compare this to v.doubleValue -- due to precision issues, I'm not sure if this is desirable, since the data could come from any source and be slightly modified outside the Int range, and the previous code does not do this check

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 28, 2017

Contributor

@imatiach-msft This comparison is intentional and checks 2 things: That the number is within integer range and that the Id does not have any non-zero digits after the decimal point. If the number is outside integer range the intValue will overflow and the number will not match its double part. Moreover if it has fractional part it will also will not match the integer.

Effectively we are making sure that the UserId and ItemId have values within integer range.

Contributor

datumbox commented Feb 28, 2017

@imatiach-msft This comparison is intentional and checks 2 things: That the number is within integer range and that the Id does not have any non-zero digits after the decimal point. If the number is outside integer range the intValue will overflow and the number will not match its double part. Moreover if it has fractional part it will also will not match the integer.

Effectively we are making sure that the UserId and ItemId have values within integer range.

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Feb 28, 2017

Contributor

@datumbox you mention there is GC & performance overhead which makes some sense. Have you run into problems with very large scale (like millions users & items & ratings)? I did regression tests here - admittedly vs 1.6.1 at that time

Contributor

MLnick commented Feb 28, 2017

@datumbox you mention there is GC & performance overhead which makes some sense. Have you run into problems with very large scale (like millions users & items & ratings)? I did regression tests here - admittedly vs 1.6.1 at that time

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Feb 28, 2017

Contributor

@MLnick Yeap! I have run into problems with the original implementation when dealing with several billions records. This is when removing casting starts paying off. :)

Contributor

datumbox commented Feb 28, 2017

@MLnick Yeap! I have run into problems with the original implementation when dealing with several billions records. This is when removing casting starts paying off. :)

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Feb 28, 2017

Contributor

Ok, let me take a look at this. Thanks!

Contributor

MLnick commented Feb 28, 2017

Ok, let me take a look at this. Thanks!

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Feb 28, 2017

Contributor

ok to test

Contributor

MLnick commented Feb 28, 2017

ok to test

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Feb 28, 2017

Test build #73585 has finished for PR 17059 at commit fc3acfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Feb 28, 2017

Test build #73585 has finished for PR 17059 at commit fc3acfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@imatiach-msft

This comment has been minimized.

Show comment
Hide comment
@imatiach-msft

imatiach-msft Feb 28, 2017

Contributor

@datumbox thanks for adding the tests, I've approved the review, assuming the other reviewers are ok with minimal double support. I haven't seen withClue used in the same way you've used it, but it seems fine to me (usually the withClue would contain the error message, and you wouldn't have the extra assert). Thank you for the nice performance improvement!

Contributor

imatiach-msft commented Feb 28, 2017

@datumbox thanks for adding the tests, I've approved the review, assuming the other reviewers are ok with minimal double support. I haven't seen withClue used in the same way you've used it, but it seems fine to me (usually the withClue would contain the error message, and you wouldn't have the extra assert). Thank you for the nice performance improvement!

@datumbox

This comment has been minimized.

Show comment
Hide comment
Contributor

datumbox commented Feb 28, 2017

@imatiach-msft

This comment has been minimized.

Show comment
Hide comment
@imatiach-msft

imatiach-msft Feb 28, 2017

Contributor

@datumbox ah ok, thanks for the link. LGTM!

Contributor

imatiach-msft commented Feb 28, 2017

@datumbox ah ok, thanks for the link. LGTM!

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Mar 1, 2017

Contributor

I updated the exception message and I added Unit-tests to cover the checkedCast method as @imatiach-msft suggested. I have not corrected the value of regParam but I had a look and it is safe to do so.

Thanks everyone for their comments and recommendations. Please consider merging this PR with master because many people are working lately on the ALS and soon conflicts will arise.

Contributor

datumbox commented Mar 1, 2017

I updated the exception message and I added Unit-tests to cover the checkedCast method as @imatiach-msft suggested. I have not corrected the value of regParam but I had a look and it is safe to do so.

Thanks everyone for their comments and recommendations. Please consider merging this PR with master because many people are working lately on the ALS and soon conflicts will arise.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Mar 1, 2017

Test build #73620 has finished for PR 17059 at commit eb33189.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Mar 1, 2017

Test build #73620 has finished for PR 17059 at commit eb33189.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Mar 1, 2017

Contributor

The discussion around numeric types was at #12762 (comment). The checked cast was introduced because it was previously implicitly cast to Int anyway, and would mangle Long ids without any warning. I did think about only supporting Int & Long there - but we decided to just support any numeric type within range. I guess it was around the time that we were porting most pipeline components to support all numeric input types.

In hindsight I would have just gone for restricting it to Int & Long.

In reality in the case of ALS users really should not be using Double for ids (e.g. an id of 123.4 is clearly wrong, 123.0 could occur I guess more by accident or poor schema, but in that case we support it here). So I think the likelihood of any real impact from being restrictive in terms of fractionals is really low.

Contributor

MLnick commented Mar 1, 2017

The discussion around numeric types was at #12762 (comment). The checked cast was introduced because it was previously implicitly cast to Int anyway, and would mangle Long ids without any warning. I did think about only supporting Int & Long there - but we decided to just support any numeric type within range. I guess it was around the time that we were porting most pipeline components to support all numeric input types.

In hindsight I would have just gone for restricting it to Int & Long.

In reality in the case of ALS users really should not be using Double for ids (e.g. an id of 123.4 is clearly wrong, 123.0 could occur I guess more by accident or poor schema, but in that case we support it here). So I think the likelihood of any real impact from being restrictive in terms of fractionals is really low.

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Mar 1, 2017

Contributor

@datumbox left a few minor comments to clean up. Overall looks good. I think it is fine for 2.2.

For the regParam even though it is a minor thing it would be best to just create a new JIRA & PR for it separately.

Contributor

MLnick commented Mar 1, 2017

@datumbox left a few minor comments to clean up. Overall looks good. I think it is fine for 2.2.

For the regParam even though it is a minor thing it would be best to just create a new JIRA & PR for it separately.

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Mar 1, 2017

Contributor

@imatiach-msft thanks for reviewing and requesting the test case, I was going to ask for one :)

Contributor

MLnick commented Mar 1, 2017

@imatiach-msft thanks for reviewing and requesting the test case, I was going to ask for one :)

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Mar 1, 2017

Contributor

@MLnick I accepted all proposed changes. Please retest. :)

Ok, I will create a new JIRA & PR for the regParam update but it will go down in history as the laziest commit ever. :p

Contributor

datumbox commented Mar 1, 2017

@MLnick I accepted all proposed changes. Please retest. :)

Ok, I will create a new JIRA & PR for the regParam update but it will go down in history as the laziest commit ever. :p

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Mar 1, 2017

Test build #73693 has finished for PR 17059 at commit a1e32aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Mar 1, 2017

Test build #73693 has finished for PR 17059 at commit a1e32aa.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Mar 1, 2017

Contributor

@srowen @MLnick I had a typo on the committed version... Could you please retest it? I don't think I have the rights to do it myself.

Contributor

datumbox commented Mar 1, 2017

@srowen @MLnick I had a typo on the committed version... Could you please retest it? I don't think I have the rights to do it myself.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Mar 1, 2017

Test build #73697 has finished for PR 17059 at commit 0b49a50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Mar 1, 2017

Test build #73697 has finished for PR 17059 at commit 0b49a50.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ghost pushed a commit to dbtsai/spark that referenced this pull request Mar 1, 2017

[SPARK-19787][ML] Changing the default parameter of regParam.
## What changes were proposed in this pull request?

In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1.

I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them.

Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](apache#17059 (comment)) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :)

## How was this patch tested?

Unit-tests

Author: Vasilis Vryniotis <vvryniotis@hotels.com>

Closes apache#17121 from datumbox/als_regparam.
@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Mar 1, 2017

Contributor

@datumbox once a PR is ok'ed to test from a committer, it should automatically retest on each commit.

Contributor

MLnick commented Mar 1, 2017

@datumbox once a PR is ok'ed to test from a committer, it should automatically retest on each commit.

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Mar 1, 2017

Contributor

@MLnick Cool did not know that. Just updated commit messages. I think I'm good with this PR from my side.

Contributor

datumbox commented Mar 1, 2017

@MLnick Cool did not know that. Just updated commit messages. I think I'm good with this PR from my side.

@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Mar 1, 2017

Test build #73706 has finished for PR 17059 at commit 361d9cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Mar 1, 2017

Test build #73706 has finished for PR 17059 at commit 361d9cb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@SparkQA

This comment has been minimized.

Show comment
Hide comment
@SparkQA

SparkQA Mar 2, 2017

Test build #73718 has finished for PR 17059 at commit 3050f6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

SparkQA commented Mar 2, 2017

Test build #73718 has finished for PR 17059 at commit 3050f6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.
@MLnick

MLnick approved these changes Mar 2, 2017

LGTM

@MLnick

This comment has been minimized.

Show comment
Hide comment
@MLnick

MLnick Mar 2, 2017

Contributor

Merged to master. Thanks @datumbox, and everyone for reviews.

Contributor

MLnick commented Mar 2, 2017

Merged to master. Thanks @datumbox, and everyone for reviews.

@asfgit asfgit closed this in 625cfe0 Mar 2, 2017

@datumbox

This comment has been minimized.

Show comment
Hide comment
@datumbox

datumbox Mar 2, 2017

Contributor

@MLnick @srowen @imatiach-msft Thank you for your comments and for helping making this improvement robust.

Contributor

datumbox commented Mar 2, 2017

@MLnick @srowen @imatiach-msft Thank you for your comments and for helping making this improvement robust.

devlace added a commit to devlace/spark that referenced this pull request Mar 6, 2017

[SPARK-19787][ML] Changing the default parameter of regParam.
## What changes were proposed in this pull request?

In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1.

I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them.

Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](apache#17059 (comment)) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :)

## How was this patch tested?

Unit-tests

Author: Vasilis Vryniotis <vvryniotis@hotels.com>

Closes apache#17121 from datumbox/als_regparam.

devlace added a commit to devlace/spark that referenced this pull request Mar 6, 2017

[SPARK-19733][ML] Removed unnecessary castings and refactored checked…
… casts in ALS.

## What changes were proposed in this pull request?

The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.

## How was this patch tested?

I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.

Author: Vasilis Vryniotis <bbriniotis@datumbox.com>

Closes apache#17059 from datumbox/als_casting_fix.

ThySinner pushed a commit to ThySinner/spark that referenced this pull request Jun 9, 2017

[SPARK-19787][ML] Changing the default parameter of regParam.
## What changes were proposed in this pull request?

In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1.

I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them.

Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](apache#17059 (comment)) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :)

## How was this patch tested?

Unit-tests

Author: Vasilis Vryniotis <vvryniotis@hotels.com>

Closes apache#17121 from datumbox/als_regparam.

ThySinner pushed a commit to ThySinner/spark that referenced this pull request Jun 9, 2017

[SPARK-19733][ML] Removed unnecessary castings and refactored checked…
… casts in ALS.

## What changes were proposed in this pull request?

The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.

## How was this patch tested?

I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.

Author: Vasilis Vryniotis <bbriniotis@datumbox.com>

Closes apache#17059 from datumbox/als_casting_fix.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment