Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

potential regression on scalding joinWithTiny on 4.5-wip #91

Closed
daniel-sudz opened this issue Mar 9, 2022 · 7 comments
Closed

potential regression on scalding joinWithTiny on 4.5-wip #91

daniel-sudz opened this issue Mar 9, 2022 · 7 comments

Comments

@daniel-sudz
Copy link
Collaborator

I've been testing scalding with newer cascading as a demo on a branch here: daniel-sudz/scalding#1.

I currently have the following bad output:

[info] - should merge and joinWithTiny shouldn't duplicate data *** FAILED ***
[info]   Set((1,3), (2,3), (3,3), (4,2)) was not equal to Set((1,2), (2,2), (3,2), (4,1)) (PlatformTest.scala:466)

it looks like there is some duplication going on considering 3 > 2 and 2 > 1. I saw that there was some previous discussion around this when cascading3 scalding branch was being developed before it got stalled. twitter/scalding#1592. The resolution there seemed to be a higher hadoop version so not really applicable here.

Not sure where to begin debugging this but would love some pointers.

@daniel-sudz
Copy link
Collaborator Author

here is the scalding test:

class TinyJoinAndMergeJob(args: Args) extends Job(args) {
  import TinyJoinAndMergeJob._

  val people = peopleInput.read.mapTo(0 -> 'id) { v: Int => v }

  val messages = messageInput.read
    .mapTo(0 -> 'id) { v: Int => v }
    .joinWithTiny('id -> 'id, people)

  (messages ++ people).groupBy('id)(_.size('count)).write(output)
}

here is the input for the tests:

object TinyJoinAndMergeJob {
  val peopleInput = TypedTsv[Int]("input1")
  val peopleData = List(1, 2, 3, 4)

  val messageInput = TypedTsv[Int]("input2")
  val messageData = List(1, 2, 3)

  val output = TypedTsv[(Int, Int)]("output")
  val outputData = List((1, 2), (2, 2), (3, 2), (4, 1))
}

previous behavior:

(1,2,3) joins against (1,2,3,4) creating (1,2,3). 
(1,2,3) + (1,2,3,4) -> (1,1,2,2,3,3,4).
(1,1,2,2,3,3,4) count by key -> ((1,2), (2,2) (3,2), (4,1))

current behaviour: every key is being over-counted by one.

@daniel-sudz
Copy link
Collaborator Author

I think this is also possibly related to cascading.avro.AvroScheme third-party package is leaking an old cascading 2.X version onto classpath. That project seems to be completely abandoned right now.

@cwensel
Copy link
Owner

cwensel commented Mar 9, 2022

So there is this: https://github.com/cwensel/cascading-avro

@daniel-sudz
Copy link
Collaborator Author

thanks for the fork and link. We're still trying to setup some discussion with the scalding project to figure out what a potential upgrade path might look like. Since it's not my project it's not clear to me what is a "must have" to support. I will try to get back once I have more info on what the plan is.

@cwensel
Copy link
Owner

cwensel commented Mar 9, 2022

It makes sense to push the avro fork out to maven central, but I don't have time to patch it since it's using maven as the build. If you guys need it, feel free to push PRs to it so we can get it out. I also don't have a build server for it, so it will need a github workflow action(s) as well.

Note, the hardest part of pushing to maven central will be getting the private keys on the build server. We can collaborate on that bit.

@cwensel
Copy link
Owner

cwensel commented Mar 16, 2022

@daniel-sudz has this been resolved?

@daniel-sudz
Copy link
Collaborator Author

@cwensel I'm going to close this I think it's unlikely that scalding will adopt cascading 4.X in the near future so I don't really have time to look into it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants