-
-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
potential regression on scalding joinWithTiny on 4.5-wip #91
Comments
here is the scalding test: class TinyJoinAndMergeJob(args: Args) extends Job(args) {
import TinyJoinAndMergeJob._
val people = peopleInput.read.mapTo(0 -> 'id) { v: Int => v }
val messages = messageInput.read
.mapTo(0 -> 'id) { v: Int => v }
.joinWithTiny('id -> 'id, people)
(messages ++ people).groupBy('id)(_.size('count)).write(output)
} here is the input for the tests: object TinyJoinAndMergeJob {
val peopleInput = TypedTsv[Int]("input1")
val peopleData = List(1, 2, 3, 4)
val messageInput = TypedTsv[Int]("input2")
val messageData = List(1, 2, 3)
val output = TypedTsv[(Int, Int)]("output")
val outputData = List((1, 2), (2, 2), (3, 2), (4, 1))
} previous behavior: (1,2,3) joins against (1,2,3,4) creating (1,2,3).
(1,2,3) + (1,2,3,4) -> (1,1,2,2,3,3,4).
(1,1,2,2,3,3,4) count by key -> ((1,2), (2,2) (3,2), (4,1)) current behaviour: every key is being over-counted by one. |
I think this is also possibly related to |
So there is this: https://github.com/cwensel/cascading-avro |
thanks for the fork and link. We're still trying to setup some discussion with the scalding project to figure out what a potential upgrade path might look like. Since it's not my project it's not clear to me what is a "must have" to support. I will try to get back once I have more info on what the plan is. |
It makes sense to push the avro fork out to maven central, but I don't have time to patch it since it's using maven as the build. If you guys need it, feel free to push PRs to it so we can get it out. I also don't have a build server for it, so it will need a github workflow action(s) as well. Note, the hardest part of pushing to maven central will be getting the private keys on the build server. We can collaborate on that bit. |
@daniel-sudz has this been resolved? |
@cwensel I'm going to close this I think it's unlikely that scalding will adopt cascading 4.X in the near future so I don't really have time to look into it. |
I've been testing scalding with newer cascading as a demo on a branch here: daniel-sudz/scalding#1.
I currently have the following bad output:
it looks like there is some duplication going on considering
3 > 2
and2 > 1
. I saw that there was some previous discussion around this when cascading3 scalding branch was being developed before it got stalled. twitter/scalding#1592. The resolution there seemed to be a higher hadoop version so not really applicable here.Not sure where to begin debugging this but would love some pointers.
The text was updated successfully, but these errors were encountered: