Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge to hadoop? #4

Closed
Tagar opened this issue Apr 28, 2015 · 9 comments
Closed

merge to hadoop? #4

Tagar opened this issue Apr 28, 2015 · 9 comments

Comments

@Tagar
Copy link

Tagar commented Apr 28, 2015

I'm surprized it's not yet part of Apache Hadoop project :)
LZO is a pain to index. Plus has some licensing issues.
Great project.

@carlomedas
Copy link
Collaborator

Thanks for good feedback.
On Hadoop 2.x by default you have LZ4 Codec but it's not configurable w.r.t. desired compression ratio and also not actually providing any splittability.
I would be happy to see this as patch to hadoop 2.x, but so far I was not even able to get attention of ElephantBird guys to work on an integration of 4mc into EB to replace LZO.

@Tagar
Copy link
Author

Tagar commented Apr 29, 2015

I just emailed Cloudera folks to have a look and file a JIRA ticket to integrate it in.
Hopefully this will get integrated. Thanks a lot!

@carlomedas
Copy link
Collaborator

Thanks!

@svravitej
Copy link

please let us know when it is integrated.

waiting for integration with hadoop

@ianoc
Copy link

ianoc commented Oct 21, 2015

EB as in elephantbird from twitter? Do you have a PR/issue to add support?

(Replacing isn't really an option for something like a serialization library since people have TB/PB's of data written with existing formats).

@carlomedas
Copy link
Collaborator

Yes sorry 'replacing' is wrong here, 'add support' makes much more sense.
I got in touch with some EB dev but never had positive feedback about the idea of integration, thus I never did open a PR/issue on EB about that.

@ianoc
Copy link

ianoc commented Oct 22, 2015

I think we'd be fine with the integration, we @ twitter aren't super likely to use it. Though I'd like to try it out, will probably do that outside EB. We have discussed getting off those container formats in EB, so if we were to migrate it would more likely be to something sequence file based for ourselves(which handles splitting regardless of compression). But the extra options and such I plan on trying out from 4mc to see how they perform for our existing lz4 use cases now

@carlomedas
Copy link
Collaborator

Very good, let me know what you think and how you find it.
Moreover I agree with your approach as well, using protobuf container is not best option from performance point of view when you have already a super-packet containing other info. In our tests we saw some little performance degradation when moving from our data-blocks (compressed with LZ4 anyways) to EB/4mc (also inside only C++ native code). Of course it was more than acceptable wrt the scalability we have in hadoop/EB architecture and most of all wrt having the EB framework coded and bug-free already :)

@svravitej
Copy link

Hi,

I think I am not in anyway connected to this mail.
Please remove me from the notifications.

Regards,
Ravitej

On Mon, Jul 25, 2016 at 5:53 AM, Carlo Medas notifications@github.com
wrote:

Closed #4 #4.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#4 (comment), or mute the
thread
https://github.com/notifications/unsubscribe-auth/ANI2ORU7_dR4EqNqdGqNs_3BoycgPnz-ks5qZJWsgaJpZM4ELBLt
.

Regards

RaviTej Somayajula

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants