Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-9650] [formats] add support for protobuf objects #7865

Closed

Conversation

amalakar
Copy link

@amalakar amalakar commented Feb 28, 2019

flink-protobuf

This library adds support to flink for running sql against protobuf objects. Flink as of now
supports avro and json files backed by JsonSchema only. To add support for sql, flink needs to know
the TypeInformation, this library provides TypeInformation for protobuf object.

It uses protobuf apis to retrieve fields and types of a prorobuf object and than provides the
field name, and type as a PojoField to flink.

Current limitations:

  • In protobuf object field names have underscore at the end like loggedAt_, so in the sql it needs
    to be referred as loggedAt_ instead of logged_at. This should be fixable in flink apis, but
    would need some digging around in the code. If we whitelist Message classes in PojoField that should help.

  • Some fields are not supported yet like Enum etc, but should be trivial to add support.

With this it is possible to run a query like the following in the stream of events:

SELECT region_,
       count(*)
FROM people
WHERE currentAge_ > 40
  AND region_ IN ('SFO',
                 'BKN')
GROUP BY region_

Note: I have been a bit hasty to get this out, as this was sitting in our internal repo for a while and I haven't had the time to clean it up to make it flink ready. But also wanted to get the code out if someone wants to work on it they can work off this code rather than working on it from scratch. We have been using this for close to an year in production. Due to other commitments I may not get a chance to work on coding style/review comments immediately, so wouldn't mind if someone wants to improve this before merge. For example some there are pending TODO items like enum support/change in PojoField to make the sql nicer (no underscore) etc.

(Apologize for not conforming to the coding style and the rest of the guidelines yet, hoping it is still useful as a beta version patch and someone may find this useful).

@flinkbot
Copy link
Collaborator

flinkbot commented Feb 28, 2019

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Review Progress

  • ✅ 1. The [description] looks good.
  • ✅ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@amalakar amalakar force-pushed the FLINK-9650-add-protobuf-support branch from a59aa45 to a438c11 Compare February 28, 2019 20:11
Copy link
Contributor

@twalthr twalthr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amalakar for contributing this format. I haven't had a deep look at your code but please try to align the changes with other formats. A good example is the recently added CSV format #7777. Most of the changes there should also exist in your PR. We need:

  • no full end-to-end tests
  • documentation instead of README
  • schema converter between Flink's schema and protobuf (deriveSchema())
  • consistent behavior

It might also be worth it to take a look at the AvroRowDerSe schema.

@twalthr
Copy link
Contributor

twalthr commented Mar 5, 2019

@flinkbot approve description
@flinkbot approve consensus

@zentol
Copy link
Contributor

zentol commented Nov 11, 2022

Subsumed by FLINK-18202.

@zentol zentol closed this Nov 11, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants