Skip to content

Conversation

@WadeWaldron
Copy link
Contributor

@WadeWaldron WadeWaldron commented Jun 25, 2024

Description

Implementing the second exercise (module 7) which focuses on writing basic Select statements in Flink.

Checklist

  • Unit tests created/updated for any new code (where applicable).
  • Run all tests with ./build.sh validate.
  • Update the CHANGELOG.md.
  • Update the README.md if necessary.

@cla-assistant
Copy link

cla-assistant bot commented Jun 25, 2024

CLA assistant check
All committers have signed the CLA.

@WadeWaldron WadeWaldron force-pushed the exercise-07 branch 2 times, most recently from 3b9fa6c to 783f5eb Compare June 25, 2024 20:47
@WadeWaldron WadeWaldron force-pushed the exercise-07 branch 3 times, most recently from 0f48c1c to 3278d80 Compare June 25, 2024 20:56
@WadeWaldron WadeWaldron marked this pull request as ready for review July 8, 2024 16:36
@WadeWaldron WadeWaldron requested a review from a team as a code owner July 8, 2024 16:36
Copy link
Member

@pmoskovi pmoskovi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice exercises. My comments are based on reading them. If you need someone going through the exercises after David & Martijn's reviews, let me know, happy to do it.


More importantly, it creates a foundation for future work.

**Note:** It's important to remember that the data in the table is streaming and unbounded. Once the query is executed it will run forever until it is terminated.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it be clear to the reader at this point what the term unbounded means?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this describes streaming: Once the query is executed it will run forever until it is terminated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the user was just doing the exercises, then that might not be clear. If they are watching the lectures, then I hope it would be clear.

- Execute and return the result.

<details>
<summary>**Hint**</summary>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the expandable Hint here! Nice touch!


Modify `Marketplace.java` as follows.

- In the `main` method, create an instance of the `CustomerService`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When referring to a method, consider adding (), as in: main().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

main() wouldn't be the right signature though. I think using the backticks works to make it clear this is a code entity.

Modify `Marketplace.java` as follows.

- In the `main` method, create an instance of the `CustomerService`.
- Call the `allCustomers` method on the service to get a `TableResult`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allCustomers()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment applies to other methods mentioned below.

You can implement a basic select statement as follows:

```
env.from("TABLE NAME")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how Hints are just that. No solutions, just hints.
This made me think - would it make sense to add links to the actual solutions? Or they'll have it all cloned to their machines anyway, so it's not necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During the first exercise, they will be reading through the README which explains how they can pull the solution if necessary.

Links to the solution are an interesting idea though. Let me think on that.


import static org.junit.jupiter.api.Assertions.*;

class CustomerServiceTest extends FlinkTableAPITest {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are written as "Integration" style tests. They talk to the real cluster. That's not ideal. Many of these tests take 30-90 seconds to execute which is incredibly slow.

However, I don't think we currently have a viable unit test strategy. I am open to alternatives that might help speed things up.

The question is, in the absence of a viable unit test strategy, is this suitable for at least doing internal enablement on the technology?

this.env = env;
}

public TableResult allCustomers() {
Copy link
Contributor Author

@WadeWaldron WadeWaldron Jul 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed a way to make these methods at least somewhat "testable". Most of the examples of the Table API consist of nothing but a "main" method which isn't testable.

I opted to create these methods inside of Services so that I could write tests that target individual Flink queries/jobs.

However, if there is a better/more common way to organize a Flink Table API application, please let me know what it is (ideally with an example) and I can see if I can adapt.


Arrays.stream(env.listTables()).forEach(System.out::println);

customers.allCustomers();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When these queries are executed there is nothing to shut them down. For queries like this, with no sink, they will automatically shutdown a short time after the application terminates.

For complete streams that include a sink, they never shut down and will run forever.

Is that desirable? How would we recommend people manage these jobs to make sure they don't leak a bunch of queries and consume resources unnecessarily? How can we prevent someone from running the same job multiple times by accident and causing duplicate data in their topic? What is our recommended way of managing the lifecycle of these jobs?

@WadeWaldron WadeWaldron deleted the exercise-07 branch August 21, 2024 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants