Skip to content

[FLINK-1486] add print method for prefixing a user defined string#372

Closed
mxm wants to merge 1 commit intoapache:masterfrom
mxm:flink-1486
Closed

[FLINK-1486] add print method for prefixing a user defined string#372
mxm wants to merge 1 commit intoapache:masterfrom
mxm:flink-1486

Conversation

@mxm
Copy link
Copy Markdown
Contributor

@mxm mxm commented Feb 6, 2015

  • extend API to include a print(String sinkIdentifier) method
  • change PrintingOutputformat to include the sink identifier
  • if appropriate, print sink identifier and task id

@rmetzger
Copy link
Copy Markdown
Contributor

rmetzger commented Feb 6, 2015

Looks good.

@fhueske
Copy link
Copy Markdown
Contributor

fhueske commented Feb 7, 2015

I think it would be nice to have some kind of hierarchical structure of the output such as:
$sinkName:$taskId > $outputValue
That would give the name of the sink first followed by the sink's task id, and finally behind the >prompt the actual output. Wouldn't that also make output parsing easier?

Looks good otherwise.

@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Feb 8, 2015

Good idea. So we would print $taskId > $outputValue if the user did not supply a string and $string:$taskId > $outputValue otherwise. If the parallelization degree is 1, we would just print $string > $outputValue if a string was supplied.

@fhueske
Copy link
Copy Markdown
Contributor

fhueske commented Feb 8, 2015

Sounds good to me!

@mxm mxm force-pushed the flink-1486 branch 2 times, most recently from f440c0e to 3d1c95c Compare February 9, 2015 13:02
@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Feb 9, 2015

To make it a bit more explicit what is sink identifier and what is the task identifier (especially when just one of the two are printed), I prefixed the sink identifier with "sinkId" and the task identifier with "taskId".

@uce
Copy link
Copy Markdown
Contributor

uce commented Feb 11, 2015

I think this is very valuable. I've tried it out and it looks good to me.

Personally, I would prefer the shorter versions proposed by Fabian. If we don't differentiate between parallelism 1 and > 1 we wouldn't have to worry about cases where just one is printed.

But I'm fine with your current solution as well.

@fhueske
Copy link
Copy Markdown
Contributor

fhueske commented Feb 11, 2015

+1 for conciseness.

@mxm mxm force-pushed the flink-1486 branch 2 times, most recently from c575134 to 71f318a Compare April 20, 2015 16:31
@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Apr 20, 2015

I've updated the pull request. I decided to implement the concise method:

sinkId:taskId> output  <- sink id provided, parallelism > 1
sinkId> output         <- sink id provided, parallelism == 1
taskId> output         <- no sink id provided, parallelism > 1
output                 <- no sink id provided, parallelism == 1

If no objections, I will merge this tomorrow.

@fhueske
Copy link
Copy Markdown
Contributor

fhueske commented Apr 20, 2015

LGTM

… string

- extend API to include a print(String sinkIdentifier) method
- change PrintingOutputformat to include the sink identifier
- if appropriate, print sink identifier and task id
- update documentation
@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Apr 21, 2015

I've added documentation for the new print method. Will merge later on.

@asfgit asfgit closed this in df7c61e Apr 22, 2015
@mxm mxm deleted the flink-1486 branch April 22, 2015 11:42
@StephanEwen
Copy link
Copy Markdown
Contributor

This may come a bit late (given that this is merged now), but I did not think of it before:

When we change the printing to happen on the client console (which we should, IMHO), we will probably realize it via collect().foreach(print) or so. In that case, there it gets executed immediately and there is always only one sink active.

Does this change still make sense then?

@fhueske
Copy link
Copy Markdown
Contributor

fhueske commented Apr 22, 2015

I think you are right. If there's only one sink active, there is no need for a sink identifier.

@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Apr 23, 2015

Do we want to break backwards compatibility or include a new method for printing on the client? After all, printing on the workers is a useful tool to debug the dataflow of a program.

@StephanEwen
Copy link
Copy Markdown
Contributor

Can you think of a case where printing on the client is worse than printing on worker?

@StephanEwen
Copy link
Copy Markdown
Contributor

From what I have seen, most people expect print() to actually go to the client.
It would be a break of backwards compatibility, agreed. Maybe one of the few that are necessary.

@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Apr 23, 2015

@StephanEwen Don't think printing on the client can be worse if the output still contains information about the producers (e.g. by a task id). IMO, a sink identifier could still make sense when you make multiple calls to print and want to distinguish easily between the outputs.

@StephanEwen
Copy link
Copy Markdown
Contributor

Ah, okay. You mean we have two methods:

  1. print() which just prints everything onto the client
  2. printWithTaskid(), which prefixes lines with the task ID?

@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Apr 24, 2015

Just saying that a prefix helps to identify output, even if everything is printed on the client. Additionally, including the task id in the output can be useful for debugging.

@StephanEwen
Copy link
Copy Markdown
Contributor

Okay, let's just try and not make this too confusing for users. Do we need all three versions?

  1. print()
  2. printWithParallelId()
  3. printWithPrefix()

@mxm
Copy link
Copy Markdown
Contributor Author

mxm commented Apr 24, 2015

Yes, it should be simple for the user. It makes sense to have one print method which just prints the output on the client. In addition, we could have another advanced print method which prints a prefix and optionally the task id.

  • print()
  • print(String prefix, boolean includeParallelID)

thvasilo pushed a commit to thvasilo/flink that referenced this pull request Apr 24, 2015
… string

- extend API to include a print(String sinkIdentifier) method
- change PrintingOutputformat to include the sink identifier
- if appropriate, print sink identifier and task id
- update documentation

This closes apache#372
bhatsachin pushed a commit to bhatsachin/flink that referenced this pull request May 5, 2015
… string

- extend API to include a print(String sinkIdentifier) method
- change PrintingOutputformat to include the sink identifier
- if appropriate, print sink identifier and task id
- update documentation

This closes apache#372
marthavk pushed a commit to marthavk/flink that referenced this pull request Jun 9, 2015
… string

- extend API to include a print(String sinkIdentifier) method
- change PrintingOutputformat to include the sink identifier
- if appropriate, print sink identifier and task id
- update documentation

This closes apache#372
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants