Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-3836] Add LongHistogram accumulator #1966

Closed
wants to merge 70 commits into from
Closed

Conversation

mbode
Copy link
Contributor

@mbode mbode commented May 6, 2016

New accumulator LongHistogram; the Histogram accumulator now throws an IllegalArgumentException instead of letting the int overflow.

  • General
    • The pull request references the related JIRA issue ("[FLINK-XXX] Jira title text")
    • The pull request addresses only one issue
    • Each commit in the PR has a meaningful commit message (including the JIRA id)
  • Documentation
    • Documentation has been added for new functionality
    • Old documentation affected by the pull request has been updated
    • JavaDoc for public methods has been added
  • Tests & Build
    • Functionality added by the pull request is covered by tests
    • mvn clean verify has been executed successfully locally or a Travis build has passed

@mbode mbode changed the title [Flink-3836] Add LongHistogram accumulator [FLINK-3836] Add LongHistogram accumulator May 8, 2016
@mbode mbode closed this May 11, 2016
@mbode mbode reopened this May 11, 2016
@@ -148,6 +149,11 @@ public Histogram getHistogram(String name) {
}

@Override
public LongHistogram getLongHistogram(String name) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should try and not have utility methods for each accumulator type in the runtime context - it becomes a lot otherwise. The methods for getHistogram() etc are also marked as public evolving, because they may possibly be removed in the future.

@StephanEwen
Copy link
Contributor

I like the addition. Two things, however, that I am not sure about:

  1. The Histogram uses a LongHistogram internally, which results in a lot of wrapping and converting. Each accumulator report (each heartbeat) needs to do the conversion. I think the Histogram should rather hold the proper (int, int) map directly.
  2. I am skeptical about failing hard in the Histogram on an integer overflow. These kind of hard failures in utility types are always tricky. A program causing an overflow will result in a non-recoverable failure (it will always overflow again). For streaming programs, that is not a nice property. I would actually rather try to deprecate the int histogram (let it overflow) and encourage to replace it over time completely with the long histogram.

@StephanEwen
Copy link
Contributor

Please let me know what you think about these suggestions!

@mbode
Copy link
Contributor Author

mbode commented May 19, 2016

Hi Stephan, sounds good.

  1. I wanted to avoid too much duplication, but I see your point.
  2. Ok, throwing a new exception breaks the API. So should I just mark Histogram as deprecated? I guess the proper way would be to make Histogram generic, enabling users to instantiate Histogram<Long>. Again, this breaks the API, so we would have to wait for the next major release – what is the process for cases like that?

@StephanEwen
Copy link
Contributor

I think for now, we should add a @Deprecated annotation to the Histogram and its related method on RuntimeContext and mention in the comment that we encourage the LongHistogram instead.

On a major release, we should go through all deprecated parts and remove them.

uce and others added 21 commits May 27, 2016 15:47
Graph algorithms for annotating
  vertex degree for undirected graphs
  vertex out-, in-, and out- and in-degree for directed graphs
  edge source, target, and source and target degree for undirected graphs
…nslators

The TranslateFunction interface is similar to MapFunction but may be
called multiple times before serialization.

This closes apache#1968
* ubuntu trusty->xenial
* jdk 7u51 -> 8u91
* flink 0.10.1 -> 1.0.2

This closes apache#1969
- replaced CharSets with StandardCharsets
- added checkElementIndex to Flink Preconditions
- replaced Guava Preconditions with Flink Preconditions
- removed single usages Ints.max() and Joiner()

This closes apache#1938
Depending on the context, the ExecutionConfig's type fields may either
be deserialized using a custom class loader or the default class
loader. It may be explicitly serialized for the Task or shipped inside
the PojoSerializer where it is serialized or directly passed in local
mode. An ExecutionConfig may be reused and thus its fields can't be set
to null after it has been shipped once.

The entire ExecutionConfig is now serialized upon setting it on the
JobGraph. It is not passed through the JobGraph's constructor but set
explicitly on the JobGraph. If no ExecutionConfig has been set, the
default is used. Unlike before, no code may modify the ExecutionConfig
after it has been set on the JobGraph.

This closes apache#1913
Addition to bbd02d2.

The java.lang.Date type shouldn't be an automatically Kryo registered
anymore.
The local clustering coefficient measures the connectedness of each
vertex's neighborhood. Scores range from 0.0 (no edges between
neighbors) to 1.0 (neighborhood is a clique).

This closes apache#1896
greghogan and others added 26 commits May 27, 2016 15:47
The Jaccard Index measures the similarity between vertex neighborhoods.
Scores range from 0.0 (no shared neighbors) to 1.0 (all neighbors are
shared).

This closes apache#1980
…e code

  - Metric groups are generally thread-safe
  - Metric groups are closable. Closed groups do not register metric objects and more.
  - TaskManager's JobMetricsGroup auto disposes when all TaskMetricGroups are closed
  - Maven project with metric reporters renamed to 'flink-metric-reporters'
  - Various code style cleanups
- introduce a unique container id independent of the Hadoop version
- improve printing of exceptions during registration
- minor improvements to the Yarn ResourceManager code

This closes apache#2013
After 38698c0, there are now two
executions defined for the Surefire plugin: unit-tests and
integration-tests. In addition, there is an implicit default execution
called default-test. This leads to the unit tests to be executed
twice. This renames unit-tests to default-test to prevent duplicate
execution.

This closes apache#2019
Until FLINK-3960 is fixed, we need to disable this test to allow other
tests to execute properly.

This closes apache#2022
We should use java.util.concurrent.ConcurrentHashMap because Netty's
ConcurrentHashMap is not available for Hadoop 1. Also, Netty's ConcurrentHashMap
is merely a copy of Java's to support Java versions prior 1.5.
- Fix FLINK-3696 (type issues of DataSetUnion by forwarding expected types to input operators).

This closes apache#2025
Only tested behavior on int overflow.
Recommend LongHistogram instead.
@mbode
Copy link
Contributor Author

mbode commented May 27, 2016

Sorry guys, botched the PR :/

@mbode mbode closed this May 27, 2016
@greghogan
Copy link
Contributor

I was interested to see what happened here and a simple rebase and force push corrects the problem.

Make sure local master is up-to-date
$ git checkout master
$ git pull apache

Fetch this PR and checkout the branch
$ git fetch github pull/1966/head:pr1966
$ git checkout pr1966

Move the new commits after the last commit on master
$ git rebase master

Push the changes to your repo
$ git push -f pr1966 origin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet