Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] [FLINK-2288] [FLINK-2302] Setup ZooKeeper for distributed coordination #886

Closed
wants to merge 2 commits into from

Conversation

uce
Copy link
Contributor

@uce uce commented Jul 3, 2015

  • FLINK-2288: Setup ZooKeeper for distributed coordination
    • Add FlinkZooKeeperQuorumPeer to wrap ZooKeeper's quorum peers with
      utilities to write required config values (default datadir, myid)
    • Add default conf/zoo.cfg config for ZooKeeper
    • Add startup scripts for ZooKeeper quorum
    • Add conf/masters file for HA masters
    • @rmetzger This PR includes docs ;-)
  • FLINK-2302: Allow multiple instances to run on single host
    • Multiple TaskManager and JobManager instances can run on a single
      host.

@tillrohrmann, you can base your changes on this branch. After that we can close this PR. I've added TODOs in TaskManager and JobManager, where you need to integrate your leader election/retrieval service.

From the docs:

Example: Start and stop a local HA-cluster with 2 JobManagers

  1. Configure ZooKeeper quorum in conf/flink.yaml:

    ha.zookeeper.quorum: localhost
  2. Configure masters in conf/masters:

    localhost
    localhost
  3. Configure ZooKeeper server in conf/zoo.cfg (currently it's only possible to run a single ZooKeeper server per machine, because there is a single client port per configuration):

    server.0=localhost:2888:3888
  4. Start ZooKeeper quorum:

    $ bin/start-zookeeper-quorum.sh
    Starting zookeeper daemon on host localhost.
  5. Start an HA-cluster:

    $ bin/start-cluster-streaming.sh
    Starting HA cluster (streaming mode) with 2 masters and 1 peers in ZooKeeper quorum.
    Starting jobmanager daemon on host localhost.
    Starting jobmanager daemon on host localhost.
    Starting taskmanager daemon on host localhost.
  6. Stop ZooKeeper quorum and cluster:

    $ bin/stop-cluster.sh
    Stopping taskmanager daemon (pid: 7647) on localhost.
    Stopping jobmanager daemon (pid: 7495) on host localhost.
    Stopping jobmanager daemon (pid: 7349) on host localhost.
    $ bin/stop-zookeeper-quorum.sh
    Stopping zookeeper daemon (pid: 7101) on host localhost.

uce added 2 commits July 3, 2015 16:47
- FLINK-2288: Setup ZooKeeper for distributed coordination
  * Add FlinkZooKeeperQuorumPeer to wrap ZooKeeper's quorum peers with
    utilities to write required config values (default datadir, myid)
  * Add default conf/zoo.cfg config for ZooKeeper
  * Add startup scripts for ZooKeeper quorum
  * Add conf/masters file for HA masters

- FLINK-2302: Allow multiple instances to run on single host
  * Multiple TaskManager and JobManager instances can run on a single
    host.
server.Y=addressY:peerPort:leaderPort
</pre>

The script `bin/start-zookeeper-quorum.sh` will start a ZooKeeper server on each of the configured hosts. The started processes start ZooKeeper servers via a Flink wrapper, which reads the configuration from `conf/zoo.cfg` and makes sure to set some rqeuired configuration values for convenience. In production setups, it is recommended to manage your own ZooKeeper installation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: rqeuired -> required

@StephanEwen
Copy link
Contributor

Looks like a good piece of work!

Can we actually get into ZooKeeper version conflicts here? For example, the Kafka connector needs a certain Zookeeper version. Can it conflict with our Zookeeper version?

@StephanEwen
Copy link
Contributor

I will address the comments and merge this...

@StephanEwen
Copy link
Contributor

BTW: It is beautiful how easy it is to start multiple TaskManagers and JobManagers on one machine now :-)

@StephanEwen
Copy link
Contributor

We need to find a solution for the webfrontend, though. Starting it on random ports is not a real solution. Also the user that wants to see progress needs to connect manually to the web server of the leader JobManager.

StephanEwen pushed a commit to StephanEwen/flink that referenced this pull request Jul 8, 2015
@asfgit asfgit closed this in 9c0dd97 Jul 8, 2015
@mxm
Copy link
Contributor

mxm commented Jul 9, 2015

Did we find a solution for the random port problem?

@StephanEwen
Copy link
Contributor

In HA mode, JobManagers start with a random free port. That is fine, because no one connects to them based on a config value, but only based on ZooKeeper entries.

@mxm
Copy link
Contributor

mxm commented Jul 9, 2015

Yes that makes sense. So the user will always have to connect to the web interface of the leading job manager, right? We could only circumvent that by separating the web interface from the job manager.

@tillrohrmann
Copy link
Contributor

The web interface is, modulo some object which are not serializable,
already independent of the JobManager. It should not be a big problem to
only have one web server which also retrieves the leading JobManager from
ZooKeeper and then serves the information from the leader.

On Thu, Jul 9, 2015 at 4:24 PM, Max notifications@github.com wrote:

Yes that makes sense. So the user will always have to connect to the web
interface of the leading job manager, right? We could only circumvent that
by separating the web interface from the job manager.


Reply to this email directly or view it on GitHub
#886 (comment).

@mxm
Copy link
Contributor

mxm commented Jul 9, 2015

Alright, I've opened a JIRA for this: https://issues.apache.org/jira/browse/FLINK-2340

mxm pushed a commit to mxm/flink that referenced this pull request Jul 14, 2015
shghatge pushed a commit to shghatge/flink that referenced this pull request Aug 8, 2015
@uce uce deleted the zk-2288 branch August 24, 2015 15:32
nikste pushed a commit to nikste/flink that referenced this pull request Sep 29, 2015
nltran pushed a commit to nltran/flink that referenced this pull request Jan 8, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants