Skip to content
This repository has been archived by the owner on Jul 10, 2024. It is now read-only.

SUBMARINE-254. Workbench server support cluster mode #59

Closed
wants to merge 1 commit into from

Conversation

xunliu
Copy link
Member

@xunliu xunliu commented Oct 21, 2019

What is this PR for?

Workbench Server is mainly for Submarine Workbench WEB is mainly for algorithm users to provide algorithm development, Python/Spark interpreter operation and other services through Notebook.

The workbench server allows multiple workbench servers to form a service cluster by integrating the cluster components.

The goal of the Submarine project is to provide high availability and high reliability services for big data processing, algorithm development, job scheduling, job scheduling, model online services, model batch and incremental updates. In addition to the high availability of big data and machine learning frameworks, the high availability of Submarine Server and Workbench Server itself is a key consideration.

Design Doc: https://docs.google.com/document/d/1Ax6FQ5CAP-jowm2_Mp2r5kc9r7s1bkFLRfzvvbm5Wzc/edit#

What type of PR is it?

[Feature]

What is the Jira issue?

How should this be tested?

  • CI Pass
  • TestWorkbenchClusterServer.java

Screenshots (if appropriate)

Questions:

  • Does the licenses files need update? No
  • Is there breaking changes for older versions? No
  • Does this needs documentation? Yes

# Submarine Cluster Server Design

## Introduction
The Submarine system contains a total of two Server services, Submarine Server and Workbench Server, which are long-running in the form of Daemon.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we change this to:
The Submarine system contains a total of two daemon services, Submarine Server and Workbench Server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

## Introduction
The Submarine system contains a total of two Server services, Submarine Server and Workbench Server, which are long-running in the form of Daemon.

Among them, Submarine Server mainly provides job submission, job scheduling, job status monitoring, and model online service for Submarine.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we delete "Among them"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Among them, Submarine Server mainly provides job submission, job scheduling, job status monitoring, and model online service for Submarine.

Workbench Server is mainly for Submarine Workbench WEB is mainly for algorithm users to provide algorithm development, Python/Spark interpreter operation and other services through Notebook.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one, "Workbench Server is mainly for Submarine Workbench WEB is mainly for". There is something missing after "Workbench Server is mainly for"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Multiple Submarine (or Workbench) Server processes create a Submarine Cluster through the RAFT algorithm library. The cluster internally maintains a metadata center. All servers can operate the metadata. The RAFT algorithm ensures that multiple processes are simultaneously co-located. A data modification will not cause problems such as mutual coverage and dirty data.

This metadata center stores data by means of key-value pairs. It is very easy to use a variety of data, but it should be noted that metadata is only suitable for storing small amounts of data and cannot be used to replace data storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"It is very easy to use a variety of data", do you mean it can store/support a variety of data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Disadvantages

Because the RAFT algorithm requires more than half of the servers available to ensure the normality of the RAFT algorithm, if we need to turn on the clustering capabilities of Submarine (Workbench) Server, when more than half of the servers are unavailable, some programs may appear. Abnormal, of course, we also detected this in the system, downgrading the system or refusing to provide service status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to modify "some programs may appear. Abnormal, of course, " to "some programs may appear abnormal. Of course,"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

### Cluster monitoring

The cluster needs to monitor whether the Submarine-Server and Submarine-Interpreter processes are working properly.
The Submarine-Server and Submarine-Interpreter processes periodically send heartbeats to update their own timestamps in the cluster metadata. The Submarine-Server with Leader identity periodically checks the timestamps of the Submarine-Server and Submarine-Interpreter processes to clear the timeout service. And process;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify "the timeout service. And process;" to "the timeout services and processes."

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

The Submarine-Server and Submarine-Interpreter processes periodically send heartbeats to update their own timestamps in the cluster metadata. The Submarine-Server with Leader identity periodically checks the timestamps of the Submarine-Server and Submarine-Interpreter processes to clear the timeout service. And process;

1. The cluster monitoring module runs in each Submarine-Server and Submarine Interpreter process, periodically sending heartbeat data of the service or process to the cluster;
2. When the cluster monitoring module runs in Submarine-Server, it collects the CPU and MEMORY usage of the server, and sends the resource usage rate to the cluster's ClusterStateMachine. When the cluster interpreter process needs to be created, the server is idle from the resource. Created in ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need to delete cluster monitor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Atomix Raft algorithm library

In order to reduce the deployment complexity of distributed mode, submarine server did not use zookeeper to build a distributed cluster. multiple submarine server groups are built into distributed clusters by building the Raft algorithm in submarine server. the Raft algorithm selects the algorithm library of atomix that has passed Jepsen consistency verification.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify “submarine server did not” to "submarine server does not".
Modify "multiple submarine server" to "Multiple submarine servers".
Modify "by building the Raft" to "by using the Raft"
Modify "the Raft algorithm selects the algorithm library of atomix" to "The Raft algorithm is involved by atomix lib"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Synchronize workbench notes

In cluster mode, the user creates, modifies, and deletes the note on any of the servers. all need to be notified to all the servers in the cluster to synchronize the update of Notebook. failure to do so will result in the user not being able to continue while switching to another server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first letter needs to be upper-case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


### Listen for note update events

Listen for the NEW_NOTE、DEL_NOTE、REMOVE_NOTE_TO_TRASH ... event of the notebook in the NotebookServer#onMessage() function.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modify "NEW_NOTE、DEL_NOTE、REMOVE_NOTE_TO_TRASH" to "NEW_NOTE, DEL_NOTE, REMOVE_NOTE_TO_TRASH"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@xunliu xunliu force-pushed the SUBMARINE-254 branch 3 times, most recently from 30f5248 to ff4941c Compare October 22, 2019 11:48
@xunliu
Copy link
Member Author

xunliu commented Oct 22, 2019

@asfgit asfgit closed this in 186a8dc Oct 22, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants