Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document the clustering #302

Merged
merged 5 commits into from
Jul 18, 2023
Merged

Document the clustering #302

merged 5 commits into from
Jul 18, 2023

Conversation

hanahmily
Copy link
Contributor

tldr

  • Meta nodes hold active nodes and shard mapping info.

  • Liaison nodes shard data based on real-time shard mapping.

  • Query nodes retrieve data from all active nodes without shard mapping info.

  • Update the CHANGES log.

Signed-off-by: Gao Hongtao <hanahmily@gmail.com>
@hanahmily hanahmily added the documentation Improvements or additions to documentation label Jul 18, 2023
@hanahmily hanahmily added this to the 0.5.0 milestone Jul 18, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jul 18, 2023

Codecov Report

Merging #302 (e88d47a) into main (c06b5e1) will decrease coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #302      +/-   ##
==========================================
- Coverage   39.96%   39.94%   -0.02%     
==========================================
  Files         100      100              
  Lines       10868    10868              
==========================================
- Hits         4343     4341       -2     
- Misses       6107     6109       +2     
  Partials      418      418              

see 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

docs/concept/clustering.md Outdated Show resolved Hide resolved
docs/concept/clustering.md Show resolved Hide resolved
docs/concept/clustering.md Outdated Show resolved Hide resolved
lujiajing1126
lujiajing1126 previously approved these changes Jul 18, 2023
Copy link
Contributor

@lujiajing1126 lujiajing1126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

----------------- ----------------- -----------------
| Data Node 1 | | Data Node 2 | | Data Node 3 |
| (Shard 1) | | (Shard 2) | | (Shard 3) |
----------------- ----------------- -----------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, it delegates the task of replication to these underlying storage systems.

I think diagram is not complete, should we add more detail diagram about how to delegates the task of replication to these underlying storage systems?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like :
image

Copy link
Contributor Author

@hanahmily hanahmily Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense. You could push a "suggestion" to update the text diagram.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry. The data node doesn't necessarily have to be built on "shared" storage. It just requires a robust one in case of a potential single point of failure. Therioally, the whole architecture is shared-nothing instead of shared storage.

docs/concept/clustering.md Outdated Show resolved Hide resolved
@wu-sheng
Copy link
Member

The design is good enough for now.

I recommend to add principles/philosophy for why the arch looks like this.

Such as

  1. Much lower query traffic compared with reading. So, all nodes oriented query works
  2. With highly adopted cloud native and public cloud vendors tech stack, network add storage is reliable and popular, so replication counts on that.
    .... More you discussed and confirmed

These are fundamentals of why we did these choices, which are more important. We could have countless optimizations with time and experiences, but these are rarely to be changed as they only close to our use cases, which is skywalking itself.

@hanahmily
Copy link
Contributor Author

The design is good enough for now.

I recommend to add principles/philosophy for why the arch looks like this.

Such as

  1. Much lower query traffic compared with reading. So, all nodes oriented query works
  2. With highly adopted cloud native and public cloud vendors tech stack, network add storage is reliable and popular, so replication counts on that.
    .... More you discussed and confirmed

These are fundamentals of why we did these choices, which are more important. We could have countless optimizations with time and experiences, but these are rarely to be changed as they only close to our use cases, which is skywalking itself.

done

@hailin0
Copy link
Contributor

hailin0 commented Jul 18, 2023

LGTM

@wu-sheng wu-sheng merged commit 056624a into main Jul 18, 2023
13 checks passed

### 6.1 Query Routing

Query Nodes differ from Liaison Nodes in that they do not store shard mapping information from Meta Nodes. Instead, they access all Data Nodes to retrieve the necessary data for queries. As the query load is lower, it is practical for query nodes to access all data nodes for this purpose. It may increase network traffic, but simplifies scaling out of the cluster.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be necessary to explain the impact of the datanodes scale on the long tail of the query

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could. Side effects of this solution is there for sure. No matter we wrote or not.

Copy link
Contributor Author

@hanahmily hanahmily Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pattern is good at batching and aggregation. Some scenarios can leverage it:

  • Ad-hoc "TopN" is a classic aggregation operation. OAP's service and instance TopN metric query will benefit from this pattern.
  • "MultiGet" is a batching operation on several entities, which is natural to retrieve data from all data nodes.
  • A Skywalking UI's classic Dashboard will fetch several metrics belonging to an entity. The OAP generates several query operations to fetch data from DB. If it can combine them to issue a batch operation, BanyanDB will perform better than the one with several separate queries.

The scenarios serve as examples to illustrate the potential of "stateless query". As core contributors, we must possess a thorough understanding of this concept.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
6 participants