Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor improvements in docs build and content #9752

Merged
merged 7 commits into from
Mar 19, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
557 changes: 1 addition & 556 deletions docs/en/operations/performance/sampling_query_profiler.md

Large diffs are not rendered by default.

Large diffs are not rendered by default.

5 changes: 2 additions & 3 deletions docs/en/operations/table_engines/aggregatingmergetree.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# AggregatingMergeTree

The engine inherits from [MergeTree](mergetree.md#table_engines-mergetree), altering the logic for data parts merging. ClickHouse replaces all rows with the same primary key (or more accurately, with the same [sorting key](mergetree.md)) with a single row (within a one data part) that stores a combination of states of aggregate functions.
Expand Down Expand Up @@ -53,7 +52,7 @@ All of the parameters have the same meaning as in `MergeTree`.
To insert data, use [INSERT SELECT](../../query_language/insert_into.md) query with aggregate -State- functions.
When selecting data from `AggregatingMergeTree` table, use `GROUP BY` clause and the same aggregate functions as when inserting data, but using `-Merge` suffix.

In the results of `SELECT` query the values of `AggregateFunction` type have implementation-specific binary representation for all of the ClickHouse output formats. If dump data into, for example, `TabSeparated` format with `SELECT` query then this dump can be loaded back using `INSERT` query.
In the results of `SELECT` query, the values of `AggregateFunction` type have implementation-specific binary representation for all of the ClickHouse output formats. If dump data into, for example, `TabSeparated` format with `SELECT` query then this dump can be loaded back using `INSERT` query.

## Example of an Aggregated Materialized View

Expand All @@ -71,7 +70,7 @@ FROM test.visits
GROUP BY CounterID, StartDate;
```

Inserting of data into the `test.visits` table.
Inserting data into the `test.visits` table.

```sql
INSERT INTO test.visits ...
Expand Down
18 changes: 9 additions & 9 deletions docs/en/operations/table_engines/collapsingmergetree.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ The engine inherits from [MergeTree](mergetree.md) and adds the logic of rows co

`CollapsingMergeTree` asynchronously deletes (collapses) pairs of rows if all of the fields in a sorting key (`ORDER BY`) are equivalent excepting the particular field `Sign` which can have `1` and `-1` values. Rows without a pair are kept. For more details see the [Collapsing](#table_engine-collapsingmergetree-collapsing) section of the document.

The engine may significantly reduce the volume of storage and increase efficiency of `SELECT` query as a consequence.
The engine may significantly reduce the volume of storage and increase the efficiency of `SELECT` query as a consequence.

## Creating a Table

Expand Down Expand Up @@ -63,7 +63,7 @@ Consider the situation where you need to save continually changing data for some

Use the particular column `Sign`. If `Sign = 1` it means that the row is a state of an object, let's call it "state" row. If `Sign = -1` it means the cancellation of the state of an object with the same attributes, let's call it "cancel" row.

For example, we want to calculate how much pages users checked at some site and how long they were there. At some moment of time we write the following row with the state of user activity:
For example, we want to calculate how much pages users checked at some site and how long they were there. At some moment we write the following row with the state of user activity:

```text
┌──────────────UserID─┬─PageViews─┬─Duration─┬─Sign─┐
Expand All @@ -80,7 +80,7 @@ At some moment later we register the change of user activity and write it with t
└─────────────────────┴───────────┴──────────┴──────┘
```

The first row cancels the previous state of the object (user). It should copy the sorting key fields of the canceled state excepting `Sign`.
The first row cancels the previous state of the object (user). It should copy the sorting key fields of the cancelled state excepting `Sign`.

The second row contains the current state.

Expand All @@ -100,7 +100,7 @@ Why we need 2 rows for each change read in the [Algorithm](#table_engine-collaps
**Peculiar properties of such approach**

1. The program that writes the data should remember the state of an object to be able to cancel it. "Cancel" string should contain copies of the sorting key fields of the "state" string and the opposite `Sign`. It increases the initial size of storage but allows to write the data quickly.
2. Long growing arrays in columns reduce the efficiency of the engine due to load for writing. The more straightforward data, the higher efficiency.
2. Long growing arrays in columns reduce the efficiency of the engine due to load for writing. The more straightforward data, the higher the efficiency.
3. The `SELECT` results depend strongly on the consistency of object changes history. Be accurate when preparing data for inserting. You can get unpredictable results in inconsistent data, for example, negative values for non-negative metrics such as session depth.

### Algorithm {#table_engine-collapsingmergetree-collapsing-algorithm}
Expand All @@ -110,11 +110,11 @@ When ClickHouse merges data parts, each group of consecutive rows with the same
For each resulting data part ClickHouse saves:

1. The first "cancel" and the last "state" rows, if the number of "state" and "cancel" rows matches and the last row is a "state" row.
2. The last "state" row, if there is more "state" rows than "cancel" rows.
3. The first "cancel" row, if there is more "cancel" rows than "state" rows.
2. The last "state" row, if there are more "state" rows than "cancel" rows.
3. The first "cancel" row, if there are more "cancel" rows than "state" rows.
4. None of the rows, in all other cases.

In addition when there is at least 2 more "state" rows than "cancel" rows, or at least 2 more "cancel" rows then "state" rows, the merge continues, but ClickHouse treats this situation as a logical error and records it in the server log. This error can occur if the same data were inserted more than once.
Also when there are at least 2 more "state" rows than "cancel" rows, or at least 2 more "cancel" rows then "state" rows, the merge continues, but ClickHouse treats this situation as a logical error and records it in the server log. This error can occur if the same data were inserted more than once.

Thus, collapsing should not change the results of calculating statistics.
Changes gradually collapsed so that in the end only the last state of almost every object left.
Expand All @@ -123,7 +123,7 @@ The `Sign` is required because the merging algorithm doesn't guarantee that all

To finalize collapsing, write a query with `GROUP BY` clause and aggregate functions that account for the sign. For example, to calculate quantity, use `sum(Sign)` instead of `count()`. To calculate the sum of something, use `sum(Sign * x)` instead of `sum(x)`, and so on, and also add `HAVING sum(Sign) > 0`.

The aggregates `count`, `sum` and `avg` could be calculated this way. The aggregate `uniq` could be calculated if an object has at least one state not collapsed. The aggregates `min` and `max` could not be calculated because `CollapsingMergeTree` does not save values history of the collapsed states.
The aggregates `count`, `sum` and `avg` could be calculated this way. The aggregate `uniq` could be calculated if an object has at least one state not collapsed. The aggregates `min` and `max` could not be calculated because `CollapsingMergeTree` does not save the values history of the collapsed states.

If you need to extract data without aggregation (for example, to check whether rows are present whose newest values match certain conditions), you can use the `FINAL` modifier for the `FROM` clause. This approach is significantly less efficient.

Expand Down Expand Up @@ -182,7 +182,7 @@ SELECT * FROM UAct

What do we see and where is collapsing?

With two `INSERT` queries, we created 2 data parts. The `SELECT` query was performed in 2 threads, and we got a random order of rows. Collapsing not occurred because there was no merge of the data parts yet. ClickHouse merges data part in an unknown moment of time which we can not predict.
With two `INSERT` queries, we created 2 data parts. The `SELECT` query was performed in 2 threads, and we got a random order of rows. Collapsing not occurred because there was no merge of the data parts yet. ClickHouse merges data part in an unknown moment which we can not predict.

Thus we need aggregation:

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Partitioning is available for the [MergeTree](mergetree.md) family tables (including [replicated](replication.md) tables). [Materialized views](materializedview.md) based on MergeTree tables support partitioning, as well.

A partition is a logical combination of records in a table by a specified criterion. You can set a partition by an arbitrary criterion, such as by month, by day, or by event type. Each partition is stored separately in order to simplify manipulations of this data. When accessing the data, ClickHouse uses the smallest subset of partitions possible.
A partition is a logical combination of records in a table by a specified criterion. You can set a partition by an arbitrary criterion, such as by month, by day, or by event type. Each partition is stored separately to simplify manipulations of this data. When accessing the data, ClickHouse uses the smallest subset of partitions possible.

The partition is specified in the `PARTITION BY expr` clause when [creating a table](mergetree.md#table_engine-mergetree-creating-a-table). The partition key can be any expression from the table columns. For example, to specify partitioning by month, use the expression `toYYYYMM(date_column)`:

Expand Down
1 change: 0 additions & 1 deletion docs/en/operations/table_engines/dictionary.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# Dictionary

The `Dictionary` engine displays the [dictionary](../../query_language/dicts/external_dicts.md) data as a ClickHouse table.
Expand Down
15 changes: 7 additions & 8 deletions docs/en/operations/table_engines/distributed.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@

# Distributed

**The Distributed engine does not store data itself**, but allows distributed query processing on multiple servers.
**Tables with Distributed engine do not store any data by themself**, but allow distributed query processing on multiple servers.
Reading is automatically parallelized. During a read, the table indexes on remote servers are used, if there are any.

The Distributed engine accepts parameters:
Expand All @@ -23,7 +22,7 @@ Distributed(logs, default, hits[, sharding_key[, policy_name]])
```

Data will be read from all servers in the 'logs' cluster, from the default.hits table located on every server in the cluster.
Data is not only read, but is partially processed on the remote servers (to the extent that this is possible).
Data is not only read but is partially processed on the remote servers (to the extent that this is possible).
For example, for a query with GROUP BY, data will be aggregated on remote servers, and the intermediate states of aggregate functions will be sent to the requestor server. Then data will be further aggregated.

Instead of the database name, you can use a constant expression that returns a string. For example: currentDatabase().
Expand Down Expand Up @@ -83,7 +82,7 @@ The parameters `host`, `port`, and optionally `user`, `password`, `secure`, `com

When specifying replicas, one of the available replicas will be selected for each of the shards when reading. You can configure the algorithm for load balancing (the preference for which replica to access) – see the [load_balancing](../settings/settings.md#settings-load_balancing) setting.
If the connection with the server is not established, there will be an attempt to connect with a short timeout. If the connection failed, the next replica will be selected, and so on for all the replicas. If the connection attempt failed for all the replicas, the attempt will be repeated the same way, several times.
This works in favor of resiliency, but does not provide complete fault tolerance: a remote server might accept the connection, but might not work, or work poorly.
This works in favour of resiliency, but does not provide complete fault tolerance: a remote server might accept the connection, but might not work, or work poorly.

You can specify just one of the shards (in this case, query processing should be called remote, rather than distributed) or up to any number of shards. In each shard, you can specify from one to any number of replicas. You can specify a different number of replicas for each shard.

Expand All @@ -99,9 +98,9 @@ The Distributed engine requires writing clusters to the config file. Clusters fr

There are two methods for writing data to a cluster:

First, you can define which servers to write which data to and perform the write directly on each shard. In other words, perform INSERT in the tables that the distributed table "looks at". This is the most flexible solution as you can use any sharding scheme, which could be non-trivial due to the requirements of the subject area. This is also the most optimal solution, since data can be written to different shards completely independently.
First, you can define which servers to write which data to and perform the write directly on each shard. In other words, perform INSERT in the tables that the distributed table "looks at". This is the most flexible solution as you can use any sharding scheme, which could be non-trivial due to the requirements of the subject area. This is also the most optimal solution since data can be written to different shards completely independently.

Second, you can perform INSERT in a Distributed table. In this case, the table will distribute the inserted data across servers itself. In order to write to a Distributed table, it must have a sharding key set (the last parameter). In addition, if there is only one shard, the write operation works without specifying the sharding key, since it doesn't mean anything in this case.
Second, you can perform INSERT in a Distributed table. In this case, the table will distribute the inserted data across the servers itself. In order to write to a Distributed table, it must have a sharding key set (the last parameter). In addition, if there is only one shard, the write operation works without specifying the sharding key, since it doesn't mean anything in this case.

Each shard can have a weight defined in the config file. By default, the weight is equal to one. Data is distributed across shards in the amount proportional to the shard weight. For example, if there are two shards and the first has a weight of 9 while the second has a weight of 10, the first will be sent 9 / 19 parts of the rows, and the second will be sent 10 / 19.

Expand All @@ -115,9 +114,9 @@ To select the shard that a row of data is sent to, the sharding expression is an

The sharding expression can be any expression from constants and table columns that returns an integer. For example, you can use the expression 'rand()' for random distribution of data, or 'UserID' for distribution by the remainder from dividing the user's ID (then the data of a single user will reside on a single shard, which simplifies running IN and JOIN by users). If one of the columns is not distributed evenly enough, you can wrap it in a hash function: intHash64(UserID).

A simple remainder from division is a limited solution for sharding and isn't always appropriate. It works for medium and large volumes of data (dozens of servers), but not for very large volumes of data (hundreds of servers or more). In the latter case, use the sharding scheme required by the subject area, rather than using entries in Distributed tables.
A simple reminder from the division is a limited solution for sharding and isn't always appropriate. It works for medium and large volumes of data (dozens of servers), but not for very large volumes of data (hundreds of servers or more). In the latter case, use the sharding scheme required by the subject area, rather than using entries in Distributed tables.

SELECT queries are sent to all the shards, and work regardless of how data is distributed across the shards (they can be distributed completely randomly). When you add a new shard, you don't have to transfer the old data to it. You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.
SELECT queries are sent to all the shards and work regardless of how data is distributed across the shards (they can be distributed completely randomly). When you add a new shard, you don't have to transfer the old data to it. You can write new data with a heavier weight – the data will be distributed slightly unevenly, but queries will work correctly and efficiently.

You should be concerned about the sharding scheme in the following cases:

Expand Down
1 change: 0 additions & 1 deletion docs/en/operations/table_engines/external_data.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@

# External Data for Query Processing

ClickHouse allows sending a server the data that is needed for processing a query, together with a SELECT query. This data is put in a temporary table (see the section "Temporary tables") and can be used in the query (for example, in IN operators).
Expand Down
4 changes: 2 additions & 2 deletions docs/tools/build.py
Original file line number Diff line number Diff line change
Expand Up @@ -207,10 +207,10 @@ def build_single_page_version(lang, args, cfg):
]
})
mkdocs_build.build(cfg)
if not args.version_prefix: # maybe enable in future
test.test_single_page(os.path.join(test_dir, 'single', 'index.html'), lang)
if args.save_raw_single_page:
shutil.copytree(test_dir, args.save_raw_single_page)
if not args.version_prefix: # maybe enable in future
test.test_single_page(os.path.join(test_dir, 'single', 'index.html'), lang)


def write_redirect_html(out_path, to_url):
Expand Down
2 changes: 1 addition & 1 deletion docs/tools/mkdocs-material-theme/partials/social.html
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
<div class="md-footer-social">
<span class="md-footer-copyright__highlight">Built from <a href="{{ config.extra.rev_url }}" rel="external nofollow">{{ config.extra.rev_short }}</a></span>
<span class="md-footer-copyright__highlight">Built from <a href="{{ config.extra.rev_url }}" rel="external nofollow" target="_blank">{{ config.extra.rev_short }}</a></span>
</div>
6 changes: 2 additions & 4 deletions docs/tools/test.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
#!/usr/bin/env python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import logging
Expand Down Expand Up @@ -33,10 +33,8 @@ def test_single_page(input_path, lang):

if duplicate_anchor_points:
logging.warning('Found %d duplicate anchor points' % duplicate_anchor_points)
if links_to_nowhere:
logging.error('Found %d links to nowhere' % links_to_nowhere)
sys.exit(10)

assert not links_to_nowhere, 'Found %d links to nowhere' % links_to_nowhere
assert len(anchor_points) > 10, 'Html parsing is probably broken'


Expand Down
2 changes: 1 addition & 1 deletion website/templates/index/community.html
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ <h5 class="text-yellow">Google Group</h5>
<div class="container">
<div class="row my-5">
<div class="col-lg">
<h2>Hosting ClickHouse Meetups</h2>
<h2 id="meet">Hosting ClickHouse Meetups</h2>
<p class="lead">
ClickHouse meetups are essential for strengthening community worldwide, but they couldn't be possible without the help of local organizers. Please, feel this form if you want to become one or want to meet ClickHouse core team for any other reason.
</p>
Expand Down