Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCSUP-3584 edit and translate #17176

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
be9a2fe
Merge branch 'master' of https://github.com/ClickHouse/ClickHouse
Sep 16, 2020
24d5bb1
Merge branch 'master' of https://github.com/ClickHouse/ClickHouse int…
Oct 12, 2020
4eba3ca
Merge branch 'master' of https://github.com/ClickHouse/ClickHouse int…
Nov 1, 2020
322b135
Merge branch 'master' of https://github.com/ClickHouse/ClickHouse int…
Nov 15, 2020
8cd98bf
Edit and translate.
Nov 17, 2020
a2e4807
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
3671f50
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
77bbe97
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
6b8c53e
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
584b044
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
605c823
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
8b87c9a
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
c2722a8
Update docs/en/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
9bf376a
Update docs/en/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
57559fc
Update docs/en/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
0785772
Update docs/ru/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
c310bd5
Update docs/en/operations/utilities/clickhouse-obfuscator.md
damozhaeva Nov 22, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
84 changes: 42 additions & 42 deletions docs/en/operations/utilities/clickhouse-obfuscator.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,42 @@
# ClickHouse obfuscator

Simple tool for table data obfuscation.

It reads input table and produces output table, that retain some properties of input, but contains different data.
It allows to publish almost real production data for usage in benchmarks.

It is designed to retain the following properties of data:
- cardinalities of values (number of distinct values) for every column and for every tuple of columns;
- conditional cardinalities: number of distinct values of one column under condition on value of another column;
- probability distributions of absolute value of integers; sign of signed integers; exponent and sign for floats;
- probability distributions of length of strings;
- probability of zero values of numbers; empty strings and arrays, NULLs;
- data compression ratio when compressed with LZ77 and entropy family of codecs;
- continuity (magnitude of difference) of time values across table; continuity of floating point values.
- date component of DateTime values;
- UTF-8 validity of string values;
- string values continue to look somewhat natural.

Most of the properties above are viable for performance testing:

reading data, filtering, aggregation and sorting will work at almost the same speed
as on original data due to saved cardinalities, magnitudes, compression ratios, etc.

It works in deterministic fashion: you define a seed value and transform is totally determined by input data and by seed.
Some transforms are one to one and could be reversed, so you need to have large enough seed and keep it in secret.

It use some cryptographic primitives to transform data, but from the cryptographic point of view,
It doesn't do anything properly and you should never consider the result as secure, unless you have other reasons for it.

It may retain some data you don't want to publish.

It always leave numbers 0, 1, -1 as is. Also it leaves dates, lengths of arrays and null flags exactly as in source data.
For example, you have a column IsMobile in your table with values 0 and 1. In transformed data, it will have the same value.
So, the user will be able to count exact ratio of mobile traffic.

Another example, suppose you have some private data in your table, like user email and you don't want to publish any single email address.
If your table is large enough and contain multiple different emails and there is no email that have very high frequency than all others,
It will perfectly anonymize all data. But if you have small amount of different values in a column, it can possibly reproduce some of them.
And you should take care and look at exact algorithm, how this tool works, and probably fine tune some of it command line parameters.

This tool works fine only with reasonable amount of data (at least 1000s of rows).
# ClickHouse obfuscator
A simple tool for table data obfuscation.
It reads an input table and produces an output table, that retains some properties of input, but contains different data.
It allows publishing almost real production data for usage in benchmarks.
It is designed to retain the following properties of data:
- cardinalities of values (number of distinct values) for every column and every tuple of columns;
- conditional cardinalities: number of distinct values of one column under the condition on the value of another column;
- probability distributions of the absolute value of integers; the sign of signed integers; exponent and sign for floats;
- probability distributions of the length of strings;
- probability of zero values of numbers; empty strings and arrays, `NULL`s;
- data compression ratio when compressed with LZ77 and entropy family of codecs;
- continuity (magnitude of difference) of time values across the table; continuity of floating-point values;
- date component of `DateTime` values;
- UTF-8 validity of string values;
- string values look natural.
Most of the properties above are viable for performance testing:
reading data, filtering, aggregatio, and sorting will work at almost the same speed
as on original data due to saved cardinalities, magnitudes, compression ratios, etc.
It works in a deterministic fashion: you define a seed value and the transformation is determined by input data and by seed.
Some transformations are one to one and could be reversed, so you need to have a large seed and keep it in secret.
It uses some cryptographic primitives to transform data but from the cryptographic point of view, it doesn't do it properly, that is why you should not consider the result as secure unless you have another reason. The result may retain some data you don't want to publish.
It always leaves 0, 1, -1 numbers, dates, lengths of arrays, and null flags exactly as in source data.
For example, you have a column `IsMobile` in your table with values 0 and 1. In transformed data, it will have the same value.
So, the user will be able to count the exact ratio of mobile traffic.
Let's give another example. When you have some private data in your table, like user email and you don't want to publish any single email address.
If your table is large enough and contains multiple different emails and no email has a very high frequency than all others, it will anonymize all data. But if you have a small number of different values in a column, it can reproduce some of them.
You should look at the working algorithm of this tool works, and fine-tune its command line parameters.
This tool works fine only with an average amount of data (at least 1000s of rows).
43 changes: 43 additions & 0 deletions docs/ru/operations/utilities/clickhouse-obfuscator.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
# Обфускатор ClickHouse

Простой инструмент для обфускации табличных данных.

Он считывает данные входной таблицы и создает выходную таблицу, которая сохраняет некоторые свойства входных данных, но при этом содержит другие данные.

Это позволяет публиковать практически реальные данные и использовать их в тестах на производительность.

Обфускатор предназначен для сохранения следующих свойств данных:
- кардинальность (количество уникальных данных) для каждого столбца и каждого кортежа столбцов;
- условная кардинальность: количество уникальных данных одного столбца в соответствии со значением другого столбца;
- вероятностные распределения абсолютного значения целых чисел; знак числа типа Int; показатель степени и знак для чисел с плавающей запятой;
- вероятностное распределение длины строк;
- вероятность нулевых значений чисел; пустые строки и массивы, `NULL`;
- степень сжатия данных алгоритмом LZ77 и семейством энтропийных кодеков;

- непрерывность (величина разницы) значений времени в таблице; непрерывность значений с плавающей запятой;
- дату из значений `DateTime`;

- кодировка UTF-8 значений строки;
- строковые значения выглядят естественным образом.


Большинство перечисленных выше свойств пригодны для тестирования производительности. Чтение данных, фильтрация, агрегирование и сортировка будут работать почти с той же скоростью, что и исходные данные, благодаря сохраненной кардинальности, величине, степени сжатия и т. д.

Он работает детерминированно. Вы задаёте значение инициализатора, а преобразование полностью определяется входными данными и инициализатором.

Некоторые преобразования выполняются один к одному, и их можно отменить. Поэтому нужно использовать большое значение инициализатора и хранить его в секрете.


Обфускатор использует некоторые криптографические примитивы для преобразования данных, но, с криптографической точки зрения, результат будет небезопасным. В нем могут сохраниться данные, которые не следует публиковать.


Он всегда оставляет без изменений числа 0, 1, -1, даты, длины массивов и нулевые флаги.
Например, если у вас есть столбец `IsMobile` в таблице со значениями 0 и 1, то в преобразованных данных он будет иметь то же значение.

Таким образом, пользователь сможет посчитать точное соотношение мобильного трафика.

Давайте рассмотрим случай, когда у вас есть какие-то личные данные в таблице (например, электронная почта пользователя), и вы не хотите их публиковать.
Если ваша таблица достаточно большая и содержит несколько разных электронных почтовых адресов, и ни один из них не встречается часто, то обфускатор полностью анонимизирует все данные. Но, если у вас есть небольшое количество разных значений в столбце, он может скопировать некоторые из них.
В этом случае вам следует посмотреть на алгоритм работы инструмента и настроить параметры командной строки.

Обфускатор полезен в работе со средним объемом данных (не менее 1000 строк).