Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](load)Support load json data with default value #20078

Closed
wants to merge 418 commits into from
Closed

[fix](load)Support load json data with default value #20078

wants to merge 418 commits into from

Conversation

DarvenDuan
Copy link
Contributor

@DarvenDuan DarvenDuan commented May 26, 2023

Proposed changes

Issue Number: close #19741

Problem summary

If a column has been defined as :key INT NULL DEFALT '100', when loading json data which lacks column key, the default value of key is not 100 but NULL.

Checklist(Required)

  • Does it affect the original behavior
  • Has unit tests been added
  • Has document been added or modified
  • Does it need to update dependencies
  • Is this PR support rollback (If NO, please explain WHY)

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@GoGoWen
Copy link
Contributor

GoGoWen commented May 26, 2023

run buildall

@morningman morningman self-assigned this May 29, 2023
@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DarvenDuan
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@DarvenDuan
Copy link
Contributor Author

run buildall

@DarvenDuan
Copy link
Contributor Author

run p0

@DarvenDuan
Copy link
Contributor Author

run P0

@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2023

clang-tidy review says "All clean, LGTM! 👍"

@DarvenDuan
Copy link
Contributor Author

run buildall

if (!(*valid)) {
return Status::OK();
auto* column_ptr = block.get_by_position(i).column->assume_mutable().get();
auto field = value->find_field_unordered(slot_desc->col_name());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is much slower than iterate through for (auto field : *value) , this loop will utilize simd instruction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I will fix it

} else {
const std::string& v_str = col_value->second;
nullable_column->get_null_map_data().push_back(0);
assert_cast<ColumnString*>(column_ptr)->insert_data(v_str.c_str(), v_str.size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if default value is CURRENT_TIMESTAMP will v_str be ok?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should add test cases

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your reminding, I had tested for default value is CURRENT_TIMESTAMP , I will add some test case.

}
}
}
if (!has_valid_value) {
for (int i = 0; i < block.columns(); ++i) {
auto column = block.get_by_position(i).column->assume_mutable();
column->pop_back(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add comment to explain why we need to pop_back here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If has no valid value in json line and we fill with default value, it's not reasonable,so we should remove this line in block. I will add some comment there

@@ -1370,53 +1374,40 @@ Status NewJsonReader::_simdjson_handle_nested_complex_json(
return Status::OK();
}

size_t NewJsonReader::_column_index(const StringRef& name, size_t key_index) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function accelerate parsing speed, do not delete it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this function accelerate parsing speed, do not delete it

I have added it back

@@ -1730,4 +1726,52 @@ Status NewJsonReader::_simdjson_write_columns_by_jsonpath(
return Status::OK();
}

Status NewJsonReader::_get_column_default_value(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should _fill_missing_column also when fill missing slot

 // fill missing slot
    int ctx_idx = 0;
    int nullcount = 0;
    for (auto slot_desc : slot_descs) {
        if (!slot_desc->is_materialized()) {
            continue;
        }
        int dest_index = ctx_idx++;
        auto* column_ptr = block.get_by_position(dest_index).column->assume_mutable().get();
        if (column_ptr->size() < cur_row_count + 1) {
           // ..._fill_missing_column.. here
            ++nullcount;
        }
        DCHECK(column_ptr->size() == cur_row_count + 1);
    }
    // There is at least one valid value here
    DCHECK(nullcount < block.columns());
    *valid = true;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have filled all slot before, do I need to fill missing slot again?

Copy link
Member

@eldenmoon eldenmoon Jun 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this logic is useless now? or maybe we could delete it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this logic is useless now? or maybe we could delete it?

Yea, it's useless logically speaking

@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2023

clang-tidy review says "All clean, LGTM! 👍"

@DarvenDuan
Copy link
Contributor Author

run buildall

Copy link
Member

@eldenmoon eldenmoon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Jun 8, 2023

PR approved by anyone and no changes requested.

@eldenmoon
Copy link
Member

maybe rebase could help to pass p0 regression-test

hufengkai and others added 19 commits June 9, 2023 09:59
…the file cache is enabled. #20560

Use consistent hash to collect BE only when the file cache is enabled. And move the consistent BE assign code to FederationBackendPolicy.
Fix explain split number and file size incorrect bug.
In some case of agg function, maybe running as streaming agg firstly,
this will call the add function when serialize, so need implement add function also.
Co-authored-by: 王翔宇 <wangxiangyu@360shuke.com>
change check tablet path interval's default value to -1
Fix some bugs of orc lazy materialization(#18615)
- Fix issue causing column size to continuously increase after `execute_conjuncts()` by calling `Block::erase_useless_column()`.
- Fix partition issues of orc lazy materialization. 
- Fix lazy materialization will not be used when the predicate column is inconsistent with the orc file.
…ot correct when creating `MergeRangeFileReader` in orc reader. (#20393)

Fix the inner reader of MergeRangeFileReader is not correct when creating MergeRangeFileReader in orc reader.
…ult value (#20478)

image update FE config max_running_txn_num_per_db default value: old value : 100 new value : 1000
…perations in Legacy (#20354)

The precision handling for division with DECIMALV3 is as follows (excluding cases where division increases precision):

(p1, s1) / (p2, s2) ----> (p1 + s2, s1)

However, due to precision loss in division, it is considered to increase the precision of the left operand:

(p1, s1) / (p2, s2) =====> (p1 + s2, s1 + s2) / (p2, s2) ----> (p1 + s2, s1)

However, the legacy optimizer repeats the analyze and substitute steps for an expression, which can result in the accumulation of precision:

(p1, s1) / (p2, s2) =====> (p1 + s2, s1 + s2) / (p2, s2) =====> (p1 + s2 + s2, s1 + s2 + s2) / (p2, s2)

To address this, the previous approach was to forcibly convert the left operand of DECIMALV3 calculations. This results in rewriting the expression as:

(p1, s1) / (p2, s2) =====> cast((p1, s1) as (p1 + s2, s1 + s2)) / (p2, s2)

Then, during the substitution step, a check is performed. If it is a cast expression, the expression modified by the cast is extracted:

cast((p1, s1) as (p1 + s2, s1 + s2)) =====> (p1, s1)

protected Expr substituteImpl(ExprSubstitutionMap smap, ExprSubstitutionMap disjunctsMap, Analyzer analyzer) {
        if (isImplicitCast()) {
            return getChild(0).substituteImpl(smap, disjunctsMap, analyzer);
        }
This way, there won't be repeated analysis, preventing the continuous increase in precision. However, if the left expression is a constant (literal), theoretically, the precision would continue to increase. Unfortunately, the code that was removed in this PR (#19926) obscured this issue.

for (Expr child : children) {
    if (child instanceof DecimalLiteral && child.getType().isDecimalV3()) {
      ((DecimalLiteral)child).tryToReduceType();
    }
}
An attempt will be made to reduce the precision of literals in the expressions. However, this code snippet can cause such a bug.

mysql [test]>select cast(1 as DECIMALV3(16, 2)) /  cast(3 as DECIMALV3(16, 2));
+-----------------------------------------------------------+
| CAST(1 AS DECIMALV3(16, 2)) / CAST(3 AS DECIMALV3(16, 2)) |
+-----------------------------------------------------------+
|                                                      0.00 |
+-----------------------------------------------------------+
1.00 / 3.00, due to reduced precision, becomes 1 / 3.
<--Describe your changes.-->
…ble (#20391)

in join node, if it's broadcast_join
and shared hash table, some counter/timer about build hash table is useless,
so we could add those counter/timer in faker profile, and those will not display in web profile.
…20342)

1. fix hive catalog docs 
2. fix dlf properties
3. fix obs impl
…20423)

fix config::txn_commit_rpc_timeout_ms is not available for commit phase in 2pc.
truncate table with partition name need case insensitive
Co-authored-by: hugoluo <hugoluo@tencent.com>
@github-actions github-actions bot added area/load Issues or PRs related to all kinds of load area/nereids area/pipeline area/planner Issues or PRs related to the query planner kind/docs Categorizes issue or PR as related to documentation. labels Jun 9, 2023
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

// specific language governing permissions and limitations
// under the License.

#include "olap/delete_bitmap_calculator.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'olap/delete_bitmap_calculator.h' file not found [clang-diagnostic-error]

#include "olap/delete_bitmap_calculator.h"
         ^


#include <gtest/gtest.h>

#include "olap/memtable.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: 'olap/memtable.h' file not found [clang-diagnostic-error]

#include "olap/memtable.h"
         ^

@DarvenDuan DarvenDuan closed this Jun 9, 2023
@DarvenDuan DarvenDuan deleted the support_json_default_value branch June 9, 2023 02:34
@DarvenDuan
Copy link
Contributor Author

maybe rebase could help to pass p0 regression-test

There is something wrong when I rebase my branch, I have created a new PR #20624, can you review it please? Thank you.

@victoyzz
Copy link

请问你已经解决了这个问题吗?我也遇到了 字段为null时默认值不生效的问题

@DarvenDuan
Copy link
Contributor Author

请问你已经解决了这个问题吗?我也遇到了 字段为null时默认值不生效的问题

#20624 这个 PR 已经解决了这个问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/load Issues or PRs related to all kinds of load area/nereids area/pipeline area/planner Issues or PRs related to the query planner area/vectorization kind/docs Categorizes issue or PR as related to documentation. kind/test reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] support real default value for json stream load