Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix](string-column) fix unescape result length error #22411

Merged
merged 1 commit into from
Aug 2, 2023

Conversation

TangSiyang2001
Copy link
Collaborator

@TangSiyang2001 TangSiyang2001 commented Jul 31, 2023

Proposed changes

While working on #22382, found that text convertor got error result while doing unescape, due to result length error.
Related case is in #22382.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@TangSiyang2001
Copy link
Collaborator Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@hello-stephen
Copy link
Contributor

(From new machine)TeamCity pipeline, clickbench performance test result:
the sum of best hot time: 48.78 seconds
stream load tsv: 528 seconds loaded 74807831229 Bytes, about 135 MB/s
stream load json: 20 seconds loaded 2358488459 Bytes, about 112 MB/s
stream load orc: 65 seconds loaded 1101869774 Bytes, about 16 MB/s
stream load parquet: 31 seconds loaded 861443392 Bytes, about 26 MB/s
insert into select: 28.9 seconds inserted 10000000 Rows, about 346K ops/s
storage size: 17162127243 Bytes

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

PR approved by at least one committer and no changes requested.

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Aug 2, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Aug 2, 2023

PR approved by anyone and no changes requested.

@dataroaring dataroaring merged commit e991f60 into apache:master Aug 2, 2023
19 of 21 checks passed
xiaokang pushed a commit to xiaokang/doris that referenced this pull request Aug 9, 2023
morningman pushed a commit that referenced this pull request Aug 15, 2023
…port enclose and escape for stream load (#22539)

 ## Proposed changes

Refactor thoughts: close #22383
Descriptions about `enclose` and `escape`: #22385

## Further comments

2023-08-09: 
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.
 
Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed: 

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.  

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
xiaokang pushed a commit that referenced this pull request Aug 17, 2023
…port enclose and escape for stream load (#22539)

 ## Proposed changes

Refactor thoughts: close #22383
Descriptions about `enclose` and `escape`: #22385

2023-08-09:
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.

Trimming escape will be enable after fix: #22411 is merged

Cases should be discussed:

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
airborne12 pushed a commit to airborne12/apache-doris that referenced this pull request Aug 21, 2023
…port enclose and escape for stream load (apache#22539)

 ## Proposed changes

Refactor thoughts: close apache#22383
Descriptions about `enclose` and `escape`: apache#22385

2023-08-09:
It's a pity that experiment shows that the original way for parsing plain CSV is faster. Therefor, the refactor is only applied on enclose related code. The plain CSV parser use the original logic.

Fallback of performance is unavoidable anyway. From the `CSV reader`'s perspective, the real weak point may be the write column behavior, proved by the flame graph.

Trimming escape will be enable after fix: apache#22411 is merged

Cases should be discussed:

1. When an incomplete enclose appears in the beginning of a large scale data, the line delimiter will be unreachable till the EOF, will the buffer become extremely large?
2. What if an infinite line occurs in the case? Essentially,  `1.` is equivalent to this.

Only support stream load as trial in this PR, avoid too many unrelated changes. Docs will be added when `enclose` and `escape` is available for all kinds of load.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.0.1-merged reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants