Skip to content

Conversation

@kaka11chen
Copy link
Contributor

@kaka11chen kaka11chen commented Apr 22, 2024

Proposed changes

Issue:

F20240421 17:14:37.494115 184986 hdfs_util.h:65] Check failed: _ref_cnt == 0 
*** Check failure stack trace: ***
F20240421 17:14:37.505879 185108 hdfs_util.h:65] Check failed: _ref_cnt == 0 
*** Check failure stack trace: ***
    @     0x556f5236d316  google::LogMessageFatal::~LogMessageFatal()
    @     0x556f5236d316  google::LogMessageFatal::~LogMessageFatal()
    @     0x556f2830e200  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e200  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e21e  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e21e  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f507893b0  doris::vectorized::VHivePartitionWriter::~VHivePartitionWriter()
    @     0x556f507893b0  doris::vectorized::VHivePartitionWriter::~VHivePartitionWriter()
    @     0x556f506c005e  std::_Hashtable<>::clear()
    @     0x556f506c005e  std::_Hashtable<>::clear()
    @     0x556f50780f4f  doris::vectorized::VHiveTableWriter::close()
    @     0x556f50780f4f  doris::vectorized::VHiveTableWriter::close()
    @     0x556f5072bc4f  doris::vectorized::AsyncResultWriter::process_block()
    @     0x556f5072bc4f  doris::vectorized::AsyncResultWriter::process_block()
    @     0x556f5072cd01  std::_Function_handler<>::_M_invoke()
    @     0x556f5072cd01  std::_Function_handler<>::_M_invoke()
    @     0x556f2b02b73d  doris::ThreadPool::dispatch_thread()
    @     0x556f2b02b73d  doris::ThreadPool::dispatch_thread()
    @     0x556f2b008d59  doris::Thread::supervise_thread()
    @     0x556f2b008d59  doris::Thread::supervise_thread()
    @     0x7f2c2bfb4609  start_thread
    @     0x7f2c2bfb4609  start_thread
    @     0x7f2c2c261133  clone
    @     0x7f2c2c261133  clone
    @              (nil)  (unknown)
*** Query id: ac4f457c003d4489-b04ac56ef05b12f0 ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1713690877 (unix time) try "date -d @1713690877" if you are using GNU date ***
*** Current BE git commitID: e6f4a2f ***
*** SIGABRT unknown detail explain (@0x38a6) received by PID 14502 (TID 184986 OR 0x7f21d0614700) from PID 14502; stack trace: ***
    @              (nil)  (unknown)
F20240421 17:14:37.505879 185108 hdfs_util.h:65] Check failed: _ref_cnt == 0 F20240421 17:14:37.887202 185110 hdfs_util.h:65] Check failed: _ref_cnt == 0 
*** Check failure stack trace: ***
    @     0x556f5236d316  google::LogMessageFatal::~LogMessageFatal()
    @     0x556f2830e200  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e21e  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f507893b0  doris::vectorized::VHivePartitionWriter::~VHivePartitionWriter()
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /root/doris/be/src/common/signal_handler.h:421

The root cause is
When it cannot be processed in the cache (such as when the cache is full), we will create a fs_handler, and the life cycle of fs_handler is managed by caller. We have separated the hdfs writer and can create fs_handler separately, so HdfsFileSystem and HdfsFileWriter may be callers. In HdfsFileSystem, the fs_handler of HdfsFileWriter is shared, so it needs to be changed to shared_ptr.

Solution

Fix hdfs file writer core with check failed: _ref_cnt == 0 in dtor of HdfsFileWriter.
Change fs_handler ptr to shared_ptr and remove ref count operations.

Further comments

If this is a relatively large or complex change, kick off the discussion at dev@doris.apache.org by explaining why you chose the solution you did and what alternatives you considered, etc...

@doris-robot
Copy link

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR

Since 2024-03-18, the Document has been moved to doris-website.
See Doris Document.

@kaka11chen
Copy link
Contributor Author

run buildall

@github-actions
Copy link
Contributor

clang-tidy review says "All clean, LGTM! 👍"

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.37% (8916/25210)
Line Coverage: 27.10% (73320/270591)
Region Coverage: 26.24% (37877/144368)
Branch Coverage: 23.05% (19288/83696)
Coverage Report: http://coverage.selectdb-in.cc/coverage/b2b3027b5d4e28307be23facb767dbb959967c4c_b2b3027b5d4e28307be23facb767dbb959967c4c/report/index.html

@doris-robot
Copy link

TPC-H: Total hot run time: 38297 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit b2b3027b5d4e28307be23facb767dbb959967c4c, data reload: false

------ Round 1 ----------------------------------
q1	17609	4363	4225	4225
q2	2016	189	184	184
q3	10434	1209	1120	1120
q4	10192	720	762	720
q5	7494	2679	2601	2601
q6	216	134	136	134
q7	1026	611	605	605
q8	9213	2035	2020	2020
q9	7372	6533	6484	6484
q10	8499	3537	3532	3532
q11	445	234	229	229
q12	497	220	213	213
q13	17777	2953	2944	2944
q14	258	238	230	230
q15	514	476	478	476
q16	524	375	381	375
q17	951	693	718	693
q18	7323	6729	6672	6672
q19	6776	1534	1513	1513
q20	642	319	301	301
q21	3531	2758	2717	2717
q22	372	311	309	309
Total cold run time: 113681 ms
Total hot run time: 38297 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4328	4232	4229	4229
q2	355	263	268	263
q3	2966	2664	2798	2664
q4	1835	1567	1576	1567
q5	5292	5281	5254	5254
q6	208	123	126	123
q7	2241	1848	1834	1834
q8	3193	3346	3307	3307
q9	8542	8575	8566	8566
q10	4105	3836	3943	3836
q11	605	490	498	490
q12	794	615	646	615
q13	16485	3331	3128	3128
q14	311	307	295	295
q15	529	486	493	486
q16	510	432	464	432
q17	1850	1504	1503	1503
q18	8062	7947	7849	7849
q19	1659	1583	1616	1583
q20	2075	1859	1831	1831
q21	6266	4851	4896	4851
q22	546	479	489	479
Total cold run time: 72757 ms
Total hot run time: 55185 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 185925 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b2b3027b5d4e28307be23facb767dbb959967c4c, data reload: false

query1	887	374	366	366
query2	6447	2592	2358	2358
query3	6657	200	199	199
query4	24604	21218	21255	21218
query5	4225	402	408	402
query6	266	180	173	173
query7	4568	298	287	287
query8	236	186	178	178
query9	8488	2330	2308	2308
query10	412	239	247	239
query11	14681	14241	14268	14241
query12	134	90	86	86
query13	1645	369	357	357
query14	8721	7851	7900	7851
query15	288	183	187	183
query16	8172	265	265	265
query17	1919	584	560	560
query18	2062	292	278	278
query19	326	151	152	151
query20	94	84	82	82
query21	201	127	122	122
query22	5020	4912	4834	4834
query23	33705	33284	33515	33284
query24	10693	3139	3118	3118
query25	616	413	396	396
query26	713	163	160	160
query27	2255	370	380	370
query28	5894	2068	2064	2064
query29	904	635	651	635
query30	285	196	179	179
query31	972	754	753	753
query32	98	54	57	54
query33	660	269	258	258
query34	882	490	494	490
query35	875	719	715	715
query36	1096	915	955	915
query37	114	72	72	72
query38	3433	3386	3317	3317
query39	1643	1575	1613	1575
query40	178	135	133	133
query41	57	43	46	43
query42	104	100	99	99
query43	575	566	536	536
query44	1140	771	751	751
query45	301	271	273	271
query46	1119	759	731	731
query47	2021	1947	1987	1947
query48	372	307	305	305
query49	867	396	398	396
query50	766	384	394	384
query51	6914	6932	6875	6875
query52	98	86	83	83
query53	336	273	275	273
query54	307	234	231	231
query55	76	69	71	69
query56	234	221	219	219
query57	1176	1150	1113	1113
query58	227	188	192	188
query59	3292	3185	3144	3144
query60	244	226	228	226
query61	90	87	83	83
query62	591	427	440	427
query63	300	277	282	277
query64	5380	3854	3944	3854
query65	3058	3028	3073	3028
query66	746	377	329	329
query67	15542	14922	15095	14922
query68	8822	543	540	540
query69	540	303	301	301
query70	1212	1096	1182	1096
query71	1470	1267	1269	1267
query72	6682	2598	2425	2425
query73	837	321	314	314
query74	6835	6378	6515	6378
query75	4007	2662	2644	2644
query76	4897	998	968	968
query77	616	257	262	257
query78	10951	10223	10145	10145
query79	9949	535	529	529
query80	2006	435	426	426
query81	515	243	246	243
query82	781	94	92	92
query83	194	158	160	158
query84	253	83	79	79
query85	1248	270	273	270
query86	396	285	305	285
query87	3517	3229	3233	3229
query88	5113	2303	2344	2303
query89	530	370	370	370
query90	1977	178	179	178
query91	123	92	94	92
query92	61	45	44	44
query93	7090	502	506	502
query94	1166	177	177	177
query95	397	294	323	294
query96	612	259	265	259
query97	3103	2967	2980	2967
query98	233	216	218	216
query99	1239	847	865	847
Total cold run time: 298374 ms
Total hot run time: 185925 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.02 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b2b3027b5d4e28307be23facb767dbb959967c4c, data reload: false

query1	0.04	0.03	0.03
query2	0.08	0.03	0.04
query3	0.23	0.05	0.04
query4	1.68	0.06	0.07
query5	0.50	0.48	0.47
query6	1.50	0.74	0.73
query7	0.01	0.01	0.02
query8	0.05	0.04	0.04
query9	0.58	0.51	0.50
query10	0.56	0.57	0.57
query11	0.17	0.13	0.12
query12	0.15	0.13	0.12
query13	0.60	0.60	0.60
query14	0.77	0.78	0.81
query15	0.86	0.82	0.80
query16	0.36	0.37	0.37
query17	1.00	1.01	1.00
query18	0.22	0.25	0.23
query19	1.84	1.72	1.68
query20	0.02	0.01	0.01
query21	15.40	0.64	0.64
query22	4.18	7.73	1.57
query23	18.31	1.41	1.28
query24	2.00	0.21	0.21
query25	0.16	0.08	0.09
query26	0.24	0.16	0.16
query27	0.07	0.08	0.08
query28	13.31	1.00	0.98
query29	12.60	3.26	3.22
query30	0.25	0.08	0.05
query31	2.88	0.40	0.37
query32	3.27	0.48	0.46
query33	2.78	2.85	2.83
query34	17.19	4.40	4.48
query35	4.58	4.51	4.48
query36	0.65	0.46	0.45
query37	0.17	0.15	0.15
query38	0.15	0.15	0.14
query39	0.04	0.03	0.04
query40	0.17	0.14	0.13
query41	0.09	0.05	0.04
query42	0.06	0.05	0.05
query43	0.04	0.04	0.03
Total cold run time: 109.81 s
Total hot run time: 30.02 s

if (!handle->invalid()) {
handle->inc_ref();
*fs_handle = handle;
*fs_handle = std::move(handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we still have to deal with inc_ref() even with shared_ptr support?
can we integrate the process (inc_ref, dec_ref) into shared_ptr ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is for fs handle cache.

HdfsFileWriter::HdfsFileWriter(Path path, std::shared_ptr<HdfsHandler> handler, hdfsFile hdfs_file,
std::string fs_name, const FileWriterOptions* opts)
: _path(std::move(path)),
_hdfs_handler(handler),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_hdfs_handler(handler),
_hdfs_handler(std::move(handler)),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto:
return std::make_unique<HdfsFileWriter>(std::move(path), std::move(handler), hdfs_file, fs_name, opts);

// do not use std::shared_ptr or std::unique_ptr
// _fs_handle is managed by HdfsFileSystemCache
HdfsHandler* _fs_handle = nullptr;
std::shared_ptr<HdfsHandler> _fs_handle = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_ptr<HdfsHandler> _fs_handle = nullptr;
std::shared_ptr<HdfsHandler> _fs_handle;

可以不用显式写 = nullptr,智能指针默认初始化都是 nullptr

}
if (_cache.size() < MAX_CACHE_HANDLE) {
std::unique_ptr<HdfsHandler> handle = std::make_unique<HdfsHandler>(hdfs_fs, true);
std::shared_ptr<HdfsHandler> handle = std::make_shared<HdfsHandler>(hdfs_fs, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::shared_ptr<HdfsHandler> handle = std::make_shared<HdfsHandler>(hdfs_fs, true);
auto handle = std::make_shared<HdfsHandler>(hdfs_fs, true);

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

_last_access_time = std::chrono::duration_cast<std::chrono::milliseconds>(
std::chrono::system_clock::now().time_since_epoch())
.count();
void update_last_access_time() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: method 'update_last_access_time' can be made const [readability-make-member-function-const]

Suggested change
void update_last_access_time() {
void update_last_access_time() const {

@kaka11chen kaka11chen force-pushed the fix_hdfs_file_writer_core branch 2 times, most recently from 60e60ff to ffc4826 Compare April 23, 2024 04:38
@kaka11chen kaka11chen force-pushed the fix_hdfs_file_writer_core branch from ffc4826 to c95dd69 Compare April 23, 2024 04:39
@kaka11chen
Copy link
Contributor Author

run buildall

@doris-robot
Copy link

TeamCity be ut coverage result:
Function Coverage: 35.24% (8916/25302)
Line Coverage: 27.00% (73325/271622)
Region Coverage: 26.17% (37878/144735)
Branch Coverage: 22.98% (19286/83924)
Coverage Report: http://coverage.selectdb-in.cc/coverage/c95dd699c71fd756e1728ad9ce523a637b17ad18_c95dd699c71fd756e1728ad9ce523a637b17ad18/report/index.html

@kaka11chen kaka11chen requested a review from platoneko April 23, 2024 07:30
Copy link
Contributor

@morningman morningman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Apr 23, 2024
@github-actions
Copy link
Contributor

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

PR approved by anyone and no changes requested.

@morningman morningman merged commit feaf51c into apache:master Apr 23, 2024
dataroaring pushed a commit that referenced this pull request Apr 23, 2024
…_cnt == 0` in dtor of `HdfsFileWriter`. (#33959)

## Issue:
```
F20240421 17:14:37.494115 184986 hdfs_util.h:65] Check failed: _ref_cnt == 0 
*** Check failure stack trace: ***
F20240421 17:14:37.505879 185108 hdfs_util.h:65] Check failed: _ref_cnt == 0 
*** Check failure stack trace: ***
    @     0x556f5236d316  google::LogMessageFatal::~LogMessageFatal()
    @     0x556f5236d316  google::LogMessageFatal::~LogMessageFatal()
    @     0x556f2830e200  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e200  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e21e  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e21e  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f507893b0  doris::vectorized::VHivePartitionWriter::~VHivePartitionWriter()
    @     0x556f507893b0  doris::vectorized::VHivePartitionWriter::~VHivePartitionWriter()
    @     0x556f506c005e  std::_Hashtable<>::clear()
    @     0x556f506c005e  std::_Hashtable<>::clear()
    @     0x556f50780f4f  doris::vectorized::VHiveTableWriter::close()
    @     0x556f50780f4f  doris::vectorized::VHiveTableWriter::close()
    @     0x556f5072bc4f  doris::vectorized::AsyncResultWriter::process_block()
    @     0x556f5072bc4f  doris::vectorized::AsyncResultWriter::process_block()
    @     0x556f5072cd01  std::_Function_handler<>::_M_invoke()
    @     0x556f5072cd01  std::_Function_handler<>::_M_invoke()
    @     0x556f2b02b73d  doris::ThreadPool::dispatch_thread()
    @     0x556f2b02b73d  doris::ThreadPool::dispatch_thread()
    @     0x556f2b008d59  doris::Thread::supervise_thread()
    @     0x556f2b008d59  doris::Thread::supervise_thread()
    @     0x7f2c2bfb4609  start_thread
    @     0x7f2c2bfb4609  start_thread
    @     0x7f2c2c261133  clone
    @     0x7f2c2c261133  clone
    @              (nil)  (unknown)
*** Query id: ac4f457c003d4489-b04ac56ef05b12f0 ***
*** is nereids: 1 ***
*** tablet id: 0 ***
*** Aborted at 1713690877 (unix time) try "date -d @1713690877" if you are using GNU date ***
*** Current BE git commitID: e6f4a2f ***
*** SIGABRT unknown detail explain (@0x38a6) received by PID 14502 (TID 184986 OR 0x7f21d0614700) from PID 14502; stack trace: ***
    @              (nil)  (unknown)
F20240421 17:14:37.505879 185108 hdfs_util.h:65] Check failed: _ref_cnt == 0 F20240421 17:14:37.887202 185110 hdfs_util.h:65] Check failed: _ref_cnt == 0 
*** Check failure stack trace: ***
    @     0x556f5236d316  google::LogMessageFatal::~LogMessageFatal()
    @     0x556f2830e200  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f2830e21e  doris::io::HdfsFileWriter::~HdfsFileWriter()
    @     0x556f507893b0  doris::vectorized::VHivePartitionWriter::~VHivePartitionWriter()
 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /root/doris/be/src/common/signal_handler.h:421
```
The root cause is
When it cannot be processed in the cache (such as when the cache is full), we will create a `fs_handler`, and the life cycle of `fs_handler` is managed by caller. We have separated the hdfs writer and can create `fs_handler` separately, so `HdfsFileSystem` and `HdfsFileWriter` may be callers. In `HdfsFileSystem`, the `fs_handler` of `HdfsFileWriter` is shared, so it needs to be changed to `shared_ptr`.

### Solution
Fix hdfs file writer core with `check failed: _ref_cnt == 0` in dtor of `HdfsFileWriter`.
Change `fs_handler` ptr to `shared_ptr` and remove ref count operations.
morningman pushed a commit that referenced this pull request Apr 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/2.1.3-merged dev/3.0.0-merged reviewed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants