Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data bridge target unavailable #10645

Merged
merged 4 commits into from Jun 21, 2023

Conversation

paulozulato
Copy link
Contributor

@paulozulato paulozulato commented May 8, 2023

Fixes https://emqx.atlassian.net/browse/EMQX-9026

Summary

馃 Generated by Copilot at 6b23c31

This pull request enhances the error handling and reporting of the MySQL and PostgreSQL connectors when the target table is not created. It also updates the default SQL statement for the MySQL connector to match the new table schema. It affects the files emqx_connector_mysql.erl, emqx_connector_pgsql.erl, emqx_resource_buffer_worker.erl, and emqx_ee_bridge_mysql.erl.

PR Checklist

Please convert it to a draft if any of the following conditions are not met. Reviewers may skip over until all the items are checked:

  • Added tests for the changes
  • Changed lines covered in coverage report
  • Change log has been added to changes/{ce,ee}/(feat|perf|fix)-<PR-id>.en.md files
  • For internal contributor: there is a jira ticket to track this change
  • If there should be document changes, a PR to emqx-docs.git is sent, or a jira ticket is created to follow up
  • Schema changes are backward compatible

Checklist for CI (.github/workflows) changes

  • If changed package build workflow, pass this action (manual trigger)
  • Change log has been added to changes/ dir for user-facing artifacts update

@paulozulato paulozulato force-pushed the data-bridge-target-unavailable branch from 6b23c31 to 9dcdc19 Compare May 10, 2023 00:54
@paulozulato paulozulato force-pushed the data-bridge-target-unavailable branch 2 times, most recently from a561f3d to e9bf002 Compare May 18, 2023 00:52
@paulozulato paulozulato force-pushed the data-bridge-target-unavailable branch 3 times, most recently from b5c67e9 to ed91d81 Compare June 9, 2023 20:14
Comment on lines 140 to 153
% wolff starts a socket to get metadata and closes it asynchronously (calling a spawn to
% actually close it), causing some {'EXIT', tcp_closed} errors.
process_flag(trap_exit, true),
Ret =
case catch wolff_client:check_if_topic_exists(Hosts, KafkaConfig, KafkaTopic) of
ok ->
ok;
{error, unknown_topic_or_partition} ->
unhealthy_target;
_ ->
error
end,
process_flag(trap_exit, false),
Ret.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
% wolff starts a socket to get metadata and closes it asynchronously (calling a spawn to
% actually close it), causing some {'EXIT', tcp_closed} errors.
process_flag(trap_exit, true),
Ret =
case catch wolff_client:check_if_topic_exists(Hosts, KafkaConfig, KafkaTopic) of
ok ->
ok;
{error, unknown_topic_or_partition} ->
unhealthy_target;
_ ->
error
end,
process_flag(trap_exit, false),
Ret.
try wolff_client:check_if_topic_exists(Hosts, KafkaConfig, KafkaTopic) of
ok ->
ok;
{error, unknown_topic_or_partition} ->
unhealthy_target;
_ ->
error
catch
_:_ -> error
end.

?

Comment on lines 331 to 341
lists:foldl(
fun
(_, connected) ->
connected;
({_Partition, Pid}, Acc) ->
case is_pid(Pid) andalso erlang:is_process_alive(Pid) of
false ->
Acc;
true ->
CheckIfTopicExists(Pid, KafkaTopic, Acc)
end
end,
disconnected,
Leaders
);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could disentangle the two checks to make it more readable, and check the topic existence only once?

on_get_status ->
    case do_get_status(..., State) of
         {disconnected, unhealthy} -> {disconnected, State, unhealthy};
         Status -> Status
    end.

%% ...

do_get_status ->
    HealthyLeaders = lists:filter(fun(L) -> is_pid ... end, Leaders),
    case HealthyLeaders of
        [Pid | _] ->
            do_get_topic_status(Pid, Topic);
        [] ->
            disconnected
    end.

do_get_topic_status ->
     try wolf_client:check_if_topic_exists(Pid, Topic) of
       ...
     catch
        _:_ -> disconnected
     end.

Comment on lines 601 to 609
delete_all_bridges(),
?check_trace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: maybe split this test case into 2.

Comment on lines 675 to 682
delete_bridge(Config),
?check_trace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into two separate cases?

end.

check_if_topic_exists(Hosts, KafkaConfig, KafkaTopic) ->
% wolff starts a socket to get metadata and closes it asynchronously (calling a spawn to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then setting trap_exit immediately after would not help? it could happen that the trap_exist is set back to false before the socket is down.
better solution is to span_monitor a process to do it and drain the DOWN message

do_get_status(Client, #{kafka_topic := KafkaTopic} = State) ->
CheckIfTopicExists =
fun(Pid, Topic, Default) ->
case wolff_client:check_if_topic_exists(Pid, Topic) of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the return value match of wolff_client:check_if_topic_exists is repeated, maybe create a wrapper function instead.

@paulozulato paulozulato force-pushed the data-bridge-target-unavailable branch from ed91d81 to a02fa2d Compare June 15, 2023 23:05
@coveralls
Copy link
Collaborator

coveralls commented Jun 16, 2023

Pull Request Test Coverage Report for Build 5337384650

  • 83 of 106 (78.3%) changed or added relevant lines in 6 files are covered.
  • 30 unchanged lines in 16 files lost coverage.
  • Overall coverage increased (+0.05%) to 81.803%

Changes Missing Coverage Covered Lines Changed/Added Lines %
apps/emqx_resource/src/emqx_resource_buffer_worker.erl 0 1 0.0%
apps/emqx_connector/src/emqx_connector_mysql.erl 10 14 71.43%
apps/emqx_connector/src/emqx_connector_pgsql.erl 12 17 70.59%
apps/emqx_oracle/src/emqx_oracle.erl 42 47 89.36%
apps/emqx_bridge_kafka/src/emqx_bridge_kafka_impl_producer.erl 17 25 68.0%
Files with Coverage Reduction New Missed Lines %
apps/emqx_conf/src/emqx_conf_schema.erl 1 95.0%
apps/emqx_dashboard/src/emqx_dashboard_desc_cache.erl 1 94.44%
apps/emqx_dashboard/src/emqx_dashboard_monitor.erl 1 66.21%
apps/emqx_gateway_mqttsn/src/emqx_mqttsn_frame.erl 1 63.48%
apps/emqx_oracle/src/emqx_oracle.erl 1 82.58%
apps/emqx_resource/src/emqx_resource_buffer_worker.erl 1 93.4%
apps/emqx/src/emqx_cm.erl 1 92.05%
apps/emqx/src/emqx_logger_textfmt.erl 1 75.0%
apps/emqx/src/emqx_message.erl 1 94.83%
apps/emqx/src/emqx_quic_data_stream.erl 1 75.21%
Totals Coverage Status
Change from base Build 5333842210: 0.05%
Covered Lines: 29958
Relevant Lines: 36622

馃挍 - Coveralls

@paulozulato paulozulato marked this pull request as ready for review June 16, 2023 15:54
@paulozulato paulozulato requested review from a team and JimMoen as code owners June 16, 2023 15:54
Comment on lines 319 to 323
case do_get_topic_status(HealthyLeaders, KafkaTopic) of
error ->
% it can be that topic is unknown or there was some error getting the leaders (like
% some connection issue). Therefore, let's try to get metadata for that topic directly.
do_get_topic_status(Hosts, KafkaConfig, KafkaTopic);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, I guess we could just call do_get_topic_status(Hosts, KafkaConfig, KafkaTopic) directly rather than going through do_get_topic_status(HealthyLeaders, KafkaTopic)?

Comment on lines 348 to 350
(_Pid, ok) ->
ok;
(Pid, Acc) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should stop iterating when we already know the topic is unknown.

Comment on lines 318 to 319
HealthyLeaders = do_get_healthy_leaders(Client, KafkaTopic),
case do_get_topic_status(HealthyLeaders, KafkaTopic) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the unknown topic error seems to take precedence over connection issues, we could check the topic first and short-circuit if it's unknown, instead of checking the leader pids.

Comment on lines 370 to 400
RetPid = self(),
QueryRef = make_ref(),
CheckTopic = fun() ->
Ret =
case wolff_client:check_if_topic_exists(Hosts, KafkaConfig, KafkaTopic) of
ok ->
ok;
{error, unknown_topic_or_partition} ->
unhealthy_target;
_ ->
error
end,
RetPid ! {QueryRef, Ret}
end,
% wolff starts a socket to get metadata and closes it asynchronously (calling a spawn to
% actually close it), causing some {'EXIT', tcp_closed} errors. Therefore, it'll be executed
% by a spawn process to avoid crashing the main process.
{Pid, Ref} = spawn_monitor(CheckTopic),
% wait process terminate
receive
{'DOWN', Ref, process, Pid, _What} -> ok
after 5000 ->
?SLOG(error, #{msg => "failed to check kafka topic", pid => Pid}),
demonitor(Ref),
exit(Pid, kill)
end,
% check for result
receive
{QueryRef, Result} -> Result
after 500 ->
error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this could use emqx_utils:nolink_apply.

try 
  emqx_utils:nolink_apply(fun() -> check... end, Timeout)
catch
  exit:_ -> ...;
  error:_ -> ...

lists:foldl(
fun
(WorkerPid, ok) ->
{ok, Conn} = ecpool_worker:client(WorkerPid),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better avoid exploding here?

@paulozulato paulozulato force-pushed the data-bridge-target-unavailable branch from a02fa2d to ca40e80 Compare June 19, 2023 20:02
wolff_client:check_if_topic_exists(Hosts, KafkaConfig, KafkaTopic)
end,
try
case emqx_utils:nolink_apply(CheckTopicFun, 500) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe 500 ms is a bit too strict and could lead to false positives?

Suggested change
case emqx_utils:nolink_apply(CheckTopicFun, 500) of
case emqx_utils:nolink_apply(CheckTopicFun, 5_000) of

@@ -224,7 +224,10 @@ on_get_status(_InstId, #{pool_name := PoolName} = State) ->
{connected, NState};
{error, _Reason} ->
%% do not log error, it is logged in prepare_sql_to_conn
connecting
connecting;
{undefined_table, NState} ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit:

Suggested change
{undefined_table, NState} ->
{error, {undefined_table, NState}} ->

(will require reordering with the catch-all clause above)

lists:foldl(
fun
(WorkerPid, ok) ->
{ok, Conn} = ecpool_worker:client(WorkerPid),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is still explosive.

lists:foldl(
fun
(WorkerPid, ok) ->
{ok, Conn} = ecpool_worker:client(WorkerPid),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

馃挘

apps/emqx_oracle/src/emqx_oracle.erl Show resolved Hide resolved
@paulozulato paulozulato force-pushed the data-bridge-target-unavailable branch 3 times, most recently from 1f9ebb3 to a5ceaf3 Compare June 21, 2023 13:57
thalesmg
thalesmg previously approved these changes Jun 21, 2023
Copy link
Contributor

@thalesmg thalesmg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are still a few occurrences of {undefined_table, _} instead of {error, {undefined_table, _}}, but that's a nit that could be addressed at a later time.

@paulozulato
Copy link
Contributor Author

there are still a few occurrences of {undefined_table, _} instead of {error, {undefined_table, _}}, but that's a nit that could be addressed at a later time.

fixed =)

@paulozulato paulozulato merged commit 62d3766 into emqx:master Jun 21, 2023
130 checks passed
@paulozulato paulozulato deleted the data-bridge-target-unavailable branch June 21, 2023 21:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants