Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug that mnesia:add_table_copy/3 could hold a read lock forever #6013

Merged
merged 2 commits into from
Jun 7, 2022

Conversation

sile
Copy link
Contributor

@sile sile commented May 24, 2022

Hi,

I found a rare mnesia bug that could occure when a node ("A") executes mnesia:add_table_copy/3 to send a table data to another node ("B").
If node B terminated during the sending process, node A could hold a read lock forever because that waits for an already received message {copier_done, node()} again in a transaction after acquiring a read lock.
This PR fixes the redundant receives to prevent the problem.

Example code to reproduce the problem

Because this problem is inherently timing dependent, it's difficult to write a deterministic reproducing code. However, the following example code can reproduce the issue with almost 100% probability on my laptop.

-module(mnesia_debug).

-export([run/0]).

%% For RPC.
-export([init_node0/0, init_node1/0, add_table_copy/2]).

-record(table, {foo, bar, baz}).

run() ->
    {ok, Pid0, _Node0} = peer:start_link(#{name => peer:random_name(), connection => standard_io}),
    {ok, Pid1, Node1} = peer:start_link(#{name => peer:random_name(), connection => standard_io}),
    io:format("Started peer nodes\n"),

    ok = peer:call(Pid0, ?MODULE, init_node0, []),
    io:format("Initialized node0: started mnesia, created table, inserted data\n"),

    ok = peer:call(Pid1, ?MODULE, init_node1, []),
    io:format("Initialized node1: started mnesia\n"),

    {ok, _} = peer:call(Pid0, mnesia, change_config, [extra_db_nodes, [Node1]]),
    io:format("Added node1 to the `extra_db_nodes` of node`\n"),

    MainPid = self(),

    peer:cast(Pid0, ?MODULE, add_table_copy, [Node1, MainPid]),
    timer:sleep(10),  % Wait for `mnesia_loader:send_more/6` has started
    io:format("Started sending table data from node0 to node1\n"),

    AcquireLockFun = fun() ->
                             io:format("Acquire write lock to block the transaction in `mnesia_loader:finish_copy/5`\n"),
                             ok = mnesia:write_lock_table(table),
                             MainPid ! {write_lock_acquired, self()},
                             io:format("Acquired write lock\n"),
                             receive 
                                 node1_stopped -> ok
                             end,
                             io:format("Released write lock\n"),
                             MainPid ! write_lock_released,
                             ok
                     end,
    peer:cast(Pid0, mnesia, transaction, [AcquireLockFun]),

    receive
        {write_lock_acquired, LockPid} -> ok
    end,
    timer:sleep(200),  % Wait for `mnesia_loader:send_more/6` has finished
    ok = peer:stop(Pid1),
    LockPid ! node1_stopped,
    io:format("Stopped node1 so that the `{copier_done, Node1}` message will be received in `mnesia_loader:finish_copy/5`\n"),

    receive
        write_lock_released -> ok
    end,
    receive
        add_table_copy_finished -> ok
    end,
    timer:sleep(1000),  % Wait for `mnesia_loader:finish_copy/5` has acquired the read lock

    io:format("Try acquiring write lock. This call will be timed out (due to the bug).\n"),
    {atomic, ok} = peer:call(Pid0, mnesia, transaction, [fun() -> mnesia:write_lock_table(table) end]),
    io:format("Unexpectedlly acquired write lock\n"),

    ok = peer:stop(Pid0),
    io:format("Stopped node0\n"),

    ok.

init_node0() ->
    ok = mnesia:create_schema([node()]),
    ok = mnesia:start(),
    {atomic, ok} = mnesia:create_table(table, [{attributes, record_info(fields, table)}, {ram_copies, [node()]}]),

    Fun = fun() ->
                  lists:foreach(fun(I) -> ok = mnesia:write(#table{foo = I, bar = I, baz = I}) end,
                                lists:seq(1, 100000))
          end,
    {atomic, ok} = mnesia:transaction(Fun),
    ok.

init_node1() ->
    ok = mnesia:start(),
    ok.

add_table_copy(Node1, MainPid) ->
    mnesia:add_table_copy(table, Node1, ram_copies),
    MainPid ! add_table_copy_finished.

Execution result:

$ erl -name foo@127.0.0.1
Erlang/OTP 25 [erts-13.0] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit:ns]
> c(mnesia_debug).
> mnesia_debug:run().
Started peer nodes
Initialized node0: started mnesia, created table, inserted data
Initialized node1: started mnesia
Added node1 to the `extra_db_nodes` of node`
Started sending table data from node0 to node1
Acquire write lock to block the transaction in `mnesia_loader:finish_copy/5`
Acquired write lock
Stopped node1 so that the `{copier_done, Node1}` message will be received in `mnesia_loader:finish_copy/5`
Released write lock
Try acquiring write lock. This call will be timed out.
** exception exit: {timeout,{gen_server,call,
                                        [<0.160.0>,
                                         {call,mnesia,transaction,[#Fun<mnesia_debug.1.103016018>]},
                                         5000]}}
     in function  gen_server:call/3 (gen_server.erl, line 382)
     in call from peer:call/5 (peer.erl, line 221)
     in call from mnesia_debug:run/0 (mnesia_debug.erl, line 82)

@github-actions
Copy link
Contributor

github-actions bot commented May 24, 2022

CT Test Results

    2 files    55 suites   22m 39s ⏱️
660 tests 511 ✔️ 149 💤 0
707 runs  538 ✔️ 169 💤 0

Results for commit 7d47129.

♻️ This comment has been updated with latest results.

To speed up review, make sure that you have read Contributing to Erlang/OTP and that all checks pass.

See the TESTING and DEVELOPMENT HowTo guides for details about how to run test locally.

Artifacts

// Erlang/OTP Github Action Bot

@dgud dgud self-assigned this May 25, 2022
@dgud
Copy link
Contributor

dgud commented May 25, 2022

Thanks, I will take a closer look next week, long weekend here in Sweden.

@IngelaAndin IngelaAndin added the team:PS Assigned to OTP team PS label May 25, 2022
@dgud
Copy link
Contributor

dgud commented May 30, 2022

I would like a real testcase for this, since you managed to reproduce often enough (1 of 1 when I tried),
can you write that or should I copy your "idea" to make a mnesia testcase?

@dgud dgud added the testing currently being tested, tag is used by OTP internal CI label May 30, 2022
@sile
Copy link
Contributor Author

sile commented May 31, 2022

Thank you for your comment.
I'm going to try implementing an mnesia test case based on the above reproduce code.

@sile
Copy link
Contributor Author

sile commented May 31, 2022

I have a question.
Which is better, adding a new test case to an existing file or creating a new file?

@dgud
Copy link
Contributor

dgud commented May 31, 2022

There is already a add_copy_when_going_down test case in mnesia_evil_coverage_test, so
my suggestion would be to add second test case after that one.

@dgud
Copy link
Contributor

dgud commented May 31, 2022

It would also be great if the branch is re-based to 24.3.4, because this will need to be patched on 24 as well.

@sile sile force-pushed the mnesia-add-table-copy-deadlock-fix branch from 2a3cbbd to 7d47129 Compare May 31, 2022 07:26
@sile
Copy link
Contributor Author

sile commented May 31, 2022

Thank you for your advice.
I rebased this branch and added a test case to mnesia_evil_coverage_test.erl file.

@dgud dgud added testing currently being tested, tag is used by OTP internal CI and removed testing currently being tested, tag is used by OTP internal CI labels May 31, 2022
@dgud dgud merged commit 72643be into erlang:maint Jun 7, 2022
@sile sile deleted the mnesia-add-table-copy-deadlock-fix branch June 7, 2022 08:29
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team:PS Assigned to OTP team PS testing currently being tested, tag is used by OTP internal CI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants