New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix de-serialization bug where AddrMan is left corrupted #7696

Merged
merged 1 commit into from May 17, 2016

Conversation

Projects
None yet
5 participants
@EthanHeilman
Contributor

EthanHeilman commented Mar 16, 2016

CAddrDB::Read is used to manage the loading of AddrMan from peers.dat. As shown in the code below, when CAddrDB::Read catches an exception from the de-serialization code it returns addrman "as-is", despite the fact that it failed to load correctly.

    try {
        // de-serialize file header (network specific magic number) and ..
        ssPeers >> FLATDATA(pchMsgTmp);

        // ... verify the network matches ours
        if (memcmp(pchMsgTmp, Params().MessageStart(), sizeof(pchMsgTmp)))
            return error("%s: Invalid network magic number", __func__);

        // de-serialize address data into one CAddrMan object
        ssPeers >> addr;
    }
    catch (const std::exception& e) {
        return error("%s: Deserialize or I/O error - %s", __func__, e.what());
    }

https://github.com/bitcoin/bitcoin/blob/master/src/net.cpp#L2330

This use of a corrupted addrman can cause a bitcoind client to get in a state such that when bitcoind starts it will not run correctly. Once bitcoind gets into such a state the only fix is for the user to manually delete the offending peers.dat file.

This pull request fixes this behavior so that an exception during the de-serialization process will leave addrman in a clean state. Unittests verify this new behavior.

@paveljanik

View changes

Show outdated Hide outdated src/test/net_tests.cpp
@@ -60,6 +60,7 @@ BITCOIN_TESTS =\
test/merkle_tests.cpp \
test/miner_tests.cpp \
test/multisig_tests.cpp \
test/net_tests.cpp \

This comment has been minimized.

@paveljanik

paveljanik Mar 16, 2016

Contributor

You test addrman, not net. Please name the file addrman_tests.cpp.

@paveljanik

paveljanik Mar 16, 2016

Contributor

You test addrman, not net. Please name the file addrman_tests.cpp.

This comment has been minimized.

@EthanHeilman

EthanHeilman Mar 16, 2016

Contributor

I am testing CAddrDB, which is called from and lives in net.h and net.cpp not CAddrMan.

@EthanHeilman

EthanHeilman Mar 16, 2016

Contributor

I am testing CAddrDB, which is called from and lives in net.h and net.cpp not CAddrMan.

This comment has been minimized.

@jonasschnelli

jonasschnelli Mar 16, 2016

Member

No strong opinion: IMO we don't need to use an identical filename for what is tested in the test. I think addrman_tests.cpp would be more clear.

@jonasschnelli

jonasschnelli Mar 16, 2016

Member

No strong opinion: IMO we don't need to use an identical filename for what is tested in the test. I think addrman_tests.cpp would be more clear.

This comment has been minimized.

@EthanHeilman

EthanHeilman Mar 16, 2016

Contributor

If someone feels strongly I'm happy to move it to addrman_tests.cpp. I wrote all the tests in addrman_tests and choose to put this test in net_tests since it feels more common to net than addrman.

Ideally CAddrDB, CBanDB and other files which manage interaction with the data directory should be refactored into a common file/class but that is much larger task (for instance CBanDB is very tightly coupled to net.cpp).

@EthanHeilman

EthanHeilman Mar 16, 2016

Contributor

If someone feels strongly I'm happy to move it to addrman_tests.cpp. I wrote all the tests in addrman_tests and choose to put this test in net_tests since it feels more common to net than addrman.

Ideally CAddrDB, CBanDB and other files which manage interaction with the data directory should be refactored into a common file/class but that is much larger task (for instance CBanDB is very tightly coupled to net.cpp).

@paveljanik

This comment has been minimized.

Show comment
Hide comment
@paveljanik

paveljanik Mar 16, 2016

Contributor

Travis fails with:

test/net_tests.cpp(93): error: in "addrman_tests/caddrdb_deserialiation": check addrman1.size() == 3 has failed
test/net_tests.cpp(105): error: in "addrman_tests/caddrdb_deserialiation": check addrman2.size() == 3 has failed

But only in one configuration.

Contributor

paveljanik commented Mar 16, 2016

Travis fails with:

test/net_tests.cpp(93): error: in "addrman_tests/caddrdb_deserialiation": check addrman1.size() == 3 has failed
test/net_tests.cpp(105): error: in "addrman_tests/caddrdb_deserialiation": check addrman2.size() == 3 has failed

But only in one configuration.

@jonasschnelli jonasschnelli added the P2P label Mar 16, 2016

@jonasschnelli

This comment has been minimized.

Show comment
Hide comment
@jonasschnelli

jonasschnelli Mar 16, 2016

Member

Concept ACK, more tests are better. Haven't verified the tests though.

Member

jonasschnelli commented Mar 16, 2016

Concept ACK, more tests are better. Haven't verified the tests though.

@EthanHeilman

This comment has been minimized.

Show comment
Hide comment
@EthanHeilman

EthanHeilman Mar 16, 2016

Contributor

@paveljanik The non-determinism of the test passing is worrying especially because the test that was failing was not testing new behavior. I haven't been able to replicate this failure locally, any idea what is happening with it?

Contributor

EthanHeilman commented Mar 16, 2016

@paveljanik The non-determinism of the test passing is worrying especially because the test that was failing was not testing new behavior. I haven't been able to replicate this failure locally, any idea what is happening with it?

@EthanHeilman

This comment has been minimized.

Show comment
Hide comment
@EthanHeilman

EthanHeilman Mar 16, 2016

Contributor

Can someone add the labels bug and data corruption.

Contributor

EthanHeilman commented Mar 16, 2016

Can someone add the labels bug and data corruption.

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Mar 24, 2016

Member

Concept ACK.

Instead of relying on the output of CAddrDB::Read to be "clean" in case of an error I'd prefer a different approach, though: ignore and reset its output when the function fails. In my experience this is a more robust approach - it's easy to forget resetting it in some code path. In a lot of APIs the output of failing functions is thus undefined.

Member

laanwj commented Mar 24, 2016

Concept ACK.

Instead of relying on the output of CAddrDB::Read to be "clean" in case of an error I'd prefer a different approach, though: ignore and reset its output when the function fails. In my experience this is a more robust approach - it's easy to forget resetting it in some code path. In a lot of APIs the output of failing functions is thus undefined.

@EthanHeilman

This comment has been minimized.

Show comment
Hide comment
@EthanHeilman

EthanHeilman Mar 24, 2016

Contributor

@laanwj I'm not sure I follow, do you want the "clearing" to happen closer to the source of the error? In the serialization code?

Contributor

EthanHeilman commented Mar 24, 2016

@laanwj I'm not sure I follow, do you want the "clearing" to happen closer to the source of the error? In the serialization code?

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Mar 25, 2016

Member

@EthanHeilman At the call site:

        if (adb.Read(addrman))
            LogPrintf("Loaded %i addresses from peers.dat  %dms\n", addrman.size(), GetTimeMillis() - nStart);
        else {
            addrman.clear(); // Addrman can be in an inconsistent state after failure, reset it
            LogPrintf("Invalid or missing peers.dat; recreating\n");
            DumpAddresses();
        }

It's hard to guarantee that adb.Read won't leave the passed addrman in an inconsistent state after failure in all possible internal code paths. This would at least catch all possible cases.

The cleanest way if we had a clean slate, I think, would be to have Read return an AddrMan* which is null if an error happened, then instantiate a new one on faiilure. But as we have that global object that's too impactful.

Member

laanwj commented Mar 25, 2016

@EthanHeilman At the call site:

        if (adb.Read(addrman))
            LogPrintf("Loaded %i addresses from peers.dat  %dms\n", addrman.size(), GetTimeMillis() - nStart);
        else {
            addrman.clear(); // Addrman can be in an inconsistent state after failure, reset it
            LogPrintf("Invalid or missing peers.dat; recreating\n");
            DumpAddresses();
        }

It's hard to guarantee that adb.Read won't leave the passed addrman in an inconsistent state after failure in all possible internal code paths. This would at least catch all possible cases.

The cleanest way if we had a clean slate, I think, would be to have Read return an AddrMan* which is null if an error happened, then instantiate a new one on faiilure. But as we have that global object that's too impactful.

@EthanHeilman

This comment has been minimized.

Show comment
Hide comment
@EthanHeilman

EthanHeilman Mar 25, 2016

Contributor

@laanwj Good idea, I amended the commit to include addrman.Clear in StartNode's Read() error condition.

Contributor

EthanHeilman commented Mar 25, 2016

@laanwj Good idea, I amended the commit to include addrman.Clear in StartNode's Read() error condition.

@laanwj

View changes

Show outdated Hide outdated src/net.cpp
@EthanHeilman

This comment has been minimized.

Show comment
Hide comment
@EthanHeilman

EthanHeilman Apr 18, 2016

Contributor

Does this pull-request need any additional changes?

Contributor

EthanHeilman commented Apr 18, 2016

Does this pull-request need any additional changes?

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj Apr 19, 2016

Member

I still don't really like passing the hash into bool CAddrDB::Read, I think it makes for an awful API :)

Member

laanwj commented Apr 19, 2016

I still don't really like passing the hash into bool CAddrDB::Read, I think it makes for an awful API :)

Fix de-serialization bug where AddrMan is corrupted after exception
* CAddrDB modified so that when de-serialization code throws an exception Addrman is reset to a clean state
* CAddrDB modified to make unit tests possible
* Regression test created to ensure bug is fixed
* StartNode modifed to clear adrman if CAddrDB::Read returns an error code.
@EthanHeilman

This comment has been minimized.

Show comment
Hide comment
@EthanHeilman

EthanHeilman May 4, 2016

Contributor

@laanwj I removed hash from CAddrDB::Read

Contributor

EthanHeilman commented May 4, 2016

@laanwj I removed hash from CAddrDB::Read

@sipa

This comment has been minimized.

Show comment
Hide comment
@sipa

sipa May 5, 2016

Member

utACK 1475ecf

Member

sipa commented May 5, 2016

utACK 1475ecf

@laanwj

This comment has been minimized.

Show comment
Hide comment
@laanwj

laanwj May 5, 2016

Member

@EthanHeilman Looks good to me now, thanks. utACK

Member

laanwj commented May 5, 2016

@EthanHeilman Looks good to me now, thanks. utACK

} catch (const std::exception& e) {
exceptionThrown = true;
}
// Even through de-serialization failed adddrman is not left in a clean state.

This comment has been minimized.

@sipa

sipa May 17, 2016

Member

adddrman

@sipa

sipa May 17, 2016

Member

adddrman

@sipa sipa merged commit 1475ecf into bitcoin:master May 17, 2016

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

sipa added a commit that referenced this pull request May 17, 2016

Merge #7696: Fix de-serialization bug where AddrMan is left corrupted
1475ecf Fix de-serialization bug where AddrMan is corrupted after exception * CAddrDB modified so that when de-serialization code throws an exception Addrman is reset to a clean state * CAddrDB modified to make unit tests possible * Regression test created to ensure bug is fixed * StartNode modifed to clear adrman if CAddrDB::Read returns an error code. (EthanHeilman)

This was referenced May 18, 2016

sickpig referenced this pull request in sickpig/BitcoinUnlimited Apr 13, 2017

Merge #7696: Fix de-serialization bug where AddrMan is left corrupted
1475ecf Fix de-serialization bug where AddrMan is corrupted after exception (EthanHeilman)

* CAddrDB modified so that when de-serialization code throws an exception Addrman is reset to a clean state
* CAddrDB modified to make unit tests possible
* Regression test created to ensure bug is fixed
* StartNode modifed to clear adrman if CAddrDB::Read returns an error code.

sickpig referenced this pull request in sickpig/BitcoinUnlimited Apr 14, 2017

Merge #7696: Fix de-serialization bug where AddrMan is left corrupted
1475ecf Fix de-serialization bug where AddrMan is corrupted after exception (EthanHeilman)

* CAddrDB modified so that when de-serialization code throws an exception Addrman is reset to a clean state
* CAddrDB modified to make unit tests possible
* Regression test created to ensure bug is fixed
* StartNode modifed to clear adrman if CAddrDB::Read returns an error code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment