Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LMDB corruption recovery -- follow-up #3880

Open
wants to merge 7 commits into
base: master
from

Conversation

@vpodzime
Copy link
Contributor

commented Oct 8, 2019

@vpodzime vpodzime added the WIP label Oct 8, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 8, 2019

This pull request introduces 10 alerts and fixes 25 when merging 3aebce9 into 4ebcf91 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 1 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 25 for Pointer argument is dereferenced without checking for NULL
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch 2 times, most recently from a4eb259 to 3f1a347 Oct 9, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 10, 2019

This pull request introduces 12 alerts and fixes 25 when merging 3f1a347 into e9b38c1 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 3 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 25 for Pointer argument is dereferenced without checking for NULL
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch from 3f1a347 to 5f1b7b7 Oct 18, 2019
This was referenced Oct 18, 2019
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch from 5f1b7b7 to 6a9bf01 Oct 20, 2019
@vpodzime vpodzime removed the WIP label Oct 20, 2019
@vpodzime vpodzime requested review from olehermanse and craigcomstock Oct 20, 2019
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch 5 times, most recently from b7c1c67 to 300930a Oct 21, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging 300930a into 7b2938d - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch 3 times, most recently from 137b4a5 to e370fa5 Oct 21, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging e370fa5 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch from e370fa5 to 31f0b45 Oct 21, 2019
vpodzime added 6 commits Oct 8, 2019
C doesn't support exceptions so in the LMDB corruption handling
callback we have no way to backtrack to the code that run the
LMDB operation that hit the corruption and tell it that things
went wrong. Our only chance is to exit the process. To avoid our
exit handlers from potentially messing up the freshly-repaired
LMDB file, we have to either use _exit() or make sure the LMDB
file is not touched by the exit handlers. This change implements
the latter.

We also use special exit codes to signal that LMDB corruption was
detected and either repaired or failed to be reparied.

Ticket: CFE-3127
Changelog: None
We need to make sure the file is repaired for future use.
SIGBUS means unaligned memory access or an attempt to access area
outside of mmap(). In most cases for us this means an LMDB
corruption. Thus if a process gets SIGBUS, it touches a flag file
that triggers a DB repair attempt in the next agent or cf-execd
run.

Ticket: CFE-3127
Changelog: Received SIGBUS now triggers a repair of local DBs
repair_lmdb_file() is the function that repairs the given LMDB
file and thus it should be the same function recording the repair
timestamp. However, some complexities are out of its scope. Let's
make it either take a file descriptor for the repair timestamp
file, assuming that file locking and such stuff is taken care of,
or do the simple repair timestamp recording on its own.

Also treat its return code in a more appropriate way -- only a
real failure should be considered failure. If the data
recovery/replication fails, but the LMDB file is moved out of the
way (removed), it should be considered as successful repair.
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch from 31f0b45 to b5ee6a3 Oct 21, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging b5ee6a3 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch from b5ee6a3 to e36cf72 Oct 21, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging e36cf72 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions
Copy link
Member

left a comment

Looks mostly good! (Please make sure it passes jenkins before merging)

bool force_repair = false;
{
char repair_flag_file[PATH_MAX] = { 0 };
xsnprintf(repair_flag_file, PATH_MAX, "%s%c%s",

This comment has been minimized.

Copy link
@olehermanse

olehermanse Oct 21, 2019

Member

Could you maybe add a comment to explain what creates this flag file and when?

/* This is full of race-conditions, but it's just a best-effort
* thing. If a force-repair is missed, it will happen next time. If it's
* done twice, no big deal. */
if (access(repair_flag_file, F_OK) == 0)

This comment has been minimized.

Copy link
@olehermanse

olehermanse Oct 21, 2019

Member

Copy-pasted code, could maybe be a function?

free(tstamp_file);
local_fd_tstamp = fd_tstamp;
}
if (!ExclusiveLockFileCheck(fd_tstamp) && ExclusiveLockFile(fd_tstamp, true) == -1)

This comment has been minimized.

Copy link
@olehermanse

olehermanse Oct 21, 2019

Member

I think the comparison expression should be parenthesized.

@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch from e36cf72 to 2e12ad9 Oct 21, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 21, 2019

This pull request introduces 11 alerts and fixes 1 when merging 2e12ad9 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions
File locking in libntech is now a bit more abstract and
versatile.
@vpodzime vpodzime force-pushed the vpodzime:master-lmdb_recover_avoid__exit branch from 2e12ad9 to 9e3c789 Oct 22, 2019
@lgtm-com

This comment has been minimized.

Copy link

commented Oct 22, 2019

This pull request introduces 11 alerts and fixes 1 when merging 9e3c789 into f3bad22 - view on LGTM.com

new alerts:

  • 9 for Pointer argument is dereferenced without checking for NULL
  • 2 for Time-of-check time-of-use filesystem race condition

fixed alerts:

  • 1 for File created without restricting permissions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
2 participants
You can’t perform that action at this time.