Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TS-3948 Lock g_records during in RecDumpRecords to avoid a race #304

Closed
wants to merge 1 commit into from

Conversation

sekimura
Copy link
Contributor

@sekimura sekimura commented Oct 9, 2015

Under certain conditions, data passing to RecDumpEntryCb and data inside the callback can be different. The problem is that g_records is not locked.

Under certain conditions, data passing to RecDumpEntryCb and data inside
the callback can be different. The problem is that g_records is not
locked.
@jpeach
Copy link
Contributor

jpeach commented Oct 9, 2015

How does this happen? There are lots of places that iterate over the records without holding the read lock. We should make sure that the lock is held correctly and consistently in all places.

@sekimura
Copy link
Contributor Author

sekimura commented Oct 9, 2015

@jpeach I've added backtrace when we got a crash https://issues.apache.org/jira/browse/TS-3948

@zwoop
Copy link
Contributor

zwoop commented Oct 9, 2015

James, it does hold this lock in a lot of places. :) But yeah, maybe we need to file a subsequent Jira on other APIs? The crash seems to happens when something (e.g. propstats) modifies librecords while at the same time, stats_over_http iterates over the metrics.

@PSUdaemon
Copy link
Contributor

FWIW, I use this in prod and our corruption issues have gone away. We are not using hardening so I think we see silent corruption instead of crashes like @zwoop and @sekimura see.

@jpeach
Copy link
Contributor

jpeach commented Oct 9, 2015

For example, RecLookupMatchingRecords doesn't lock around g_records, likewise g_records is not locked anywhere in P_RecCore.cc. The g_records_rwlock lock seems to be used when accessing the records store via the hash table, so it is not clear to me why you need to lock here.

In the stack trace, the relevant record is of type RECD_INT but by the time it gets to the plugin it is RECD_STRING. The proxy.process.ssl.origin_server_decryption_failed metric is a RECD_INT, so how did it get a partial update to a different type? Rather than adding extra locking, I really want an explanation of why the global lock should be held here.

@PSUdaemon
Copy link
Contributor

Wanted to update, our prod issues did not go away completely, but were greatly reduced. So I don't think this is the correct or complete fix, but it's related.

@bryancall
Copy link
Contributor

This needs to be investigated more to completely solve this issue.

@zwoop
Copy link
Contributor

zwoop commented Mar 27, 2016

@sekimura I think we should close this for now, and maybe you can produce a more complete fix? If you agree, please close this PR.

@sekimura
Copy link
Contributor Author

Hence this is not a complete fix, I'll close this PR.

@sekimura sekimura closed this Mar 27, 2016
@PSUdaemon
Copy link
Contributor

Is there a new issue for this? I have some more information to share.

@zwoop
Copy link
Contributor

zwoop commented May 5, 2016

Well, TS-3948 is still open, use that?

@sekimura sekimura deleted the TS-3948 branch December 1, 2016 22:39
SolidWallOfCode pushed a commit to SolidWallOfCode/trafficserver that referenced this pull request Feb 1, 2017
YTSATS-935: TSStringPercentDecode null-termination
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants