New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[M] 1931913: API-level product/content updates no longer occur in parallel (CANDLEPIN-440) #3670
Conversation
Ceiu
commented
Nov 14, 2022
- Changed the ProductManager and ContentManager to obtain pessimistic write locks before making any change or removal of products or content
- Moved several org-less content and product queries from the OwnerContentCurator or OwnerProductCurator to the ContentCurator or ProductCurator as appropriate
- Changed the output from several curator methods from lists of tuples to maps of collections to better convey exactly what was being returned and to make it easier to follow during analysis
- The OrphanCleanupJob no longer removes content that is referenced by a non-orphaned product or an environment, even in cases where the content is technically orphaned (bad content mapping)
- Added a DB-level delete cascade on Content.modifiedProductIds, as JPA is inexplicably unwilling or unable to cascade a deletion on the parent entity to an element collection when using JPA bulk deletions
- Removed some unused curator methods
|
There's a lot of churn in this change, but the heavy lifting is being done by the change in ProductManager and ContentManager by changing the pessimistic read lock to a pessimistic write lock. This effectively means only one content or product change can occur at a time via the API, but without a large restructuring of the model, we don't have many better options. There's a new spec test in OwnerContentSpecTest which hits one of the ways this bug can arise, but the root cause seems to be making simultaneous modifications to one or more content entities which are owned by the same product. When performed, there's a race condition that can occur during the child remapping step on the owning product. Due to transaction isolation, the content changes (and thus, new content UUIDs) will not be visible to each other, so while both sets of content changes go through, the mapping to one of them gets clobbered. In many cases, this is mostly transparent (save for the "missing" content change due to the mapping error), but will become very apparent once the OrphanCleanupJob runs and removes the incorrectly mapped content; leaving the product with fewer content than expected. The fix here is two-fold: the aforementioned locking changes, combined with enhancements to the OrphanCleanupJob to no longer remove content that is still mapped to a product or environment, even if that mapping is invalid. Additionally, the content removal has been changed to be performed in bulk, which may improve performance in some cases. |
50a9340
to
f375611
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! Small doc changes and a print statement possibly left in the spec test
spec-tests/src/test/java/org/candlepin/spec/content/OwnerContentSpecTest.java
Outdated
Show resolved
Hide resolved
f375611
to
cc92e63
Compare
|
These test failures have me concerned. Going to flip this to a draft and take a look at them to make sure they're not related. |
|
Look to be sporadic failures. Cannot reproduce locally at all, or reliably on Jenkins. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is rare, but I was able to reproduce this using the container locally by using the @RepeatedTest(100) annotation to run the shouldAutoSubscribePhysicalSystemsWithQuantity2PerSocketPair 100 times.
Here is a shortened version of the stacktrace (I have the full one if you need):
2022-11-17 13:11:14,038 [thread=http-bio-8443-exec-11] [req=da6d7d5d-54e8-40a5-923d-6955339573ab, org=, csid=] ERROR org.candlepin.exceptions.mappers.CandlepinExceptionMapper - Runtime Error could not extract ResultSet at org.mariadb.jdbc.export.ExceptionFactory.createException:296
org.hibernate.exception.LockAcquisitionException: could not extract ResultSet
at org.hibernate.dialect.MySQLDialect$3.convert(MySQLDialect.java:562)
at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:37)
...
at org.hibernate.query.internal.AbstractProducedQuery.getSingleResult(AbstractProducedQuery.java:1665)
at org.candlepin.model.AbstractHibernateCurator.getSystemLock(AbstractHibernateCurator.java:1339)
at org.candlepin.model.AbstractHibernateCurator.getSystemLock(AbstractHibernateCurator.java:1343)
at org.candlepin.controller.ProductManager.createProduct(ProductManager.java:232)
at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:56)
at org.candlepin.resource.OwnerProductResource.createProductByOwner(OwnerProductResource.java:164)
...
Caused by: java.sql.SQLTransactionRollbackException: (conn=28) Deadlock found when trying to get lock; try restarting transaction
...
It is concerning that it is a deadlock
|
Interestingly enough: This seems to ONLY happen on the first 6 times the test runs after I bring up the container/candlepin. It is as if 6 of the repeated runs are run in parallel, and they all deadlock each other. No matter how many hundreds of times I run the test again, it won't fail until the container/candlepin is started fresh. |
- Changed the ProductManager and ContentManager to obtain pessimistic write locks before making any change or removal of products or content - Moved several org-less content and product queries from the OwnerContentCurator or OwnerProductCurator to the ContentCurator or ProductCurator as appropriate - Changed the output from several curator methods from lists of tuples to maps of collections to better convey exactly what was being returned and to make it easier to follow during analysis - The OrphanCleanupJob no longer removes content that is referenced by a non-orphaned product or an environment, even in cases where the content is technically orphaned (bad content mapping) - Added a DB-level delete cascade on Content.modifiedProductIds, as JPA is inexplicably unwilling or unable to cascade a deletion on the parent entity to an element collection when using JPA bulk deletions - Removed some unused curator methods
cc92e63
to
d30980f
Compare
|
@nikosmoum this is related to having an empty database and needing to create the system locks on demand. For some reason, this is a potential area of deadlock. I'll look deeper into it If we prime the DB with those values, we don't get into that situation. The update adds a liquibase update to add them to the DB if they don't already exist. |