Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[M] 1931913: API-level product/content updates no longer occur in parallel (CANDLEPIN-440) #3670

Merged
merged 1 commit into from Nov 17, 2022

Conversation

Ceiu
Copy link
Contributor

@Ceiu Ceiu commented Nov 14, 2022

  • Changed the ProductManager and ContentManager to obtain pessimistic write locks before making any change or removal of products or content
  • Moved several org-less content and product queries from the OwnerContentCurator or OwnerProductCurator to the ContentCurator or ProductCurator as appropriate
  • Changed the output from several curator methods from lists of tuples to maps of collections to better convey exactly what was being returned and to make it easier to follow during analysis
  • The OrphanCleanupJob no longer removes content that is referenced by a non-orphaned product or an environment, even in cases where the content is technically orphaned (bad content mapping)
  • Added a DB-level delete cascade on Content.modifiedProductIds, as JPA is inexplicably unwilling or unable to cascade a deletion on the parent entity to an element collection when using JPA bulk deletions
  • Removed some unused curator methods

@nikosmoum nikosmoum requested review from a team and JoshAlbrecht23 and removed request for a team November 14, 2022 20:57
@Ceiu
Copy link
Contributor Author

Ceiu commented Nov 14, 2022

There's a lot of churn in this change, but the heavy lifting is being done by the change in ProductManager and ContentManager by changing the pessimistic read lock to a pessimistic write lock. This effectively means only one content or product change can occur at a time via the API, but without a large restructuring of the model, we don't have many better options.

There's a new spec test in OwnerContentSpecTest which hits one of the ways this bug can arise, but the root cause seems to be making simultaneous modifications to one or more content entities which are owned by the same product. When performed, there's a race condition that can occur during the child remapping step on the owning product. Due to transaction isolation, the content changes (and thus, new content UUIDs) will not be visible to each other, so while both sets of content changes go through, the mapping to one of them gets clobbered. In many cases, this is mostly transparent (save for the "missing" content change due to the mapping error), but will become very apparent once the OrphanCleanupJob runs and removes the incorrectly mapped content; leaving the product with fewer content than expected.

The fix here is two-fold: the aforementioned locking changes, combined with enhancements to the OrphanCleanupJob to no longer remove content that is still mapped to a product or environment, even if that mapping is invalid. Additionally, the content removal has been changed to be performed in bulk, which may improve performance in some cases.

@Ceiu Ceiu force-pushed the crog/m/update_versioned_entity_locking branch from 50a9340 to f375611 Compare November 15, 2022 12:55
@Ceiu Ceiu changed the title [M] API-level product and content modifications no longer occur in parallel (CANDLEPIN-440) [M] 1931913: API-level product/content updates no longer occur in parallel (CANDLEPIN-440) Nov 15, 2022
Copy link
Contributor

@JoshAlbrecht23 JoshAlbrecht23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Small doc changes and a print statement possibly left in the spec test

@Ceiu Ceiu force-pushed the crog/m/update_versioned_entity_locking branch from f375611 to cc92e63 Compare November 15, 2022 20:41
@Ceiu
Copy link
Contributor Author

Ceiu commented Nov 15, 2022

These test failures have me concerned. Going to flip this to a draft and take a look at them to make sure they're not related.

@Ceiu Ceiu marked this pull request as draft November 15, 2022 23:19
@Ceiu Ceiu marked this pull request as ready for review November 16, 2022 15:21
@nikosmoum nikosmoum requested a review from a team November 16, 2022 15:21
@Ceiu
Copy link
Contributor Author

Ceiu commented Nov 16, 2022

Look to be sporadic failures. Cannot reproduce locally at all, or reliably on Jenkins.

Copy link
Member

@nikosmoum nikosmoum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is rare, but I was able to reproduce this using the container locally by using the @RepeatedTest(100) annotation to run the shouldAutoSubscribePhysicalSystemsWithQuantity2PerSocketPair 100 times.
Here is a shortened version of the stacktrace (I have the full one if you need):

2022-11-17 13:11:14,038 [thread=http-bio-8443-exec-11] [req=da6d7d5d-54e8-40a5-923d-6955339573ab, org=, csid=] ERROR org.candlepin.exceptions.mappers.CandlepinExceptionMapper - Runtime Error could not extract ResultSet at org.mariadb.jdbc.export.ExceptionFactory.createException:296
org.hibernate.exception.LockAcquisitionException: could not extract ResultSet
        at org.hibernate.dialect.MySQLDialect$3.convert(MySQLDialect.java:562)
        at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:37)
...
        at org.hibernate.query.internal.AbstractProducedQuery.getSingleResult(AbstractProducedQuery.java:1665)
        at org.candlepin.model.AbstractHibernateCurator.getSystemLock(AbstractHibernateCurator.java:1339)
        at org.candlepin.model.AbstractHibernateCurator.getSystemLock(AbstractHibernateCurator.java:1343)
        at org.candlepin.controller.ProductManager.createProduct(ProductManager.java:232)
        at com.google.inject.persist.jpa.JpaLocalTxnInterceptor.invoke(JpaLocalTxnInterceptor.java:56)
        at org.candlepin.resource.OwnerProductResource.createProductByOwner(OwnerProductResource.java:164)
...
Caused by: java.sql.SQLTransactionRollbackException: (conn=28) Deadlock found when trying to get lock; try restarting transaction
...

It is concerning that it is a deadlock

@nikosmoum
Copy link
Member

Interestingly enough: This seems to ONLY happen on the first 6 times the test runs after I bring up the container/candlepin. It is as if 6 of the repeated runs are run in parallel, and they all deadlock each other. No matter how many hundreds of times I run the test again, it won't fail until the container/candlepin is started fresh.

@Ceiu Ceiu marked this pull request as draft November 17, 2022 14:27
- Changed the ProductManager and ContentManager to obtain pessimistic
  write locks before making any change or removal of products or content
- Moved several org-less content and product queries from the
  OwnerContentCurator or OwnerProductCurator to the ContentCurator or
  ProductCurator as appropriate
- Changed the output from several curator methods from lists of tuples
  to maps of collections to better convey exactly what was being
  returned and to make it easier to follow during analysis
- The OrphanCleanupJob no longer removes content that is referenced
  by a non-orphaned product or an environment, even in cases where the
  content is technically orphaned (bad content mapping)
- Added a DB-level delete cascade on Content.modifiedProductIds, as
  JPA is inexplicably unwilling or unable to cascade a deletion on
  the parent entity to an element collection when using JPA bulk
  deletions
- Removed some unused curator methods
@Ceiu Ceiu force-pushed the crog/m/update_versioned_entity_locking branch from cc92e63 to d30980f Compare November 17, 2022 18:56
@Ceiu
Copy link
Contributor Author

Ceiu commented Nov 17, 2022

@nikosmoum this is related to having an empty database and needing to create the system locks on demand. For some reason, this is a potential area of deadlock. I'll look deeper into it

If we prime the DB with those values, we don't get into that situation. The update adds a liquibase update to add them to the DB if they don't already exist.

@Ceiu Ceiu marked this pull request as ready for review November 17, 2022 18:59
@nikosmoum nikosmoum requested a review from a team November 17, 2022 18:59
@JoshAlbrecht23 JoshAlbrecht23 merged commit e981b71 into master Nov 17, 2022
2 checks passed
@JoshAlbrecht23 JoshAlbrecht23 deleted the crog/m/update_versioned_entity_locking branch November 17, 2022 20:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants