Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bugfix: Fix pyfunc ensembler undeployment bug #331

Conversation

deadlycoconuts
Copy link
Contributor

@deadlycoconuts deadlycoconuts commented Apr 12, 2023

Context

After the Gorm library has been updated to v2 in #261, there is a bug discovered in how we're using the Gorm methods to save router versions to the DB whereby nested fields (such as the ensembler configs) do not get updated in their respective tables. This has caused deployed pyfunc ensemblers to not have their corresponding docker configs updated in the ensembler_configs table. Since these are left as NULL, what then happens downstream is that the cluster controller fails to build KnativeService objects that correspond to the deployed ensembler instances, causing it to be unable to deploy ensemblers that have been deployed (either as part of a router update operation or an undeployment operation).

This PR introduces a config to the Gorm DB layer which adds a new field FullSaveAssociations: true to the existing Gorm DB configs, which would ensure that all tables get updated whenever Gorm is used to update a router version (or any other object for that matter of fact). See go-gorm/gorm#6162 for more details.

This PR introduces additional steps to update the ensembler and enricher configs of a router version (these are stored in their respective tables) and were previously not getting updated whenever a router version is updated:

err = tx.Save(routerVersion).Error
if err != nil {
	return nil, err
}

if routerVersion.Enricher != nil {
	err = tx.Save(routerVersion.Enricher).Error
	if err != nil {
		return nil, err
	}
}
if routerVersion.Ensembler != nil {
	err = tx.Save(routerVersion.Ensembler).Error
	if err != nil {
		return nil, err
	}
}

Totally Unrelated Changes - Unused Arguments Clean Up

The linter's raising a lot of warnings about unused arguments so I've decided to remove them completely wherever possible, which is the case for the methods like NewRouterEndpoint, NewEnricherService and NewEnsemblerService since the existing implementations of the the clusterSvcBuilder don't even use them. And since these methods are concrete implementations of the interface ClusterServiceBuilder's methods, I've also removed the arguments from them too, which then requires some minor refactoring of the methods of mockClusterServiceBuilder (which also implements the interface) and the tests where it is used.

Most notably, I've removed all instances of the following:

			Labels: map[string]string{
				"env": envType,
			},

that's specified in the expected and mocked responses (in api/turing/service/router_deployment_service_test.go) since in no way is the actual implementation of the clusterSvcBuilder ever using the envType argument to set labels in the KnativeService struct returned. It's weird to continue mocking responses that set those labels and then configure the expected responses to include those labels.

In all other cases (the majority of them), I've simply renamed the remaining unused arguments to _ so those should be pretty easy to look through.

@deadlycoconuts deadlycoconuts added the type: bug Something isn't working label Apr 12, 2023
@deadlycoconuts deadlycoconuts self-assigned this Apr 12, 2023
@deadlycoconuts deadlycoconuts marked this pull request as ready for review April 12, 2023 08:22
@deadlycoconuts deadlycoconuts requested review from a team and ariefrahmansyah April 12, 2023 08:22
Comment on lines -87 to -89
Labels: map[string]string{
"env": envType,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should no longer be here the actual clusterServiceBuilder does not even use an argument envType to set the labels.

Comment on lines -274 to -276
Labels: map[string]string{
"env": envType,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed these from the expected responses since we were just 'expecting' a mocked response though in reality the actual method doesn't even actually set the envType as the env label.

Copy link
Collaborator

@krithika369 krithika369 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down, @deadlycoconuts ! Left some questions.

Endpoint: PyFuncEnsemblerServiceEndpoint,
Port: PyFuncEnsemblerServicePort,
Env: pyfuncConfig.Env,
ServiceAccount: "",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If this is going to be an empty value, we can just leave this field unset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing that out, I've just removed that :)

api/turing/service/router_deployment_service.go Outdated Show resolved Hide resolved
@deadlycoconuts deadlycoconuts force-pushed the fix_pyfunc_ensembler_undeployment_bug branch from ae645a6 to 8a71e04 Compare April 13, 2023 09:49
@deadlycoconuts deadlycoconuts force-pushed the fix_pyfunc_ensembler_undeployment_bug branch 2 times, most recently from 4404564 to 973bce2 Compare April 14, 2023 03:06
@deadlycoconuts deadlycoconuts force-pushed the fix_pyfunc_ensembler_undeployment_bug branch 3 times, most recently from 239fdac to dbfdac5 Compare April 17, 2023 08:09
@deadlycoconuts deadlycoconuts force-pushed the fix_pyfunc_ensembler_undeployment_bug branch from dbfdac5 to 64c2cbf Compare April 17, 2023 09:07
@@ -184,7 +184,6 @@ var _ = DeployedRouterContext("testdata/create_router_nop_logger_proprietary_exp

Eventually(func(g Gomega) {
router = api.GetRouter(apiE, routerCtx.ProjectID, routerCtx.ID)
// TODO: Why is it pending? - g.Expect(router.Value("status").Raw()).To(Equal(api.Status.Pending))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this as it is no longer needed (I agree that we shouldn't be expecting the pending state).

),
endpoint="anything",
endpoint="ensemble",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made these changes to the SDK tests so these configs resemble the test configs here: https://github.com/caraml-dev/turing/blob/main/api/e2e/test/testdata/create_router_nop_logger_proprietary_exp.json.tmpl.

Comment on lines +21 to +27
# check that the status of the router is pending
try:
router.wait_for_status(RouterStatus.PENDING)
except TimeoutError:
raise Exception(
f"Turing API is taking too long for router {router.id} to start getting deployed."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this block here in order to verify that the router enters a pending state before being deployed. This is similar to the API e2e test here: https://github.com/caraml-dev/turing/blob/main/api/e2e/test/router_e2e_test.go#L236.

Copy link
Collaborator

@krithika369 krithika369 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks for sorting this out and all the test case analysis, @deadlycoconuts !

@deadlycoconuts deadlycoconuts merged commit 6c7b8be into caraml-dev:main Apr 18, 2023
@deadlycoconuts deadlycoconuts deleted the fix_pyfunc_ensembler_undeployment_bug branch June 6, 2023 07:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants