New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(hc): Use hybrid cloud services in AuthIdentityHandler #43035
Conversation
These changes are targeted toward getting all AuthHelper and AuthIdentityHandler code to run with control silo limits. Most availability errors from AuthIdentityHandler are fixed, but we need more work around the Actor model and ApiInviteHelper to make it stable.
request: HttpRequest | ||
identity: Mapping[str, Any] | ||
|
||
def __post_init__(self) -> None: | ||
# For debugging. TODO: Remove when tests are stable | ||
if not isinstance(self.organization, ApiOrganization): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
You could also make this conditional to tests by
import sys
if 'pytest' in sys.modules:
...
Which means you can safely continue to leave it.
self, *, metadata: AuditLogMetadata, organization_member: ApiOrganizationMember | ||
) -> None: | ||
user_id = organization_member.user_id | ||
assert user_id is not None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm... I think this is wrong. Just because a user does not exist, does not mean we cannot log something about their id. Many AuditLogEntry invocations come from async processes, where technically they can have a user_id in hand of a user that has already been deleted. At the moment, we use cascade logic to set_null
those user_ids, but in the future an async process will provide that.
I think the correct thing is to take the user_id provided "as is" without validating its deletion and allow eventually consistency of the tombstone / outbox pattern I'm working on to handle the rest.
So basically, we still want audit log entries even when a user no longer exists, and we'll handle the set null at a different level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I would guess that, when I dropped in that assertion, I was thinking more about getting Mypy to pass than about the correct behavior for audit logs.
It also wouldn't surprise me if, before my changes, the rare case of a null user would have simply caused an error, and it's just more obvious now.
from sentry.services.hybrid_cloud.user import APIUser, user_service | ||
|
||
|
||
class DatabaseBackedAuditLogService(AuditLogService): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A high level concern: right now, I actually already did implement an AuditLogEntry service:
def save_or_write_to_kafka(self):
"""
Region Silos do not have access to the AuditLogEntry table which is specific to the control silo.
For those silos, this method publishes the attempted audit log write to a durable kafka queue synchronously
that will eventually be consumed by the control silo. For the control silo, this method ultimately results
in a save() call.
"""
from sentry.region_to_control.producer import audit_log_entry_service
audit_log_entry_service.produce_audit_log_entry(self)
Which is hooked up to some existing call sites, but isn't running in production. However, it is currently used in a handful of tests.
One major difference is that in my original design, AuditLogs were written to kafka and asynchronously written to the database. I do think we want to move away from kafka, but I wonder if using Outbox to synchronize these might also be a strong option. It would improve latency and availability of these code paths since writing an AuditLog is never a synchronously required action.
Strictly speaking, such work is not necessary for this PR for now (let's get the auth login work across the line), but when the login work is wrapping up, let's think about how we can converge on a single audit log service, and wether we think it should remain synchronous or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AuditLogs were written to kafka and asynchronously written to the database. I do think we want to move away from kafka, but I wonder if using Outbox to synchronize these might also be a strong option. It would improve latency and availability of these code paths since writing an AuditLog is never a synchronously required action.
Having audits be transported by outbox would be ideal imo. Not having to setup cross region Kafka would be good as it helps us improve resilience and reduces infrastructure complexity.
@abc.abstractmethod | ||
def handle_new_membership( | ||
self, | ||
request: Request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, my worry here is the requirement of the request object. Is there, any way we can further unpack this? Because when this becomes an RPC invocation, we don't want to be serializing an entire request object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we rely on two session values, invite_token
and invite_member_id
and the request user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but it would cascade some sizeable changes through ApiInviteHelper
. How about as a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally makes sense to do separately 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flags.member_limit__restricted = True | ||
|
||
# Otherwise create a new membership | ||
om = organization_service.add_organization_member( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. In the future, this will become simpler: we'll create an OrganizationMemberMapping
orm locally on the control, and that will synchronize back. But this is sufficient for current world compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! I think, as you said at standup, taking the approach by starting with setting test to stable=True and following the implications that surfaces is a great way to make progress. I think this is a very smart way to break up the complex work and I can see where you're headed. I added a lot of feedback but I want to emphasize that I agree with your approach and am grateful for breaking this down.
Resolves failures in test_auth_organization_login.
def log_organization_membership( | ||
self, | ||
*, | ||
metadata: AuditLogMetadata, | ||
organization_member: ApiOrganizationMember, | ||
) -> None: | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking that there will be a method for each audit log that is recorded? We have quite a few audit log event types which will make this interface wide.
We should be able to extend the existing AuditLogEvent plumbing code to be aware of required data attributes which would let us have an interface like
audit_log_service.log(AuditLogMetaData(
organization=self.organization,
actor=self.user,
ip_address=self.request.META["REMOTE_ADDR"],
target_object=auth_identity.id,
event=audit_log.get_event_id("SSO_IDENTITY_LINK"),
auth_identity=auth_identity
)
With required parameters embedded in to the event
definition AuditLogMetaData
could ensure all the required attributes are present. This approach would have a narrow interface, but wouldn't be as typesound as what you have presently.
from sentry.services.hybrid_cloud.user import APIUser, user_service | ||
|
||
|
||
class DatabaseBackedAuditLogService(AuditLogService): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AuditLogs were written to kafka and asynchronously written to the database. I do think we want to move away from kafka, but I wonder if using Outbox to synchronize these might also be a strong option. It would improve latency and availability of these code paths since writing an AuditLog is never a synchronously required action.
Having audits be transported by outbox would be ideal imo. Not having to setup cross region Kafka would be good as it helps us improve resilience and reduces infrastructure complexity.
@abc.abstractmethod | ||
def handle_new_membership( | ||
self, | ||
request: Request, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we rely on two session values, invite_token
and invite_member_id
and the request user.
Regarding several comments on and near the writing of audit logs: I agree with your feedback, and everything around the audit logs was pretty much a hack job trying to get this integrated with minimal code movement outside of |
These changes are targeted toward getting all AuthHelper and AuthIdentityHandler code to run with control silo limits. Most availability errors from AuthIdentityHandler are fixed, but we need more work around the Actor model and ApiInviteHelper to make it stable. Co-authored-by: Zach Collins <recursive.cookie.jar@gmail.com>
These changes are targeted toward getting all AuthHelper and AuthIdentityHandler code to run with control silo limits. Most availability errors from AuthIdentityHandler are fixed, but we need more work around the Actor model and ApiInviteHelper to make it stable.