feat(hc): Use hybrid cloud services in AuthIdentityHandler #43035

RyanSkonnord · 2023-01-10T18:48:28Z

These changes are targeted toward getting all AuthHelper and AuthIdentityHandler code to run with control silo limits. Most availability errors from AuthIdentityHandler are fixed, but we need more work around the Actor model and ApiInviteHelper to make it stable.

corps · 2023-01-10T20:53:42Z

src/sentry/auth/helper.py

    request: HttpRequest
    identity: Mapping[str, Any]

+    def __post_init__(self) -> None:
+        # For debugging. TODO: Remove when tests are stable
+        if not isinstance(self.organization, ApiOrganization):


👍
You could also make this conditional to tests by

import sys if 'pytest' in sys.modules: ...

Which means you can safely continue to leave it.

src/sentry/auth/helper.py

src/sentry/services/hybrid_cloud/audit/impl.py

corps · 2023-01-10T21:25:38Z

src/sentry/services/hybrid_cloud/audit/impl.py

+        self, *, metadata: AuditLogMetadata, organization_member: ApiOrganizationMember
+    ) -> None:
+        user_id = organization_member.user_id
+        assert user_id is not None


Hmmm... I think this is wrong. Just because a user does not exist, does not mean we cannot log something about their id. Many AuditLogEntry invocations come from async processes, where technically they can have a user_id in hand of a user that has already been deleted. At the moment, we use cascade logic to set_null those user_ids, but in the future an async process will provide that.

I think the correct thing is to take the user_id provided "as is" without validating its deletion and allow eventually consistency of the tombstone / outbox pattern I'm working on to handle the rest.

So basically, we still want audit log entries even when a user no longer exists, and we'll handle the set null at a different level.

Sounds good. I would guess that, when I dropped in that assertion, I was thinking more about getting Mypy to pass than about the correct behavior for audit logs.

It also wouldn't surprise me if, before my changes, the rare case of a null user would have simply caused an error, and it's just more obvious now.

src/sentry/services/hybrid_cloud/audit/impl.py

corps · 2023-01-10T21:31:43Z

src/sentry/services/hybrid_cloud/audit/impl.py

+from sentry.services.hybrid_cloud.user import APIUser, user_service
+
+
+class DatabaseBackedAuditLogService(AuditLogService):


A high level concern: right now, I actually already did implement an AuditLogEntry service:

def save_or_write_to_kafka(self): """ Region Silos do not have access to the AuditLogEntry table which is specific to the control silo. For those silos, this method publishes the attempted audit log write to a durable kafka queue synchronously that will eventually be consumed by the control silo. For the control silo, this method ultimately results in a save() call. """ from sentry.region_to_control.producer import audit_log_entry_service audit_log_entry_service.produce_audit_log_entry(self)

Which is hooked up to some existing call sites, but isn't running in production. However, it is currently used in a handful of tests.

One major difference is that in my original design, AuditLogs were written to kafka and asynchronously written to the database. I do think we want to move away from kafka, but I wonder if using Outbox to synchronize these might also be a strong option. It would improve latency and availability of these code paths since writing an AuditLog is never a synchronously required action.

Strictly speaking, such work is not necessary for this PR for now (let's get the auth login work across the line), but when the login work is wrapping up, let's think about how we can converge on a single audit log service, and wether we think it should remain synchronous or not.

AuditLogs were written to kafka and asynchronously written to the database. I do think we want to move away from kafka, but I wonder if using Outbox to synchronize these might also be a strong option. It would improve latency and availability of these code paths since writing an AuditLog is never a synchronously required action.

Having audits be transported by outbox would be ideal imo. Not having to setup cross region Kafka would be good as it helps us improve resilience and reduces infrastructure complexity.

corps · 2023-01-10T21:32:52Z

src/sentry/services/hybrid_cloud/auth/__init__.py

+    @abc.abstractmethod
+    def handle_new_membership(
+        self,
+        request: Request,


Hmm, my worry here is the requirement of the request object. Is there, any way we can further unpack this? Because when this becomes an RPC invocation, we don't want to be serializing an entire request object.

It looks like we rely on two session values, invite_token and invite_member_id and the request user.

Sounds good, but it would cascade some sizeable changes through ApiInviteHelper. How about as a separate PR?

Totally makes sense to do separately 👍

corps · 2023-01-10T21:34:29Z

src/sentry/services/hybrid_cloud/auth/impl.py

+            flags.member_limit__restricted = True
+
+        # Otherwise create a new membership
+        om = organization_service.add_organization_member(


Cool. In the future, this will become simpler: we'll create an OrganizationMemberMapping orm locally on the control, and that will synchronize back. But this is sufficient for current world compatibility.

corps

Cool! I think, as you said at standup, taking the approach by starting with setting test to stable=True and following the implications that surfaces is a great way to make progress. I think this is a very smart way to break up the complex work and I can see where you're headed. I added a lot of feedback but I want to emphasize that I agree with your approach and am grateful for breaking this down.

…ationMember

Resolves failures in test_auth_organization_login.

markstory · 2023-01-11T16:04:40Z

src/sentry/services/hybrid_cloud/audit/__init__.py

+    def log_organization_membership(
+        self,
+        *,
+        metadata: AuditLogMetadata,
+        organization_member: ApiOrganizationMember,
+    ) -> None:
+        raise NotImplementedError


Are you thinking that there will be a method for each audit log that is recorded? We have quite a few audit log event types which will make this interface wide.

We should be able to extend the existing AuditLogEvent plumbing code to be aware of required data attributes which would let us have an interface like

audit_log_service.log(AuditLogMetaData( organization=self.organization, actor=self.user, ip_address=self.request.META["REMOTE_ADDR"], target_object=auth_identity.id, event=audit_log.get_event_id("SSO_IDENTITY_LINK"), auth_identity=auth_identity )

With required parameters embedded in to the event definition AuditLogMetaData could ensure all the required attributes are present. This approach would have a narrow interface, but wouldn't be as typesound as what you have presently.

markstory · 2023-01-11T16:06:28Z

src/sentry/services/hybrid_cloud/audit/impl.py

+from sentry.services.hybrid_cloud.user import APIUser, user_service
+
+
+class DatabaseBackedAuditLogService(AuditLogService):


AuditLogs were written to kafka and asynchronously written to the database. I do think we want to move away from kafka, but I wonder if using Outbox to synchronize these might also be a strong option. It would improve latency and availability of these code paths since writing an AuditLog is never a synchronously required action.

Having audits be transported by outbox would be ideal imo. Not having to setup cross region Kafka would be good as it helps us improve resilience and reduces infrastructure complexity.

markstory · 2023-01-11T16:09:39Z

src/sentry/services/hybrid_cloud/auth/__init__.py

+    @abc.abstractmethod
+    def handle_new_membership(
+        self,
+        request: Request,


It looks like we rely on two session values, invite_token and invite_member_id and the request user.

src/sentry/services/hybrid_cloud/organization/impl.py

RyanSkonnord · 2023-01-13T00:18:26Z

Regarding several comments on and near the writing of audit logs: I agree with your feedback, and everything around the audit logs was pretty much a hack job trying to get this integrated with minimal code movement outside of AuditLogService. Per discussion with @corps, I think the most realistic way to ship these auth changes would be to punt the audit log stuff until we can cover it with a good general mechanism.

These changes are targeted toward getting all AuthHelper and AuthIdentityHandler code to run with control silo limits. Most availability errors from AuthIdentityHandler are fixed, but we need more work around the Actor model and ApiInviteHelper to make it stable. Co-authored-by: Zach Collins <recursive.cookie.jar@gmail.com>

RyanSkonnord requested a review from a team January 10, 2023 18:48

RyanSkonnord requested review from a team as code owners January 10, 2023 18:48

github-actions bot added the Scope: Backend Automatically applied to PRs that change backend components label Jan 10, 2023

RyanSkonnord mentioned this pull request Jan 10, 2023

wip: Adapting AuthIdentityHandler to HC services #41821

Closed

vercel bot deployed to Preview – storybook January 10, 2023 18:56 View deployment

vercel bot deployed to Preview – sentry January 10, 2023 18:57 View deployment