Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Sentry Architecture #2

Merged
merged 8 commits into from
Sep 28, 2022
Merged

New Sentry Architecture #2

merged 8 commits into from
Sep 28, 2022

Conversation

mitsuhiko
Copy link
Member

@mitsuhiko mitsuhiko commented Jul 21, 2022

This is a living document with some of my collected thoughts about changes to Sentry's core architecture on a longer time horizon.

Rendered RFC

Copy link

@fpacifici fpacifici left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more comments to come ...

This will cause issues in the testsuite as the sentry test suite currently tries to create events
regularly and pragmatic solutions for this need to be found.

### Remove Pickle in Rabbit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes please

Are we also considering serialization formats other than JSON. I think we never tried anything seriously but it may be interesting to explore alternatives to improve parsing/serialization (but really parsing)


### Remove Celery and RabbitMQ for Kafka

Our use of RabbitMQ is reaching the limits of what can be done with this system safely. If we hit

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have some references about this ? When you say "hit the disk" are you talking about Rabbit ? Or the tasks themselves ?
Rabbit has some advantages specifically in terms of auto scaling. In Kafka you are bound to the number of partitions of your topic. The traffic is not redistributed across partitions when you add consumers to a consumer group. Only partitions are reshuffled. So in order to autoscale and preserve a balanced load you need a number of partitions that is much larger than the number of consumer, which presents problems in its own as it is very heavy on the broker and causes massive rebalancing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today if we ever hit disk on rabbit the thing crawls to a ground. I am unfortunately not able to find the last incident involving this as it predates the current incident process, but it's something we are generally aware of. We could probably operate rabbit so that going to disk is an option, but today that does not appear to be something we can do.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish we fully understood what is going on with RabbitMQ in that scenario and did tests that re-validate that understanding.

It should however still be possible to move this to a Kafka model where batches of tasks are redistributed
to slower and faster topics. For instance the processing of the task could be started and if it's not
finishing within a deadline (or the code already knows that this execution can't happen) it would be
dispatched to increasing slower topics. With sufficient concurrency on the consumers this is probably

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how we would do parallelism right in a scenario where we have different classes of traffic that do not have to be processed in strict order
https://github.com/confluentinc/parallel-consumer

the entire pipeline whereas a native event can spend up to 30 minutes or more in the pipeline in very
extreme cases (completely cold caches against a slow symbol server).

It should however still be possible to move this to a Kafka model where batches of tasks are redistributed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the split between classes of traffic and the need to move away from Rabbit are orthogonal. Or am I missing something?
Multiple queues can be implemented in either system.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally speaking I think we just need to make a decision how we feel about rabbit vs more kafka. Everything else there is downstream from it. If we no longer want rabbit we need to build some stuff to replicate the benefits we get from it with Kafka.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need though ?
I feel that Rabbit and Kafka solve two related problems but provide different functionalities. Some problems can be solved by both but not all of them.
Rabbit:

  • Auto scales well.
  • Cheaper ?

Kafka:

  • Can provide in order delivery guarantees (that Rabbit cannot provide)
  • Has better delivery guarantees (we can easily do both at least once and at most once)
  • It seems to scale massively but partitioning makes autoscaling less effective.

So if we want to pick one over the other. It has to be Kafka because of the functional requirements around delivery guarantees that Rabbit does not provide. That means we cannot do without Kafka.

I guess the question is whether we have real use cases where we use rabbitMQ and that would be hard to replace with Kafka.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From where I'm standing we have RabbitMQ for two reasons at the moment: one is that it does not have head of line blocking which makes it a pretty trivial choice for parts of the processing library where we are not quite sure how long something takes. The second reasons is that we build a lot of stuff on top of celery, particularly the cron based stuff.

The first reason is in fact a benefit but it can also be done with Kafka. The second reason is just a historic thing I believe. I'm not sure if we have actual cost or scaling benefits with RabbitMQ still but that's something I would like to explore.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Particularly on the RabbitMQ side we currently operate with non durable, federated queues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern I have is that celery is a great abstraction for quick prototyping while Kafka is everything but. Moving every trivially low-scale task that doesn't have consistency requirements to Kafka today is much harder than creating a task that runs on worker-glob.

one thing we could do is to implement a celery broker on top of kafka. this would make it easier to gather operational experience with running certain product features on kafka vs rabbitmq, and could significantly sweeten the deal if we decided we're gonna ditch rabbitmq entirely for whatever reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered a managed cloud service for tasks queues instead of RabbitMQ? GCP Tasks or Pub/Sub? Here's a comparison.

I can confirm as well that scaling of workers on RabbitMQ is simpler than on Kafka due to the 1 consumer / partition limit and scaling of consumer groups causing rebalancing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we did want to do everything with Kafka agree with @fpacifici that the parallel consumer should be investigated for non-ordered workloads.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say that I have no preference on if we want to use RabbitMQ (or a hosted queue) or Kafka. I can in general live with both but if we stay with RabbitMQ I would like to move us into a world where different codebases can use the queue and not just the monolith. Particularly also other languages.

text/0002-new-architecture.md Outdated Show resolved Hide resolved
This is generally quite inefficient but we are also running into limitations in the workers. For instance
we rely exclusively on memcache for caches which means that we are limited by our cache size limitations.
Anything above the cache size is not processable which can cause a bad user experience for users depending
on very large source maps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having one way to apply symbols sounds like a good simplification.

Currently quite a bit of code lives in `getsentry` for reasons of assumed convenience. However this has created
the situation that it is quite easy to ship regressions because the changes in `getsentry` were not considered.
There is likely a whole range of code that does not need to live in `getsentry` and moving it to `sentry` would
make the testsuite more reliable, faster to run and catch more regressions early.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could see usage periods and usage collection being something that could move to sentry. We would keep invoicing and subscriptions in getsentry but the usage subsystem could be moved out. This could also help align product features like orgstats and dynamic sampling suggestions with the usage periods that we use for billing in saas more easily.

text/0002-new-architecture.md Outdated Show resolved Hide resolved
text/0002-new-architecture.md Outdated Show resolved Hide resolved
Comment on lines +190 to +196
The Sentry monolith currently houses a range of small services that are however clustered together. For instance the
processing pipeline really could be separated entirely from the rest of the System. However the same is true for quite
a lot of different parts of the product. For instance certain parts of the user experience are sufficiently idependent
in the UI code today already that they could turn into a separate service that just supplies UI views together with some
APIs. While it's unclear if the existing experiences are worth splitting up into new services, it's probably likely that
we can at least structure their code so that we can reduce the total amount of pieces of code that need to be checked and
tested against changes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worthwhile being explicit that service compartmentalization doesn't imply splitting the monolith into multiple repos and running them as independent processes. Similarly, compartmentalizing the UI doesn't mean 'micro-frontends' that are stitched together.

My understanding is that we want to compartmentalize build and test dependencies so that CI and code review can be done more efficiently is that what you had in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much yes. I think the discussion about single vs multi repo is largely orthogonal to how things are organized. We have pretty bad tooling for both mono repo and multi repo at the moment and I think in the absence of a clear direction I would personally probably keep the status quo for now.

We can still have a separate conversation later if we want to move things more into one repo or into multiple repos.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should clarify what we really want to achieve with a Service Infrastructure. I think different people have different ideas on the goal, and depending on the goal the solution is different.
I heard about testability, isolating feature to reduce risk, isolating feature to reduce deployment cost (which would require independent stand alone services).

I would add another one which is architecture sanity (when everything depends on everything there is no architecture).
Do we have other reasons ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that covers it.

(For me one hidden reason about the service discussion is that if we acknowledge that we already operate services, we would do a better job articulating the operational requirements. Today all services we create are one-off creations which makes operating the entirety outside of the primary sentry.io installation challenging. Though that applies less to the monolith than other things we operate but which we might separate from the monolith in the future.)

Comment on lines +203 to +207
The biggest limiting factor today in creating new services is the configuration of these services. When an engineer adds
a new service there is typically a followup PR against the internal ops repo to configure these services for the main
sentry installation. Then there is some sort of ad-hoc workaround to make the whole thing work in local development, how
to make it work on a single tenant or self hosted installation. Each new services undergoes a new phase of rediscovery how
suboptimal the rules and policies around scaling these services are.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we overloading 'services'? Do we need a different term for the processes that a product service needs in order to operate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed we should probably be more explicit about what those things are.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic nit maybe. Should we start talking about domain boundaries and bounded contexts (from domain driven design) instead of services ?
Defining bounded contexts is a logical partition of the data model (which requires having a coherent data model to begin with). That then implies clear dependencies between features/components as your component cannot directly access the part of the data model that belongs to another bounded context.
Having the clear dependencies then allows us to isolate what we call services today whether or not we split the code base or not.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we start talking about domain boundaries and bounded contexts (from domain driven design) instead of services ?
Defining bounded contexts is a logical partition of the data model (which requires having a coherent data model to begin with).

Totally agree that the first step to separating 'services' within the monolith or outside of it should start with bounded contexts and what consistency guarantees we need within each context.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the service separation within the monolith I'm trying to work something sensible out as a near term roadmap.

text/0002-new-architecture.md Outdated Show resolved Hide resolved
mitsuhiko and others added 4 commits August 21, 2022 16:42
Co-authored-by: Mark Story <mark@mark-story.com>
Co-authored-by: Mark Story <mark@mark-story.com>
Co-authored-by: Mark Story <mark@mark-story.com>
Co-authored-by: Mark Story <mark@mark-story.com>
Comment on lines +13 to +15
decisions have to be made about the future of our codebase. As we are talking about creating
more Sentry deployments in different regions and for smaller clusters, some of these problems
are becoming more pressing as they create complexity for each added installation.
Copy link

@fpacifici fpacifici Aug 21, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is an implication to our architecture coming from the idea of multiple smaller clusters that we are not covering here: operation and monitoring.
Essentially the focus moves from the ability to run large scalable clusters to operate efficiently many clusters where each has smaller scalability issues.
There are two ways to address this problem (probably the two have to be addressed together).

  • Make the architecture simpler with fewer components or fewer types of components to be deployed. This could drive us to consolidate components like multiple snuba consumers into fewer deployment units.
  • Massively invest in automation so that managing multiple clusters (adding components, tuning components) becomes not harder than managing one. Example: we cannot be scaling up/down individual kafka consumer clusters with 10 regions and we cannot have to add a new Kafka topic manually to each region.

I think the above should be a requirement for the new Sentry architecture.
Maybe this should be a work stream in its own.

the entire pipeline whereas a native event can spend up to 30 minutes or more in the pipeline in very
extreme cases (completely cold caches against a slow symbol server).

It should however still be possible to move this to a Kafka model where batches of tasks are redistributed

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need though ?
I feel that Rabbit and Kafka solve two related problems but provide different functionalities. Some problems can be solved by both but not all of them.
Rabbit:

  • Auto scales well.
  • Cheaper ?

Kafka:

  • Can provide in order delivery guarantees (that Rabbit cannot provide)
  • Has better delivery guarantees (we can easily do both at least once and at most once)
  • It seems to scale massively but partitioning makes autoscaling less effective.

So if we want to pick one over the other. It has to be Kafka because of the functional requirements around delivery guarantees that Rabbit does not provide. That means we cannot do without Kafka.

I guess the question is whether we have real use cases where we use rabbitMQ and that would be hard to replace with Kafka.

Comment on lines +131 to +132
Within relay some changes are likely to be necessary for continued to support of providing a stable service. In
particular as more and more work moves into Relay certain risks that the current architecture is carrying need to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea that more and more work keeps moving into Relay ?
I see a lot of work that cannot be anywhere else like bucketing metrics and process dynamic sampling as we want customer relays to be able to drop transactions.
Though the more logic we put in Relay without a split between infra code and product code the harder it is to keep it stable.
Product logic that maybe should not be there can be:

  • extract spans details (for the suspect spans feature)
  • A various amount of replays specific processing logic
  • A lot of events specific normalization (the more we add envelope types the more product logic will end up in Relay to normalize the data structure)
  • ...

Is there something that prevents us from making Relay simpler and moving all product logic into the specific data category ingestion pipeline to further isolate them?
Also did a bug specific to a data category processing ever cause an incident in relay that impacted others ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea that more and more work keeps moving into Relay ?

Unclear, but I have been talking with @jan-auer already that if we keep doing we might want to split relay into two pieces two which are connected through a persistent queue. And once we do something like that we can also modularize it. Metrics aggregation I'm conflicted about, but in particular longer term there is a lot of reason to try to attempt symbolication before something hits the innermost parts of the pipeline, particularly now that we have a lot more mobile/native traffic.

Is there something that prevents us from making Relay simpler and moving all product logic into the specific data category ingestion pipeline to further isolate them?

Anything we need to know ahead of dynamic sampling needs to be expanded/processes by relay.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs a more concrete plan. We've been talking for a long time about doing symbolication in Relay but so far it's been a pie-in-the-sky idea.

I think the problems this pipeline-reordering brings up are mostly impossible to solve. I have a rough idea on how to do grouping-aware filtering in some cases, but I don't see how we can stackwalk before dynamic sampling without any access to our massive internal system symbols, for example.

Comment on lines +139 to +140
layers. For instance we can achieve much better aggregations by ensuring that data from related metrics keys are
forwarded to the same processing relays.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are three types of traffic routing problems we are facing today, though they are really instances of the same problem and we could have one system to solve the three.

  1. Traffic affinity. Which is what you mentioned.
  2. Partitioning by organization to run parallel slices of the architecture. Everything based on Kafka does not scale indefinitely horizontally on one topic. We generally first scale by adding partitions, then by adding parallelism on the consumer, but at a point (above 64 partitions) break the traffic into independent topics and partition organizations across topics. As we scale more this last scenario will happen more.
  3. Smart semantic partitioning. Besides Relay metrics aggregations there is a number of Kafka consumers that become more efficient if traffic is sort of semantic partitioned. As an example: if, tentatively, an organization will be only present in a small number of partitions. The gain is still cache locality.

I think the problem is always the same:

  • map an organization to a logical shard.
  • have a component (that can even be a service) that dynamic ally or statically route a logical shard to a physical resource (a relay node, a kafka topic or a kafka partition, etc.)
  • have a feedback loop for dynamic traffic routing that observes the load on the physical resource and inform the component that performs the routing.

I think there is some value in building this as a Relay agnostic system that works based on an abstract concept of partition and physical resource.

I wrote down some of that in the comments here
https://www.notion.so/sentry/Traffic-Steering-f2244ea8c40d405ebca998fe8dc6fd05
This is the scenario in the case 2 above https://www.notion.so/sentry/Traffic-routing-between-shards-d5f15d5e419f4dc5aac67c1c1140fa98

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with filippo and I wish we would start building this service sooner rather than later. Our current initiatives don't have a common interface to code against, and since multiple teams are working on sharding/slicing/hybrid-cloud solutions I fear that there will be some avoidable fragmentation in approaches.

Comment on lines +190 to +196
The Sentry monolith currently houses a range of small services that are however clustered together. For instance the
processing pipeline really could be separated entirely from the rest of the System. However the same is true for quite
a lot of different parts of the product. For instance certain parts of the user experience are sufficiently idependent
in the UI code today already that they could turn into a separate service that just supplies UI views together with some
APIs. While it's unclear if the existing experiences are worth splitting up into new services, it's probably likely that
we can at least structure their code so that we can reduce the total amount of pieces of code that need to be checked and
tested against changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should clarify what we really want to achieve with a Service Infrastructure. I think different people have different ideas on the goal, and depending on the goal the solution is different.
I heard about testability, isolating feature to reduce risk, isolating feature to reduce deployment cost (which would require independent stand alone services).

I would add another one which is architecture sanity (when everything depends on everything there is no architecture).
Do we have other reasons ?

Comment on lines +203 to +207
The biggest limiting factor today in creating new services is the configuration of these services. When an engineer adds
a new service there is typically a followup PR against the internal ops repo to configure these services for the main
sentry installation. Then there is some sort of ad-hoc workaround to make the whole thing work in local development, how
to make it work on a single tenant or self hosted installation. Each new services undergoes a new phase of rediscovery how
suboptimal the rules and policies around scaling these services are.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

semantic nit maybe. Should we start talking about domain boundaries and bounded contexts (from domain driven design) instead of services ?
Defining bounded contexts is a logical partition of the data model (which requires having a coherent data model to begin with). That then implies clear dependencies between features/components as your component cannot directly access the part of the data model that belongs to another bounded context.
Having the clear dependencies then allows us to isolate what we call services today whether or not we split the code base or not.

more Sentry deployments in different regions and for smaller clusters, some of these problems
are becoming more pressing as they create complexity for each added installation.

# Expected Workstreams

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we may have another work stream to deal with.
We have a few independent code bases today (a few independent services) and likely, in the future, we will have more. Though we do not have a Sentry Foundation set of services and libraries that solve common problems for all the code bases services, thus we end up rebuilding things.
There is a number of common components I am thinking about:

  • rate limiting libraries (a number of rate limiters policies. Today we literally reimplemented that in Snuba and Sentry)
  • A runtime config flags service+sdk with admin UI usable by everybody. We have 3 systems now, none of which is good.
  • Shared settings system across services.
  • Traffic routing sdk.
  • A streaming platform that allows product teams to run their product logic without caring too much about the deployment of a Kafka consumer. If we move more towards Kafka to replace rabbit we will have a lot more consumers running product logic.
  • Postgres models (we will certainly not allow different services to share a postgres instance, though it does not mean that different services could access their own postgres instance and take advantage of reusing the sentry abstractions)
  • ....

Comment on lines +231 to +232
However it's likely that we will be unable to reduce cardinality in the long run and the data model should ideally
be able to represent this high cardinality data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is supporting high cardinality metrics the first problem to solve ?
The product fundamentally relies on the assumption that real world data is low cardinality (at least for what concerns transaction names). The product concept falls apart if that is not true, i.e. if you cannot group by transaction names effectively. This is because the whole product experience is based on aggregating by transaction names.

So, knowing that real world data is high cardinality, shouldn't we first decide whether we change the product assumption (and then build a high cardinality storage) or whether we keep the product assumption and fix data cardinality instead ?

Copy link
Member

@untitaker untitaker Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fpacifici I think you're mostly thinking about transaction name, but people are also wishing for significantly more tags on performance metrics AND release health that the product would be able to deal with very well... but not the infrastructure.

I think it would help to note down some examples here of concrete product features or queries that we would want to enable with this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some examples of high cardinality data outside of URLs:

  • User / Customer ID
  • Shopping Cart ID
  • Request ID
  • IP address

scale is quite tricky and in some situations we need to needlessly spawn more processes and containers even though they
could be colocated.

## Data Store

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few other ideas about data store improvements.

  • The next generation of our Clickhosue fleet. Which should support dynamic schemas, elasticity, separation of storage and compute, self service. I had some discussion with @nikhars about this.
  • An intermediate high volume storage between postgres (for consistency) and clickhouse (for scale). We should identify which cases fit here and specifically whether we should revisit the main storage for our issue data model.


# Expected Workstreams

This captures all expected streams of work that are in the scope of overhauling Sentry's

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another aspect we may want to capture regarding the new architecture: the concept of ownership and self service. We have seen that we cannot live anymore in a world where ops and SnS own everything and that product teams need to take ownership of their system, which can include application code in the monolith but also systems (like kafka consumers or storage instances). This also includes being accountable for the resources they use so that they can take responsibility for decisions between a short term non-scalable solution to deliver faster and a long term scalable solution.

I think there are two fundamental ways to approach the problem that generate different architectures:

  • physically segregate systems by ownership. Product would have full stack ownership of their resources including k8s deployments, clickhosue clusters, k8s namespace.
  • a mixed model where the product owns application logic running on a serverless infrastructure managed by infra teams.

For snuba we are going towards the second model. I am outlining it here https://www.notion.so/sentry/Improve-product-development-speed-92c5b9f06a6744288454437f5a8c3db6

Though this problem will apply to Sentry as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


We still use pickle as serialization format for Celery which means that it's not possible for code
outside of the Sentry monolith to dispatch tasks. This is both an issue for the separation of
`sentry` and `sentry_pipeline` (as import paths are changing) as well as it causes problems for
Copy link
Member

@untitaker untitaker Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I absolutely support moving away from pickle, I would challenge that this is a blocker for renaming sentry to sentry_pipeline. Celery tasks are explicitly named using the instrumented_task decorator, and while the names usually correspond to teh import path, you can move tasks around and rename modules without breaking task producers. we already did that successfully.

The only place where I think actual import paths are serialized for Celery today is when a complex object is pickled as a task argument. There is prior art to restricting what objects may be pickled in load-it-up.py. A custom pickle serializer wired up to Celery that warns if a non-JSON-serializable object is being serialized should be step 0 before moving to a different serialization format. Ideally we'd be doing the same for org/project/global options.

This step 0 will require a lot of coordination and followup tasks for many product areas as I am fairly sure we pickle complex objects in both task arguments and postgres

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is not the naming of the task but the arguments. If I today want to dispatch a task containing a model from outside the sentry codebase, I need to create a message that pickles into a django model from within the sentry codebase. The only realistic and portable way to do this is to import the monolith, create the instance and dispatch the task that way.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The non JSON safe detection code BTW is already in Sentry. The issue is that the tasks are still passing these complex objects in practice. That's a thing we need to start addressing before we can even entertain the idea of moving off it.


### Separation of Traffic Classes

Relay currently forwards all envelopes to the next relay in chain the same. There are however benefits by being able
Copy link
Member

@untitaker untitaker Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open question:

  • Where should this split occur? I assume the simplest answer is to do the routing in pops, and only separate processing relays, but could it make sense to do it in customer relays already, so that pops are also segmented?


### Serverless Function Declarations for Consumers

The smallest unit of execution at Sentry is not a docker image but in fact a function. We currently have no real serverless
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to the rabbitmq vs kafka discussion, but we already have this abstraction in sentry: a celery task

we can consider two almost orthogonal things here:

  1. whether we want to continue using celery as our task abstraction, or build something simpler and more restrictive in-house
  2. what the backing queue should be (rabbitmq, kafka, something else)

handling of moving such code into environments where other code wants to access it. We already require depickling
in the data pipeline for such cases.

### Phase out Buffers where Possible
Copy link
Member

@untitaker untitaker Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

honestly i think buffers is conceptually a good idea to keep. it provides batching for a subset of operations in your task/function without forcing you to give up isolation of tasks/functions. I instead wish we made buffers more scalable under the hood, encourage/allow isolation per product domain, and extended the abstraction to work for actually more things than models.

moving to bespoke kafka consumers that do batching is a pretty big investment to make for each product, and I would prefer if we continued to have higher-level building blocks (higher-level than arroyo). To me that would be a rewrite of buffers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think buffers is a good concept and it should stay. The way it's built is maybe not ideal (in particular the model and pickle dependency) and particularly given extended outages as a possibility it would be preferrable this to be backed by kafka rather than redis for instance. I don't think everything should build a kafka consumer but a buffer service could be a possibility.

markstory added a commit that referenced this pull request Sep 8, 2022
Building on the ideas presented in #2, this RFC aims to provide a more
formal description of how services/domains within the monolith could be
structured.
markstory added a commit that referenced this pull request Sep 8, 2022
Building on the ideas presented in #2, this RFC aims to provide a more
formal description of how services/domains within the monolith could be
structured.
@mitsuhiko mitsuhiko merged commit 9581e0e into main Sep 28, 2022
@mitsuhiko mitsuhiko deleted the feature/arch branch September 28, 2022 09:38
@mitsuhiko
Copy link
Member Author

I merged this RFC and will maintain it as an aspirational living document. The link to the PR is maintained so that old discussions can be found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants