-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
ref(sentry-metrics): Add MetricsKeyIndexer table #28914
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
afee685
dcf7e4c
b94843a
6535086
9e7ae1a
d30430d
39e2479
c84a69e
e4b5aa4
b1906b4
256cbb3
5efe976
7b552e8
613b141
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,55 @@ | ||
| # Generated by Django 2.2.24 on 2021-10-04 18:19 | ||
|
|
||
| import django.utils.timezone | ||
| from django.db import migrations, models | ||
|
|
||
| import sentry.db.models.fields.bounded | ||
|
|
||
|
|
||
| class Migration(migrations.Migration): | ||
| # This flag is used to mark that a migration shouldn't be automatically run in | ||
| # production. We set this to True for operations that we think are risky and want | ||
| # someone from ops to run manually and monitor. | ||
| # General advice is that if in doubt, mark your migration as `is_dangerous`. | ||
| # Some things you should always mark as dangerous: | ||
| # - Large data migrations. Typically we want these to be run manually by ops so that | ||
| # they can be monitored. Since data migrations will now hold a transaction open | ||
| # this is even more important. | ||
| # - Adding columns to highly active tables, even ones that are NULL. | ||
| is_dangerous = False | ||
|
|
||
| # This flag is used to decide whether to run this migration in a transaction or not. | ||
| # By default we prefer to run in a transaction, but for migrations where you want | ||
| # to `CREATE INDEX CONCURRENTLY` this needs to be set to False. Typically you'll | ||
| # want to create an index concurrently when adding one to an existing table. | ||
| # You'll also usually want to set this to `False` if you're writing a data | ||
| # migration, since we don't want the entire migration to run in one long-running | ||
| # transaction. | ||
| atomic = True | ||
|
|
||
| dependencies = [ | ||
| ("sentry", "0234_grouphistory"), | ||
| ] | ||
|
|
||
| operations = [ | ||
| migrations.CreateModel( | ||
| name="MetricsKeyIndexer", | ||
| fields=[ | ||
| ( | ||
| "id", | ||
| sentry.db.models.fields.bounded.BoundedBigAutoField( | ||
| primary_key=True, serialize=False | ||
| ), | ||
| ), | ||
| ("string", models.CharField(max_length=200)), | ||
| ("date_added", models.DateTimeField(default=django.utils.timezone.now)), | ||
| ], | ||
| options={ | ||
| "db_table": "sentry_metricskeyindexer", | ||
| }, | ||
| ), | ||
| migrations.AddConstraint( | ||
| model_name="metricskeyindexer", | ||
| constraint=models.UniqueConstraint(fields=("string",), name="unique_string"), | ||
| ), | ||
| ] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,30 @@ | ||
| from typing import Any | ||
|
|
||
| from django.db import connections, models, router | ||
| from django.utils import timezone | ||
|
|
||
| from sentry.db.models import Model | ||
|
|
||
|
|
||
| class MetricsKeyIndexer(Model): # type: ignore | ||
| __include_in_export__ = False | ||
|
|
||
| string = models.CharField(max_length=200) | ||
| date_added = models.DateTimeField(default=timezone.now) | ||
|
|
||
| class Meta: | ||
| db_table = "sentry_metricskeyindexer" | ||
| app_label = "sentry" | ||
| constraints = [ | ||
| models.UniqueConstraint(fields=["string"], name="unique_string"), | ||
| ] | ||
|
|
||
| @classmethod | ||
| def get_next_values(cls, num: int) -> Any: | ||
| using = router.db_for_write(cls) | ||
| connection = connections[using].cursor() | ||
|
|
||
| connection.execute( | ||
| "SELECT nextval('sentry_metricskeyindexer_id_seq') from generate_series(1,%s)", [num] | ||
| ) | ||
| return connection.fetchall() | ||
|
Comment on lines
+23
to
+30
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a lot of context on this project, how will we use the values from this sequence? It vaguely looks like you want to reserve a range of ids, and then use those ids later on to create new rows. Is that the general idea?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @wedamija Sorry for not giving enough context in the PR description, I can go back an update in a bit but yeah:
Thats basically it. Eventually we want to have postgres be off the critical path, but in order to do that we need to know the ids ahead of time. What I am unsure about is what kind of ranges we are talking, is it 100, 1000, 10000? Since this metrics indexer will be used by metrics names, tag keys, and tag values, it could be a lot of writes for high cardinality tags
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Looks good for now. Once we know how many ids we're allocating per second we can decide whether we need to do something more complex here. I'm not sure if there's a performance hit to calling
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I bet it will be desirable to avoid calling nextval 10k times (that would be 10k writes I believe). We may not strictly need a sequence at that point but just a counter. |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,74 @@ | ||
| from collections import defaultdict | ||
| from typing import Any, Dict, List, Optional, Set | ||
|
|
||
| from sentry.sentry_metrics.indexer.models import MetricsKeyIndexer | ||
| from sentry.utils.services import Service | ||
|
|
||
|
|
||
| class PGStringIndexer(Service): # type: ignore | ||
| """ | ||
| Provides integer IDs for metric names, tag keys and tag values | ||
| and the corresponding reverse lookup. | ||
| """ | ||
|
|
||
| __all__ = ("record", "resolve", "reverse_resolve", "bulk_record") | ||
|
|
||
| def _bulk_record(self, unmapped_strings: Set[str]) -> Any: | ||
| records = [MetricsKeyIndexer(string=string) for string in unmapped_strings] | ||
| # We use `ignore_conflicts=True` here to avoid race conditions where metric indexer | ||
| # records might have be created between when we queried in `bulk_record` and the | ||
| # attempt to create the rows down below. | ||
| MetricsKeyIndexer.objects.bulk_create(records, ignore_conflicts=True) | ||
| # Using `ignore_conflicts=True` prevents the pk from being set on the model | ||
| # instances. Re-query the database to fetch the rows, they should all exist at this | ||
| # point. | ||
| return MetricsKeyIndexer.objects.filter(string__in=unmapped_strings) | ||
|
|
||
| def bulk_record(self, strings: List[str]) -> Dict[str, int]: | ||
| # first look up to see if we have any of the values | ||
| records = MetricsKeyIndexer.objects.filter(string__in=strings) | ||
| result = defaultdict(int) | ||
|
|
||
| for record in records: | ||
| result[record.string] = record.id | ||
|
|
||
| unmapped = set(strings).difference(result.keys()) | ||
| new_mapped = self._bulk_record(unmapped) | ||
|
|
||
| for new in new_mapped: | ||
| result[new.string] = new.id | ||
|
|
||
| return result | ||
|
|
||
| def record(self, string: str) -> int: | ||
| """Store a string and return the integer ID generated for it""" | ||
| result = self.bulk_record(strings=[string]) | ||
| return result[string] | ||
|
|
||
| def resolve(self, string: str) -> Optional[int]: | ||
| """Lookup the integer ID for a string. | ||
|
|
||
| Returns None if the entry cannot be found. | ||
| """ | ||
| try: | ||
| id: int = MetricsKeyIndexer.objects.filter(string=string).values_list("id", flat=True)[ | ||
| 0 | ||
| ] | ||
| except IndexError: | ||
| return None | ||
|
|
||
| return id | ||
|
|
||
| def reverse_resolve(self, id: int) -> Optional[str]: | ||
| """Lookup the stored string for a given integer ID. | ||
|
|
||
| Returns None if the entry cannot be found. | ||
| """ | ||
| try: | ||
| string: str = MetricsKeyIndexer.objects.filter(id=id).values_list("string", flat=True)[ | ||
| 0 | ||
| ] | ||
| except IndexError: | ||
| return None | ||
|
|
||
| return string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to add the
IndexOperationhere because otherwise I got the missing "hints={'tables':..}argument" forAddIndex. It seemed to me that since theAddIndexoperation is model specific that I could put this here cc @wedamijaThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks good to me