-
-
Notifications
You must be signed in to change notification settings - Fork 4.5k
feat: Add dif assemble endpoint #7141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Migration Checklist
Generated by 🚫 danger |
93549ad
to
171b6c5
Compare
src/sentry/api/endpoints/chunk.py
Outdated
|
||
return Response( | ||
{ | ||
'url': '{}{}'.format(endpoint, reverse('sentry-api-0-chunk-upload')), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only thing that changed here.
We want to return the full url instead of just the "domain"
CELERY_QUEUES = [ | ||
Queue('alerts', routing_key='alerts'), | ||
Queue('auth', routing_key='auth'), | ||
Queue('assemble', routing_key='assemble'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @JTCunning I'm not sure if we need to do anything special anymore from ops to handle a new queue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we're good. If the task takes up a significant amount of resources, we'll isolate it with another pool of workers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just blocking this until we add verification to checksums like we discussed offline.
src/sentry/tasks/assemble.py
Outdated
The type is a File.ChunkAssembleType | ||
''' | ||
if len(file_blob_ids) == 0: | ||
logger.warning('sentry.tasks.assemble.assemble_chunks', extra={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can remove all of your log statements that are prepended with sentry.tasks.assemble
since that will be in the logger name and is unnecessary.
src/sentry/tasks/assemble.py
Outdated
file.assemble_from_file_blob_ids(file_blob_ids, checksum) | ||
if file.headers.get('state', '') == ChunkFileState.ERROR: | ||
logger.error( | ||
'sentry.tasks.assemble.assemble_chunks', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since you have multiple assemble_chunks
error statements throughout your code, you should append them with the logical reason they're erroring, so assemble_chunks.state_error
or something.
I think we should kill mode 1 and always require the chunks and metadata to be sent (as a file with the same checksum might exist with other parameters). Additionally we will need to org scope the chunk upload for security reasons as discussed on slack. I'm fine storing a chunk-verified bit in cache for 12 hours which should also be our "chunk not stable" time. David also says we can keep a huge table. Either works I think. |
What we discussed on slack:
Notes unrelated to above convo:
|
Would be great if we could return the error description in |
I've added a new model called |
@mattrobenolt any new feedback? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some specific comments.
Generally though I think we should really make this a DIF specific endpoint for now (eg: on a dif specific url instead of generic chunk-assemble). Reason being that it seems cleaner from an access management point of view and fits better into how current code functions.
src/sentry/api/endpoints/chunk.py
Outdated
) | ||
except IntegrityError: | ||
pass | ||
if blob.checksum not in checksum_list: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change this loop to be a for checksum, chunk in izip(checksums, files)
and then check the blob.checksum against the checksum directly?
src/sentry/api/endpoints/chunk.py
Outdated
for owned_blob in all_owned_blobs: | ||
owned_blobs.append((owned_blob.blob.id, owned_blob.blob.checksum)) | ||
|
||
# If the request does not cotain any chunks for a file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "cotain"
src/sentry/api/endpoints/chunk.py
Outdated
elif len(owned_blobs) != len(chunks): | ||
# Create a missing chunks array which we return as response | ||
# so the client knows which chunks to reupload | ||
missing_chunks = list(chunks) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this into missing_chunks = set(chunks)
and then remove items with missing_chunks.discard(blob[1])
. Faster and easier (O(1)
vs O(n something)
)
|
||
def forwards(self, orm): | ||
# Adding index on 'File', fields ['checksum'] | ||
db.create_index('sentry_file', ['checksum']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't run this in production.
This needs to be done with CREATE INDEX CONCURRENTLY
. grep the repo for examples of other migrations adding indexes.
|
||
# Flag to indicate if this migration is too risky | ||
# to run online and needs to be coordinated for offline | ||
is_dangerous = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because creating this index will take a while, especially with CONCURRENTLY
, this needs to be flipped to True
so we don't block deploy for however long it takes to create the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this is good to go from my side. Annoyingly we can't push this to staging because of the migration.
Also need @mattrobenolt seal of approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking for the prefetch_related
comment and make sure the migration doesn’t need to be rebased.
src/sentry/api/bases/chunk.py
Outdated
def _check_file_blobs(self, organization, checksum, chunks): | ||
files = File.objects.filter( | ||
checksum=checksum | ||
).select_related('blobs').all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you’re looking for prefetch_related
here. Otherwise, below you’re doing an O(n)
for each file to fetch the blobs.
src/sentry/api/bases/chunk.py
Outdated
name=name, | ||
checksum=checksum, | ||
type='chunked', | ||
headers={'state': ChunkFileState.CREATED} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not a fan of using a header here to manage the state. Is this file truly temporary?
Might I suggest a name like, __state
to better signal that it’s not real?
Migration needs debased over #7191 |
This adds
POST
/api/0/projects/sentry/internal/files/difs/assemble/
API endpoint to Sentry.tl;dr
This is part 2 #7095
This enables us the upload dif (Debug Information Files) of arbitrary size.
95%+ of lines is the added
crash.sym
fixture for testsThis endpoint pieces together chunks that were uploaded before in the mentioned PR.
It adds a new model called
FileBlobOwner
which makes sure that someone who uploads a blob has access rights to it (Organization).The request is JSON scheme validated.
Request
POST
/api/0/projects/sentry/internal/files/difs/assemble/
Body:
This actually tries to assemble a file if all chunks were uploaded before.
If chunks are missing (or ownership is missing) the response will be:
If all chunks are already uploaded and the file did not exsist before response will be:
State can be:
This will trigger the task
sentry.tasks.assemble.assemble_chunks
to do the actual assembling.The assemble task supports
dif
s (Debug information files) like dsyms and so on.It does not currently support proguard and source map files.
Legacy
**Request** `POST` `/api/0/chunk-assemble/` Body: ```json { "38fbd8b2cbe56884115e324dd5f2a10c8201450c": true } ```This will check the
File
model if a file with this checksum already exists in the database.Response: