Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry Deepcell in case of bad gateway (or similar) response errors #605

Merged
merged 11 commits into from
Jun 30, 2022

Conversation

alex-l-kong
Copy link
Contributor

What is the purpose of this PR?

Closes #602, which indicates an error in decoding the JSON in case the Deepcell API returns a response indicating an error.

How did you implement your changes

Utilize the Retry module in requests and mount it to a requests.Session, which we use to invoke the post method to Deepcell.

We also add a try/except block around this statement in case none of the retries work and the resulting json cannot be decoded anyways.

Remaining issues

I'm a bit in the dark here because I've not encountered this error personally. @ngreenwald let me know if this resolves the issue, or if it will need additional work.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald whoops, I forgot to request the review earlier. I wasn't able to replicate the bad gateway/json decode error on my end, let me know if you run into issues with the Retry module on your end.

@alex-l-kong
Copy link
Contributor Author

@ngreenwald since we make a similar API calls later to our Redis database, I can implement the same Retry logic there if it works for Deepcell on your end.

@ngreenwald
Copy link
Member

Just tried it, looks the error isn't being caught

JSONDecodeError                           Traceback (most recent call last)
/usr/local/lib/python3.6/site-packages/requests/models.py in json(self, **kwargs)
    909         try:
--> 910             return complexjson.loads(self.text, **kwargs)
    911         except JSONDecodeError as e:

/usr/local/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    353             parse_constant is None and object_pairs_hook is None and not kw):
--> 354         return _default_decoder.decode(s)
    355     if cls is None:

/usr/local/lib/python3.6/json/decoder.py in decode(self, s, _w)
    338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    340         end = _w(s, end).end()

/usr/local/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

During handling of the above exception, another exception occurred:

JSONDecodeError                           Traceback (most recent call last)
<ipython-input-9-c0157cd31fa5> in <module>
----> 1 deepcell_service_utils.create_deepcell_output(deepcell_input_dir, deepcell_output_dir, fovs=fovs, scale=rescale_factor, zip_size=10)

/usr/local/lib/python3.6/site-packages/ark/utils/deepcell_service_utils.py in create_deepcell_output(deepcell_input_dir, deepcell_output_dir, fovs, suffix, host, job_type, scale, timeout, zip_size, parallel)
    147             executor.shutdown(wait=True)
    148     else:
--> 149         list(map(_zip_run_extract, fov_groups, range(len(fov_groups))))
    150 
    151 

/usr/local/lib/python3.6/site-packages/ark/utils/deepcell_service_utils.py in _zip_run_extract(fov_group, group_index)
    117         print('Uploading files to DeepCell server.')
    118         status = run_deepcell_direct(
--> 119             zip_path, deepcell_output_dir, host, job_type, scale, timeout
    120         )
    121 

/usr/local/lib/python3.6/site-packages/ark/utils/deepcell_service_utils.py in run_deepcell_direct(input_dir, output_dir, host, job_type, scale, timeout, num_retries)
    244             json={
    245                 'hash': predict_hash,
--> 246                 'key': ["status", "progress", "output_url", "reason", "failures"]
    247             }
    248         ).json()

/usr/local/lib/python3.6/site-packages/requests/models.py in json(self, **kwargs)
    915                 raise RequestsJSONDecodeError(e.message)
    916             else:
--> 917                 raise RequestsJSONDecodeError(e.msg, e.doc, e.pos)
    918 
    919     @property

JSONDecodeError: [Errno Expecting value] <html>
<head><title>502 Bad Gateway</title></head>
<body>
<center><h1>502 Bad Gateway</h1></center>
<hr><center>nginx/1.17.10</center>
</body>
</html>
: 01

@alex-l-kong
Copy link
Contributor Author

alex-l-kong commented Jun 19, 2022

@ngreenwald in that case, seems likely that the main endpoint doesn't actually throw a 502 error, but simply returns an HTML object that indicates it encountered one (which can't be processed as JSON). I've explicitly added a check for JSONDecodeError, let me know if it works now.

Copy link
Member

@ngreenwald ngreenwald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this did the trick. Now we need to deal with zip files that never got processed. There should be a status update that gets printed for the user with with the name of the zip file that wasn't processed if the retry times out.

I think having the 5 retries in place should do the trick, but if it doesn't, we'll need a record of what happened so that we can add functionality to retry for those zip files that failed.

I also added some print statements for debugging, you can remove those.

ark/utils/deepcell_service_utils.py Show resolved Hide resolved
ark/utils/deepcell_service_utils.py Outdated Show resolved Hide resolved
ark/utils/deepcell_service_utils.py Outdated Show resolved Hide resolved
@alex-l-kong
Copy link
Contributor Author

Looks like this did the trick. Now we need to deal with zip files that never got processed. There should be a status update that gets printed for the user with with the name of the zip file that wasn't processed if the retry times out.

I think having the 5 retries in place should do the trick, but if it doesn't, we'll need a record of what happened so that we can add functionality to retry for those zip files that failed.

I also added some print statements for debugging, you can remove those.

Here's how I'm thinking the process will work:

while True:
    # make calls in parallel
    if parallel:
        with ThreadPoolExecutor() as executor:
            executor.map(_zip_run_extract, fov_groups, range(len(fov_groups)))
            executor.shutdown(wait=True)
    else:
        list(map(_zip_run_extract, fov_groups, range(len(fov_groups))))

    # run check for zip file created for each `fov_group`
    # if they all exist, break
    # if not, then set fov_groups equal to the ones that remain, and stay in the loop

I can also add another status variable that will prevent this from getting stuck in long cycle in case of a catastrophic error on Deepcell's end, but based on your findings using the retry module, I doubt this loop will need to run more than a few times at most.

@ngreenwald
Copy link
Member

ngreenwald commented Jun 29, 2022

Yeah, I agree that's the long-term solution. However, for now I think we should just print a message when this error occurs. Switching the default batch_size to 5, and retrying on failures, seems to have resolved all the problems for me.

Once other users report that they're running into the same issue, we can make the change, but it will require some refactoring of how the function works, since right now the output zip file isn't uniquely named. So we'll need to make the name matches with the input zip name, and also make sure that when the process is restarted the old zip files don't give a false positive for which file needs to get uploaded.

Can you open an issue describing it?

@alex-l-kong
Copy link
Contributor Author

Yeah, I agree that's the long-term solution. However, for now I think we should just print a message when this error occurs. Switching the default batch_size to 5, and retrying on failures, seems to have resolved all the problems for me.

Once other users report that they're running into the same issue, we can make the change, but it will require some refactoring of how the function works, since right now the output zip file isn't uniquely named. So we'll need to make the name matches with the input zip name, and also make sure that when the process is restarted the old zip files don't give a false positive for which file needs to get uploaded.

Can you open an issue describing it?

Sounds good, I've just opened the issue.

I think the now-modified print statements we have do a pretty good job of notifying the user, I'll also include an additional print out of the FOVs that couldn't be processed with normal retry logic under the check for if status != 0.

@ngreenwald ngreenwald merged commit ad6ccfc into master Jun 30, 2022
@ngreenwald ngreenwald deleted the deepcell_json_decode branch June 30, 2022 17:46
@srivarra srivarra added the bug Something isn't working label Aug 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

json decode error when uploading to deepcell server
3 participants