Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-352] catch and retry malformed json #4982

Merged
merged 14 commits into from
Apr 5, 2022
Merged

Conversation

emmyoop
Copy link
Member

@emmyoop emmyoop commented Mar 30, 2022

resolves #4849, resolves #4342

Description

There are still some exceptions when trying to install deps that fail without retrying. This PR tries to catch the rest of them.

It also caches the HUB response to reduce calls. See #4983 for some more detailed context.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have added information about my change to be included in the CHANGELOG.

@cla-bot cla-bot bot added the cla:yes label Mar 30, 2022
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

try:
# the quotes in the response were converted to single quotes which is not
# valid json. json.dumps() will fix that.
json.loads(json.dumps(jsonData))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels bad. Any suggestions on a better way to do this welcome!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment doesn't state this yet but the api/v1/index.json path also returns a list objects that causes it fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait... I think it should be valid JSON... it's just a list... according to jsonlint dot com it's valid

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 my understanding is json.resp() returns a json object which really means it's returning a python list in this case. So json.loads(jsonData) freaks out with TypeError: the JSON object must be str, bytes or bytearray, not list. I've updated all this confusing logic I added to just straight check if json.resp() is a list or dict. That's all we ever expect it to be. Since we control the HUB and the only time it's not one of those two things something else is going wrong, this is safe. It's also much clearer as to what's happening.

Comment on lines 102 to 106
expected_type = dict
expected_keys = ["name", "packages", "downloads"]
return _get_with_retries(
"api/v1/{}/{}.json".format(name, version), registry_base_url, expected_type, expected_keys
)
Copy link
Contributor

@jtcohen6 jtcohen6 Apr 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid a lot of superfluous requests by making this a filter on the (cached) package request response, rather than a separate request. This is what I'm trying over in #4983:

def package_version(name, version, registry_base_url=None):
response = package(name, registry_base_url)
return response["versions"][version]

But we're not validating ["name", "packages", "downloads"] as expected keys of the package response there, just the presence of a versions key (which contains a list of version dictionaries, each of which should contain those keys).

Maybe we should take a hybrid approach like:

  • First try grabbing the specific version info from the (cached) package response
  • Check to ensure that the expected keys are present (["name", "packages", "downloads"])
  • If they're missing for some reason, fall back to making the current version-specific request, with retries: "api/v1/{}/{}.json"

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check to ensure that the expected keys are present (["name", "packages", "downloads"])

Another way of doing this: RegistryPackageMetadata.from_dict() here, on the specific version response, instead of in dbt.deps.registry._fetch_metadata

@emmyoop emmyoop marked this pull request as ready for review April 1, 2022 21:16
@emmyoop emmyoop requested a review from a team April 1, 2022 21:16
@emmyoop emmyoop requested review from a team as code owners April 1, 2022 21:16
@emmyoop emmyoop added the backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch label Apr 1, 2022
@emmyoop emmyoop requested review from nathaniel-may, gshank and jtcohen6 and removed request for a team April 4, 2022 13:05
Copy link
Contributor

@nathaniel-may nathaniel-may left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! Added a bunch of optional suggestions as well.

Comment on lines 49 to 51
error_msg = (
f"Request error: The response type of {type(response)} is not valid: {resp.text}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's put what we expected in this error message


def package(name, registry_base_url=None):
response = _get_with_retries("api/v1/{}.json".format(name), registry_base_url)
expected_keys = ["name", "versions"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we seem to only use this as a set, maybe we could just create it as a set using {} instead of []

if not set(expected_keys).issubset(response) and not set(expected_version_keys).issubset(
response["versions"]
):
error_msg = f"Request error: Expected the response to contain keys {expected_keys} but one or more are missing: {resp.text}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] we can use set diff to discover and explicitly report these keys. There's only 3 keys right now so if we don't want to bother I'm fine with that too.

def _get(path, registry_base_url=None):
url = _get_url(path, registry_base_url)
def _get(package_name, registry_base_url=None):
url = _get_url(package_name, registry_base_url)
fire_event(RegistryProgressMakingGETRequest(url=url))
resp = requests.get(url, timeout=30)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this line throw at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But we're catching all exceptions from requests in the retry logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets add a comment to that effect then.

Comment on lines 87 to 107
def package(package_name, registry_base_url=None):
# returns a dictionary of metadata for all versions of a package
response = _get_with_retries(package_name, registry_base_url)
return response["versions"]


def package_version(package_name, version, registry_base_url=None):
# returns the metadata of a specific version of a package
response = package(package_name, registry_base_url)
return response[version]


def get_available_versions(package_name):
# returns a list of all available versions of a package
response = package(package_name)
return list(response)


def _get_index(registry_base_url=None):

url = _get_url("index", registry_base_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like these functions and how they abstract over retries. Can we add type hints to them?

@emmyoop emmyoop requested a review from leahwicz April 5, 2022 01:15
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! This is a huge improvement

@emmyoop emmyoop merged commit 7f953a6 into main Apr 5, 2022
@emmyoop emmyoop deleted the er/ct-352-json-malform branch April 5, 2022 15:44
@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2022

The backport to 1.0.latest failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-1.0.latest 1.0.latest
# Navigate to the new working tree
cd .worktrees/backport-1.0.latest
# Create a new branch
git switch --create backport-4982-to-1.0.latest
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick --mainline 1 7f953a6d48a4db254b17fd3929d267e4b6069228
# Push it to GitHub
git push --set-upstream origin backport-4982-to-1.0.latest
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-1.0.latest

Then, create a pull request where the base branch is 1.0.latest and the compare/head branch is backport-4982-to-1.0.latest.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.0.latest Tag for PR to be backported to the 1.0.latest branch cla:yes
Projects
None yet
4 participants