Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce permissions of default authentication scope #4372

Merged
merged 11 commits into from
May 29, 2024

Conversation

sarayourfriend
Copy link
Contributor

@sarayourfriend sarayourfriend commented May 22, 2024

Fixes

Fixes #4324 by @sarayourfriend

Description

This PR adds a new Privilege concept to the Openverse API, and allows us to "upgrade" access to fine-grained aspects of the API for certain requesters. I've used this to add further gradations to the limits on page size and pagination depth in line with the issue's requirements.

I've left comments on the PR to help explain some of the changes. The line count looks big, but it really is mostly a lot of changes to unit tests and reorganisation of the admin site configuration. The unit tests had some severe duplication of the media_type_config fixture, which is now fixed, and cleans the test data up a bit. This makes the tests consistent across the integration and unit tests, and easier to reason about and work with (no more context switching which media type fixture is this one?...). I ended up touching these because the changes in the serializer at first assumed the request was always available. This works in practice, but not in tests, because we actually do not always instantiate the serializer with a fully and correctly configured request object. That's pretty frustrating, but fixing it would require even more changes outside the scope of this PR, so for now the serializer shamefully allows the request to be missing so that tests will not fail. I've fixed the issue for some tests, but really we should have a helper method for the tests to use that creates the serializer with particular data and an correctly configured authed or anonymous request. Or, even better, move more tests towards the integration level, rather than unit level, and forget about these implementation details of the serializer in the tests (where they really don't matter). Anyway, that's all a lot more work and this is messy enough as it is.

Otherwise, the bulk of the real changes are the new api.constants.privileges module and its data, along with the changes to the media serializer.

Testing Instructions

Run the app with just api/test and make anonymous requests, and see that limits are enforced. Page size should be limited to 20, and you shouldn't be able to go deeper than 240 results (12 pages of 20 size).

Make authenticated requests, and confirm the behaviour matches the configuration and your understanding of it. Do the same with a privileged app, by visiting the Django admin and adding the privileges on the ThrottledApplication. Enable the privilege by checking the checkbox for that privilege, and save the throttled application.

Review the new unit tests in test_auth and confirm they cover the requirements.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • [N/A] I ran the DAG documentation generator (just catalog/generate-docs for catalog
    PRs) or the media properties generator (just catalog/generate-docs media-props
    for the catalog or just api/generate-docs for the API) where applicable.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@sarayourfriend sarayourfriend added 🟧 priority: high Stalls work on the project or its dependents 🕹 aspect: interface Concerns end-users' experience with the software 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🧱 stack: api Related to the Django API labels May 22, 2024
@github-actions github-actions bot added the migrations Modifications to Django migrations label May 22, 2024
@sarayourfriend sarayourfriend force-pushed the add/refined-registered-app-limits branch from 47a42a8 to 53b2423 Compare May 23, 2024 03:32
@WordPress WordPress deleted a comment from github-actions bot May 23, 2024
@sarayourfriend
Copy link
Contributor Author

Something (presumably caused by my deduplication of the media_type fixture in favour of the more complete, older, and widely used media_type_config), is causing test state to leak between tests, and the query assertions are failing due to the enabled sources query augmentation.

Very frustrating! I can't see at all how the test configuration is causing that to happen, and it only happens on a full test run, so it's purely down to some luck of ordering. Very annoying. I'm trying to find a way to completely squash the redis data on each subsequent test run so there's no chance of either the real redis being used nor of the data leaking between tests.

@WordPress WordPress deleted a comment from github-actions bot May 23, 2024
@WordPress WordPress deleted a comment from github-actions bot May 23, 2024
@@ -41,18 +43,20 @@ def get_redis_connection(*args, **kwargs):


@pytest.fixture(autouse=True)
def django_cache(redis, monkeypatch) -> RedisCache:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! Indeed, something was not working about the cache fixtures, and they were sometimes persisting between tests. Adding new tests (and moving fixtures around) in this PR exposed the issue, but it ends up being easy to solve by actually completely squashing the possibility of a cache being reused by literally replacing it. We can access the real data structure underneath (and it isn't even a hack to do so), so we don't need to monkeypatch in this case.

This explains the changes in this fixture module.

api/api/constants/privilege.py Outdated Show resolved Hide resolved
@@ -1,140 +1,22 @@
from django.conf import settings
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to reorganise this module and the admin in general. We've been putting things pretty haphazardly, and it was becoming a mess. The forms module was too generically named, and encouraged needlessly separating a form definition from the admin it was meant for. That doesn't make sense.

I've split it up into basically the conceptual model of our data, roughly matching the serializers, models, and so forth, rather than the implementation detail (form this, admin that).

The only new things in the admin are the throttled admin changes and the new form for it in admin.oauth.

@@ -50,19 +50,17 @@
"name": "auth",
"description": dedent(
"""
The API has rate-limiting and throttling in place to prevent
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose not to list out all the details of what the limits are for throttling or privileged pagination. The throttle limits were wrong, sort of, in that those are the limits for some views, but thumbnails and some others have different limits. I don't think we actually need to publish the configured limits explicitly, and doing it like this (when we know they are configured via environment variables elsewhere anyway) inevitably leads to inaccurate information and drifting documentation.

Instead, I've changed this to describe the general idea of Openverse's API having intentional limits to requesters, how we communicate those limits (kept from the existing docs), and letting the reader know they can request higher limits if they wish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is a great improvement IMO.

@@ -0,0 +1,161 @@
from dataclasses import dataclass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just moved from api/test/unit/conftest.py and deduplicates the much more recent and less complete media_type fixture removed from api/test/integration.

"/", HTTP_AUTHORIZATION=f"Bearer {access_token.token}"
)

return APIView().initialize_request(request)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixtures in this module are just moved around, but I've also fixed an issue in this authed_request fixture where it was causing request.auth to be a token string, rather than the actual authenticated access token object that it would be in normal circumstances. It was using force_authenticate which causes auth to be just a string, and to bypass the authentication class. We don't need to do that, we have a real token, so we can just rely on the normal authentication configuration. That's a good thing! It means we're integrating the whole authentication flow into these tests and ensuring it works in all scenarios that we have tests for.

Comment on lines -252 to -277
@pytest.mark.django_db
def test_page_size_limit_unauthed(api_client):
query_params = {"page_size": 20}
res = api_client.get("/v1/images/", query_params)
assert res.status_code == 200
query_params["page_size"] = 21
res = api_client.get("/v1/images/", query_params)
assert res.status_code == 401


@pytest.mark.django_db
def test_page_size_limit_authed(api_client, test_auth_token_exchange):
time.sleep(1)
token = test_auth_token_exchange["access_token"]
query_params = {"page_size": 21}
res = api_client.get(
"/v1/images/", query_params, HTTP_AUTHORIZATION=f"Bearer {token}"
)

assert res.status_code == 200

query_params = {"page_size": 500}
res = api_client.get(
"/v1/images/", query_params, HTTP_AUTHORIZATION=f"Bearer {token}"
)
assert res.status_code == 200
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests are superseded by the new ones at the end of this module.

Comment on lines -203 to -208
@pytest.mark.django_db
def test_max_page_count(api_client):
response = api_client.get(
"/v1/images/", {"page": settings.MAX_PAGINATION_DEPTH + 1}
)
assert response.status_code == 400
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is replaced by the new test_auth tests.

@@ -1,40 +1,6 @@
from dataclasses import dataclass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of these removed lines are just moved to api/test/fixtures/media_type_config.py

@sarayourfriend sarayourfriend marked this pull request as ready for review May 23, 2024 04:14
@sarayourfriend sarayourfriend requested a review from a team as a code owner May 23, 2024 04:14
@obulat
Copy link
Contributor

obulat commented May 28, 2024

I tried testing this PR locally, but I get

django.db.utils.ProgrammingError: column api_throttledapplication.privileges does not exist
LINE 1: ..."verified", "api_throttledapplication"."revoked", "api_throt...

error in tests and when I try to open the ThrottledApplication admin page. I assumed that I needed to run just api/dj migrate, but this gives me the following error:

  Applying api.0062_decision_through_tables...Traceback (most recent call last):

django.db.utils.ProgrammingError: relation "api_imagedecisionthrough" already exists

What should I do?

It would also be easier to review this PR if the admin splitting was extracted to a separate PR.

@sarayourfriend
Copy link
Contributor Author

sarayourfriend commented May 28, 2024

@obulat you need to run just api/init, I thought? It entirely depends on what branches you've worked on recently and what migrations you've applied to your local API database. Unfortunately I don't know what more specific instructions to give other than just recreate (or just down -v && just api/up && just api/init).

@openverse-bot
Copy link
Collaborator

Based on the high urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 3 day(s) ago. PRs labelled with high urgency are expected to be reviewed within 2 weekday(s)2.

@sarayourfriend, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code, including tests, looks great to me @sarayourfriend. I really like the refactoring here, and the effort in comments and the commit history to make that easier to review 😅

I have been having trouble actually testing but just confirmed it's an issue with my local env and not this branch. Submitting my comments so far, but I'm going to continue looking at this now!

"""

slug: str
anonymous: typing.Any
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why typing.Any for these?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, they're currently int, but I resume they could be anything. I'll try and see if Python will support a Generic parameter here 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A generic appears to work! 🎉 Well, whether it would actually pass mypy, I have no idea, but the code runs still and I don't get any errors with my informal type information from my IDE.

"""This serializer passes pagination parameters from the query string."""

_SUBJECT_TO_PAGINATION_LIMITS = (
"This parameter is subject to limitations based on authentication "
"and special privileges. For details, refer to [the authentication "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

special privileges -> access level here might also be more clear.

api/api/constants/privilege.py Outdated Show resolved Hide resolved

if real_page_count > max_possible_page_count:
return floor(max_possible_page_count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The math here reads confusingly to me for the same reason commented above related to the name PAGINATION_DEPTH. That being said -- very cool to clamp the actual result count independent of page size!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the changes to names fixes this for you. Otherwise, I can try to think of a different way to express this.

@@ -50,19 +50,17 @@
"name": "auth",
"description": dedent(
"""
The API has rate-limiting and throttling in place to prevent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, this is a great improvement IMO.

@sarayourfriend sarayourfriend force-pushed the add/refined-registered-app-limits branch from c66cbe2 to f8e26dc Compare May 29, 2024 01:37
@WordPress WordPress deleted a comment from github-actions bot May 29, 2024
Copy link

This PR has migrations. Please rebase it before merging to ensure that conflicting migrations are not introduced.

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @sarayourfriend! Finally got it working and this tested well for me. I tried authenticated & unauthenticated, with setting the privileges individually and both at once. Everything worked beautifully :)

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks great and works well locally (just recreate worked, thank you!).

My main change request is for form clarification. It can be done in this PR, or in a follow-up one.

As a maintainer trying to restrict a violating app's access level, what do I need to do? When I look at the form, without the knowledge about the underlying code, I'm not sure if I should be changing the rate_limit_model or the privileges, and which privileges should I restrict to fix the problem.
We could add a documentation page about it, but I think for the maintainer trying to fix an API traffic problem, it would be nice to have an instruction inside the form itself. Something like "Unselect max_page_size for an app that is violating Openverse TOS by scraping content" in the privileges label.

@sarayourfriend
Copy link
Contributor Author

That's definitely a documentation problem, and I think a misunderstanding of how to apply decisions based on ToS violations. The correct response to an application abusing any level of access it's been granted is to revoke its access. I'll talk with @zackkrida about where best to document this, I'm not entirely sure if it's suitable for public documentation right now.

@sarayourfriend sarayourfriend merged commit e0c2951 into main May 29, 2024
48 checks passed
@sarayourfriend sarayourfriend deleted the add/refined-registered-app-limits branch May 29, 2024 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🕹 aspect: interface Concerns end-users' experience with the software 🧰 goal: internal improvement Improvement that benefits maintainers, not users migrations Modifications to Django migrations 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Reduce permissions of default authentication scope
4 participants