Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation Plan: Bulk moderation actions #3719

Merged
merged 6 commits into from Feb 28, 2024

Conversation

stacimc
Copy link
Contributor

@stacimc stacimc commented Jan 29, 2024

Fixes

Fixes #1967 by @sarayourfriend

Description

I've chosen @obulat and @sarayourfriend as reviewers. In particular I'd like to call out to @sarayourfriend (as I do in the document itself) that this plan build on and makes some slight change to the work in #3494 that should be carefully reviewed.

This discussion is following the Openverse decision-making process. Information about this process can be found on the Openverse documentation site.

Requested reviewers or participants will be following this process. If you are being asked to give input on a specific detail, you do not need to familiarise yourself with the process and follow it.

Current round

This discussion is currently in the Decision round.

The deadline for review of this round is 19 February 2024.

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@stacimc stacimc added 🟨 priority: medium Not blocking but should be addressed soon 🌟 goal: addition Addition of new feature 📄 aspect: text Concerns the textual material in the repository 🧱 stack: api Related to the Django API skip-changelog labels Jan 29, 2024
@stacimc stacimc self-assigned this Jan 29, 2024
@stacimc stacimc requested a review from a team as a code owner January 29, 2024 23:10
@stacimc stacimc requested review from AetherUnbound, sarayourfriend and obulat and removed request for AetherUnbound January 29, 2024 23:10
@stacimc
Copy link
Contributor Author

stacimc commented Jan 29, 2024

@sarayourfriend I'm very happy to extend the deadline for review given your schedule, please just let me know! 🙂

Copy link

Full-stack documentation: https://docs.openverse.org/_preview/3719

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

New files ➕:

@sarayourfriend
Copy link
Contributor

sarayourfriend commented Jan 30, 2024

5 Feb should be fine but I will let you know by Thursday if I am going to be late on this. At worst I would review this on my 5 Feb, which is before your 5 Feb, and you would never suspect that I'd done it last minute 🤡 But I'll let you know if that isn't going to work for whatever reason.

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really great, and I anticipate approving without issue, provided the usage of Elasticsearch or Postgres for searching records to bulk action against (and the rationale for the decision) is clarified. My hunch is that Elasticsearch is the safer option, and may even be the necessary option, considering creator is not indexed in the API Postgres. However, I haven't spent a great deal of time thinking about this, so I am really just asking for clarification on which we will use and why considering the limitations of each. Postgres would be a lot less work overall, I imagine, considering how easily Django admin already interacts with it. But is the performance hit worth that? Probably that balances against how much effort it takes to use Elasticsearch instead of Postgres for that view.

Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will also approve this IP :)

My main 2 questions are:

  1. Do we delete the item from the main media table, or only Elasticsearch? (see more details in the inline comment)
  2. How will we proceed with updating the related IP? Will we open another issue for it?

- Capitalize Elasticsearch
- Clarify language
- Specify that new media filters should query Elasticsearch rather than Postgres
@stacimc
Copy link
Contributor Author

stacimc commented Feb 12, 2024

I am going to leave this in the Revision round for a few more days before moving to the final round, to give @sarayourfriend and @obulat another chance to respond to continue the discussion on a few comments, in particular:

In response to @obulat's other question, How will we proceed with updating the related IP: I think it's sufficient to update the existing issue for creating the ModerationDecision. There are actually very few changes needed per this final draft. Curious what others think though!

@sarayourfriend
Copy link
Contributor

whether to #3719 (comment) to only remove records from Elasticsearch, to give a 'safety window' for reindexing

To clarify my answer from the comment thread, my vote is a strong "no", it is not worth making a change for this. My reasons are in the thread.

@stacimc stacimc marked this pull request as draft February 14, 2024 19:06
@stacimc stacimc marked this pull request as ready for review February 14, 2024 19:07
@stacimc
Copy link
Contributor Author

stacimc commented Feb 14, 2024

Re-requesting reviews as this has entered the Decision Round :)

Copy link
Contributor

@sarayourfriend sarayourfriend left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work 👍

Copy link
Member

@dhruvkb dhruvkb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm not an assigned reviewer, I am still sharing some comments that came up during working on #3760. I'll add more comments if something comes up later, but these are the thoughts I have so far.

`media_identifier` (previously a one-to-one relationship to the media the
decision is related to), will be updated to `media_identifiers` _plural_: a
many-to-many relationship to potentially multiple media records the decision is
related to. Under the covers, this will be implemented by a join table. As
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are using PostgreSQL media_identifiers can be implemented as a ArrayField in Django that allows the array to be stored in-table rather than as a separate join table. Also it allows us to have the weak constraints because they will just be strings to the ArrayField.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the join table due to the assumption that bulk reports may affect large numbers of records. The size of the rows in the ModerationDecision table could get quite large and querying on the field could be very slow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes, that makes sense. I had not thought about the impact this would have on querying. But wouldn't DB-writes be incredibly slow if we have to write individual records for each media item in a bulk-moderation event?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe so -- we can use RelatedManager.set to bulk set the list of related records, if I'm understanding your question!

@sarayourfriend
Copy link
Contributor

@obulat are you able to review the implementation plan soon? It would be great to get this approved soon (with any changes needed, of course, if you have blockers to share) so that we can update the issues from the previous implementation plan with the new ModerationDecision model and other modifications to the existing plan.

Co-authored-by: sarayourfriend <24264157+sarayourfriend@users.noreply.github.com>
@openverse-bot
Copy link
Collaborator

Based on the medium urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 5 day(s) ago. PRs labelled with medium urgency are expected to be reviewed within 4 weekday(s)2.

@stacimc, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@stacimc
Copy link
Contributor Author

stacimc commented Feb 22, 2024

I've just realized (prompted by @dhruvkb 's comment above) that I may need to clarify the specifics of the implementation for the many to many relationship. I likely do not have time to do so today so I'm going to draft this briefly.

@stacimc stacimc marked this pull request as draft February 22, 2024 20:16
@sarayourfriend
Copy link
Contributor

@obulat Can you confirm you've seen the request to review this? Will you confirm if you will approve with the change Staci is making or if you'll require other changes?

@stacimc
Copy link
Contributor Author

stacimc commented Feb 23, 2024

I was briefly concerned that there would be an issue with using the ManyToManyField as designed, due to the requirement that the record ids be non-constrained. TL;DR I had forgotten that this is possible using the db_constrained argument as we already do elsewhere in the code (for example, in the implementation of the MediaReports). I thought it was worth adding a specific mention of that argument to the IP and will also do so in the issues when they are updated/created. This should not require more work and I am undrafting now, apologies -- better to be safe 😅

@stacimc stacimc marked this pull request as ready for review February 23, 2024 00:59
@krysal krysal added the 🧭 project: implementation plan An implementation plan for a project label Feb 27, 2024
Copy link
Contributor

@obulat obulat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The plan is solid and very detailed as it considers all of the edge cases, and also updates the related plan.
Sorry for the delay

Co-authored-by: Olga Bulat <obulat@gmail.com>
@stacimc stacimc merged commit 1818cb1 into main Feb 28, 2024
38 checks passed
@stacimc stacimc deleted the implementation-plan/bulk-moderation-actions branch February 28, 2024 17:31
@stacimc
Copy link
Contributor Author

stacimc commented Feb 28, 2024

I've created issues under this milestone, and updated the relevant issue from the related IP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🌟 goal: addition Addition of new feature 🟨 priority: medium Not blocking but should be addressed soon 🧭 project: implementation plan An implementation plan for a project skip-changelog 🧱 stack: api Related to the Django API
Projects
Status: Accepted
Archived in project
Development

Successfully merging this pull request may close these issues.

Implementation Plan: Bulk moderation actions
6 participants