Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Project Proposal: Detecting, filtering, and blurring results that include sensitive terms #873

Merged
merged 17 commits into from Mar 17, 2023

Conversation

sarayourfriend
Copy link
Contributor

@sarayourfriend sarayourfriend commented Mar 9, 2023

Project proposal for #377

Due date

Review for this proposal should aim to be completed by 24 March 2023, roughly two weeks from when this PR was undrafted.

Notes

I've added a project-proposal.md template file based on the approach I took to assist future project proposal writers in getting started. The sections are based on the requirements of projects proposals described in the project process documentation.

Reviewers: Please make note of the two outstanding questions. I've made recommendations for each, but I would like particular consideration from y'all on these two points, whether you agree or disagree with my recommendations.

I pinged @krysal and @AetherUnbound semi-randomly as the reviewers, but if y'all aren't able to review this proposal over the next two weeks, please let me know and I'll find other folks to do so.

Checklist

  • My pull request has a descriptive title (not a vague title like
    Update index.md).
  • My pull request targets the default branch of the repository (main) or
    a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • [N/A] I added or updated tests for the changes I made (if applicable).
  • [N/A] I added or updated documentation (if applicable).
  • [N/A] I tried running the project locally and verified that there are no visible
    errors.

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@openverse-bot openverse-bot added this to In progress in Openverse PRs Mar 9, 2023
@openverse-bot openverse-bot added the 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work label Mar 9, 2023
@github-actions
Copy link

github-actions bot commented Mar 9, 2023

Full-stack documentation: Ready

https://WordPress.github.io/openverse/_preview/873

Please note that GitHub pages takes a little time to deploy newly pushed code, if the links above don't work or you see old versions, wait 5 minutes and try again.

You can check the GitHub pages deployment action list to see the current status of the deployments.

@sarayourfriend sarayourfriend force-pushed the kickoff/detecting-sensitive-textual-content branch from 9bfdff4 to 55ab235 Compare March 9, 2023 05:39
@sarayourfriend sarayourfriend force-pushed the kickoff/detecting-sensitive-textual-content branch from 55ab235 to 9b7a9eb Compare March 9, 2023 05:45
@sarayourfriend sarayourfriend marked this pull request as ready for review March 9, 2023 05:49
@sarayourfriend sarayourfriend requested a review from a team as a code owner March 9, 2023 05:49
@openverse-bot openverse-bot moved this from In progress to Needs review in Openverse PRs Mar 9, 2023
@sarayourfriend sarayourfriend added 🟨 priority: medium Not blocking but should be addressed soon 📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users and removed 🚦 status: awaiting triage Has not been triaged & therefore, not ready for work labels Mar 9, 2023
@krysal krysal added 🟧 priority: high Stalls work on the project or its dependents 🧱 stack: api Related to the Django API and removed 🟨 priority: medium Not blocking but should be addressed soon labels Mar 11, 2023
quickly. I'm not totally convinced it'd be _that_ much faster though. Like I
said above, BlurHash looks complicated, but really only in comparison to how
simple the CSS approach is. Anything would look complicated when compared to
just adding a CSS filter style to something.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we had a single generic blurred image instead of trying to blur the one that needs it? This way, the user would not be able to discern what's behind the blur, but is it really necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way, the user would not be able to discern what's behind the blur

I think it's desirable to be able to sort of "guess" at what's behind the blur. For example, if you notice a lot of red, maybe the image includes a large amount of blood or something you'd rather avoid seeing, even though some sensitive results might be of interest for your query.

On the other hand, there are issues with this assumption as well, as in the case of this Mastodon issue: mastodon/mastodon#10729

Our context might be different from Mastodon, though we do include Flickr, which is social media, and may have the same issues the Mastodon users are describing in that issue.

If we used a single generic blurred image, how would we choose it? We decided against using anything associated with Openverse (like the logo) to avoid implicitly associating Openverse with sensitive content. Are there other reasonable options that would potentially mislead people? Maybe a blank box or some other solid colour would be appropriate 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: I can also see having variety on the search results page, when we use these for placeholders, being much more visually appealing than the same, uniform placeholder over every image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I'll also note that the issue is slightly different from Mastodon in that on Mastodon images are often marked with author-composed "content warnings". We won't have those, at least not for the time being (until we implement some hypothetical categorisation of sensitive terms, for example). In our case, the blurred image would be an even more valuable signal for someone deciding whether a result might be relevant to their query.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Pixabay places a black overlay with text "Adult Content" (more or less like a Mastodon content warning) that toggles off when a user selects option to remove safe search filters. No, it does not provide a hint to the image.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an interesting alternative, thanks cogdog. We could do something similar with "Sensitive Content" for the text if folks felt it was a better approach.

If this feels important to further discuss, I would like to suggest tabling the discussion and having a separate proposal written to identify the particular approach. If we do that, I would change this proposal to merely reference "obscuring" images, without specifying how they are obscured (whether blurred or hidden entirely or something else).

I suggest this because the API implementation plan could go forward without needing to have a decision made on how to obscure the images, whereas this discussion only really blocks the frontend implementation plan.

determination is made in Elasticsearch and benefits from Elasticsearch's full
text search, then we may need to change the single result endpoint to pull the
result in question from Elasticsearch rather than Postgres, leveraging the
`_source` property on hits. This could be a big, complex change, though and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure we have access to the _source property on hits. @dhruvkb, could you confirm?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sarayourfriend and others added 3 commits March 14, 2023 17:39
…osal_detecting_sensitive_textual_content.md

Co-authored-by: Zack Krida <zackkrida@pm.me>
…osal_detecting_sensitive_textual_content.md

Co-authored-by: Zack Krida <zackkrida@pm.me>
@sarayourfriend
Copy link
Contributor Author

I've updated the proposal to recommend implementing CSS filtering first. I did this after testing the solution locally (by blurring all images) and visiting the site on the lowest-spec hardware I have access to and it doesn't appear to have any adverse effects.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the proposal and template! I've included a few questions, as well as answers to your outstanding questions from my own perspective 😄

docs/projects/templates/project-proposal.md Show resolved Hide resolved
Comment on lines +173 to +174
To simply phrase the question to reviewers: **What is your preference for
storing the "include sensitive results" option and why?**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also agree that this is a reasonable balance! From my own experience with other settings on other applications I know that if I had to re-enable a setting every time I used the search, I probably wouldn't use the setting at all.

Comment on lines +196 to +199
I am almost entirely closed to the option of storing the option in the query
params. It gives no discernible benefit in my estimation but comes with risks
that are counter-productive to the goals of this project. If there is a strong
argument to be made it favour if this approach, however, please share it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, especially given the case you mentioned.

Comment on lines +354 to +357
For reviewers: **What are you feelings on this issue and why?**. Do you think it
is worth investing time up front in BlurHash, LQIP, or another solution; or is
it fine to push that off to a later date as a potential future improvement to
the feature?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sincerely appreciate the effort you took to explore various options. While reading the proposal (before getting to your recommendation on the approach) my perspective was that we should go with the CSS-blur option initially. I like your idea of introducing a move to BlurHash down the line as its own proposal; the other applications of the hash seem fascinating and may be something we'd like to explore down the line.

In short, I agree that we should move forward with CSS blurring at this time, and this document can be used as a jumping off point for a move to BlurHash down the line.

Comment on lines +262 to +269
- Requires conditionally calculating the hash on the API side in the thumbnail
endpoint
- This would be done by using a new query parameter on the thumbnail endpoint
to request the blurred hash
- Can have a performance impact on the thumbnail endpoint
- Introduces a delay between when the image is unblurred when it is displayed
due to needing to download the actual thumbnail
- Could have different performance issues if decoding the hash is a heavy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much of this section discusses the thumbnail endpoint - do we use thumbnails for the single result page or do we load the whole image? If it's the latter, since we're not serving the images themselves, would we need to switch to using the thumbnail for sensitive results?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per @zackkrida: It appears that we hotlink the upstream image in the single result case. I imagine that would impact how we handle navigating to the single result page for sensitive content and what the blurring process is for that, is that correct?

Copy link
Contributor Author

@sarayourfriend sarayourfriend Mar 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, there's some nuance that needs to be teased out here.

The frontend does use the thumbnail on the single results page if navigating to the result from the search page. This allows us to display the result (albeit at lower quality) until we have the full image ready to display. It prevents some weird pop-in behaviours from happening. In that case, we're already using the thumbnail and so we could delay downloading the full image until it is unblurred.

In the case where someone lands directly on the single results page (without navigating from the search page), we do not use the thumbnail. We'd need to have the frontend decide to use the thumbnail instead of the full image initially for sensitive results and then only load the full image once it's been "unblurred".

This is mostly moot with the CSS filter approach though, as in that case we can always just apply the filter client side, regardless of where the image is coming from. We would only need to consider these nuances if we went with an API-side blurring approach, BlurHash, LQIP, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay that's a great point! I didn't realize that we could punt on these nuances with the CSS blur approach. That's even further argument for it at this time 🙂

Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great proposal, it would be incredible to (re)add this feature to Openverse! I agree with most of what is written. My outstanding questions are related to the API implementation, which I know it's in planning still but I'd like to discuss it at a high level. When I talked to Zack about this initiative, I thought of it more as a work on the Catalog.

For what I read, the classification of the sensible content will happen at query time, either in Python or ES. I can see this as a first step to expedite the detection of the sensible media, given that presumably, it will take us a significant time to classify all our existing content. Now, this makes me wonder:

  • will these discoveries be saved to the database? I can see having this info stored useful for analytical purposes
  • is it a temporary solution (the API part) until we implement a sort of classifier DAG or add the checks to the MediaStores classes?

I'm intrigued by the ES solution since I saw your proof of concept PR, I hadn't thought of it before. I'm worried it may put too much burden on the cluster but I'm eager to see it working for us.

Comment on lines +354 to +357
For reviewers: **What are you feelings on this issue and why?**. Do you think it
is worth investing time up front in BlurHash, LQIP, or another solution; or is
it fine to push that off to a later date as a potential future improvement to
the feature?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CSS solution sounds good enough (I didn't even know there were other ways, TIL!), and I'm fine putting off a more advanced way of blurring the results. Indeed I don't have strong opinions on this end, so I'll let that to the rest of the reviewers.

Comment on lines +60 to +67
#### Designation of results with sensitive terms

Results that include sensitive terms in their textual content are so denoted in
the API results in a new result field named `sensitivity`. Sensitivity should be
an array that should be empty for results where `mature = false` and the textual
content does not contain any sensitive terms. For results marked as `mature`,
the array should include a `"mature"` string. For results with sensitive terms
in their textual content, the array should include a `"sensitive_text"` string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is speaking of the Response serializers, which sounds great to me. However, do you know how the manually reported content will fit into this process? For example, will sensitive terms be included as tags? Do we need to add more columns? Will the api_mature<media> tables may be deprecated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't considered the mature reports and had to look at the code to learn how they worked again, specifically, how the information gets incorporated into search. Forgive me if you already know this, but it was new to me and to understand my answer this is necessary context, so for my sake and for others who might not be familiar I am including it.

When the mature report is created, the .save method updates the document in ES to set mature=True and forces a refresh of the index.

There are a couple details here:

  1. This will automatically work with the proposed solution, the reports just get folded into the rest of the documents already labelled as "mature". These documents would have the "mature" string in the sensitivity array.
  2. I believe this is slightly broken in that this information is not preserved between data refreshes. As far as I can tell, there is no mechanism to re-apply the setting of mature=true for results that have confirmed reports during data refresh. Once a new index is created, the mature setting on those documents will go back to false. I tried digging for something that would prevent this, but I can't find it. If you take a look at the code and confirm my suspicion, I'll open an issue for us to investigate how to fix this. I suspect we would need to use a similar backport strategy as the others in discussion to get the confirmed reports from API to catalogue. We could also use Elasticsearch ingestion pipelines to flip the mature flag on documents as they're being ingested based on the mature report tables in the API database. Yet another alternative is to iterate through the mature report table after indexing and set the flag on the documents in a batch update. All are probably complex solutions to implement with their own pitfalls that would need to be considered.

So that being said, if the mature content reporting process didn't have that data refresh issue, I think it would work perfectly without modification with this proposal. If we fix that issue, then it will work with the proposal, but there's nothing this proposal needs to do to handle that bug. It is a separate bug and doesn't impede the sensitive terms detection discussed here. We would keep the mature tables around, provided we don't decide a different way of handling them is necessary in the course of fixing the bug described.

will sensitive terms be included as tags? Do we need to add more columns?

Do you mean tags on the Elasticsearch documents? I don't believe it's necessary and would make search more complicated as we'd have to somehow prevent those sensitive terms from affecting the overall score of the document. Each tag would be adding an instance of the term on the document's overall textual content.

As for adding new columns, I don't believe it's necessary. Right now we're not classifying the particular terms into categories, so the only token that would exist in the sensitivity array is sensitive_text.

I guess it is worth clarifying explicitly that documents labelled "mature" are separate entirely (at least conceptually) from those we find have potentially sensitive textual content.

I hope this answers your questions. Please do follow up if there's anything else I can clarify.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great call about considering the data refresh pieces of this! I looked at the code, and it appears that when building the index we do already consider deleted and mature records:

exists_in_deleted_table = SQL(exists_in_table).format(
table=Identifier(f"api_deleted{model}"),
identifier=Identifier(table, "identifier"),
name=Identifier("deleted"),
)
exists_in_mature_table = SQL(exists_in_table).format(
table=Identifier(f"api_mature{model}"),
identifier=Identifier(table, "identifier"),
name=Identifier("mature"),
)
return exists_in_deleted_table, exists_in_mature_table

That function is called in the update & reindex steps, and I believe the mature value returned there is applied to the index. So if the record exists in api_mature{media}, it will have mature=True in the resultant index:

deleted, mature = get_existence_queries(model_name, table_name)
query = SQL(
"SELECT *, {deleted}, {mature} "

I was unsure about this myself because my initial assumption mirrored your own Sara, I'm glad that @dhruvkb ensured that we handled this appropriately during the refresh 😄

sarayourfriend and others added 5 commits March 15, 2023 09:46
Co-authored-by: Zack Krida <zackkrida@pm.me>
…osal_detecting_sensitive_textual_content.md

Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com>
Co-authored-by: Krystle Salazar <krystle.salazar@automattic.com>
@sarayourfriend
Copy link
Contributor Author

Thanks for the feedback, Krystle. I think it's fine to discuss the broad strokes of the API implementation to make sure we've got the right idea in mind as far as how everything will wire up, and especially to make sure we're planning to implement things in the right places. As you mentioned, there is an alternative approach that could see this implemented partially at the catalogue level.

For what I read, the classification of the sensible content will happen at query time, either in Python or ES.

This is mostly true, but I want to clarify that for the initial filtering, it will always happen in ES, as a separate re-indexing step, during data refresh. At least, that's the solution that I found previously that avoids us needing to include the exclusion of sensitive terms into every single query that doesn't want them (the majority of queries).

That being said, there's the secondary issue of how to determine which results in a set that we know might include sensitive results, are the sensitive ones. That's the part that is still up in the air and could happen in either ES or Python. I hope to know more about which approach we should take after getting answers back from folks more expert in Elasticsearch than me. I have one viable solution for doing it in Elasticsearch that I think would work very well and have little to no performance impact for us. The viability of that solution still needs to be verified with folks who know ES better than me.

it will take us a significant time to classify all our existing content

This is also my assumption. If we had the recrawler working and ready to go, we might be able to implement the classification work there and have it part of the data before it is ingested into Elasticsearch. That would allow us to avoid the re-indexing with every data refresh. However, I don't believe we yet know how long the recrawling process will take. If there's more up to date information about this, then it's worth looking into if the process takes a reasonable amount of time to run.

will these discoveries be saved to the database? I can see having this info stored useful for analytical purposes

It would be great to have this data saved somewhere. Both for analytical purposes but also to save us needing to do the reindex into the filtered index at every data refresh (if we're able to save it in the catalogue). The query to find results that match the sensitive content would not theoretically be a significant burden on our cluster, but I should test this during a downtime with the production cluster to make sure. If that's the case, then we can at least easily query for all documents that include sensitive content and then do whatever we want with those results, within the limitations of Elasticsearch's pagination.

More comprehensive analysis that wouldn't be limited by ES's pagination would require saving the data elsewhere, hopefully backported to the catalogue or managed by the recrawler.

is it a temporary solution (the API part) until we implement a sort of classifier DAG or add the checks to the MediaStores classes?

Definitely! If there are any ways for us to get this information before documents are indexed, that is 100% preferable to this solution. As above, my understanding is that this isn't viable at the moment. I'm happy to explicitly label the querying approach a stop-gap until we're able to implement something less ad-hoc in the future.

Thanks again for the review and feedback!

@sarayourfriend sarayourfriend force-pushed the kickoff/detecting-sensitive-textual-content branch from 8b7bf53 to b1b562e Compare March 14, 2023 23:39
@sarayourfriend
Copy link
Contributor Author

Thanks @krysal and @AetherUnbound for the initial reviews 🙏 It sounds like everyone is on the same page as far as implementing blurring first with the CSS filter.

I wanted to highlight for both of y'all, as reviewers, this discussion that @obulat started: #873 (comment)

Do either of y'all have concerns over specifically blurring over other obfuscation methods like using a placeholder or hiding the image behind a solid colour with explanatory text? If so, do you feel it's necessary to hash out those particulars in this proposal or is it something you'd be comfortable handling as a separate proposal to be written and approved before the frontend implementation plan is written?

cc @panchovm as I believe the designs would need to keep this in mind as well. Have you got any opinion on whether we need to more carefully consider other obfuscation methods?

@fcoveram
Copy link
Contributor

The proposal is very clear. Thanks for putting all the info and testing the possible blur implementations.

I agree with how Mastodon approached the feature, but for this project, I have no concerns over the solution we are taking. Regarding the blur effect, I am ok with any path we take as all three solutions are in line with the design idea I have discussed.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly I (personally) think that the blurring approach is more appealing than using a placeholder image. As others states it gives variety on our search results page and gives the user some indication of what the content is. The placeholder would not give any context, and until we develop more detailed ways to provide user with context before they decide to view an image marked as sensitive media, this would mean that images anywhere along the gradient of "sensitive" could be behind every placeholder. I'd personally rather have a sense of what the image looks like before viewing it, rather than have no clue whatsoever (especially since we also don't have the user-generated content warnings that Mastodon has).

Thanks for the clarifications on everything Sara, I think this proposal looks great! We can refine the Elasticsearch implementation details based on what we hear back from the Elasticsearch experts.

Openverse PRs automation moved this from Needs review to Reviewer approved Mar 15, 2023
Copy link
Member

@krysal krysal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Sara! Everything is clear with your thoughtful answers, so I feel confident approving this proposal ✅

Edit: About the blurring vs other obfuscation methods, I slightly prefer the blurring for the same reasons that Madison and Zack mention, nothing more to add.

@sarayourfriend sarayourfriend merged commit 2d65aa0 into main Mar 17, 2023
32 checks passed
Openverse PRs automation moved this from Reviewer approved to Merged! Mar 17, 2023
@sarayourfriend sarayourfriend deleted the kickoff/detecting-sensitive-textual-content branch March 17, 2023 01:07
@obulat obulat added the 🧭 project: proposal A proposal for a project label May 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
📄 aspect: text Concerns the textual material in the repository 🧰 goal: internal improvement Improvement that benefits maintainers, not users 🟧 priority: high Stalls work on the project or its dependents 🧭 project: proposal A proposal for a project 🧱 stack: api Related to the Django API
Projects
Status: Accepted
Openverse PRs
  
Merged!
Development

Successfully merging this pull request may close these issues.

None yet

8 participants