Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow filtering moderation queue by language #2161

Merged
merged 20 commits into from
Feb 19, 2024
Merged

Conversation

foysalit
Copy link
Contributor

@foysalit foysalit commented Feb 9, 2024

This PR adds language tracking to moderation subjects and allows queryModerationStatuses endpoint to filter the moderation queue by languages.

For record subjects, the record is fetched from the appview and the language is retrieved from langs property.
For repo subjects, the author's feed is fetched from the appview and all languages are collected from all posts in the first page of their timeline.

@@ -1,9 +1,5 @@
lockfileVersion: '6.0'

settings:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure why these changed 🤷🏽‍♂️

@@ -39,8 +39,8 @@ describe('moderation-statuses', () => {
}
const bobsPost = {
$type: 'com.atproto.repo.strongRef',
uri: sc.posts[sc.dids.bob][1].ref.uriStr,
cid: sc.posts[sc.dids.bob][1].ref.cidStr,
uri: sc.posts[sc.dids.bob][0].ref.uriStr,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This post has klingon lang set, the previous one doesn't.

@foysalit foysalit requested a review from bnewbold February 12, 2024 23:34
Copy link
Collaborator

@bnewbold bnewbold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we will want to generalize this to an internal flag/tag system, and have flags like lang-ja or lang:ja to do queries. That will let us do faceting/filtering/queues on arbitrary metadata, and make it way easier to add that kind of metadata/feature without doing a database migration etc.

Timeline wise, it would be great to get flags in before mod service launch.

I know the per-language queues is a pretty urgent need, and i'm not against going ahead with this in the shape it is now to get it out this week, but if it isn't too much work to do flags instead, i'd lean that way.

Curious what @devinivy thinks.

export async function up(db: Kysely<unknown>): Promise<void> {
await db.schema
.alterTable('moderation_subject_status')
.addColumn('langs', 'jsonb')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would the array datatype be easier here? I don't have a strong opinion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm... do you have anything written down about the flag approach? just trying to imagine the design for this. do we just have a flags array column which stores arbitrary string flags? or do we need a more elaborate, many to many relationship set up to normalize the flags?
If we go with the former, it's straight forward of course.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use jsonb for structured arrays in a couple places (i.e. blob cids), I'd be inclined to stick with that just by convention.

@bnewbold
Copy link
Collaborator

(moving flags conversation out of thread)

I was previously imagining that flags would basically work the same as labels, just be internal only. And we would want the ability to query/filter subjects by flag values.

I think that this would mean a new mod event type for changing flags on a subject. Unlike labels, flags don't need signatures, negations, or as much metadata about createdAt, etc; that metadata would be captured in the event log. I think there would basically be a one-to-many table of subjects and flag values. I could be wrong but I assume we don't need to fully normalize flag values out as a table (many-to-many), just use the strings as values (with an index I guess). Maybe a string array column on the subject table itself would work? I'm not sure, one-to-many feels like the usual way to do it. I wouldn't be against caching createAt or other metadata about flags as well, just don't have any product/feature need for it right now, I think.

Flags would get hydrated in to subject view responses.

A possible downside is if this creates a lot of event churn. For the current task of adding language tags to newly-reported content, i'm not sure if we'd want the ozone backend to just add the language tag as part of the report event, or if there would be a report event and then a separate "flag" event. If there isn't a flag event, there doesn't end up being provenance info about when and why the flag was created; on the other hand maybe we don't really need that for system flags like language. A related precedent is doing the takedown labels for a takedown event: this results in a label which doesn't have an associated label-mod-event (IIRC). Probably fine, but gets a bit at an abstraction question of whether the mod event log captures every fine detail or whether it is mostly to capture human and external-bot (eg, automod) interventions.

@devinivy
Copy link
Collaborator

@bnewbold I'm a bit inclined to capture the events, and then later make a judgement call about how/whether to present them.

Copy link
Collaborator

@bnewbold bnewbold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quick review of modEventFlag lexicon. left some notes, but generally this shape/direction looks correct!

"modEventFlag": {
"type": "object",
"description": "Add/Remove a flag on a subject",
"required": ["add", "remove"],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the arrays need to be required, or can they be omitted if empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to stay in line with the labelling lexicons where we require both create and negate properties. I don't think we allow defining required fields as "one or the other" which is probably why this is kinda needed?

"remove": {
"type": "array",
"items": { "type": "string" },
"description": "Flags to be removed to the subject. Ignores a flag If doesn't exists, won't be duplicated."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Ignores a flag if it doesn't exists."

Comment on lines 77 to 80
"langs": {
"type": "array",
"items": { "type": "string" }
},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make these flags now, right? (maybe you were just focusing on the other lexicons for now)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought the api seems a bit nicer if we used specifically named params like this that ultimately searches against flags. but now realizing that it would make things difficult to add more filters so will move this to flags instead.

type: 'string',
},
description:
"Flags to be removed to the subject. Ignores a flag If doesn't exists, won't be duplicated.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as other description

@@ -587,6 +587,23 @@
}
}
},
"modEventFlag": {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would definitely add optional comment field; I think every event type should have comment

Comment on lines 260 to 263
if (isModEventTag(event)) {
if (event.add.length) meta.addedTags = event.add.join(' ')
if (event.remove.length) meta.removedTags = event.remove.join(' ')
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these are stuffed in meta rather than having their own jsonb columns (i.e. in line with labels)?

Copy link
Collaborator

@devinivy devinivy Feb 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thing that's a little funky here is that the fields addedTags and removedTags are not reflected in the model (i.e. in the types for db/tables/). Even if they end-up in meta, we have access to the full range of jsonb, so I'd be all for modeling them as an array rather than a space-separated strings.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no reason in particular other than trying to keep the column count low. we would need 2 columns if we want to go with the label approach and for most rows these columns will just be nulls.

@@ -745,6 +771,7 @@ export class ModerationService {
nullsLast: true,
})

// console.log(paginatedBuilder.compile())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// console.log(paginatedBuilder.compile())

builder = builder.where(
sql<string>`${ref(
'moderation_subject_status.tags',
)} @> ${JSON.stringify(tags)}`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I recall, I added a new jsonb() db type helper that would fit here.

const { ref } = this.db.db.dynamic
if (tags.length) {
builder = builder.where(
sql<string>`${ref(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wont evaluate to a string, but the type also doesn't need to be preserved here anyway so just nixing it.

Suggested change
sql<string>`${ref(
sql`${ref(

createdBy,
})
} catch (err) {
console.error('Error getting record langs', err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the logger here rather than console.error() so that we can consume the log in a standard format.

@@ -163,6 +163,12 @@ export class ModerationViews {
eventView.event.sticky = true
}

if (event.action === 'com.atproto.admin.defs#modEventTag') {
eventView.event.add = event.meta?.addedTags?.toString().split(' ') || []
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A common issue with space-separated strings is that we need a little bit more logic than a split(' '). If the string is empty '' then the split gives you [''], when you probably want [].

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to dedicated jsonb columns as you recommended above so we won't have to worry about the string related edge cases.

@@ -39,6 +39,9 @@ Object {
},
"subjectBlobCids": Array [],
"subjectRepoHandle": "alice.test",
"tags": Array [
"lang:unknown",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we like the lang:unknown as it's explicit, versus not having a lang tag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah and also, this helps reduce unnecessary calls to figure out lang tag on every event.
since this tagger runs on every event for a subject unless the subject already has a lang tag, failing to figure out the right tag once probably means future attempts will fail too so we will just avoid running it again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 3-char code und ("Undetermined") is a pseudo-specified way to represent this.

thinking ahead to UI stuff... we may want to have a special/specific drop-down/conversion for language tags, different from other tags? But I think we can get to that whenever, doesn't need to be first iteration.

Comment on lines 198 to 211
if (addedTags.length) {
const tags = currentStatus?.tags || []
newStatus.tags = jsonb([
...new Set([...tags, ...addedTags]),
]) as unknown as string[]
subjectStatus.tags = newStatus.tags
}
if (removedTags.length) {
const tags = currentStatus?.tags || []
newStatus.tags = jsonb(
tags.filter((tag) => !removedTags.includes(tag)),
) as unknown as string[]
subjectStatus.tags = newStatus.tags
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that adding and removing a tag at the same time would respect the removal but not the addition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed and cleaned up the logic here. there's also a test for this now. however, one edge case here is if a request comes in with add: ['x'], remove: ['x']. with the current implementation, we will remove x and not add.

event: {
$type: 'com.atproto.admin.defs#modEventTag',
add: recordLangs
? recordLangs.map((lang) => `lang:${lang}`)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing, just an idea. But the language on a post could be highly specific, anything within BCP 47. I wonder if for the purposes here it would make sense to use only the primary language tag. E.g. en-US would be boiled down to en, not distinguishing between US english and british english en-GB.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes! I think this would be good to do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the the first fragment of the hyphenated codes is the generalized code? That's what I'm storing now but not sure if we would want the full code to be there too?

@bnewbold
Copy link
Collaborator

seems like this backend part has moved in a good direction. LGTM; i'll leave to dan/divy to review the actual implementation.

@bnewbold
Copy link
Collaborator

Cross-posting from another issue: I think we need the ability to filter on "not-tag" in addition to tag. (not sure what the query param name should be for that)

@foysalit
Copy link
Contributor Author

Fixed up some minor issues and added a way to retrieve subjects that do not have certain tags using a excludeTags filter param. @devinivy / @bnewbold waiting on your final ✅ on this.

Copy link
Collaborator

@devinivy devinivy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@devinivy devinivy merged commit d815732 into main Feb 19, 2024
10 checks passed
@devinivy devinivy deleted the mod-subject-language-filter branch February 19, 2024 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants