Muted Keywords #1144
Replies: 6 comments
-
how would this hold up to thousands of keywords? I ask this because it might be very slow to iterate through every single keyword |
Beta Was this translation helpful? Give feedback.
-
We could find a theoretical maximum number of keywords where performance isn't noticeably hindered when retrieving a user's complete timeline. We then cap the number of words a user is allowed to have to that maximum value. |
Beta Was this translation helpful? Give feedback.
-
sounds good then, I would make a pr but I want to wait for the bsky devs to say something |
Beta Was this translation helpful? Give feedback.
-
Some thoughts Should the lists be public and subscribeable, like mute lists? Should the lists support and kind of normalization? eg. "Word" == "word" this also includes special characters and characters which lookalike. This isn't required for an MVP, but would most certainly be nice to have. More of a implementation detail, but what if every post returned includes a blocked word? Regarding performance I wouldn't care initially, from the high level this sounds like something that can be easily parallelised to reduce end user latency. |
Beta Was this translation helpful? Give feedback.
-
Great thoughts! Here are my responses
It certainly wouldn't hurt for lists of keywords to be shared amongst users, but I'm not sure the added complexity necessitates it. At the end of the day, these are simple words, and adding/removing words from your list isn't as complex as muting undesired actors. However, I'm thinking this v1 can be focused on simple keyword lists. It shouldn't be too difficult to build off this and add functionailty down the line to make subscribeable keywords lists.
I was thinking for now we'll have a lowercase-to-lowercase comparison ("WORD"=="word"). I agree that bad actors will always try to find a way around normalization filters, but I don't think it's worth the added time/compute power for now to come up with a list of possible leetspeak alternatives ("w0rd"=="word") unless we really start to notice abuse. Of course we can take this as complex as we want, like training an AI to perform Levenshtein distance on word comparisons. But I think you're gonna need another server and team for that 🤣 Though, something that comes to mind, should we check if a string is located within a string ("Melons are disgusting" -> "elon" ✅) or separate words by a set of delimiters ("Delicious apples" -> "apple" ❌)?
It would just return an empty |
Beta Was this translation helpful? Give feedback.
-
As for matching of the keyword itself, it might be best to just have it be similar to Mastodon's, here's roughly how it creates the matching for it: const ESCAPE_RE = /[.*+?^${}()|[\]\\]/g;
const escape = (str: string) => {
return str.replace(ESCAPE_RE, '\\$&');
};
const WORD_START_RE = /^[\p{M}\p{L}\p{N}\p{Pc}]/u;
const WORD_END_RE = /[\p{M}\p{L}\p{N}\p{Pc}]$/u;
export interface KeywordFilter {
keyword: string;
whole: boolean;
}
export const createRegexMatcher = (filters: KeywordFilter[]) => {
let str = '';
let pfx = '';
let sfx = '';
for (let i = 0, l = filters.length; i < l; i++) {
const { keyword, whole } = matchers[i];
str && (str += '|');
if (whole) {
pfx = WORD_START_RE.test(keyword) ? '\\b' : '';
sfx = WORD_END_RE.test(keyword) ? '\\b' : '';
str += pfx + escape(keyword) + sfx;
} else {
str += escape(keyword);
}
}
return new RegExp(str, 'i');
}; |
Beta Was this translation helpful? Give feedback.
-
As a Bluesky services developer, I would like keywords to be mutable from the timeline so my users can choose to hide posts from their feed if they have certain keywords.
Background
This issue is a server-side companion to the following issue on the bluesky repo. While this popular issue could be solved with a client side solution of parsing all feed record's texts for words, a server-side implementation would prove more efficient and dynamic.
Proposed Implementation
A new database table,
mutedKeywords
would be created along with two newapp.bsky.graph
lexicons,mutedKeyword
andgetMutedKeywords
, the structures of which would look very similar to that ofapp.bsky.graph.block
. Keywords can then be inserted, along with the user's DID and datetime into the table one at a time through a simple endpoint (with a deletion endpoint also in place). On retrieval of any feed, post's records would be checked against the keywords matched by the requester's DID, should they exist. Posts will be excluded should they include any keywords.Some Tricky Situations
"Melons are disgusting".includes("elon"
) ✅)?"Delicious apples".split().contains("apple")
❌)?Beta Was this translation helpful? Give feedback.
All reactions