feat: accept more social media patterns#1286
feat: accept more social media patterns#1286B4nan merged 11 commits intoapify:masterfrom lhotanok:accept-more-social-media-patterns
Conversation
| * https://www.tiktok.com/trending?shareId=123456789 | ||
| * https://www.tiktok.com/embed/123456789 | ||
| * https://m.tiktok.com/v/123456789 | ||
| * https://www.tiktok.com/@user | ||
| * https://www.tiktok.com/@user-account.pro | ||
| * https://www.tiktok.com/@user/video/123456789 |
There was a problem hiding this comment.
examples are nice, but tests better :]
https://github.com/apify/apify-js/blob/master/test/utils_social.test.js
Sure, I noticed the failing tests just after I created PR, so I'll resolve them now. It should mostly be related to not setting the non-capturing groups correctly. Also, I'll try to add new tests into https://github.com/apify/apify-js/blob/master/test/utils_social.test.js 🙂 |
B4nan
left a comment
There was a problem hiding this comment.
cc @metalwarrior665 @pocesar, would be great if you could check this if it works as expected
it feels a bit weird to detect that email as linked in URL, and I would maybe expect that if we detect URL without http prefix, we'll add it automatically, so the return value is guaranteed to be valid list of URLs (which is not currently true, but given the existing tests, it was behaving like that before too for some cases)
|
@B4nan I'll add new tests for TikTok, Discord and Pinterest in the next commit. |
|
I don't mind, can be as well separate PR. |
Tests finally updated 😄 Please, check if capturing groups for |
| const PINTEREST_REGEX_STRING = '(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:(?:(?:(?:www\\.)?pinterest(?:\\.com|(?:\\.[a-z]{2}){1,2}))|(?:[a-z]{2})\\.pinterest\\.com)(?:\\/))((pin\\/[0-9]{2,50})|((?!pin)[a-z0-9\\-_\\.]+(\\/[a-z0-9\\-_\\.]+)?))(?:\\/)?'; | ||
|
|
||
| // eslint-disable-next-line max-len, quotes | ||
| const DISCORD_REGEX_STRING = '(?<!\\w)(?:https?:\\/\\/)?(?:www\\.)?((?:(?:discord|discordapp)\\.com\\/channels(?:\\/)[0-9]{2,50}(\\/[0-9]{2,50})*)|(?:(?:discord\\.(?:com|me|li|gg|io)|discordapp\\.com)(?:\\/invite)?)\\/(?!channels)[a-z0-9\\-_]{2,50})(?:\\/)?'; |
There was a problem hiding this comment.
Just noticed this, it would be nice if it matched Discord's other subdomains (mainly ptb and canary) as well 🙏
Also, is it intended that the URL will only match message links to messages in servers? When a message link to a message in DMs is sent, it has the format /channels/@me/id 👀
There was a problem hiding this comment.
Meaning https://canary.discord.com/ and https://ptb.discord.com/? Never encountered these before 😄 Could be added though
I thought we probably don't want to collect /@me/... links as these are typically not listed on the web among social media links and contacts, right? But maybe given this logic, we should exclude links to messages in servers as well. 🤔 I'm not sure whether they are listed somewhere publicly or not.

Fixes vdrmota/Social-Media-and-Contact-Info-Extractor/#21