Skip to content

feat: accept more social media patterns#1286

Merged
B4nan merged 11 commits intoapify:masterfrom
lhotanok:accept-more-social-media-patterns
Feb 18, 2022
Merged

feat: accept more social media patterns#1286
B4nan merged 11 commits intoapify:masterfrom
lhotanok:accept-more-social-media-patterns

Conversation

@lhotanok
Copy link
Copy Markdown
Contributor

Fixes vdrmota/Social-Media-and-Contact-Info-Extractor/#21

@lhotanok lhotanok changed the title Accept more social media patterns feat: accept more social media patterns Jan 25, 2022
Copy link
Copy Markdown
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just few nit picks, it looks good to me, but we need lots of new tests (and existing tests need adjustments)

Comment thread src/utils_social.js Outdated
Comment thread src/utils_social.js Outdated
Comment thread src/utils_social.js
Comment on lines +529 to +534
* https://www.tiktok.com/trending?shareId=123456789
* https://www.tiktok.com/embed/123456789
* https://m.tiktok.com/v/123456789
* https://www.tiktok.com/@user
* https://www.tiktok.com/@user-account.pro
* https://www.tiktok.com/@user/video/123456789
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/utils_social.js Outdated
Comment thread src/utils_social.js Outdated
Comment thread src/utils_social.js Outdated
@lhotanok
Copy link
Copy Markdown
Contributor Author

lhotanok commented Jan 26, 2022

just few nit picks, it looks good to me, but we need lots of new tests (and existing tests need adjustments)

Sure, I noticed the failing tests just after I created PR, so I'll resolve them now. It should mostly be related to not setting the non-capturing groups correctly. Also, I'll try to add new tests into https://github.com/apify/apify-js/blob/master/test/utils_social.test.js 🙂

Copy link
Copy Markdown
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @metalwarrior665 @pocesar, would be great if you could check this if it works as expected

it feels a bit weird to detect that email as linked in URL, and I would maybe expect that if we detect URL without http prefix, we'll add it automatically, so the return value is guaranteed to be valid list of URLs (which is not currently true, but given the existing tests, it was behaving like that before too for some cases)

Comment thread src/utils_social.js Outdated
@lhotanok
Copy link
Copy Markdown
Contributor Author

lhotanok commented Feb 3, 2022

@B4nan I'll add new tests for TikTok, Discord and Pinterest in the next commit.
By the way, do we want to add more websites within this patch? @zpelechova suggested Reddit

@B4nan
Copy link
Copy Markdown
Member

B4nan commented Feb 3, 2022

I don't mind, can be as well separate PR.

Comment thread src/utils_social.js
@lhotanok
Copy link
Copy Markdown
Contributor Author

lhotanok commented Feb 4, 2022

I don't mind, can be as well separate PR.

Tests finally updated 😄

Please, check if capturing groups for matches[1] are acceptable this way. E.g. I tried to capture only pin id for Pinterest or invite id for Discord but the regex handles multiple different patterns so it matches prefixes as well - such as pin/123456789, username/board, discord.com/invite/jyEM2PRvMU , discordapp.com/channels/231496023303957476 or trending?shareId=123456789 vs embed/123456789 for TikTok).

Comment thread src/utils_social.js Outdated
const PINTEREST_REGEX_STRING = '(?<!\\w)(?:http(?:s)?:\\/\\/)?(?:(?:(?:(?:www\\.)?pinterest(?:\\.com|(?:\\.[a-z]{2}){1,2}))|(?:[a-z]{2})\\.pinterest\\.com)(?:\\/))((pin\\/[0-9]{2,50})|((?!pin)[a-z0-9\\-_\\.]+(\\/[a-z0-9\\-_\\.]+)?))(?:\\/)?';

// eslint-disable-next-line max-len, quotes
const DISCORD_REGEX_STRING = '(?<!\\w)(?:https?:\\/\\/)?(?:www\\.)?((?:(?:discord|discordapp)\\.com\\/channels(?:\\/)[0-9]{2,50}(\\/[0-9]{2,50})*)|(?:(?:discord\\.(?:com|me|li|gg|io)|discordapp\\.com)(?:\\/invite)?)\\/(?!channels)[a-z0-9\\-_]{2,50})(?:\\/)?';
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed this, it would be nice if it matched Discord's other subdomains (mainly ptb and canary) as well 🙏

Also, is it intended that the URL will only match message links to messages in servers? When a message link to a message in DMs is sent, it has the format /channels/@me/id 👀

Copy link
Copy Markdown
Contributor Author

@lhotanok lhotanok Feb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning https://canary.discord.com/ and https://ptb.discord.com/? Never encountered these before 😄 Could be added though

I thought we probably don't want to collect /@me/... links as these are typically not listed on the web among social media links and contacts, right? But maybe given this logic, we should exclude links to messages in servers as well. 🤔 I'm not sure whether they are listed somewhere publicly or not.

@B4nan B4nan merged commit d1fdf65 into apify:master Feb 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants