Enhance URL parsing in regex to exclude non-Latin alphabets at the end #946

notJoon · 2023-07-03T08:44:15Z

Description

The existing regex in function detectLinkables is unfortunately matches non-Latin characters if they are attached to the URL without a space. This behaviour leads to unexpected results, when dealing with Asian languages (e.g. can't open external links or userename).

Because Asian languages are mostly has a grammatical elements, which are appended some character to the end of a noun without a space.

To solve this issue, I modified the previous regex to ensure URLs end with Latin alphabet characters or numbers. The updated regex is as follows:

/((^|\s|()@[a-z0-9.-]+)|((^|\s|()https?:\/\/[\w.-]+[a-z0-9])|((^|\s|()(?<domain>[a-z][a-z0-9]*(.[a-z0-9]+)+)[a-z0-9]))))/gi

This regex will help to correctly parse URLs in contexts with mixed language usage.

Thanks!

modify detectLinkables

b72224c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance URL parsing in regex to exclude non-Latin alphabets at the end #946

Enhance URL parsing in regex to exclude non-Latin alphabets at the end #946

notJoon commented Jul 3, 2023

Enhance URL parsing in regex to exclude non-Latin alphabets at the end #946

Are you sure you want to change the base?

Enhance URL parsing in regex to exclude non-Latin alphabets at the end #946

Conversation

notJoon commented Jul 3, 2023

Description