Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved detection of URLs with a valid list of TLDs #74

Merged
merged 9 commits into from
Jun 13, 2021
Merged

Conversation

amadejpapez
Copy link
Collaborator

@amadejpapez amadejpapez commented Jun 11, 2021

  • a script for getting a list of valid TLDs was already added, this PR only adds the list to the URL regex
  • URL regex no longer requires http, https or ftp at the start (Better URL Regex #45), except if it is an IP address
  • added a few new tests, removed skip from tryhackme.com as it now matches and added a skip to http://папироска.рф as it no longer matches. Browser shows it as that but it understands it in a punycode form https://xn--80aaxitdbjk.xn--p1ai/ - which the regex matches
  • removed langdetect from pyproject.toml
  • also reran poetry install, so poetry.lock is up to date

@amadejpapez
Copy link
Collaborator Author

I need to do some more regex testing, so this is not finished yet. :)

@bee-san
Copy link
Owner

bee-san commented Jun 12, 2021

Want to review? There's a merge conflict too but it's probably just black :)

@amadejpapez
Copy link
Collaborator Author

Should be good now! :)

Copy link
Owner

@bee-san bee-san left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good work!!!

pywhat/Data/tld_list.txt Outdated Show resolved Hide resolved
pywhat/regex_identifier.py Outdated Show resolved Hide resolved
scripts/get_tlds.py Show resolved Hide resolved
tests/test_regex_identifier.py Show resolved Hide resolved
@bee-san bee-san merged commit 4d8f83e into main Jun 13, 2021
@bee-san bee-san deleted the regex-tlds branch June 13, 2021 06:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants