-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add support for public suffix extraction #14 #17
Conversation
Codecov Report
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. @@ Coverage Diff @@
## main #17 +/- ##
==========================================
- Coverage 25.73% 25.50% -0.24%
==========================================
Files 5 7 +2
Lines 3913 3949 +36
==========================================
Hits 1007 1007
- Misses 2906 2942 +36
|
missing all these cases:
psl::public_suffix("https://www.takatoukiter.asakshfakjf.yokohama.jp")
#> [1] "asakshfakjf.yokohama.jp"
adaR::public_suffix("https://www.takatoukiter.asakshfakjf.yokohama.jp")
#> [1] "jp" Created on 2023-09-23 with reprex v2.0.2 |
|
@schochastics I tried this in Ruby (I think it is probably more authoritative than require 'uri/http'
require 'public_suffix'
def parse(url)
uri = URI.parse(url)
domain = PublicSuffix.parse(uri.host, ignore_private: true)
puts domain.domain
puts PublicSuffix.valid?(uri.host)
end
urls = ["https://www.takatoukiter.asakshfakjf.yokohama.jp",
"https://domain.api.gov.uk/page?q=1234#abcd",
"https://domain.com/page?q=1234#abcd",
"https://blogspot.com"] ## private domain
for url in urls do
parse(url)
end
edit: this is the equivalent of |
require 'uri/http'
require 'public_suffix'
def parse(url)
uri = URI.parse(url)
domain = PublicSuffix.parse(uri.host, ignore_private: true)
puts domain.tld
puts PublicSuffix.valid?(uri.host)
end
urls = ["https://www.takatoukiter.asakshfakjf.yokohama.jp",
"https://domain.api.gov.uk/page?q=1234#abcd",
"https://whatever.domain.com/page?q=1234#abcd",
"https://a.b.c.kobe.jp/abc",
"https://blogspot.com"] ## private domain
for url in urls do
parse(url)
end This one indeed gives |
Thanks for the help! Should we rename it to tld_extract maybe once it works? |
@schochastics It depends on the purpose. If what we want is It's so darn complicated. |
@chainsawriot yeah it is crazy. Guess best case we can accomodate both tld and registrar or the one that the webtrack people need. Can you see how the rust lib gets gov.uk right? AFAIS all this libs just work with the dat file as input and dont I know how they get gov.uk instead of api.gov.uk. What you are wondering about are the wildcard cases. So there are tld *.kawasaki.jp and anything.kawasaki.jp is apparently a tld. We definitely need to be more explicit (for the layperson) with the naming. So tld_extract and registrar_extract maybe. |
This (js) actually gives "api.gov.uk". I think there is no authoritative answer. I think what we can only say is "x compatible", e.g. libpsl compatible. var psl = require("psl");
var parsed = psl.parse('domain.api.gov.uk');
console.log(parsed.tld); |
Another simple option is to return TLD (the very strict one) and the matched entry in psl. adaR::extract_tld("https://www.takatoukiter.asakshfakjf.yokohama.jp") ## asakshfakjf.yokohama.jp
adaR::extract_psl_entry("https://www.takatoukiter.asakshfakjf.yokohama.jp") ## *.yokohama.jp
adaR::extract_tld("https://domain.api.gov.uk") ## gov.uk, libpsl compatible
adaR::extract_psl_entry("https://domain.api.gov.uk") ## api.gov.uk |
|
@schochastics Now I REALLY know what the problem is: The psl is actually two lists.
"api.gov.uk" is actually in the private domains, and therefore should not be matched; therefore, "gov.uk" should the definitive correct answer. So, the js implementation is incorrect. For us, the solution is to split the list into two. |
Dear lord thanks for the detective work. Then I split the file. Do we even need the private stuff? |
@schochastics Well, unless you want to check for the ICANN compliance. We can implement it when we need that. For now, YAGNI. |
Of course, you can also have |
I will keep that for future work |
Sanity check with psl package.
Created on 2023-09-23 with reprex v2.0.2
The result from psl should be authoritative. api.gov.uk does appear in the psl list, but not sure why it is apparently not the right solution here