Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add support for public suffix extraction #14 #17

Merged
merged 10 commits into from
Sep 24, 2023
Merged

add support for public suffix extraction #14 #17

merged 10 commits into from
Sep 24, 2023

Conversation

schochastics
Copy link
Member

Sanity check with psl package.

urls <- c(
'https://subsub.sub.domain.co.uk/page?q=1234#abcd',
'https://domain.api.gov.uk/page?q=1234#abcd',
'https://domain.co.uk/page?q=1234#abcd',
'https://www.domain.com/page?q=1234#abcd',
'https://domain.com/page?q=1234#abcd'
)

adaR::public_suffix(urls)
#> [1] "co.uk"      "api.gov.uk" "co.uk"      "com"        "com"
hosts <- adaR::ada_get_hostname(urls)
psl::public_suffix(hosts)
#> [1] "co.uk"  "gov.uk" "co.uk"  "com"    "com"

Created on 2023-09-23 with reprex v2.0.2

The result from psl should be authoritative. api.gov.uk does appear in the psl list, but not sure why it is apparently not the right solution here

@codecov-commenter
Copy link

codecov-commenter commented Sep 23, 2023

Codecov Report

Merging #17 (4fa0070) into main (a3fbc40) will decrease coverage by 0.24%.
Report is 2 commits behind head on main.
The diff coverage is 0.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##             main      #17      +/-   ##
==========================================
- Coverage   25.73%   25.50%   -0.24%     
==========================================
  Files           5        7       +2     
  Lines        3913     3949      +36     
==========================================
  Hits         1007     1007              
- Misses       2906     2942      +36     
Files Changed Coverage Δ
R/psl.R 0.00% <0.00%> (ø)

... and 5 files with indirect coverage changes

@schochastics
Copy link
Member Author

missing all these cases:

*.kawasaki.jp
*.kitakyushu.jp
*.kobe.jp
*.nagoya.jp
*.sapporo.jp
*.sendai.jp
*.yokohama.jp
psl::public_suffix("https://www.takatoukiter.asakshfakjf.yokohama.jp")
#> [1] "asakshfakjf.yokohama.jp"
adaR::public_suffix("https://www.takatoukiter.asakshfakjf.yokohama.jp")
#> [1] "jp"

Created on 2023-09-23 with reprex v2.0.2

@schochastics
Copy link
Member Author

  • Not utf8 safe yet.

  • api.gov.uk remains a mystery. All parsers I tried return gov.uk although api.gov.uk is in the list

@chainsawriot
Copy link
Collaborator

chainsawriot commented Sep 24, 2023

@schochastics I tried this in Ruby (I think it is probably more authoritative than psl the R package, because GitHub Pages also uses public_suffix)

require 'uri/http'
require 'public_suffix'

def parse(url)
  uri = URI.parse(url)
  domain = PublicSuffix.parse(uri.host, ignore_private: true)
  puts domain.domain
  puts PublicSuffix.valid?(uri.host)
end

urls = ["https://www.takatoukiter.asakshfakjf.yokohama.jp",
        "https://domain.api.gov.uk/page?q=1234#abcd",
        "https://domain.com/page?q=1234#abcd",
        "https://blogspot.com"] ## private domain

for url in urls do
  parse(url)
end

api.gov.uk is indeed returned.

edit: this is the equivalent of apex_domain; finding a way to get apex_suffix

@chainsawriot
Copy link
Collaborator

require 'uri/http'
require 'public_suffix'

def parse(url)
  uri = URI.parse(url)
  domain = PublicSuffix.parse(uri.host, ignore_private: true)
  puts domain.tld
  puts PublicSuffix.valid?(uri.host)
end

urls = ["https://www.takatoukiter.asakshfakjf.yokohama.jp",
        "https://domain.api.gov.uk/page?q=1234#abcd",
        "https://whatever.domain.com/page?q=1234#abcd",
        "https://a.b.c.kobe.jp/abc",
        "https://blogspot.com"] ## private domain

for url in urls do
  parse(url)
end

This one indeed gives gov.uk. ok, now I understand what the issue is.

@schochastics
Copy link
Member Author

schochastics commented Sep 24, 2023

Thanks for the help! Should we rename it to tld_extract maybe once it works?

@chainsawriot
Copy link
Collaborator

@schochastics It depends on the purpose. If what we want is tld (top-level domain), then I think "gov.uk" is probably correct (although I don't understand "asakshfakjf.yokohama.jp" as a top-level domain). However, if the purpose is to identify the registrar to whom the domain name "domain.api.gov.uk" is registered, then I think "api.gov.uk" is a better answer (and one can look at the list to see the contact person).

It's so darn complicated.

@schochastics
Copy link
Member Author

schochastics commented Sep 24, 2023

@chainsawriot yeah it is crazy. Guess best case we can accomodate both tld and registrar or the one that the webtrack people need. Can you see how the rust lib gets gov.uk right? AFAIS all this libs just work with the dat file as input and dont I know how they get gov.uk instead of api.gov.uk.

What you are wondering about are the wildcard cases. So there are tld *.kawasaki.jp and anything.kawasaki.jp is apparently a tld.

We definitely need to be more explicit (for the layperson) with the naming. So tld_extract and registrar_extract maybe.

@chainsawriot
Copy link
Collaborator

This (js) actually gives "api.gov.uk". I think there is no authoritative answer. I think what we can only say is "x compatible", e.g. libpsl compatible.

var psl = require("psl");
var parsed = psl.parse('domain.api.gov.uk');
console.log(parsed.tld);

@chainsawriot
Copy link
Collaborator

Another simple option is to return TLD (the very strict one) and the matched entry in psl.

adaR::extract_tld("https://www.takatoukiter.asakshfakjf.yokohama.jp") ## asakshfakjf.yokohama.jp
adaR::extract_psl_entry("https://www.takatoukiter.asakshfakjf.yokohama.jp") ## *.yokohama.jp
adaR::extract_tld("https://domain.api.gov.uk") ## gov.uk, libpsl compatible
adaR::extract_psl_entry("https://domain.api.gov.uk") ## api.gov.uk

@schochastics
Copy link
Member Author

schochastics commented Sep 24, 2023

@chainsawriot ok then we stick with what we get out of the list for now (ie what we have). seems less painful. @chainsawriot sry, answered before I saw your next answer. I like this approach. Do you have an idea howto do that efficiently-ish?

@chainsawriot
Copy link
Collaborator

@schochastics Now I REALLY know what the problem is:

The psl is actually two lists.

  • ICANN DOMAINS
  • PRIVATE DOMAINS

"api.gov.uk" is actually in the private domains, and therefore should not be matched; therefore, "gov.uk" should the definitive correct answer. So, the js implementation is incorrect.

For us, the solution is to split the list into two.

@schochastics
Copy link
Member Author

Dear lord thanks for the detective work. Then I split the file. Do we even need the private stuff?

@chainsawriot
Copy link
Collaborator

@schochastics Well, unless you want to check for the ICANN compliance. We can implement it when we need that. For now, YAGNI.

@chainsawriot
Copy link
Collaborator

Of course, you can also have adaR::public_suffix(domain, private = TRUE).

@schochastics
Copy link
Member Author

Of course, you can also have adaR::public_suffix(domain, private = TRUE).

I will keep that for future work

@schochastics schochastics marked this pull request as ready for review September 24, 2023 12:09
@schochastics schochastics merged commit 2ef77ef into main Sep 24, 2023
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants