Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add ada_get_domain #43

Closed
schochastics opened this issue Sep 26, 2023 · 8 comments · Fixed by #46
Closed

add ada_get_domain #43

schochastics opened this issue Sep 26, 2023 · 8 comments · Fixed by #46
Labels

Comments

@schochastics
Copy link
Member

kindly requested by webtrack team:

ada_get_domain("https://subsub.sub.domain.co.uk")
#> domain.co.uk

Just glueing some existing functions

@schochastics schochastics self-assigned this Sep 26, 2023
@chainsawriot
Copy link
Collaborator

@schochastics
Copy link
Member Author

schochastics commented Sep 26, 2023

is there a get_domain hidden somewhere in ada-url? Havent found anything.
I am here now but it does not catch all special cases

R_ada_get_domain <- function(url) {
    host <- ada_get_hostname(url)
    ps <- public_suffix(url)
    pat <- paste0("\\.", ps, "$")
    dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
    domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
    domain[host == ps] <- ""
    domain[is.na(ps)] <- host
}

#' @rdname ada_get_domain
#' @export
ada_get_domain <- function(url, decode = TRUE) {
    .get(url, decode, R_ada_get_domain)
}

@chainsawriot
Copy link
Collaborator

No, I don't think ada has it, given the fact it is not psl aware. It should be the TLD (via psl) plus the thing before it. How about using pat plus all non-dot before it.

domain <- "https://www.domain.biz"
stringr::str_extract(domain, paste0("[^\\.]+\\.", public_suffix(domain)))

@schochastics
Copy link
Member Author

I think this does not work e.g. with the example in #44

@chainsawriot
Copy link
Collaborator

Very bad way to fix this (given "kobe.jp" can be extracted).

quickfixquicksand <- function(url, suffix = adaR::public_suffix(url)) {
    hostname <- adaR::ada_get_hostname(url)
    if (suffix == hostname) {
        return(hostname)
    }
    stringr::str_extract(hostname, paste0("[^\\.]+\\.", suffix))
}

quickfixquicksand("https://kobe.jp", "kobe.jp")
quickfixquicksand("https://www.bbc.co.uk")
quickfixquicksand("https://www.bmbf.de")

@schochastics
Copy link
Member Author

there are yet again special treatment for wildcard ps.

R_ada_get_domain <- function(url) {
    host <- ada_get_hostname(url)
    host <- sub("^www\\.", "", host)
    ps <- public_suffix(url)
    pat <- paste0("\\.", ps, "$")

    dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
    domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
    domain[host == ps & !ps %in% psl$wildcard] <- ""
    domain[host == ps & ps %in% psl$wildcard] <- ps
    domain[is.na(ps)] <- host
    domain
}

This works for the tests I made, but will now go through the whole list you posted

@schochastics
Copy link
Member Author

oh crap this broke things again

@schochastics
Copy link
Member Author

ok we cannot support all the test cases, because not all test cases have a valid public suffix

@schochastics schochastics removed their assignment Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants