-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add ada_get_domain #43
Comments
is there a R_ada_get_domain <- function(url) {
host <- ada_get_hostname(url)
ps <- public_suffix(url)
pat <- paste0("\\.", ps, "$")
dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
domain[host == ps] <- ""
domain[is.na(ps)] <- host
}
#' @rdname ada_get_domain
#' @export
ada_get_domain <- function(url, decode = TRUE) {
.get(url, decode, R_ada_get_domain)
}
|
No, I don't think ada has it, given the fact it is not psl aware. It should be the TLD (via psl) plus the thing before it. How about using domain <- "https://www.domain.biz"
stringr::str_extract(domain, paste0("[^\\.]+\\.", public_suffix(domain))) |
I think this does not work e.g. with the example in #44 |
Very bad way to fix this (given "kobe.jp" can be extracted). quickfixquicksand <- function(url, suffix = adaR::public_suffix(url)) {
hostname <- adaR::ada_get_hostname(url)
if (suffix == hostname) {
return(hostname)
}
stringr::str_extract(hostname, paste0("[^\\.]+\\.", suffix))
}
quickfixquicksand("https://kobe.jp", "kobe.jp")
quickfixquicksand("https://www.bbc.co.uk")
quickfixquicksand("https://www.bmbf.de") |
there are yet again special treatment for wildcard ps. R_ada_get_domain <- function(url) {
host <- ada_get_hostname(url)
host <- sub("^www\\.", "", host)
ps <- public_suffix(url)
pat <- paste0("\\.", ps, "$")
dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
domain[host == ps & !ps %in% psl$wildcard] <- ""
domain[host == ps & ps %in% psl$wildcard] <- ps
domain[is.na(ps)] <- host
domain
} This works for the tests I made, but will now go through the whole list you posted |
oh crap this broke things again |
ok we cannot support all the test cases, because not all test cases have a valid public suffix |
kindly requested by webtrack team:
Just glueing some existing functions
The text was updated successfully, but these errors were encountered: