NUTCH-1806 Delegate processing of URL domains to crawler-commons #816

sebastian-nagel · 2024-04-29T13:45:44Z

and NUTCH-1942 Remove TopLevelDomain

use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package
adapt and extend unit tests
- add tests for URLUtil.getTopLevelDomainName(url)
- reflect changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix)
- adapt to minor API changes
  - URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found
  - for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation
- add unit tests for host names with trailing dot ("www.apache.org.")
- add unit test for URLs without host/domain (cf. NUTCH-2450)
update and complete Javadoc
update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil
remove the class TLDScoringFilter. The configuration is bound to the domain-suffixes.xml which wasn't maintained anymore and is now removed
remove package org.apache.nutch.util.domain
move DomainStatistics to org.apache.nutch.util
remove configuration files of domain utils

- add unit test for URLs without host/domain (cf. NUTCH-2450)

- add unit tests for host names with trailing dot ("www.apache.org.")

- use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the org.apache.nutch.util.domain package - adapt and extend unit tests - add tests for URLUtil.getTopLevelDomainName(url) - changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix) - minor API changes - URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found - for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation - complete Javadoc

NUTCH-1942 Remove TopLevelDomain - update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil - remove TLDScoringFilter - remove package org.apache.nutch.util.domain - move DomainStatistics to org.apache.nutch.util - remove configuration files of domain utils

- restore previous behavior of URLUtil.getDomainSuffix(...) and getTopLevelDomainName(...) to return null if there is no valid public suffix resp. TLD - unify spelling of top-level domain

sebastian-nagel added 4 commits April 29, 2024 12:09

NUTCH-1806 Delegate processing of URL domains to crawler commons

f6bcec9

- add unit test for URLs without host/domain (cf. NUTCH-2450)

NUTCH-1806 Delegate processing of URL domains to crawler commons

bc2ae7e

- add unit tests for host names with trailing dot ("www.apache.org.")

sebastian-nagel changed the title ~~NUTCH-1806 Delegate processing of URL domains to crawler-common~~ NUTCH-1806 Delegate processing of URL domains to crawler-commons Sep 6, 2024

NUTCH-1806 Delegate processing of URL domains to crawler commons

40881e8

- restore previous behavior of URLUtil.getDomainSuffix(...) and getTopLevelDomainName(...) to return null if there is no valid public suffix resp. TLD - unify spelling of top-level domain

sebastian-nagel merged commit 8b11962 into apache:master Sep 17, 2024
4 checks passed

sebastian-nagel deleted the NUTCH-1942-domain-utils-to-use-crawler-commons branch September 17, 2024 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NUTCH-1806 Delegate processing of URL domains to crawler-commons #816

NUTCH-1806 Delegate processing of URL domains to crawler-commons #816

sebastian-nagel commented Apr 29, 2024 •

edited

Loading

NUTCH-1806 Delegate processing of URL domains to crawler-commons #816

NUTCH-1806 Delegate processing of URL domains to crawler-commons #816

Conversation

sebastian-nagel commented Apr 29, 2024 • edited Loading

sebastian-nagel commented Apr 29, 2024 •

edited

Loading