Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-1806 Delegate processing of URL domains to crawler-commons #816

Conversation

sebastian-nagel
Copy link
Contributor

@sebastian-nagel sebastian-nagel commented Apr 29, 2024

and NUTCH-1942 Remove TopLevelDomain

  • use methods from crawler-commons' EffectiveTldFinder in URLUtil replacing classed and methods from the "org.apache.nutch.util.domain" package

  • adapt and extend unit tests

    • add tests for URLUtil.getTopLevelDomainName(url)
    • reflect changes to the public suffix list since 2014 ("xyz" is now a public suffix / ICANN suffix)
    • adapt to minor API changes
      • URLUtil.getDomainName(url) returns the host name in case no valid public suffix is found
      • for Unicode suffixes and TLDs the methods URLUtil.getDomainSuffix(url) resp. URLUtil.getTopLevelDomainName(url) now return the ASCII representation
    • add unit tests for host names with trailing dot ("www.apache.org.")
    • add unit test for URLs without host/domain (cf. NUTCH-2450)
  • update and complete Javadoc

  • update DomainStatistics, TLDIndexingFilter and domain URL filters to use the updated methods in URLUtil

  • remove the class TLDScoringFilter. The configuration is bound to the domain-suffixes.xml which wasn't maintained anymore and is now removed

  • remove package org.apache.nutch.util.domain

  • move DomainStatistics to org.apache.nutch.util

  • remove configuration files of domain utils

- add unit test for URLs without host/domain (cf. NUTCH-2450)
- add unit tests for host names with trailing dot ("www.apache.org.")
- use methods from crawler-commons' EffectiveTldFinder in URLUtil
  replacing classed and methods from the org.apache.nutch.util.domain
  package
- adapt and extend unit tests
  - add tests for URLUtil.getTopLevelDomainName(url)
  - changes to the public suffix list since 2014
    ("xyz" is now a public suffix / ICANN suffix)
  - minor API changes
    - URLUtil.getDomainName(url) returns the host name
      in case no valid public suffix is found
    - for Unicode suffixes and TLDs the methods
      URLUtil.getDomainSuffix(url) resp.
      URLUtil.getTopLevelDomainName(url) now return
      the ASCII representation
- complete Javadoc
NUTCH-1942 Remove TopLevelDomain
- update DomainStatistics, TLDIndexingFilter and domain URL filters
  to use the updated methods in URLUtil
- remove TLDScoringFilter
- remove package org.apache.nutch.util.domain
- move DomainStatistics to org.apache.nutch.util
- remove configuration files of domain utils
@sebastian-nagel sebastian-nagel changed the title NUTCH-1806 Delegate processing of URL domains to crawler-common NUTCH-1806 Delegate processing of URL domains to crawler-commons Sep 6, 2024
- restore previous behavior of URLUtil.getDomainSuffix(...) and
  getTopLevelDomainName(...) to return null if there is no valid
  public suffix resp. TLD
- unify spelling of top-level domain
@sebastian-nagel sebastian-nagel merged commit 8b11962 into apache:master Sep 17, 2024
4 checks passed
@sebastian-nagel sebastian-nagel deleted the NUTCH-1942-domain-utils-to-use-crawler-commons branch September 17, 2024 15:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant