Skip to content

Commit

Permalink
NUTCH-3041 Address confusing logging in o.a.n.net.URLExemptionFilters (
Browse files Browse the repository at this point in the history
  • Loading branch information
lewismc committed May 16, 2024
1 parent 5f1330a commit 8abc78a
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 23 deletions.
7 changes: 5 additions & 2 deletions src/java/org/apache/nutch/net/URLExemptionFilters.java
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@
import org.slf4j.LoggerFactory;

import java.lang.invoke.MethodHandles;
import java.util.Arrays;

/** Creates and caches {@link URLExemptionFilter} implementing plugins. */
public class URLExemptionFilters {
Expand All @@ -44,8 +45,10 @@ public URLExemptionFilters(Configuration conf) {
throw new IllegalStateException(e);
}
}
LOG.info("Found {} extensions at point:'{}'", filters.length,
URLExemptionFilter.X_POINT_ID);
if (filters.length > 0) {
LOG.info("Found {} URLExemptionFilter implementations: '{}'", filters.length,
Arrays.toString(filters));
}
}

/**
Expand Down
18 changes: 7 additions & 11 deletions src/plugin/urlfilter-ignoreexempt/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@

urlfilter-ignoreexempt
======================
This plugin allows certain urls to be exempted when the external links are configured to be ignored.
This is useful when focused crawl is setup but some resources like static files are linked from CDNs (external domains).
This plugin allows certain urls to be exempted when the external links are configured to be ignored.
This is useful when focused crawl is setup but some resources like static files are linked from CDNs (external domains).

# How to enable ?
Add `urlfilter-ignoreexempt` value to `plugin.includes` property
Expand All @@ -36,25 +36,21 @@ open `conf/db-ignore-external-exemptions.txt` and add the regex rules.
## Format :

The format is same same as `regex-urlfilter.txt`.
Each non-comment, non-blank line contains a regular expression
prefixed by '+' or '-'. The first matching pattern in the file
determines whether a URL is exempted or ignored. If no pattern
matches, the URL is ignored.

Each non-comment, non-blank line contains a regular expression
prefixed by '+' or '-'. The first matching pattern in the file
determines whether a URL is exempted or ignored. If no pattern
matches, the URL is ignored.

## Example :

To exempt urls ending with image extensions, use this rule
To exempt urls ending with image extensions, use this rule

`+(?i)\.(jpg|png|gif)$`



## Testing the Rules :

After enabling the plugin and adding your rules to `conf/db-ignore-external-exemptions.txt`, run:

`bin/nutch plugin urlfilter-ignoreexempt org.apache.nutch.urlfilter.ignoreexempt.ExemptionUrlFilter http://yoururl.here`


This should print `true` for urls which are accepted by configured rules.
Original file line number Diff line number Diff line change
Expand Up @@ -25,21 +25,25 @@
import java.util.regex.Pattern;
import java.util.List;


/**
* This implementation of {@link org.apache.nutch.net.URLExemptionFilter} uses regex configuration
* to check if URL is eligible for exemption from 'db.ignore.external'.
* When this filter is enabled, the external urls will be checked against configured sequence of regex rules.
* This implementation of {@link org.apache.nutch.net.URLExemptionFilter}
* uses regex configuration to check if URL is eligible for exemption from
* the <code>db.ignore.external.links</code> configuration property.
* When this filter is enabled, the external urls will be checked
* against configured sequence of regex rules.
*<p>
* The exemption rule file defaults to db-ignore-external-exemptions.txt in the classpath but can be
* overridden using the property <code>"db.ignore.external.exemptions.file" in ./conf/nutch-*.xml</code>
* The exemption rule file defaults to
* <code>db-ignore-external-exemptions.txt</code> in the classpath but
* can be overridden using the configuration property
* <code>db.ignore.external.exemptions.file</code>.
*</p>
*
* The exemption rules are specified in plain text file where each line is a rule.
* The format is same same as `regex-urlfilter.txt`.
* The exemption rules are specified in plain text file where each line
* is a rule.
* The format is same same as <code>regex-urlfilter.txt</code>.
* Each non-comment, non-blank line contains a regular expression
* prefixed by '+' or '-'. The first matching pattern in the file
* determines whether a URL is exempted or ignored. If no pattern
* prefixed by <b>+</b> or <b>-</b>. The first matching pattern in the file
* determines whether a URL is exempted or ignored. If no pattern
* matches, the URL is ignored.
*
* @since Feb 10, 2016
Expand Down

0 comments on commit 8abc78a

Please sign in to comment.