Permalink
Commits on Feb 25, 2016
  1. NUTCH-2231 Jexl support in generator job

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1732332 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 25, 2016
Commits on Feb 24, 2016
  1. NUTCH-2231 Jexl support in generator job

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1732177 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 24, 2016
  2. NUTCH-2232 DeduplicationJob should decode URL's before length is comp…

    …ared
    
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1732160 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 24, 2016
  3. NUTCH-2229 Allow Jexl expressions on CrawlDatum's fixed attributes

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1732140 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 24, 2016
Commits on Feb 23, 2016
  1. NUTCH-2227 RegexParseFilter

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1731849 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 23, 2016
  2. NUTCH-2221 Introduce db.ignore.internal.links to FetcherThread

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1731836 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 23, 2016
  3. NUTCH-2220 Rename db.* options used only by the linkdb to linkdb.*

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1731831 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 23, 2016
  4. NUTCH-2228 Plugin index-replace unit test broken on Java 8

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1731824 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 23, 2016
Commits on Feb 22, 2016
  1. NUTCH-2219 Criteria order to be configurable in DeduplicationJob

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1731651 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 22, 2016
Commits on Feb 18, 2016
  1. NUTCH-2218 - Update CHANGES.txt. Merge PR #91

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1731103 13f79535-47bb-0310-9956-ffa450edef68
    MJJoyce committed Feb 18, 2016
  2. NUTCH-2218 - Update CrawlComplete util to use Commons CLI

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1731102 13f79535-47bb-0310-9956-ffa450edef68
    MJJoyce committed Feb 18, 2016
Commits on Feb 17, 2016
  1. NUTCH-2223 Upgrade xercesImpl to 2.11.0 to fix hang on issue in tika …

    …mimetype detection
    
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1730808 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 17, 2016
  2. NUTCH-2224 Average bytes/second calculated incorrectly in fetcher

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1730803 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 17, 2016
  3. NUTCH-2225 Parsed time calculated incorrectly

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1730802 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 17, 2016
Commits on Feb 16, 2016
  1. NUTCH-961 Expose Tika's Boilerpipe support

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1730694 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 16, 2016
  2. NUTCH-1233 Rely on Tika for outlink extraction

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1730687 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 16, 2016
  3. NUTCH-2210 Upgrade to Tika 1.12

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1730686 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 16, 2016
Commits on Feb 11, 2016
  1. NUTCH-2209 Improved Tokenization for Similarity Scoring plugin, this …

    …closes #87
    
    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1729763 13f79535-47bb-0310-9956-ffa450edef68
    sujen1412 committed Feb 11, 2016
Commits on Feb 3, 2016
  1. NUTCH-2211 Added filterchecker and normalizerchecker to bin/nutch script

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1728339 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 3, 2016
  2. NUTCH-2197 Add Solr 5 cloud indexer support

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1728313 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Feb 3, 2016
Commits on Jan 27, 2016
  1. Added missing stopword file for NUTCH-2206

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1727126 13f79535-47bb-0310-9956-ffa450edef68
    sujen1412 committed Jan 27, 2016
  2. NUTCH-2206 Provide example scoring.similarity.stopword.file

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1727122 13f79535-47bb-0310-9956-ffa450edef68
    sujen1412 committed Jan 27, 2016
Commits on Jan 22, 2016
  1. NUTCH-2204 : revert erroneous commit

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1726318 13f79535-47bb-0310-9956-ffa450edef68
    sebastian-nagel committed Jan 22, 2016
  2. NUTCH-2204 Remove junit lib from runtime

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1726314 13f79535-47bb-0310-9956-ffa450edef68
    sebastian-nagel committed Jan 22, 2016
Commits on Jan 21, 2016
  1. NUTCH-2201 Remove loops program from webgraph package

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1725981 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 21, 2016
  2. NUTCH-1325 HostDB for Nutch

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1725952 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 21, 2016
Commits on Jan 19, 2016
  1. NUTCH-2203 Suffix URL filter can't handle trailing/leading whitespaces

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1725538 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 19, 2016
Commits on Jan 15, 2016
  1. NUTCH-2194 Run IndexingFilterChecker as simple Telnet server

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1724771 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 15, 2016
Commits on Jan 13, 2016
  1. NUTCH-2196 IndexingFilterChecker to optionally normalize

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1724418 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 13, 2016
  2. NUTCH-2195 IndexingFilterChecker to optionally follow N redirects

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1724409 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 13, 2016
Commits on Jan 12, 2016
  1. NUTCH-2190 Protocol normalizer

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1724199 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 12, 2016
Commits on Jan 11, 2016
  1. NUTCH-2190 Protocol normalizer

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1724085 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 11, 2016
Commits on Jan 8, 2016
  1. NUTCH-1838 Host and domain based regex and automaton filtering

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1723710 13f79535-47bb-0310-9956-ffa450edef68
    asf-sync-process committed Jan 8, 2016
  2. NUTCH-2178 DeduplicationJob to optionally group on host or domain

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1723690 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 8, 2016
  3. NUTCH-1449 Optionally delete documents skipped by IndexingFilters

    git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1723688 13f79535-47bb-0310-9956-ffa450edef68
    Markus Jelsma committed Jan 8, 2016