Permalink
Commits on Aug 23, 2016
  1. @sebastian-nagel

    NUTCH-2164 NUTCH-2242 Inconsistent 'Modified Time' in crawl db / last…

    …Modified not always set
    
     - set modified time (time of last successful fetch) by DefaultFetchSchedule and AdaptiveFetchSchedule
       but only if the document is actually modified
     - update unit tests to check whether modification time is properly set
     - set modified time (sent by responding server in HTTP header) in ProtocolOutput:
       FetchSchedule implementations can access the HTTP modified time from CrawlDatum's
       metadata (PROTO_STATUS_KEY = "_pst_")
    sebastian-nagel committed Mar 11, 2016
  2. @sebastian-nagel
Commits on Aug 22, 2016
  1. @sebastian-nagel

    NUTCH-2300 Fetcher to optionally save robots.txt

    Merge branch 'SaveRobotsTxt' of https://github.com/sebastian-nagel/nutch, this closes #141
    sebastian-nagel committed Aug 22, 2016
  2. @sebastian-nagel
Commits on Aug 19, 2016
  1. @sebastian-nagel

    add hint and log warning that fetcher.store.robotstxt works only in c…

    …ombination with fetcher.store.content
    sebastian-nagel committed Aug 19, 2016
  2. @sebastian-nagel
  3. @sebastian-nagel

    Allow Fetcher to optionally store robots.txt content (if property fet…

    …cher.store.robotstxt == true).
    
    Improved RobotRulesParser command-line tool.
    sebastian-nagel committed May 25, 2016
Commits on Aug 16, 2016
  1. @sebastian-nagel

    Merge branch 'NUTCH-2299' of https://github.com/sebastian-nagel/nutch

    …this closes #140
    
    - Remove obsolete properties protocol.plugin.check.*
    sebastian-nagel committed Aug 16, 2016
Commits on Aug 15, 2016
  1. @sebastian-nagel
Commits on Aug 9, 2016
  1. @sujen1412 @sujen1412
Commits on Jul 24, 2016
  1. @lewismc
Commits on Jul 16, 2016
  1. @lewismc
  2. @lewismc
  3. @naegelejd @lewismc

    NUTCH-2287 Indexer-elastic plugin should use Elasticsearch BulkProces…

    …sor and BackoffPolicy
    naegelejd committed with lewismc Jun 30, 2016
  4. @lewismc
Commits on Jul 13, 2016
  1. @stevegy

    Format the HttpFormAuthentication.java with eclipse format and add ja…

    …vadoc. Add the httpclient-auth.xml.template for cookie policy config example.
    stevegy committed Jul 13, 2016
Commits on Jul 12, 2016
  1. @stevegy

    fix the cookie policy issue when the form authentication receives ses…

    …sion cookie in a non-standard format - NUTCH-2280
    stevegy committed Jul 12, 2016
Commits on Jul 5, 2016
  1. @naegelejd

    NUTCH-2287 Indexer-elastic plugin should use Elasticsearch BulkProces…

    …sor and BackoffPolicy
    naegelejd committed Jun 30, 2016
Commits on Jul 2, 2016
  1. @sebastian-nagel
  2. @sebastian-nagel
  3. @sebastian-nagel
  4. @sebastian-nagel
  5. @sebastian-nagel
Commits on Jul 1, 2016
  1. @thammegowda
  2. @thammegowda
  3. @sebastian-nagel

    NUTCH-1553 Property 'indexer.delete.robots.noindex' not working when …

    …using parser-html
    
    - fix broken unit test (fix HTML markup, make test for meta data extraction obligatory)
    - add all values of general metadata to parse metadata
    sebastian-nagel committed Jul 1, 2016
  4. @sebastian-nagel

    NUTCH-2291 - Fix mrunit dependencies

    - remove classifier from dependency because pom file name on Maven repository does not contain a classifier
    sebastian-nagel committed Jul 1, 2016
Commits on Jun 30, 2016
  1. @sjwoodard

    Use static HttpClient for all SOLR connections

    Changed HttpClient to static based on http://hc.apache.org/httpclient-3.x/performance.html and added connection all SolrJ connections.
    sjwoodard committed on GitHub Jun 30, 2016
  2. @sebastian-nagel

    NUTCH-1553 Property 'indexer.delete.robots.noindex' not working when …

    …using parser-html
    
    - add general metadata to parse metadata where it can be checked by the indexer
    sebastian-nagel committed Jun 30, 2016
Commits on Jun 29, 2016
  1. @lewismc
Commits on Jun 27, 2016
  1. @naegelejd

    fix for NUTCH-2234

    and NUTCH-2236.
    Upgrades Elasticsearch and Hadoop dependencies, which, in turn,
    requires updates to Guava and Lucene dependencies:
    
    - Elasticsearch 1.4.1 -> Elasticsearch 2.3.3
    - Lucene 4.10.2 -> 5.5.0
    - Solrj 5.4.1 -> 5.5.0
    - Guava 16.0.1 -> Guava 18.0
    - Hadoop 2.4.0 -> 2.7.2
    naegelejd committed May 25, 2016
  2. @sjwoodard

    NUTCH-2267 - Solr and Hadoop JAR mismatch

    Explicitly pass in an instance of SystemDefaultHttpClient to CloudSolrClient, otherwise SolrJ will use a default implementation of CloseableHttpClient, which is not present in the HttpClient and HttpCore JARs in Hadoop < 2.8 (see https://issues.apache.org/jira/browse/SOLR-7948 and https://issues.apache.org/jira/browse/HADOOP-12767).
    sjwoodard committed on GitHub Jun 27, 2016
Commits on Jun 23, 2016
  1. @sebastian-nagel

    NUTCH-2272 Index checker server to optionally keep client connection …

    …open
    
    - removed from change log for release 1.12 as it is not included
    sebastian-nagel committed Jun 23, 2016
Commits on Jun 20, 2016
  1. @lewismc
Commits on Jun 3, 2016
  1. NUTCH-2272 Index checker server to optionally keep client connection …

    …open
    Markus Jelsma committed Jun 3, 2016