Permalink
Switch branches/tags
Nothing to show
Commits on Nov 29, 2017
Commits on Apr 20, 2017
  1. Merge pull request #13 from bryant1410/master

    sebastian-nagel committed Apr 20, 2017
    Fix broken headings in Markdown files
Commits on Apr 17, 2017
Commits on Feb 14, 2013
Commits on Feb 13, 2013
  1. Fix ARCFileReader to deal with payload length mismatch between what is

    ahadrana committed Feb 13, 2013
    specified in the ARC Header and what is actually available in the
    decompressed stream.
Commits on Feb 8, 2013
  1. 1. Add support for ArcFileItem via ARCFileItemInputFormat.

    ahadrana committed Feb 8, 2013
    2. Remove non-maven third party lib dependencies in anticipation of
    moving to maven instead of ant.
Commits on Jan 23, 2013
  1. Revert to making ARCFileReader constructor public, and delegating

    ahadrana committed Jan 23, 2013
    S3InputStream swap out in ARCFileInputFormat classes.
Commits on Jan 21, 2013
  1. ARCFileReader does not close incoming stream in its close method.

    ahadrana committed Jan 21, 2013
    ARCFileRecordReader needs to assume this responsibility.
Commits on Jan 18, 2013
Commits on Jan 17, 2013
  1. 1. Fix build.xml to fetch maven ant task properly.

    ahadrana committed Jan 17, 2013
    2. Remove external Hadoop dependency.
    3. Add deprecated files back in.
  2. Merge pull request #8 from pshken/master

    ahadrana committed Jan 17, 2013
    Updated "mvn.ant.task.url" URL in build.xml
  3. 1. Normalize project structure in anticipation of move from ant to mvn.

    ahadrana committed Jan 17, 2013
    2. Deprecated old ARCInputFormat code (moved from hadoop.io to
    hadoop.io.deprecated
    3. Added new Hadoop FileSystem based ARCInputFormat to hadoop.io package.
    4. Added test cases to validate new ARCInputFormat code.
    5. ARCInputFormat now returns a BytesWritable buffer as the value type.
       The buffer consists of the HTTP Headers (encoded as UTF-8), followed
       by a trailing CRLF (\r\n).Everything after the trailing CRLF is
       the raw content returned by the original HTTP request.
Commits on Nov 29, 2012
Commits on Jun 29, 2012
  1. adding '.gitignore' file. bumped version to 1.0.

    Chris Stephens
    Chris Stephens committed Jun 29, 2012
Commits on Mar 20, 2012
  1. Added protocol definitions for ArchiveInfo and ParseOutput.

    Ahad Rana Ahad Rana
    Ahad Rana authored and Ahad Rana committed Mar 20, 2012
Commits on Mar 15, 2012
  1. Merge pull request #4 from namin/master

    ahadrana committed Mar 15, 2012
    Fix a S3ServiceException when using an input prefix
  2. Do not list directories as resources.

    namin committed Mar 15, 2012
    Listing directories as resources trigger errors such as:
    org.jets3t.service.S3ServiceException: S3 GET failed for '/data%2F' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidRange</Code><Message>The requested range is not satisfiable</Message><ActualObjectSize>0</ActualObjectSize><RequestId>F063C5C315CC967B</RequestId><HostId>HiShCYLg5oo+hdZceTVRkhhqebTZL5kl1m2gqf+0a0Mme+CSS0d2e9RERPMmcnPY</HostId><RangeRequested>bytes=0-</RangeRequested></Error>
      at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:416)
      at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestGet(RestS3Service.java:752)
      at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1601)
      at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1544)
      at org.jets3t.service.S3Service.getObject(S3Service.java:2072)
      at org.commoncrawl.hadoop.io.JetS3tARCSource.getStream(JetS3tARCSource.java:261)
Commits on Nov 21, 2011
Commits on Nov 17, 2011
  1. Merge pull request #1 from matpalm/emr

    commoncrawl
    commoncrawl committed Nov 17, 2011
    slight generalisation so we can build on elastic mapreduce
Commits on Nov 16, 2011
  1. Added introductory README

    ahadrana committed Nov 16, 2011
Commits on Nov 14, 2011
  1. Fix directory tree.

    Ahad Rana Ahad Rana
    Ahad Rana authored and Ahad Rana committed Nov 14, 2011
  2. Minor modification to GoogleURL interface.

    Ahad Rana Ahad Rana
    Ahad Rana authored and Ahad Rana committed Nov 9, 2011
  3. Fixed bug in InputStream implementation.

    Ahad Rana Ahad Rana
    Ahad Rana authored and Ahad Rana committed Nov 9, 2011
  4. Includes, among other things, (1) added mergeutils project into commo…

    Ahad Rana Ahad Rana
    Ahad Rana authored and Ahad Rana committed Aug 10, 2011
    …ncrawl source tree (2) added query project into commoncrawl source tree (3) major refactoring of query project (4) bulk scan implementation (5) integration of parallel query functionality (6) bulk query support in cacheFE server (7) fix improper flush bug in Indexer code