-
sebastian-nagel committed
Nov 29, 2017
-
Merge pull request #13 from bryant1410/master
sebastian-nagel committedApr 20, 2017 Fix broken headings in Markdown files
-
bryant1410 committed
Apr 17, 2017
-
Remove content overflow warning message.
ahadrana committedFeb 14, 2013
-
Fix ARCFileReader to deal with payload length mismatch between what is
ahadrana committedFeb 13, 2013 specified in the ARC Header and what is actually available in the decompressed stream.
-
1. Add support for ArcFileItem via ARCFileItemInputFormat.
ahadrana committedFeb 8, 2013 2. Remove non-maven third party lib dependencies in anticipation of moving to maven instead of ant.
-
Revert to making ARCFileReader constructor public, and delegating
ahadrana committedJan 23, 2013 S3InputStream swap out in ARCFileInputFormat classes.
-
ahadrana committed
Jan 23, 2013 -
Add InputFormats for both mapred and mapreduce packages.
ahadrana committedJan 23, 2013
-
ARCFileReader does not close incoming stream in its close method.
ahadrana committedJan 21, 2013 ARCFileRecordReader needs to assume this responsibility.
-
Yet again, remove unused imports.
ahadrana committedJan 18, 2013 -
Remove improper HexDump import statement.
ahadrana committedJan 18, 2013 -
Fix behavioral issue related to the InputStream provided by the S3N
ahadrana committedJan 18, 2013 FileSystem.
-
Add simplified command line utility to validate ARC File.
ahadrana committedJan 18, 2013
-
Remove all Thrift dependencies from CC project.
ahadrana committedJan 17, 2013 -
1. Fix build.xml to fetch maven ant task properly.
ahadrana committedJan 17, 2013 2. Remove external Hadoop dependency. 3. Add deprecated files back in.
-
Merge pull request #8 from pshken/master
ahadrana committedJan 17, 2013 Updated "mvn.ant.task.url" URL in build.xml
-
1. Normalize project structure in anticipation of move from ant to mvn.
ahadrana committedJan 17, 2013 2. Deprecated old ARCInputFormat code (moved from hadoop.io to hadoop.io.deprecated 3. Added new Hadoop FileSystem based ARCInputFormat to hadoop.io package. 4. Added test cases to validate new ARCInputFormat code. 5. ARCInputFormat now returns a BytesWritable buffer as the value type. The buffer consists of the HTTP Headers (encoded as UTF-8), followed by a trailing CRLF (\r\n).Everything after the trailing CRLF is the raw content returned by the original HTTP request.
-
ant.task URL have updated in build.xml due to dead link
Shh committedNov 29, 2012
-
adding '.gitignore' file. bumped version to 1.0.
Chris Stephens committedJun 29, 2012
-
Added protocol definitions for ArchiveInfo and ParseOutput.
Ahad Rana authored and Ahad Rana committedMar 20, 2012
-
Merge pull request #4 from namin/master
ahadrana committedMar 15, 2012 Fix a S3ServiceException when using an input prefix
-
Do not list directories as resources.
namin committedMar 15, 2012 Listing directories as resources trigger errors such as: org.jets3t.service.S3ServiceException: S3 GET failed for '/data%2F' XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>InvalidRange</Code><Message>The requested range is not satisfiable</Message><ActualObjectSize>0</ActualObjectSize><RequestId>F063C5C315CC967B</RequestId><HostId>HiShCYLg5oo+hdZceTVRkhhqebTZL5kl1m2gqf+0a0Mme+CSS0d2e9RERPMmcnPY</HostId><RangeRequested>bytes=0-</RangeRequested></Error> at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRequest(RestS3Service.java:416) at org.jets3t.service.impl.rest.httpclient.RestS3Service.performRestGet(RestS3Service.java:752) at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1601) at org.jets3t.service.impl.rest.httpclient.RestS3Service.getObjectImpl(RestS3Service.java:1544) at org.jets3t.service.S3Service.getObject(S3Service.java:2072) at org.commoncrawl.hadoop.io.JetS3tARCSource.getStream(JetS3tARCSource.java:261)
-
Merge branch 'master' of github.com:commoncrawl/commoncrawl
ahadrana committedNov 21, 2011 -
Remove accidental check-in of generator timestamp files.
ahadrana committedNov 21, 2011
-
Merge pull request #1 from matpalm/emr
commoncrawl committedNov 17, 2011 slight generalisation so we can build on elastic mapreduce
-
slight generalisation so we can build on elastic mapreduce
matpalm committedNov 17, 2011
-
Fix HTML escaping issue in README.md
ahadrana committedNov 16, 2011 -
Fix launcher script to classpath build/lib as part of transition to M…
ahadrana committedNov 16, 2011 …aven.
-
ahadrana committed
Nov 16, 2011 -
Final modifications before recommit to GitHub
ahadrana committedNov 16, 2011
-
Ahad Rana authored and Ahad Rana committed
Nov 14, 2011 -
Minor modification to GoogleURL interface.
Ahad Rana authored and Ahad Rana committedNov 9, 2011 -
Fixed bug in InputStream implementation.
Ahad Rana authored and Ahad Rana committedNov 9, 2011 -
Includes, among other things, (1) added mergeutils project into commo…
Ahad Rana authored and Ahad Rana committedAug 10, 2011 …ncrawl source tree (2) added query project into commoncrawl source tree (3) major refactoring of query project (4) bulk scan implementation (5) integration of parallel query functionality (6) bulk query support in cacheFE server (7) fix improper flush bug in Indexer code