Skip to content

Commit

Permalink
NUTCH-2182 Make reverseUrlDirs file dumper option hash the URL for co…
Browse files Browse the repository at this point in the history
…nsistency

git-svn-id: https://svn.apache.org/repos/asf/nutch/trunk@1720466 13f79535-47bb-0310-9956-ffa450edef68
  • Loading branch information
MJJoyce committed Dec 16, 2015
1 parent fa4ebb8 commit c0d9490
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 18 deletions.
2 changes: 2 additions & 0 deletions CHANGES.txt
@@ -1,5 +1,7 @@
Nutch Change Log

* NUTCH-2182 Make reverseUrlDirs file dumper option hash the URL for consistency

* NUTCH-2183 Improvement to SegmentChecker for skipping non-segments present in segments directory (lewismc)

* NUTCH-2180 FileDumper skips Corrupt Segments (Harshavardhan Manjunatha via lewismc)
Expand Down
20 changes: 2 additions & 18 deletions src/java/org/apache/nutch/tools/FileDumper.java
Expand Up @@ -37,6 +37,7 @@
//Commons imports
import org.apache.commons.io.IOUtils;
import org.apache.commons.io.FilenameUtils;
import org.apache.commons.codec.digest.DigestUtils;

//Hadoop
import org.apache.hadoop.conf.Configuration;
Expand Down Expand Up @@ -244,24 +245,7 @@ public boolean accept(File file) {
String[] reversedURL = TableUtil.reverseUrl(url).split(":");
reversedURL[0] = reversedURL[0].replace('.', '/');

// URLs with content at a folder level and nested below that
// run into problems when dumping. For example:
//
// www.foo.com/bar/
// www.foo.com/bar/about.html
//
// One of these will fail to dump depending on processing order.
// To address this, we will use a placeholder when dumping a URL
// such as the one ending in '/bar/'
String lastDir = reversedURL[reversedURL.length - 1];
if (! lastDir.contains(".")) {
if (lastDir.charAt(lastDir.length() - 1) != '/') {
reversedURL[reversedURL.length - 1] += '/';
}
reversedURL[reversedURL.length - 1] += "_file";
}

String reversedURLPath = org.apache.commons.lang3.StringUtils.join(reversedURL, "/");
String reversedURLPath = reversedURL[0] + "/" + DigestUtils.sha256Hex(url).toUpperCase();
outputFullPath = String.format("%s/%s", fullDir, reversedURLPath);

// We'll drop the trailing file name and create the nested structure if it doesn't already exist.
Expand Down

0 comments on commit c0d9490

Please sign in to comment.