Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUTCH-2793 indexer-csv: make it work in distributed mode #534

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/plugin/indexer-csv/README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
indexer-csv plugin for Nutch
============================

**indexer-csv plugin** is used for writing documents to a CSV file. It does not work in distributed mode, the output is written to the local filesystem, not to HDFS, see [NUTCH-1541](https://issues.apache.org/jira/browse/NUTCH-1541). The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:
**indexer-csv plugin** is used for writing documents to a CSV file. The configuration for the index writers is on **conf/index-writers.xml** file, included in the official Nutch distribution and it's as follow:

```xml
<writer id="<writer_id>" class="org.apache.nutch.indexwriter.csv.CSVIndexWriter">
Expand Down Expand Up @@ -39,4 +39,4 @@ escapechar | Escape character used to escape a quote character | &quot;
maxfieldlength | Max. length of a single field value in characters | 4096
maxfieldvalues | Max. number of values of one field, useful for, e.g., the anchor texts field | 12
header | Write CSV column headers | true
outpath | Output path / directory (local filesystem path, relative to current working directory) | csvindexwriter
outpath | Output path / directory (local filesystem path, relative to current working directory) | csvindexwriter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still "local filesystem"? Ev. we could the outpath to overcome the problem of multiple index writers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I did not understand that, could you elaborate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I've mixed two points mixed together:

  • the description would also need a change as it will not be a path on the local filesystem if running in distributed mode
  • there is also the open question how to allow two index writers writing output the filesystem:
    • in local mode this would require that the outpath points to a different directory
    • in distributed mode we could use outpath to write into distinct output directories or distinct subdirectories of one job-specific output directory

Original file line number Diff line number Diff line change
Expand Up @@ -44,17 +44,14 @@
* index as CSV or tab-separated plain text table. Format (encoding, separators,
* etc.) is configurable by a couple of options, see output of
* {@link #describe()}.
*
* <p>
* Note: works only in local mode, to be used with index option
* <code>-noCommit</code>.
* </p>
*
*/
public class CSVIndexWriter implements IndexWriter {

public static final Logger LOG = LoggerFactory
.getLogger(CSVIndexWriter.class);

private String filename = "nutch.csv";
private Configuration config;

/** ordered list of fields (columns) in the CSV file */
Expand Down Expand Up @@ -192,7 +189,7 @@ protected int find(String value, int start) {

@Override
public void open(Configuration conf, String name) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is deprecated since the switch to the XML-based index writer configuration (see NUTCH-1480 and the wiki page IndexWriters). "name" was just an arbitrary name not a file name indicating a task-specific output path. We would need a method which takes both: the IndexWriterParams and the output path. This would require changes in the IndexWriter interface and also the classes IndexWriters and IndexerMapReduce. I'm also not sure whether the output path alone is sufficient. We'll eventually need an OutputCommitter and need to think about situations if we have multiple index writers (eg. via exchanges). See also the discussion in NUTCH-1541.


filename = name;
}

/**
Expand Down Expand Up @@ -227,7 +224,7 @@ public void open(IndexWriterParams parameters) throws IOException {
LOG.info("Writing output to {}", outputPath);
Path outputDir = new Path(outputPath);
fs = outputDir.getFileSystem(config);
csvLocalOutFile = new Path(outputDir, "nutch.csv");
csvLocalOutFile = new Path(outputDir, filename);
if (!fs.exists(outputDir)) {
fs.mkdirs(outputDir);
}
Expand Down