Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

APEXMALHAR-2116 Added FS record reader operator, module, test #326

Merged
merged 1 commit into from
Jul 31, 2016

Conversation

yogidevendra
Copy link
Contributor

No description provided.

@yogidevendra
Copy link
Contributor Author

@amberarrow Could you please review this?

@amberarrow
Copy link
Contributor

Ok, I'll take a look

import com.datatorrent.lib.io.block.ReaderContext;

/**
* This operator can be used for reading records/tuples from Filesystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line and several others have trailing whitespace; this is generally considered not a good thing -- see
for instance: http://programmers.stackexchange.com/questions/121555/why-is-trailing-whitespace-a-big-deal
Suggest removing all trailing white space.

@otterc
Copy link
Contributor

otterc commented Jun 21, 2016

@yogidevendra
Can you please explain how the downstream operators will know which block a record belongs to?
Specifically when you converge, the information that which record belongs to which block will be lost.

Is this being created for any specific use case? Can you please share that?

@yogidevendra
Copy link
Contributor Author

@ChandniSingh
This is mainly for the usecases where downstream operators do not care about which block the incoming record belongs to. For example, dedup, enrichment, projection, transform operator just work on POJO. They do not care about which file or which block does this tuple belong to.

Usecases, in this context do not intend to retain the original sequence of records at sources. They can consider each tuple to be independent and just focus on how to process them at scale. In this case, this module will be connected to "byte[] to POJO converter" (Say delimited parser) and then further to dedup or enrichment processing operator. Finally, output of dedup will be connected to some output operator.

@yogidevendra yogidevendra force-pushed the APEXMALHAR-2116-record-reader branch from aaeb7fe to dd31bbe Compare June 23, 2016 07:00
@yogidevendra
Copy link
Contributor Author

Incorporated review comments.

* 4. recursive: if scan recursively input directories<br/>
* 5. blockSize: block size used to read input blocks of file<br/>
* 6. readersCount: count of readers to read input file<br/>
* 7. sequencialFileRead: If emit file blocks in sequence?<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sequencial => sequential

@amberarrow
Copy link
Contributor

Still looking at this -- probably take a couple of days more.

@yogidevendra yogidevendra force-pushed the APEXMALHAR-2116-record-reader branch from dd31bbe to 75ef1d5 Compare June 24, 2016 08:16
* Code for enabling BeanUtils to accept comma separated string to
* initialize FIELD_TYPE[]
*/
class RecordReaderModeConverter extends AbstractConverter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide more detail on why this is necessary and how it is used ?

@yogidevendra
Copy link
Contributor Author

@amberarrow Removed the unnecessary code for BeanUtils convertor. But, facing some issues with travis build. Build failures seems to be non-deterministic. (Different unit tests failed at multiple attempts).

Please let me know if you have any other suggestion regarding above code changes. I will update the PR for the same. We should keep this PR on hold till travis build is stable.

@amberarrow
Copy link
Contributor

@yogidevendra OK, rest looks good, though I'd like to test it with a small sample application on the cluster; do you have such a sample ? It would also be a good addition to the examples collection.

@yogidevendra
Copy link
Contributor Author

Let me know the location to add sample app. I will open separate PR for sample app in the respective repo.

@amberarrow
Copy link
Contributor

https://github.com/DataTorrent/examples under tutorials

* The module reads data in parallel, following parameters can be configured
* <br/>
* 1. files: list of file(s)/directories to read<br/>
* 2. filePatternRegularExp: Files names matching given regex will be read<br/>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Files names => Files with names

*/
public void setFiles(String files)
{
this.files = files;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the field be named "directories" if we expect it to be a list of directories ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to keep it consistent with FSInputModule.java. If you feel that readability/intuitive names are more important than having consistency across operators then I am fine with changing this field name.

/**
* Length for fixed width record
*/
private int recordLength;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should either have default value or mark Min constraint to make sure it is set by user.

@yogidevendra yogidevendra force-pushed the APEXMALHAR-2116-record-reader branch from 506ab58 to 5e955e4 Compare July 8, 2016 07:46
@yogidevendra
Copy link
Contributor Author

@DT-Priyanka Incorporated changes based on your review feedback.
@amberarrow Could you please do a final round of review and then merge if there no further comments from the community?

@yogidevendra
Copy link
Contributor Author

@amberarrow Could you please merge this if there are no more comments.

@yogidevendra
Copy link
Contributor Author

yogidevendra commented Jul 14, 2016

@DT-Priyanka Could you please merge this if there are no more comments?

* Length for fixed width record
*/
@Min(1)
private int recordLength;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Record length will be optional in case of DELIMITED_RECORD and is compulsory in case of FIXED_LENGHT records. so keeping it either optional or compulsory field both have pros and cons.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. Good point.
Should we separate this into two different classes. FixedWidthRecordReader, DelimitedRecordReader? That will make configuration clean.

@amberarrow Any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding another class seems like overkill; how about removing the annotation and adding a check in the code to ensure that the value is positive if mode is FIXED_WIDTH_RECORD ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amberarrow As per your suggestion, removed @ min annotation for record length.
Added validation in the code to ensure that the value is positive if FIXED_WIDTH_RECORD.
Also, added test case for validating this.

Could you have a look and merge this if it looks OK.

@yogidevendra yogidevendra force-pushed the APEXMALHAR-2116-record-reader branch from 5e955e4 to ba8bbbe Compare July 21, 2016 11:46
@amberarrow
Copy link
Contributor

There are a couple of minor issues:

  1. The commit message is missing the JIRA (see instructions at https://apex.apache.org/contributing.html:
    "Commit messages need to reference JIRA (pull requests will be linked to ticket)"
  2. Several lines (all inside comments) have whitespace at the end; please remove.

2. javadoc improvements.

3. Adding default values

4. Incorporating review comments.
@yogidevendra yogidevendra force-pushed the APEXMALHAR-2116-record-reader branch from ba8bbbe to 02d657c Compare July 25, 2016 06:37
@yogidevendra
Copy link
Contributor Author

@amberarrow Updated as per the feedback.

@asfgit asfgit merged commit 02d657c into apache:master Jul 31, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
5 participants