Skip to content

[BEAM-2222] Migrate hadoop inputformat to website#235

Closed
aaltay wants to merge 2 commits intoapache:asf-sitefrom
aaltay:asf-site
Closed

[BEAM-2222] Migrate hadoop inputformat to website#235
aaltay wants to merge 2 commits intoapache:asf-sitefrom
aaltay:asf-site

Conversation

@aaltay
Copy link
Member

@aaltay aaltay commented May 9, 2017

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/456/

Jenkins built the site at commit id f9d0e0e with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

@davorbonaci
Copy link
Member

R: @hadarhg @melap

Copy link

@melap melap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest putting in Python tabs for each code block with some sort of "not applicable to Python" type message, to avoid the missing code block troubles if someone arrives at the page with Python chosen.


You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional, and some are required for certain InputFormat classes, but the following properties must be set for all InputFormats:

mapreduce.job.inputformat.class: The InputFormat class used to connect to your data source of choice.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This list isn't rendering as a list, so very hard to read

@melap
Copy link

melap commented May 9, 2017

One other general comment -- missing code formatting for a lot of the class names/methods


A HadoopInputFormatIO is a Transform for reading data from any source which
implements Hadoop InputFormat. For example- Cassandra, Elasticsearch, HBase, Redis, Postgres, etc.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A HadoopInputFormatIO is a transform for reading data from any source that implements Hadoop's InputFormat. For example, Cassandra, Elasticsearch, HBase, Redis, Postgres, etc.

implements Hadoop InputFormat. For example- Cassandra, Elasticsearch, HBase, Redis, Postgres, etc.

HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat, so if there is another Beam IO Transform specifically for connecting to your data source of choice, we would recommend using that one, but this IO Transform allows you to connect to many data sources that do not yet have a Beam IO Transform.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HadoopInputFormatIO allows you to connect to many data sources that do not yet have a Beam IO transform. However, HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat. So, if there is another Beam IO transform for connecting specifically to your data source of choice, we recommend you use that one.

HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat, so if there is another Beam IO Transform specifically for connecting to your data source of choice, we would recommend using that one, but this IO Transform allows you to connect to many data sources that do not yet have a Beam IO Transform.

You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional, and some are required for certain InputFormat classes, but the following properties must be set for all InputFormats:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional and some are required for certain InputFormat classes, but the following properties must be set for all InputFormat classes:

mapreduce.job.inputformat.class: The InputFormat class used to connect to your data source of choice.
key.class: The key class returned by the InputFormat in 'mapreduce.job.inputformat.class'.
value.class: The value class returned by the InputFormat in 'mapreduce.job.inputformat.class'.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • mapreduce.job.inputformat.class - The InputFormat class used to connect to your data source of choice.
  • key.class - The Key class returned by the InputFormat in mapreduce.job.inputformat.class.
  • value.class - The Value class returned by the InputFormat in mapreduce.job.inputformat.class.

```

You will need to check to see if the key and value classes output by the InputFormat have a Beam Coder available. If not, You can use withKeyTranslation/withValueTranslation to specify a method transforming instances of those classes into another class that is supported by a Beam Coder. These settings are optional and you don't need to specify translation for both key and value.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to check if the Key and Value classes output by the InputFormat have a Beam Coder available. If not, you can use withKeyTranslation or withValueTranslation to specify a method transforming instances of those classes into another class that is supported by a Beam Coder. These settings are optional and you don't need to specify translation for both key and value.

To read data from Cassandra, org.apache.cassandra.hadoop.cql3.CqlInputFormat
CqlInputFormat can be used which needs following properties to be set.

Create Cassandra Hadoop configuration as follows:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line ("Create Cassandra Hadoop configuration as follows:")

```

The CqlInputFormat key class is java.lang.Long Long, which has a Beam Coder. The CqlInputFormat value class is com.datastax.driver.core.Row Row, which does not have a Beam Coder. Rather than write a new coder, you can provide your own translation method as follows:

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CqlInputFormat key class is java.lang.Long Long, which has a Beam Coder. The CqlInputFormat value class is com.datastax.driver.core.Row Row, which does not have a Beam Coder. Rather than write a new coder, you can provide your own translation method, as follows:


### Elasticsearch - EsInputFormat

To read data from Elasticsearch, EsInputFormat can be used which needs following properties to be set.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To read data from Elasticsearch, use EsInputFormat, which needs following properties to be set:


To read data from Elasticsearch, EsInputFormat can be used which needs following properties to be set.

Create ElasticSearch Hadoop configuration as follows:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete this line ("Create ElasticSearch Hadoop configuration as follows:")

```

The org.elasticsearch.hadoop.mr.EsInputFormat EsInputFormat key class is
org.apache.hadoop.io.Text Text and value class is org.elasticsearch.hadoop.mr.LinkedMapWritable LinkedMapWritable. Both key and value classes have Beam Coders.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The org.elasticsearch.hadoop.mr.EsInputFormat's EsInputFormat key class is org.apache.hadoop.io.Text Text, and its value class is org.elasticsearch.hadoop.mr.LinkedMapWritable LinkedMapWritable. Both key and value classes have Beam Coders.

@aaltay
Copy link
Member Author

aaltay commented May 9, 2017

Thank you! I believe I addressed all the comments, PTAL.

@asfbot
Copy link

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/457/

Jenkins built the site at commit id 89e8397 with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

@aaltay
Copy link
Member Author

aaltay commented May 9, 2017

Thank you!

@asfgit asfgit closed this in 47ad185 May 9, 2017
robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request Jun 5, 2018
melap pushed a commit to apache/beam that referenced this pull request Jun 20, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants