[BEAM-2222] Migrate hadoop inputformat to website#235

Closed

aaltay wants to merge 2 commits intoapache:asf-sitefrom

aaltay:asf-site

Member

aaltay commented May 9, 2017

R: @davorbonaci
cc: @radhika-kulkarni

(Base content: https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/README.md)


          Migrate hadoop inputformat to website

f9d0e0e

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/456/

Jenkins built the site at commit id f9d0e0e with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

Member

davorbonaci commented May 9, 2017

R: @hadarhg @melap

melap reviewed

View reviewed changes

melap left a comment

I would suggest putting in Python tabs for each code block with some sort of "not applicable to Python" type message, to avoid the missing code block troubles if someone arrives at the page with Python chosen.

melap reviewed

View reviewed changes

src/documentation/io/built-in-hadoop.md Outdated


		You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional, and some are required for certain InputFormat classes, but the following properties must be set for all InputFormats:

		mapreduce.job.inputformat.class: The InputFormat class used to connect to your data source of choice.

melap May 9, 2017

This list isn't rendering as a list, so very hard to read

melap commented May 9, 2017

One other general comment -- missing code formatting for a lot of the class names/methods

hadarhg reviewed

View reviewed changes

src/documentation/io/built-in-hadoop.md


		A HadoopInputFormatIO is a Transform for reading data from any source which
		implements Hadoop InputFormat. For example- Cassandra, Elasticsearch, HBase, Redis, Postgres, etc.

hadarhg May 9, 2017

A HadoopInputFormatIO is a transform for reading data from any source that implements Hadoop's InputFormat. For example, Cassandra, Elasticsearch, HBase, Redis, Postgres, etc.

src/documentation/io/built-in-hadoop.md

		implements Hadoop InputFormat. For example- Cassandra, Elasticsearch, HBase, Redis, Postgres, etc.

		HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat, so if there is another Beam IO Transform specifically for connecting to your data source of choice, we would recommend using that one, but this IO Transform allows you to connect to many data sources that do not yet have a Beam IO Transform.

hadarhg May 9, 2017

HadoopInputFormatIO allows you to connect to many data sources that do not yet have a Beam IO transform. However, HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat. So, if there is another Beam IO transform for connecting specifically to your data source of choice, we recommend you use that one.

src/documentation/io/built-in-hadoop.md

		HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat, so if there is another Beam IO Transform specifically for connecting to your data source of choice, we would recommend using that one, but this IO Transform allows you to connect to many data sources that do not yet have a Beam IO Transform.

		You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional, and some are required for certain InputFormat classes, but the following properties must be set for all InputFormats:

hadarhg May 9, 2017

You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional and some are required for certain InputFormat classes, but the following properties must be set for all InputFormat classes:

src/documentation/io/built-in-hadoop.md

+              mapreduce.job.inputformat.class: The InputFormat class used to connect to your data source of choice.
+              key.class: The key class returned by the InputFormat in 'mapreduce.job.inputformat.class'.
+              value.class: The value class returned by the InputFormat in 'mapreduce.job.inputformat.class'.

hadarhg May 9, 2017

mapreduce.job.inputformat.class - The InputFormat class used to connect to your data source of choice.
key.class - The Key class returned by the InputFormat in mapreduce.job.inputformat.class.
value.class - The Value class returned by the InputFormat in mapreduce.job.inputformat.class.

src/documentation/io/built-in-hadoop.md

		```

		You will need to check to see if the key and value classes output by the InputFormat have a Beam Coder available. If not, You can use withKeyTranslation/withValueTranslation to specify a method transforming instances of those classes into another class that is supported by a Beam Coder. These settings are optional and you don't need to specify translation for both key and value.

hadarhg May 9, 2017

You will need to check if the Key and Value classes output by the InputFormat have a Beam Coder available. If not, you can use withKeyTranslation or withValueTranslation to specify a method transforming instances of those classes into another class that is supported by a Beam Coder. These settings are optional and you don't need to specify translation for both key and value.

src/documentation/io/built-in-hadoop.md Outdated

+              To read data from Cassandra, org.apache.cassandra.hadoop.cql3.CqlInputFormat
+              CqlInputFormat can be used which needs following properties to be set.
+              Create Cassandra Hadoop configuration as follows:

hadarhg May 9, 2017

Delete this line ("Create Cassandra Hadoop configuration as follows:")

src/documentation/io/built-in-hadoop.md

		```

		The CqlInputFormat key class is java.lang.Long Long, which has a Beam Coder. The CqlInputFormat value class is com.datastax.driver.core.Row Row, which does not have a Beam Coder. Rather than write a new coder, you can provide your own translation method as follows:

hadarhg May 9, 2017

The CqlInputFormat key class is java.lang.Long Long, which has a Beam Coder. The CqlInputFormat value class is com.datastax.driver.core.Row Row, which does not have a Beam Coder. Rather than write a new coder, you can provide your own translation method, as follows:

src/documentation/io/built-in-hadoop.md Outdated


		### Elasticsearch - EsInputFormat

		To read data from Elasticsearch, EsInputFormat can be used which needs following properties to be set.

hadarhg May 9, 2017

To read data from Elasticsearch, use EsInputFormat, which needs following properties to be set:

src/documentation/io/built-in-hadoop.md Outdated


		To read data from Elasticsearch, EsInputFormat can be used which needs following properties to be set.

		Create ElasticSearch Hadoop configuration as follows:

hadarhg May 9, 2017

Delete this line ("Create ElasticSearch Hadoop configuration as follows:")

src/documentation/io/built-in-hadoop.md Outdated

+              ```
+              The org.elasticsearch.hadoop.mr.EsInputFormat EsInputFormat key class is
+              org.apache.hadoop.io.Text Text and value class is org.elasticsearch.hadoop.mr.LinkedMapWritable LinkedMapWritable. Both key and value classes have Beam Coders.

hadarhg May 9, 2017

The org.elasticsearch.hadoop.mr.EsInputFormat's EsInputFormat key class is org.apache.hadoop.io.Text Text, and its value class is org.elasticsearch.hadoop.mr.LinkedMapWritable LinkedMapWritable. Both key and value classes have Beam Coders.


          fix review comments

89e8397

Member Author

aaltay commented May 9, 2017

Thank you! I believe I addressed all the comments, PTAL.

asfbot commented May 9, 2017

Refer to this link for build results (access rights to CI server needed):
https://builds.apache.org/job/beam_PreCommit_Website_Stage/457/

Jenkins built the site at commit id 89e8397 with Jekyll and staged it here. Happy reviewing.

Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again.

melap approved these changes

View reviewed changes

Member Author

aaltay commented May 9, 2017

Thank you!

asfgit closed this in

47ad185

aaltay mentioned this pull request

[BEAM-2222] Remove hadoop io readme apache/beam#3014

Closed

robertwb pushed a commit to robertwb/incubator-beam that referenced this pull request


          This closes apache/beam-site#235

b354025

melap pushed a commit to apache/beam that referenced this pull request


          This closes apache/beam-site#235

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet