[BEAM-2222] Migrate hadoop inputformat to website#235
[BEAM-2222] Migrate hadoop inputformat to website#235aaltay wants to merge 2 commits intoapache:asf-sitefrom
Conversation
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id f9d0e0e with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
melap
left a comment
There was a problem hiding this comment.
I would suggest putting in Python tabs for each code block with some sort of "not applicable to Python" type message, to avoid the missing code block troubles if someone arrives at the page with Python chosen.
|
|
||
| You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional, and some are required for certain InputFormat classes, but the following properties must be set for all InputFormats: | ||
|
|
||
| mapreduce.job.inputformat.class: The InputFormat class used to connect to your data source of choice. |
There was a problem hiding this comment.
This list isn't rendering as a list, so very hard to read
|
One other general comment -- missing code formatting for a lot of the class names/methods |
|
|
||
| A HadoopInputFormatIO is a Transform for reading data from any source which | ||
| implements Hadoop InputFormat. For example- Cassandra, Elasticsearch, HBase, Redis, Postgres, etc. | ||
|
|
There was a problem hiding this comment.
A HadoopInputFormatIO is a transform for reading data from any source that implements Hadoop's InputFormat. For example, Cassandra, Elasticsearch, HBase, Redis, Postgres, etc.
| implements Hadoop InputFormat. For example- Cassandra, Elasticsearch, HBase, Redis, Postgres, etc. | ||
|
|
||
| HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat, so if there is another Beam IO Transform specifically for connecting to your data source of choice, we would recommend using that one, but this IO Transform allows you to connect to many data sources that do not yet have a Beam IO Transform. | ||
|
|
There was a problem hiding this comment.
HadoopInputFormatIO allows you to connect to many data sources that do not yet have a Beam IO transform. However, HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat. So, if there is another Beam IO transform for connecting specifically to your data source of choice, we recommend you use that one.
| HadoopInputFormatIO has to make several performance trade-offs in connecting to InputFormat, so if there is another Beam IO Transform specifically for connecting to your data source of choice, we would recommend using that one, but this IO Transform allows you to connect to many data sources that do not yet have a Beam IO Transform. | ||
|
|
||
| You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional, and some are required for certain InputFormat classes, but the following properties must be set for all InputFormats: | ||
|
|
There was a problem hiding this comment.
You will need to pass a Hadoop Configuration with parameters specifying how the read will occur. Many properties of the Configuration are optional and some are required for certain InputFormat classes, but the following properties must be set for all InputFormat classes:
| mapreduce.job.inputformat.class: The InputFormat class used to connect to your data source of choice. | ||
| key.class: The key class returned by the InputFormat in 'mapreduce.job.inputformat.class'. | ||
| value.class: The value class returned by the InputFormat in 'mapreduce.job.inputformat.class'. | ||
|
|
There was a problem hiding this comment.
mapreduce.job.inputformat.class- TheInputFormatclass used to connect to your data source of choice.key.class- TheKeyclass returned by theInputFormatinmapreduce.job.inputformat.class.value.class- TheValueclass returned by theInputFormatinmapreduce.job.inputformat.class.
| ``` | ||
|
|
||
| You will need to check to see if the key and value classes output by the InputFormat have a Beam Coder available. If not, You can use withKeyTranslation/withValueTranslation to specify a method transforming instances of those classes into another class that is supported by a Beam Coder. These settings are optional and you don't need to specify translation for both key and value. | ||
|
|
There was a problem hiding this comment.
You will need to check if the Key and Value classes output by the InputFormat have a Beam Coder available. If not, you can use withKeyTranslation or withValueTranslation to specify a method transforming instances of those classes into another class that is supported by a Beam Coder. These settings are optional and you don't need to specify translation for both key and value.
| To read data from Cassandra, org.apache.cassandra.hadoop.cql3.CqlInputFormat | ||
| CqlInputFormat can be used which needs following properties to be set. | ||
|
|
||
| Create Cassandra Hadoop configuration as follows: |
There was a problem hiding this comment.
Delete this line ("Create Cassandra Hadoop configuration as follows:")
| ``` | ||
|
|
||
| The CqlInputFormat key class is java.lang.Long Long, which has a Beam Coder. The CqlInputFormat value class is com.datastax.driver.core.Row Row, which does not have a Beam Coder. Rather than write a new coder, you can provide your own translation method as follows: | ||
|
|
There was a problem hiding this comment.
The CqlInputFormat key class is java.lang.Long Long, which has a Beam Coder. The CqlInputFormat value class is com.datastax.driver.core.Row Row, which does not have a Beam Coder. Rather than write a new coder, you can provide your own translation method, as follows:
|
|
||
| ### Elasticsearch - EsInputFormat | ||
|
|
||
| To read data from Elasticsearch, EsInputFormat can be used which needs following properties to be set. |
There was a problem hiding this comment.
To read data from Elasticsearch, use EsInputFormat, which needs following properties to be set:
|
|
||
| To read data from Elasticsearch, EsInputFormat can be used which needs following properties to be set. | ||
|
|
||
| Create ElasticSearch Hadoop configuration as follows: |
There was a problem hiding this comment.
Delete this line ("Create ElasticSearch Hadoop configuration as follows:")
| ``` | ||
|
|
||
| The org.elasticsearch.hadoop.mr.EsInputFormat EsInputFormat key class is | ||
| org.apache.hadoop.io.Text Text and value class is org.elasticsearch.hadoop.mr.LinkedMapWritable LinkedMapWritable. Both key and value classes have Beam Coders. |
There was a problem hiding this comment.
The org.elasticsearch.hadoop.mr.EsInputFormat's EsInputFormat key class is org.apache.hadoop.io.Text Text, and its value class is org.elasticsearch.hadoop.mr.LinkedMapWritable LinkedMapWritable. Both key and value classes have Beam Coders.
|
Thank you! I believe I addressed all the comments, PTAL. |
|
Refer to this link for build results (access rights to CI server needed): Jenkins built the site at commit id 89e8397 with Jekyll and staged it here. Happy reviewing. Note that any previous site has been deleted. This staged site will be automatically deleted after its TTL expires. Push any commit to the pull request branch or re-trigger the build to get it staged again. |
|
Thank you! |
R: @davorbonaci
cc: @radhika-kulkarni
(Base content: https://github.com/apache/beam/blob/master/sdks/java/io/hadoop/README.md)