Explore storing continuous ingest bulk import files in S3 #94

keith-turner · 2019-07-15T16:22:10Z

When running bulk import continuous ingest test it can take a while to generate a good bit of data to start testing. Not sure, but it may be faster to generate a data set once and store it in S3. Then future test could possibly use that data set.

I think it would be interesting to experiment with this and if it works well add documentation to the bulk import test docs explaining how to do it. One gotcha with this approach is that anyone running a test needs to be consistent with split points. A simple way to address this problem would be store a file of split points in S3 with the data.

keith-turner · 2019-07-15T16:31:02Z

I suspect the procedure for this would be the following :

Generate data
- Run lots of map reduce jobs to create multiple dirs for bulk import
- Save split points to a file
- distcp bulk import dirs and splits points from HDFS to S3 bucket
Use data
- distcp bulk import dirs from S3 to HDFS
- get splits file from S3
- split table with single tablet using splits file
- import data

https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html#S3A

keith-turner · 2019-07-15T16:52:13Z

Current bulk import test docs : bulk-test.md

keith-turner · 2019-07-19T18:53:46Z

Below are some notes from copying bulk data from HDFS to S3

# bulk files were generated intp /tmp/bt dir in hdfs

# prep directory before distp, assuming all splits files are the same just keep one
hadoop fs -mv /tmp/bt/1/splits.txt /tmp/bt
hadoop fs -rm /tmp/bt/*/splits.txt
hadoop fs -rm /tmp/bt/*/files/_SUCCESS

# get the S3 libs on the local hadoop classpath
# edit following file and set : export HADOOP_OPTIONAL_TOOLS="hadoop-aws"
vim $HADOOP_HOME/etc/hadoop/hadoop-env.sh 

# The remote map reduce jobs will need the S3 jars on the classpath, define the following for this.. may need to change for your version of hadoop
export LIBJARS=$HADOOP_HOME/share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.375.jar,$HADOOP_HOME/share/hadoop/tools/lib/hadoop-aws-3.2.0.jar

# the following command will distcp files to bucket
hadoop distcp -libjars ${LIBJARS} -Dfs.s3a.access.key=$AWS_KEY -Dfs.s3a.secret.key=$AWS_SECRET hdfs://leader1:8020/tmp/bt s3a://$AWS_BUCKET/continuous-1000

keith-turner mentioned this issue Jul 16, 2019

New bulk import api failed during test with agitation apache/accumulo#1260

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore storing continuous ingest bulk import files in S3 #94

Explore storing continuous ingest bulk import files in S3 #94

keith-turner commented Jul 15, 2019

keith-turner commented Jul 15, 2019

keith-turner commented Jul 15, 2019

keith-turner commented Jul 19, 2019

Explore storing continuous ingest bulk import files in S3 #94

Explore storing continuous ingest bulk import files in S3 #94

Comments

keith-turner commented Jul 15, 2019

keith-turner commented Jul 15, 2019

keith-turner commented Jul 15, 2019

keith-turner commented Jul 19, 2019