Skip to content

s3booster-snowball.py, this script implemented batch feature in parallel so it is fast and simple to use, especially when dealing with small files.

License

Notifications You must be signed in to change notification settings

aws-samples/s3booster-snowball

s3booster snowball

s3booster-snowball-v2.py, this script implemented batch feature in parallel so it is fast and simple to use, especially when dealing with small files. If you have headache with low performance when uploading small files, it may give you StimPack! s3booster provides two features 1)first one is to accellerate performance when ingesting small files on Snowball, 2)second is to archive files and generate big tar file on Amazon S3 in order to improve uploading performance and save management cost.

How to Use

Here is example to execute s3booster-snowball-v2.py or you can refer run-s3booster-sbe.sh and run-s3booster-archive.sh shell scripts.

For Snowball Usage,

python3 s3booster-snowball-v2.py --bucket_name your-own-bucket --src_dir /data/fs1/ --endpoint https://s3.ap-northeast-2.amazonaws.com --profile_name sbe1 --prefix_root fs3/ --max_process 5 --max_tarfile_size $((1*(1024**3))) --symlink no

For Archiving Usage,

python3 s3booster-snowball-v2.py --bucket_name your-own-bucket --src_dir /data/fs1/ --endpoint https://s3.ap-northeast-2.amazonaws.com --profile_name sbe1 --max_process 5 --max_tarfile_size $((1*(1024**3))) --no_extract 'yes' --target_file_prefix 'new_s3_path/' --storage_class 'GLACIER_IR'

Here is help

ec2-user$ python3 s3booster-snowball-v2.py --help
usage: s3booster-snowball-v2.py [-h] --bucket_name BUCKET_NAME --src_dir SRC_DIR --endpoint ENDPOINT [--profile_name PROFILE_NAME]
                                [--prefix_root PREFIX_ROOT] [--max_process MAX_PROCESS] [--max_tarfile_size MAX_TARFILE_SIZE]
                                [--compression COMPRESSION] [--no_extract NO_EXTRACT] [--target_file_prefix TARGET_FILE_PREFIX]
                                [--storage_class STORAGE_CLASS]

optional arguments:
  -h, --help            show this help message and exit
  --bucket_name BUCKET_NAME
                        your bucket name e) your-bucket
  --src_dir SRC_DIR     source directory e) /data/dir1/
  --endpoint ENDPOINT   snowball endpoint e) http://10.10.10.10:8080 or https://s3.ap-northeast-2.amazonaws.com
  --profile_name PROFILE_NAME
                        aws_profile_name e) sbe1
  --prefix_root PREFIX_ROOT
                        prefix root e) dir1/
  --max_process MAX_PROCESS
                        NUM e) 5
  --max_tarfile_size MAX_TARFILE_SIZE
                        NUM bytes e) $((1*(1024**3))) #1GB for < total 50GB, 10GB for >total 50GB
  --compression COMPRESSION
                        specify gz to enable
  --no_extract NO_EXTRACT
                        yes|no; Do not set the autoextract flag
  --target_file_prefix TARGET_FILE_PREFIX
                        prefix of the target file we are creating into the snowball
  --storage_class STORAGE_CLASS
                        specify S3 classes, be cautious Snowball support only STANDARD class; StorageClass=STANDARD|REDUCED_REDUNDANCY|STANDARD_I
                        A|ONEZONE_IA|INTELLIGENT_TIERING|GLACIER|DEEP_ARCHIVE|OUTPOSTS|GLACIER_IR
  --symlinkdir          yes|no; if you want to follow symbolic link dir, type 'yes', default is 'no'

Executing Script

Here is output of execution

sh run-s3booster-sbe.sh
multi part uploading:  1 / 11 , size: 104884733 bytes
multi part uploading:  1 / 11 , size: 104884714 bytes
multi part uploading:  1 / 11 , size: 104869657 bytes
multi part uploading:  1 / 11 , size: 104884786 bytes
multi part uploading:  1 / 11 , size: 104883288 bytes
multi part uploading:  1 / 11 , size: 104868660 bytes
multi part uploading:  1 / 11 , size: 104867541 bytes
... omitted
... omitted
snowball-20210810_152400-7Y5EPP.tgz is uploaded successfully

multi part uploading:  7 / 11 , size: 104866395 bytes
multi part uploading:  8 / 11 , size: 104862516 bytes
multi part uploading:  9 / 11 , size: 104890119 bytes
^[[O^[[Imulti part uploading:  10 / 11 , size: 104866477 bytes
metadata info: {'ResponseMetadata': {'RequestId': '3X9ZKZA90YRQ98SC', 'HostId': 'YcmBg0Syf9pEbRjMPdorhyIZgckXsz8xliXagtZxDp8gasK4TDwgG98g6rrHxTy8F6fKEOQ3/+4=', 'HTTPStatusCode': 200, 'HTTPHeaders': {'x-amz-id-2': 'YcmBg0Syf9pEbRjMPdorhyIZgckXsz8xliXagtZxDp8gasK4TDwgG98g6rrHxTy8F6fKEOQ3/+4=', 'x-amz-request-id': '3X9ZKZA90YRQ98SC', 'date': 'Tue, 10 Aug 2021 15:26:28 GMT', 'last-modified': 'Tue, 10 Aug 2021 15:25:24 GMT', 'etag': '"06aa2906ce7dbf864d64ff828d615c65-11"', 'x-amz-meta-snowball-auto-extract': 'true', 'accept-ranges': 'bytes', 'content-type': 'binary/octet-stream', 'server': 'AmazonS3', 'content-length': '1077720331'}, 'RetryAttempts': 0}, 'AcceptRanges': 'bytes', 'LastModified': datetime.datetime(2021, 8, 10, 15, 25, 24, tzinfo=tzutc()), 'ContentLength': 1077720331, 'ETag': '"06aa2906ce7dbf864d64ff828d615c65-11"', 'ContentType': 'binary/octet-stream', 'Metadata': {'snowball-auto-extract': 'true'}}

snowball-20210810_152400-MRCMA5.tgz is uploaded successfully

====================================
Duration: 0:02:27.091026
Total File numbers: 503004
S3 Endpoint: https://s3.ap-northeast-2.amazonaws.com
End

Checking the logs

Log Directory: ./log/

  • error-{date}.log : each file of failed to tar will be logged here
  • success-{date}.log: success message will be logged here
  • filelist-{date}.log: all files which are archived will be logged here

File Path

If you want to change objecs path which are extracted, you can specify prefix_root.

If you want to change tarfile's path on S3, you can specify target_file_prefix(when you use target_file_prefix, don't forget to add '/' such as 'newpath/'.

Caveat

metadata, snowball-auto-extract

--no_extract = 'no': if you are moving data to Snowball Edge, "--no_extract 'yes'" should be used. Specifying 'snowball-auto-extract=true' automatically extracts the contents of the archived files when the data is imported into Amazon S3. You can confirm this output from 'success-[date].log'

Don't include './' path in src_dir parameter

Normally in Unix/Linux environment, './' means current directory, so someone tends to use it. However, if you use in '--src_dir' parameter, it will add '.' prefix in S3.

For example, when "--src_dir './d001/dir001'" it will create following prefix like "s3://[bucket_name]/./d001/dir001/file.1"

symbolic link directory supported(2022.06.09)

--symlinkdir='yes' option is added, so you can follow the symbolic link directory, default value is 'no'. However, be careful of using it, this option can cause infinite loop.

Also, now broken sybolic link files are ignored, and normal symbolic link file will be included into tarfile.

Refering from here: https://docs.python.org/3/library/os.html#os.walk

Note Be aware that setting followlinks to True can lead to infinite recursion if a link points to a parent directory of itself. walk() does not keep track of the directories it visited already.

Search files with s3select

When you archived files on S3, you have to know which TARFILE contains the file which you want to get back.

s3select.sh will search keyword in /log/filelist.log on S3, and inform you the TARFILE.
filelist.log is generated by s3booster-snowball.py after uploading tarfiles on S3. you can find it s3://[bucket]/log path.

Here is the result.

[archiver]$ sh s3select.sh
new_s3_path/snowball-20220329_050947-VKNI2Q.tar ,/data2/fs1/d0011/dir0009/file0441 ,fs3/d0011/dir0009/file0441 ,26726
new_s3_path/snowball-20220329_050947-QV93SP.tar ,/data2/fs1/d0006/dir0009/file0441 ,fs3/d0006/dir0009/file0441 ,33763
new_s3_path/snowball-20220329_050947-YHKQ51.tar ,/data2/fs1/d0001/dir0009/file0441 ,fs3/d0001/dir0009/file0441 ,17378
new_s3_path/snowball-20220329_050947-QV93SP.tar ,/data2/fs1/d0007/dir0009/file0441 ,fs3/d0007/dir0009/file0441 ,17968
new_s3_path/snowball-20220329_050947-R0X1MB.tar ,/data2/fs1/d0022/dir0009/file0441 ,fs3/d0022/dir0009/file0441 ,22852

===== TAR Files containing dir0009/file0441 =====
new_s3_path/snowball-20220329_050947-QV93SP.tar
new_s3_path/snowball-20220329_050947-R0X1MB.tar
new_s3_path/snowball-20220329_050947-VKNI2Q.tar
new_s3_path/snowball-20220329_050947-YHKQ51.tar

Here is the s3select.sh script.

#!/bin/bash
bucket="your-own-bucket"                        # S3 bucket name
key="log/filelist-20220329_050947.log"          # filelist log file which will be generated by s3booster 
keyword="dir0009/file0441"                      # keyword or filename which you want to find
limitNum="100"                                  # filename list which you want to print
tmpfile="/tmp/temp-s3select.log"                # output file

aws s3api select-object-content \
    --bucket $bucket \
    --key $key \
    --expression "SELECT * FROM s3object s where Lower(s._2) like '%${keyword}%' limit $limitNum" \
    --expression-type 'SQL' \
    --input-serialization '{"CSV": {"FieldDelimiter": ","}}' \
    --output-serialization '{"CSV": {"FieldDelimiter": ","}}' /tmp/temp-s3select.log

cat /tmp/temp-s3select.log
echo ""
echo "===== TAR Files containing $keyword ====="
cat /tmp/temp-s3select.log | awk '{print $1}' | sort | uniq

About

s3booster-snowball.py, this script implemented batch feature in parallel so it is fast and simple to use, especially when dealing with small files.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published