Skip to content
/ bus3 Public

PoC backup tool to S3 storage for Linux. It supports file/chunk level dedupe, file versioning and Python asyncio for maximum concurrency with small footprint.

Notifications You must be signed in to change notification settings

achiwa912/bus3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 

Repository files navigation

Table of Contents

  1. bus3 - buckup to S3
    1. Overview
    2. Getting started
      1. Prerequisites
      2. Installation
      3. FYI; Postgres config for Fedora/CentOS
      4. Configuration file
      5. Usage
  2. License
  3. Contact
  4. Acknowledgements
  5. Appendix A; Past improvements
  6. Appendix B; Performance testing
    1. Small random files (4KB)
    2. large random files (1 or 4GB)

bus3 - buckup to S3

bus3.py is an experimental backup tool to S3 storage. It fully utilizes asyncio to maximize concurrency with small footprint. It relies on aiofiles, asyncpg and aioboto3 libraries.

Important notice - bus3 is still under development (experimental) and may or may not work for now.

Overview

bus3 is designed to be able to:

  • backup files, directories and symbolic/hard links
  • preserve extended attributes
  • track backup history and file versions
  • perform file or chunk (default 64MB) level dedupe
  • backup very large files without using up all the memory
  • handle a large number of files without using up memory
  • maximize cuncurrency with asyncio (coroutines)
    • spawn an async task for each file or directory to back up
    • spawn an async task for each object write to S3
  • support PostgreSQL as opposed to sqlite3 to avoid the global write lock

bus3 splits large files into chunks and stores them as separate objects in S3 storage. It stores file metadata in the database. The database needs to be backed up separately after each backup.

Getting started

Prerequisites

  • S3 storage
    • Not tested with Amazon AWS S3 (yet)
  • Linux
    • Developed on Fedora 33 and CentOS 8
  • Python 3.8 or later
  • bus3.py - the backup tool
  • bus3.yaml - config file
  • May need root priviledge to execute

Installation

  1. Prepare S3 storage and a dedicated bucket for bus3.py
  2. Setup python 3.8 or later
  3. Setup Postgres and create a database named bus3
  4. Install aiofiles
  5. Install aioboto3=8.3.0 (latest 9.0 doesn't work???)
  6. Install asyncpg
  7. Install pyyaml
  8. Edit bus3.yaml for S3 storage endpoint, bucket name and directory to backup
  9. Setup ~/.aws/credentials (eg, aws cli)
  10. Run python bus3.py -b to backup

FYI; Postgres config for Fedora/CentOS

https://fedoraproject.org/wiki/PostgreSQL#Installation

  1. sudo dnf install postgresql-server

  2. sudo vi /var/lib/pgsql/data/pg_hba.conf

    host all all 127.0.0.1/32 md5

  3. sudo postgresql-setup –initdb

  4. sudo systemctl start postgresql

  5. sudo su - postgres

  6. createdb bus3

  7. psql

    ALTER USER postgres PASSWORD '';

Configuration file

bus3.yaml is the configuration file.

root_dir: /<path-to-backup-directory>
s3_config:
  s3_bucket: <bucket name>
  s3_endpoint: https://<S3-storage-URL>:<port>

Usage

To back up:

python bus3.py -b

To see backup history/list:

python bus3.py [-l]

Example output:

(bus3) [test@localhost bus3]$ python bus3.py -l
  #: date & time         backup root directory
  0: 2021-06-24 15:31:01 /home/test/py/bus3/test
  1: 2021-06-24 15:57:25 /home/test/py/bus3/test
  2: 2021-06-24 16:26:53 /home/test/py/bus3/test
  3: 2021-06-24 22:34:11 /home/test/py/bus3/test
  4: 2021-06-25 07:26:45 /home/test/py/bus3/test
  5: 2021-06-25 07:31:05 /home/test/py/bus3/test
  6: 2021-06-25 07:41:52 /home/test/py/bus3/test
07:46:42,292 INFO: Completed or gracefully terminated

# is the backup history number (or scan counter)

To restore directory/file:

python bus3.py -r all|<file/dierctory-to-restore> <directory-to-be-restored> [<backup-history-number>]

<file/directory-to-restore> can either be specified as a full path (ie, starts with /) or a relative path to the backup root directory sepcified in the bus3.yaml. If all is specified, bus3 will restore all backup files and directories. (Most tests specify all so far.)

If <backup-history-number> is not specified, bus3 will restore the latest version.

Important: Please make sure to backup database after each backup files/directories with bus3.py.

License

bus3.py is under MIT license.

Contact

Kyosuke Achiwa - @kyos_achwan - achiwa912+gmail.com (please replace + with @)

Project Link: https://github.com/achiwa912/bus3

Acknowledgements

TBD

Appendix A; Past improvements

improvement supported comment
Switch from sqlite3 to postgres yes  
Create DB pool yes  
Create S3 client pool yes  
Reduce local file reads no Performance didn't change but increased memory utilization

Appendix B; Performance testing

Conducted performance test in a local environment with a locally connected S3 storage (ie, NOT Amazon AWS).

Small random files (4KB)

Backed up and restored 1000 4KB random files in a directory.

S3 pool size max S3 tasks max DB tasks backup (files/sec) restore (files/sec)
150 150 96 45.2 59.9
150 150 150 61.1 59.1
150 150 256 60.9 62.7
256 256 256 61.8 59.5
96 256 256 65.8 58.3
64 256 256 66.9 63.0
32 256 256 63.9 60.0
16 256 256 46.5 59.4
8 256 256 37.9 62.7

large random files (1 or 4GB)

file size (GB) files max large buffers backup (MB/s) restore (MB/s)
4 2 16 57.57 88.68
1 1 16 57.5 92.53
1 2 16 55.15 78.18
1 4 16 56.29 88.63
1 8 16 56.8 93.79
1 8 32 56.48 90.69
1 16 32 54.73 91.09

About

PoC backup tool to S3 storage for Linux. It supports file/chunk level dedupe, file versioning and Python asyncio for maximum concurrency with small footprint.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages