Here, you have to use multiprocessing and subprocess module methods to sync the data from /data/prod to /data/prod_backup folder.

Hint: os.walk() generates the file names in a directory tree by walking the tree either top-down or bottom-up. This is used to traverse the file system in Python.

This script uses Python's built-in multiprocessing module to create a pool of worker processes. It uses os.cpu_count() to determine the number of worker processes, which will be equal to the number of CPUs on your machine. It then uses map() function to apply the backup() function to every directory in the source directory. This will run the rsync operation in parallel for each subdirectory of the source directory, which could speed up the operation if there are a large number of directories and if the operation is I/O-bound.


In [None]:
#!/usr/bin/env python3
import os
import subprocess
from multiprocessing import Pool

src = "/home/<user_name>/data/prod/"
dest = "/home/<user_name>/data/prod_backup/"

def backup(dir_name):
    src_dir = os.path.join(src, dir_name)
    dest_dir = os.path.join(dest, dir_name)
    subprocess.call(["rsync", "-arq", src_dir, dest_dir])

with Pool(os.cpu_count()) as p:
    p.map(backup, os.listdir(src))

The two scripts have slightly different functionalities.

The first script:

Parallelizes the copying process with a separate rsync call for each directory under src but doesn't go into subdirectories.
The second script:

Parallelizes the copying process with a separate rsync call for each directory and subdirectory under src.
If your goal is to ensure the most efficient use of system resources and you have a deeply nested directory structure, the second script may be better because it leverages parallelism more effectively, by allocating a separate process for each directory and subdirectory.

However, the second script might also lead to a higher system load if we have a large number of directories, because it creates a pool with as many workers as there are directories.

We could consider modifying the second script to use a number of workers equal to the number of CPU cores (like the first script does), which would be a good trade-off between speed and system load:

Finally, note that the efficiency of these scripts also depends on the specific workload and system configuration, so it would be a good idea to test them in your environment to see which one performs better.

In [None]:
#!/usr/bin/env python3
from multiprocessing import Pool
import subprocess
import os

src = "{}/data/prod/".format(os.getenv("HOME"))
dest = "{}/data/prod_backup/".format(os.getenv("HOME"))

def run(folder):
    subprocess.call(["rsync", "-arq", folder, dest])

if __name__ == "__main__":
    folders = []
    for path, directories, files in os.walk(src):
        for name in directories:
            folders.append(os.path.join(path, name))

    pool = Pool(os.cpu_count())  # use as many workers as there are CPU cores
    pool.map(run, folders)
