# large_files_handler.ipynb
GitHub repositories require files to be less than 50 MB.
However, large files can be split into smaller ones, stored in GitHub, and reassembled after cloning.

In this notebook, I use a three Linux commands, **find**, **split** and **cat**, to implement this strategy.

The notebook adds paths for large files in **.gitignore** to prevent them being added to the repo.

**find** is used to list large files. Note that '..' is included to start the search in the parent folder, which is the top level folder for the repository. In this example, the commanline returns three file paths.
```
find .. -type f -size +50M'

../code/object-detectors/inference_data/frozen_inference_graph_5classes.pb
../code/object-detectors/inference_data/frozen_inference_graph_3classes.pb
../code/object-detectors/inference_data/mask_rcnn_cvat_0160.h5
```

**split** is used to split the files in 40 MB parts which can be stored in a GitHub repo.

```
split -b 40MB {filepath} {filepath}.part_
```

**cat** is used to reassemble parts into the original file.
```
cat {filepath}.part_?? > {filepath}
```

My idea is to run this notebook at the beginning of a workflow to handle large files.

In [1]:
import subprocess
import os
import glob

In [2]:
def get_large_files_list():
    """ 
    Returns a list of filenames for files greater than 50 Mb 
    Note that recursive search starts in parent folder.
    """
    result = subprocess.run('find .. -type f -size +50M', stdout=subprocess.PIPE, shell=True)
    return [i.decode('utf8') for i in result.stdout.split()]

In [3]:
def confirm_large_files_are_listed_in_gitignore(large_files_list):

    # if ../.gitignore does not exist, create it
    gitignore_path = '../.gitignore'
    if not os.path.exists(gitignore_path):
        os.system(f'touch {gitignore_path}')

    # read ../.gitignore as a string
    with open(gitignore_path, 'r') as f:
        s = f.read()          

    # make sure that a filename for each file greater than 50 Mb is included in the string
    string_has_been_modified = False
    for filename in large_files_list:
        found = any(filename in x for x in s)
        if not found:
            s = s + filename + '\n'
            string_has_been_modified = True

    # if the string has been modified, replace ../.gitignore
    if string_has_been_modified:
        with open(gitignore_path, 'w') as f:
            f.write(s)    

In [4]:
def assemble_large_files_from_parts():
    """ each large file is assembeled only if it does not already exist """
    parts_files = sorted(glob.glob('../**/*.part_*', recursive=True))
    for file in list(set([s[:s.find('.part_')] for s in parts_files])):
        if not os.path.exists(file):
            command = f'cat {file}.part_?? > {file}'
            print(command)
            subprocess.run(command, shell=True)

In [5]:
def split_large_files(large_file_list):
    """ each large file is split only if parts do not already exist """
    for filename in large_file_list:
        parts = glob.glob(f'{filename}.part_*', recursive=True)
        if len(parts) == 0:
            command = f'split -b 40MB {filename} {filename}.part_'
            print(command)
            subprocess.run(command, shell=True)

In [6]:
# MAIN

large_files_list = get_large_files_list()
if len(large_files_list) == 0:   # this is the state immediately after cloning from github
    assemble_large_files_from_parts()
else:
    split_large_files(large_files_list)
    
large_files_list = get_large_files_list()
confirm_large_files_are_listed_in_gitignore(large_files_list)
print('The following large files (>50 Mb) are included in .gitignore')
print('Parts lists for these files facilitate storage in GitHub repos.')
print('Running ths notebook (large_file_handler.ipynb) will re-assemble large files from the parts files.')
for f in large_files_list:
    print(f)
print('FINISHED')

The following large files (>50 Mb) are included in .gitignore
Parts lists for these files facilitate storage in GitHub repos.
Running ths notebook (large_file_handler.ipynb) will re-assemble large files from the parts files.
../code/object-detectors/inference_data/frozen_inference_graph_5classes.pb
../code/object-detectors/inference_data/frozen_inference_graph_3classes.pb
../code/object-detectors/inference_data/mask_rcnn_cvat_0160.h5
FINISHED
