### DATA MANAGEMENT
#### Commonly used file and folder manipulation commands
In this notebook, I gather my commonly used python bits for file and folder manipulation. If you work with many different samples, tools and programs and frequently find yourself adding, deleting, renaming or replacing large numbers of files and/or directories, you might find this notebook useful. There are undoubtedly better ways of doing some of these things so feel free to suggest improvements.

I generally like to keep the directory structure relatively flat to avoid lots of navigating. In my research, I use 5 - 6 different programs and tools that generate a few files each. I use the following directory structure:
```
project
│    dir_list.txt
│
└─── sample_1
│   │
│   └─── tool_1
│   │   │   tool_1_file_1
│   │   │   tool_1_file_2
│   │   │   ...
│   │
│   └─── tool_2
│   │   │   tool_2_file_1
│   │   │   tool_2_file_2
│   │   │   ...
│   │
│   └─── ...
│
└─── sample_2
│   └─── ...
│ ...
```
This helps to keep the data organized and allows easy access to the corresponding files and directories using the steps below. The file `dir_list.txt` contains the absolute paths to the directories `sample_1`, `sample_2`, etc. (see step 1) and serves as the entry-point for most operations. It also allows you to gather all the data in a single dataframe to perform analysis (see step 7).

In [1]:
import os
import shutil
import pandas as pd

#### Setting up some examples

In [2]:
def create_some_files_here(path, name):
    for i in [name + '_file_a.txt', name + '_file_b.txt']:    
        with open(os.path.join(path, i), 'w') as f:
            f.write(name + '_content_a\n')
            f.write(name + '_content_b\n')

def setup_examples(example_dir):
    if not os.path.exists(example_dir):
        os.mkdir(example_dir)

    dirs = [a + str(b) for b in range(3) for a in ['sample_A_','sample_B_']]
    for i in dirs:
        my_dir = os.path.join(example_dir, i)
        if not os.path.exists(my_dir):
            os.mkdir(my_dir)
        
        create_some_files_here(my_dir, 'sample')
        
        for j in ['tool_a','tool_b','tool_c']:
            sub_dir = os.path.join(my_dir, j)
            if not os.path.exists(sub_dir):
                os.mkdir(sub_dir) 
            
            create_some_files_here(sub_dir, 'tool')

example_dir = 'example'
setup_examples(example_dir)

#### 1. Creating a list of all sub-directories of a certain directory
Start with this step to specify the directory that contains the files/directories to be used in later steps.

In [3]:
example_dir = 'example'
sub_dirs    = os.listdir(example_dir)
abs_path    = os.path.abspath(example_dir)

Write the paths to all sub-directories and files in `example_dir` to the text file `content_list.txt`:

In [4]:
output_file = os.path.join(example_dir, 'content_list.txt')

with open(output_file, 'w') as f:
    for sub_dir in sub_dirs:
        sub_path = os.path.join(abs_path, sub_dir)
        f.write(sub_path + '\n')

Write the paths to all sub-directories of `example_dir` to the text file `dir_list.txt`:

In [5]:
output_file = os.path.join(example_dir, 'dir_list.txt')

with open(output_file, 'w') as f:
    for sub_dir in sub_dirs:
        sub_path = os.path.join(abs_path, sub_dir)
        if os.path.isdir(sub_path):
            f.write(sub_path + '\n')

Write the paths to all sub-directories of `example_dir` with the string `_A_` in the directory name:

In [6]:
output_file = os.path.join(example_dir, 'dir_list_A.txt')

with open(output_file, 'w') as f:
    for sub_dir in sub_dirs:
        sub_path = os.path.join(abs_path, sub_dir)
        if (os.path.isdir(sub_path)) and ('_A_' in sub_dir):
            f.write(sub_path + '\n')

Write the paths to all sub-directories of `example_dir` that end with `_0`:

In [7]:
output_file = os.path.join(example_dir, 'dir_list_0.txt')

with open(output_file, 'w') as f:
    for sub_dir in sub_dirs:
        sub_path = os.path.join(abs_path, sub_dir)
        if (os.path.isdir(sub_path)) and (sub_dir.endswith('_0')):
            f.write(sub_path + '\n')

#### 2. Creating directories
This assumes the files and/or directories to be removed are specified in a text file, as shown in the previous step.

In [8]:
example_dir  = 'example'

Add a sub-directory `dir_d` to all directories listed in `dir_list_A.txt`:

In [9]:
path_list_file = os.path.join(example_dir, 'dir_list_A.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]

for path in path_list:
    new_dir = os.path.join(path, 'dir_d')
    if not os.path.isdir(new_dir):
        os.mkdir(new_dir)

#### 3. Removing files or directories
Note: **use with caution**.

In [10]:
example_dir  = 'example'

Remove all directories (and the contents) listed in `dir_list_A.txt` with `shutil.rmtree`: 

In [11]:
path_list_file = os.path.join(example_dir, 'dir_list_A.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]

for path in path_list:
    if os.path.isdir(path):
        shutil.rmtree(path, ignore_errors=True)

Remove all files from the directories listed in `dir_list_0.txt`:

In [12]:
path_list_file = os.path.join(example_dir, 'dir_list_0.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]

for path in path_list:
    if os.path.isdir(path):
        for file_name in os.listdir(path):
            file_path = os.path.join(path, file_name)
            if os.path.isfile(file_path):
                os.remove(file_path)

Remove all files ending with `_a` from the directories listed in `dir_list.txt`:

In [13]:
path_list_file = os.path.join(example_dir, 'dir_list.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]

for path in path_list:
    if os.path.isdir(path):
        for file_name in os.listdir(path):
            file_path = os.path.join(path, file_name)
            if (os.path.isfile(file_path)) and (file_name.endswith('_a')):
                os.remove(file_path)

#### 4. Renaming and moving files 
Since a lot of directories/files just got deleted, `setup_examples()` is run first.

In [14]:
example_dir = 'example'
setup_examples(example_dir)

Rename all `sample_file_a.txt` files in the directories listed in `dir_list.txt` to `sample_file_c.txt`:

In [15]:
path_list_file = os.path.join(example_dir, 'dir_list.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]
target_file    = 'sample_file_a.txt'
rename_to      = 'sample_file_c.txt'

for path in path_list:
    src = os.path.join(path, target_file)
    if os.path.isfile(src):
        dst = os.path.join(path, rename_to)
        os.rename(src, dst)

Rename all `tool_b` sub-directories in the directories listed in `dir_list_A.txt` to `tool_d`:

In [16]:
path_list_file = os.path.join(example_dir, 'dir_list_A.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]
target_dir     = 'tool_b'
rename_to      = 'tool_d'

for path in path_list:
    src = os.path.join(path, target_dir)
    if os.path.isdir(src):
        dst = os.path.join(path, rename_to)
        if not os.path.isdir(dst):
            os.rename(src, dst)

Move all `sample_file_c.txt` files in the directories listed in `dir_list_A.txt` to the sub-directory `tool_c`:

In [17]:
path_list_file = os.path.join(example_dir, 'dir_list_A.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]
target_file    = 'sample_file_c.txt'
move_to        = os.path.join('tool_c', target_file)

for path in path_list:
    src = os.path.join(path, target_file)
    if os.path.isfile(src):
        dst = os.path.join(path, move_to)
        os.rename(src, dst)

#### 5. Copying directories and files

Copy `tool_c/sample_file_c.txt` to `tool_d/sample_file_c.txt` for all directories liste din `dir_list_A.txt`:

In [18]:
path_list_file = os.path.join(example_dir, 'dir_list_A.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]
target_file    = os.path.join('tool_c', 'sample_file_c.txt')
copy_to        = os.path.join('tool_d', 'sample_file_c.txt')

for path in path_list:
    src = os.path.join(path, target_file)
    if os.path.isfile(src):
        dst = os.path.join(path, copy_to)
        shutil.copyfile(src, dst)

Copy `tool_a` and its contents to three new directories, for all directories listed in `dir_list_0.txt`:

In [19]:
path_list_file = os.path.join(example_dir, 'dir_list_0.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]
target_dir     = 'tool_a'

for path in path_list:
    src = os.path.join(path, target_dir)
    if os.path.isdir(src):
        for i in range(3):
            dst = os.path.join(path, target_dir + '_' + str(i))
            if not os.path.exists(dst):
                shutil.copytree(src, dst)

#### 6.  Editing the contents of existing files
Use this if you have generated a large number of (text) files and find that you need to edit some of the content. 

In [20]:
path_list_file = os.path.join(example_dir, 'dir_list_0.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]
target_file    = os.path.join('tool_a', 'tool_file_a.txt')

for path in path_list:
    target_path = os.path.join(path, target_file)
    if os.path.isfile(target_path):
        content = [line.strip() for line in open(target_path,'r')]
    
        for idx,line in enumerate(content):
            if idx == 1:
                content[idx] = 'content_fixed\n'

        with open(target_path, 'w') as f:
            f.write('\n'.join(content))

#### 7. Gathering data from the output files
After running your programs and generating the output for each tool, use this step to collect all the data in a dataframe for analysis. 

In [21]:
path_list_file = os.path.join(example_dir, 'dir_list.txt')
path_list      = [line.strip() for line in open(path_list_file, 'r')]
data_set       = []

# Create a custom function for every type of file you want to read in
# (in this example, all the content is the same)
def read_sample_file_b(sample_data, f):
    content = f.readlines()
    sample_data['sample_parameter_a'] = content[0].strip()
    sample_data['sample_parameter_b'] = content[1].strip()
    
def read_tool_file_a(sample_data, f):
    content = f.readlines()
    sample_data['tool_a_parameter_a'] = content[0].strip()
    sample_data['tool_a_parameter_b'] = content[1].strip()

def read_tool_file_b(sample_data, f):
    content = f.readlines()
    sample_data['tool_b_parameter_a'] = content[0].strip()
    sample_data['tool_b_parameter_b'] = content[1].strip()

for path in path_list:
    sample_data         = {}        
    sample_data['path'] = path
    
    sample_file = os.path.join(path, 'sample_file_b.txt')
    if os.path.isfile(sample_file):
        with open(sample_file, 'r') as f:
            read_sample_file_b(sample_data, f)
    
    tool_dir = os.path.join(path, 'tool_a')
    for tool_file in os.listdir(tool_dir):
        if tool_file == 'tool_file_a.txt':
            with open(os.path.join(tool_dir, tool_file), 'r') as f:
                read_tool_file_a(sample_data, f)
        elif tool_file == 'tool_file_b.txt':
            with open(os.path.join(tool_dir, tool_file), 'r') as f:
                read_tool_file_b(sample_data, f)
    
    # ...repeat as needed, for other tools
    
    data_set.append(sample_data)

# Create dataframe
df = pd.DataFrame(data_set)
df.head()

Unnamed: 0,path,sample_parameter_a,sample_parameter_b,tool_a_parameter_a,tool_a_parameter_b,tool_b_parameter_a,tool_b_parameter_b
0,/Users/joostv/Desktop/Research/Code/Other/[Git...,sample_content_a,sample_content_b,tool_content_a,content_fixed,tool_content_a,tool_content_b
1,/Users/joostv/Desktop/Research/Code/Other/[Git...,sample_content_a,sample_content_b,tool_content_a,tool_content_b,tool_content_a,tool_content_b
2,/Users/joostv/Desktop/Research/Code/Other/[Git...,sample_content_a,sample_content_b,tool_content_a,tool_content_b,tool_content_a,tool_content_b
3,/Users/joostv/Desktop/Research/Code/Other/[Git...,sample_content_a,sample_content_b,tool_content_a,content_fixed,tool_content_a,tool_content_b
4,/Users/joostv/Desktop/Research/Code/Other/[Git...,sample_content_a,sample_content_b,tool_content_a,tool_content_b,tool_content_a,tool_content_b
