This notebook sorts through which files have already been transferred into the new data (clidex/data/) structure and which remain exclusively in the old structure (clidex/data_old/). Files which have not yet been transferred to the new repository are  sorted into groups corresponding to folders of the new structure before being exported to clidex/data_old/. 

Last update: 23 Sep 2024 | FFW

In [1]:
import os
import hashlib
from collections import defaultdict, Counter
from concurrent.futures import ThreadPoolExecutor
import shutil 
import pandas as pd
from IPython.display import clear_output

In [76]:
# # Function to generate a hash for the first MB of the file
# def hash_partial_file(file_path, chunk_size=1024*1024):  # Default chunk size is 1 MB
#     """Generate a hash for only a portion of the file."""
#     hasher = hashlib.sha256()  # Or use hashlib.md5() if you prefer a faster, less secure hash
#     with open(file_path, 'rb') as f:
#         chunk = f.read(chunk_size)  # Read only the first chunk_size bytes
#         hasher.update(chunk)
#     return hasher.hexdigest()

In [77]:
# # Function to hash file and store result, or log if permission denied
# def hash_and_store_partial(file_name, directory, chunk_size=1024*1024, denied_list=[]):
#     """Hash a file and store the result if it exists, or log permission denied files."""
#     file_path = os.path.join(directory, file_name)
#     try:
#         if os.path.isfile(file_path):
#             return (file_name, hash_partial_file(file_path, chunk_size))
#     except PermissionError:
#         print(f"Permission denied: {file_path}")
#         denied_list.append(file_name)  # Add to permission denied list
#         return None  # Skip this file
#     return None

In [78]:
# # Read file names from text files
# def read_file_names(file_path):
#     with open(file_path, 'r') as f:
#         # Read each line, strip whitespace, and keep the full path
#         file_names = [line.strip() for line in f]
#     return file_names

# # Paths to the text files (inside your project folder)
# data_old_path = 'File Lists/data_old_files.txt'
# data_path = 'File Lists/data_files.txt'

# # Read file names from text files
# data_old_files1 = read_file_names(data_old_path)
# data_files1 = read_file_names(data_path)

In [79]:
# # Function to filter out .DS_Store files
# def filter_out_ds_store(file_list):
#     return [file for file in file_list if not (file.endswith('.DS_Store') or file.endswith('._.DS_Store'))]

# # Filter out .DS_Store files from both lists
# data_old_files = filter_out_ds_store(data_old_files1)
# data_files = filter_out_ds_store(data_files1)

In [80]:
# # Define the paths to your data directories
# data_old_directory = '/vast/clidex/'

# # Get file hashes for all filtered files in data_old using multiple threads and partial hashing
# data_old_hashes = []
# chunk_size = 1024 * 1024  # 1 MB chunk

# with ThreadPoolExecutor(max_workers=8) as executor:
#     # Map each file to the hash_and_store_partial function
#     results = executor.map(lambda file_name: hash_and_store_partial(file_name, data_old_directory, chunk_size), data_old_files)

#     # Collect non-None results
#     data_old_hashes = [res for res in results if res is not None]

Now we can move on to see what files have been moved over successfullly and which have not:

In [81]:
# # Define the base directory for data
# base_directory = '/vast/clidex/data'

# # Get file hashes for all files in 'data' using absolute paths
# data_hashes = []
# denied_files = []  # Keep track of permission denied files in 'data'

# # Create absolute paths for files in 'data'
# data_files_absolute = [os.path.join(base_directory, file.lstrip('./')) for file in data_files]

# with ThreadPoolExecutor(max_workers=8) as executor:
#     results_data = executor.map(lambda file_path: hash_and_store_partial(file_path, base_directory, chunk_size, denied_files), data_files_absolute)
#     data_hashes = [res for res in results_data if res is not None]

In [82]:
# # Comparison: Find files in data_old that are missing from data
# data_old_hashes_set = set(file_hash for _, file_hash in data_old_hashes)
# data_hashes_set = set(file_hash for _, file_hash in data_hashes)

# # Files that are in 'data_old' but not in 'data'
# only_in_old = [file_name for file_name, file_hash in data_old_hashes if file_hash not in data_hashes_set]
# in_both = [file_name for file_name, file_hash in data_old_hashes if file_hash in data_hashes_set]

In [9]:
print(str(len(in_both)) + ' files have been transferred over from data_old to data.') 
print(str(len(only_in_old)) + ' files exist only in data_old.')
print('data_old contains ' + str(len(data_old_files)) + ' total files.')
print('6962 + 13721 = ' + str(6962 + 13721))

6962 files have been transferred over from data_old to data.
13721 files exist only in data_old.
data_old contains 20685 total files.
6962 + 13721 = 20683


Here the discrepancy of two relates to the two files which I did not have acess to when hashing all files in data_old. These are /vast/clidex/data_old/CESM_HR/Zusatzmaterial/CESM1_HR_PICONTROL.cvdp_data.150-501.nc-20230824T194710Z-001.zip and /vast/clidex/data_old/CESM_HR/Zusatzmaterial/POP_tx0.1v3_grid-001.nc - will inquire about them later.

Before proceeding lets save the file lists so that we do not have to perform the (somewhat long hashing again):

In [83]:
# # Define paths for the output files
# only_in_old_path = 'File Lists/only_in_old.txt'
# in_both_path = 'File Lists/in_both.txt'

# # Save lists to text files
# with open(only_in_old_path, 'w') as f:
#     for file_name in only_in_old:
#         f.write(f"{file_name}\n")

# with open(in_both_path, 'w') as f:
#     for file_name in in_both:
#         f.write(f"{file_name}\n")

Now lets start organizing file names into groups based upon what data they give and WHERE we want them to end up in the new data structure. Reading in the saved files (so that you only have to run this half of the notebook):

In [2]:
# Define paths for the input files
only_in_old_path = 'File Lists/only_in_old.txt'
in_both_path = 'File Lists/in_both.txt'

# Read the list of files only in old
with open(only_in_old_path, 'r') as f:
    only_in_old = [line.strip() for line in f.readlines()]

# Read the list of files in both
with open(in_both_path, 'r') as f:
    in_both = [line.strip() for line in f.readlines()]

In [3]:
OCEAN_3D_keys = ['ARGO', 'CARS', 'XBT', 'WOA', 'QuOTA', 'OHC', 'INSTANT_ARRAY']
NEMO_keys = ['forcing', 'VIKING', 'FOCI', 'ORCA', 'TROPAC', 'NUSA20', 'INDRANI']

In [4]:
def suggest_sorting_criteria(file_list):
    sorting_suggestions = defaultdict(list)
    
    for file_path in file_list:
        # Split the path and get all folder names
        parts = file_path.split('/')
        
        # Extract relevant folder names (e.g., ignoring 'data_old' or 'CESM_CAM5_LME')
        relevant_parts = [part for part in parts if part not in ['data_old', 'CESM_CAM5_LME']]

        # Use the first folder name as the primary sort key
        if relevant_parts:
            primary_key = relevant_parts[0]
            sorting_suggestions[primary_key].append(file_path)

            # Check for additional potential sorting criteria
            if len(relevant_parts) > 1:
                secondary_key = relevant_parts[1]
                sorting_suggestions[primary_key].append(f"Also consider sorting by '{secondary_key}' for: {file_path}")

    return dict(sorting_suggestions)

In [5]:
# Define your keywords
OCEAN_3D_keys = ['ARGO', 'CARS', 'XBT', 'WOA', 'QuOTA', 'OHC', 'INSTANT_ARRAY']
NEMO_keys = ['forcing', 'VIKING', 'FOCI', 'ORCA', 'TROPAC', 'NUSA20', 'INDRANI']

# Initialize a dictionary to store sorted files
sorted_files = {key: [] for key in OCEAN_3D_keys + NEMO_keys}

# Example initialization of file_list and all_keywords
file_list = only_in_old
all_keywords = OCEAN_3D_keys + NEMO_keys

In [6]:
def sort_files_by_keywords(file_list, keywords):
    """
    Sort files into groups based on the earliest matching keyword in the file path.

    Parameters:
    - file_list: List of file names to be sorted.
    - keywords: List of keywords to use for sorting.

    Returns:
    - A dictionary with keywords as keys and sets of matching file names as values.
    """
    # Initialize a dictionary to store sorted files as sets
    sorted_files = {key: set() for key in keywords}

    # Sort files into groups based on keywords
    for file_name in file_list:
        best_match_key = None
        best_match_index = float('inf')  # Start with a very large number
        
        for key in sorted_files.keys():
            # Check if the key is in the file name (case insensitive)
            if key.lower() in file_name.lower():
                # Find the index of the first occurrence of the keyword in the file path
                match_index = file_name.lower().index(key.lower())
                # If this match is earlier than the best match found so far, update
                if match_index < best_match_index:
                    best_match_index = match_index
                    best_match_key = key

        # If a best match key was found, add the file to that category
        if best_match_key:
            sorted_files[best_match_key].add(file_name)

    return sorted_files

In [7]:
# Ensure file_list and all_keywords are defined
if 'file_list' in locals() and 'all_keywords' in locals():
    # Call the function
    sorted_files = sort_files_by_keywords(file_list, all_keywords)

    # Display a summary of the sorted files
    summary = {key: len(files) for key, files in sorted_files.items() if files}

    print("Summary of sorted files:")
    for key, count in summary.items():
        print(f"Sorted by '{key}': {count} files")

    # Prompt user to specify a category to display detailed lists
    while True:
        category_to_display = input("Which category would you like to see detailed lists for? (type 'exit' to quit) ").strip().replace("'", "").lower()

        # Allow the user to exit
        if category_to_display in ['exit', 'quit']:
            print("Exiting the search...")
            break  # Exit the loop and stop the operation

        # Check against the lowercase keys of sorted_files
        lowercase_sorted_keys = {key.lower(): key for key in sorted_files.keys()}

        if category_to_display in lowercase_sorted_keys:
            original_key = lowercase_sorted_keys[category_to_display]
            files = sorted_files[original_key]
            if files:
                print(f"\nFiles sorted by '{original_key}':")
                for file_name in files:
                    print(file_name)
            else:
                print(f"No files found for category '{original_key}'.")
            break  # Exit the loop after displaying the files
        else:
            print(f"Category '{category_to_display}' not recognized. Please try again.")
else:
    print("Ensure that 'file_list' and 'all_keywords' are defined before calling the function.")

Summary of sorted files:
Sorted by 'ARGO': 344 files
Sorted by 'CARS': 2 files
Sorted by 'XBT': 25 files
Sorted by 'WOA': 1 files
Sorted by 'QuOTA': 2 files
Sorted by 'OHC': 1926 files
Sorted by 'INSTANT_ARRAY': 12 files
Sorted by 'forcing': 665 files
Sorted by 'VIKING': 127 files
Sorted by 'ORCA': 2687 files


Which category would you like to see detailed lists for? (type 'exit' to quit)  exit


Exiting the search...


In [8]:
# Initialize an empty dictionary to store results
sorted_files = {}

In [10]:
while True:
    # Clear the previous output (except the current input prompt)
    clear_output(wait=True)
    
    # Prompt user for input string
    search_string = input("Enter a string to search for in file names (or type 'exit' to quit): ").strip().lower().replace("'", "").replace('"', '')

    # Allow the user to exit the loop
    if search_string == 'exit':
        clear_output(wait=True)  # Clear the output before exiting
        print("Exiting the search.")
        break

    # Find matching files
    matching_files = [file_name for file_name in file_list if search_string in file_name.lower()]

    # Output the results
    if matching_files:
        print(f"There are {len(matching_files)} files containing this keyword.")
        
        # Ask if the user wants to see the full list
        show_full_list = input("Would you like to see the full list of matching files? (yes/no): ").strip().lower()
        
        if show_full_list == 'yes':
            print(f"Files containing '{search_string}':")
            for file_name in matching_files:
                print(file_name)

        # Ask for the range of files to append
        append_to_dict = input("Would you like to append these results to the dictionary? (yes/no): ").strip().lower()
        
        if append_to_dict == 'yes':
            range_input = input("Enter the range of files to add (e.g., 4: for all from index 4 onwards or 2:5 for a specific range): ")

            try:
                # Parse the range input
                if ':' in range_input:
                    if range_input.count(':') == 1:  # Handling cases like "4:", "2:5"
                        start_index, end_index = range_input.split(':')
                        start_index = int(start_index) if start_index else 0  # Default to 0 if start is empty
                        end_index = int(end_index) if end_index else None  # None means to the end
                        selected_files = matching_files[start_index:end_index]
                    else:
                        raise ValueError
                else:  # If no colon, add the whole list
                    selected_files = matching_files

                # Store selected files in the dictionary
                sorted_files[search_string.upper()] = pd.DataFrame(selected_files, columns=["Matching Files"])
                print(f"Results stored under key '{search_string.upper()}'.")
            except ValueError:
                print("Invalid range input. Please use the format 'start:end'.")
    else:
        print(f"No files found containing '{search_string}'.")

Exiting the search.


We can look at our dictionary:

In [35]:
#We can check our dict
#sorted_files

Making sure we did not add any files to multiple dict dimensions:

In [15]:
def find_duplicates(sorted_files):
    seen_files = set()
    duplicates = set()

    # Iterate through each key in the dictionary
    for key in sorted_files:
        for file_name in sorted_files[key]:
            # Skip unwanted entries (like column names)
            if file_name == "Matching Files":
                continue
            
            # Normalize file name
            file_name = str(file_name).strip()
            
            # Check if the file name has already been seen
            if file_name in seen_files:
                duplicates.add(file_name)
            else:
                seen_files.add(file_name)

    return duplicates

In [16]:
# Example usage
duplicates = find_duplicates(sorted_files)

if duplicates:
    print("Duplicate files found:")
    for file in duplicates:
        print(file)
else:
    print("No duplicate files found.")

No duplicate files found.


In [11]:
#Now lets save these lists of sorted files as txt files to use in our terminal

# Iterate over the dictionary
for key, df in sorted_files.items():
    # Extract the list of file paths (assuming 'Matching Files' is the column name)
    file_paths = sorted_files[key].values.tolist()
    # Flatten the list (extract the string from each inner list)
    file_paths = [path[0] if isinstance(path, list) else path for path in file_paths]
    
    # Define the text file name using the key
    txt_filename = f"sorted_txt_files/{key}_files.txt"
    
    # # Write the file paths to the text file
    # with open(txt_filename, 'w') as file:
    #     for path in file_paths:
    #         file.write(f"{path}\n")

     # Write the file paths to the text file with the absolute path
    with open(txt_filename, 'w') as file:
        for path in file_paths:
            absolute_path = f"/vast/clidex/{path}"  # Prepend /vast/clidex/ to each path
            file.write(f"{absolute_path}\n")
    
    #print(f"Saved {txt_filename}")

I have worked through most of the keywords I know we want to move files for given our new data structure outline (https://docs.google.com/drawings/d/1hdjjVtAG1EScDvX2TAPQTSizGjYRxCvb2byInvkRNBI/edit). Now, let's see how many of the ~20,000 files have we accounted for?

In [20]:
unique_files = set()  # Use a set to automatically eliminate duplicates
for key in sorted_files.keys():
    unique_files.update(sorted_files[key]['Matching Files'])
print("Total unique files:", len(unique_files))

Total unique files: 12877


Let's see what is remaining and try and figure out where those belong:

In [27]:
# Flattening the values in sorted_files and converting them to a set
sorted_files_list = []
for key in sorted_files.keys():
    # Extracting the 'Matching Files' column from the DataFrame
    sorted_files_list.extend(sorted_files[key]['Matching Files'].values)

# Convert the flattened list to a set
sorted_files_set = set(sorted_files_list)
only_in_old_set = set(only_in_old)

# Find the files that are in only_in_old but not in sorted_files
unsorted_files = only_in_old_set - sorted_files_set

In [28]:
print('There were ' + str(len(only_in_old_set)) + ' files found exclusively in data_old')
print('We have sorted ' + str(len(sorted_files_set)) + ' of these files into folders for data')
print('This leaves ' + str(len(unsorted_files)) + ' unsorted files')
print('Does this add up? 12799 + 922 = ' + str(12799 + 922))

There were 13721 files found exclusively in data_old
We have sorted 12877 of these files into folders for data
This leaves 844 unsorted files
Does this add up? 12799 + 922 = 13721


Saving our dicts to txt files so that we can use these lists to extract files from data_old to specified directories in data:

In [19]:
1600 - 105 - 475 - 200 -200

620