## Usage Notes:

To use this notebook effectively, please follow these guidelines for organizing your files and directories:

- Base Directory: Define a base directory where all relevant subfolders are stored. You should specify this base directory within the notebook. For example, you might set it as `./output/2024-08-01/`.

- Subfolders: Within the base directory, there should be multiple subfolders. Each subfolder must contain at least one .log file that the notebook will process. The names of these subfolders can be anything, but they are typically generated by default Hydra configurations in the format `%H-%M-%S`. The notebook script can automatically detect these subfolders, so no need for manual specification.

- Log Files: Ensure that each subfolder contains the log files to be processed. The functions defined in the notebook will extract and analyze data from these files.

Description of the Output CSV File from this notebook is in end.

First import the packages and define the log-processing function we will use:

In [1]:
import os
import pandas as pd

def process_log_file(log_file_path):
    # Initialize empty lists to store the extracted values
    session_names = []
    r2_values = []
    footprints = []
    connection_sparsities = []
    activation_sparsities = []
    effective_macs = []
    effective_acs = []
    dense_values = []

    # Define the keywords to search in the log file
    session_keyword = "- Constructing model for"
    r2_keyword = "- r2:"
    footprint_keyword = "- footprint:"
    connection_sparsity_keyword = "- connection_sparsity:"
    activation_sparsity_keyword = "- activation_sparsity:"
    synaptic_operations_keyword = "- synaptic_operations:"

    # Read the log file and extract the information, ignoring lines that contain "pretraining"
    with open(log_file_path, 'r') as file:
        lines = file.readlines()
        for line in lines:
            if "pretraining" in line:
                continue
            if session_keyword in line:
                session_name = line.split(session_keyword)[-1].strip()
                session_names.append(session_name)
            elif r2_keyword in line:
                r2_value = float(line.split(r2_keyword)[-1].strip())
                r2_values.append(r2_value)
            elif footprint_keyword in line:
                footprint = float(line.split(footprint_keyword)[-1].strip())
                footprints.append(footprint)
            elif connection_sparsity_keyword in line:
                connection_sparsity = float(line.split(connection_sparsity_keyword)[-1].strip())
                connection_sparsities.append(connection_sparsity)
            elif activation_sparsity_keyword in line:
                activation_sparsity = float(line.split(activation_sparsity_keyword)[-1].strip())
                activation_sparsities.append(activation_sparsity)
            elif synaptic_operations_keyword in line:
                operations = eval(line.split(synaptic_operations_keyword)[-1].strip())
                effective_macs.append(operations.get('Effective_MACs', 0.0))
                effective_acs.append(operations.get('Effective_ACs', 0.0))
                dense_values.append(operations.get('Dense', 0.0))

    # Normalize the lengths of all lists by appending None to make them equal to the max length
    max_length = max(len(session_names), len(r2_values), len(footprints), len(connection_sparsities), 
                     len(activation_sparsities), len(effective_macs), len(effective_acs), len(dense_values))
    
    session_names = session_names + [None] * (max_length - len(session_names))
    r2_values = r2_values + [None] * (max_length - len(r2_values))
    footprints = footprints + [None] * (max_length - len(footprints))
    connection_sparsities = connection_sparsities + [None] * (max_length - len(connection_sparsities))
    activation_sparsities = activation_sparsities + [None] * (max_length - len(activation_sparsities))
    effective_macs = effective_macs + [None] * (max_length - len(effective_macs))
    effective_acs = effective_acs + [None] * (max_length - len(effective_acs))
    dense_values = dense_values + [None] * (max_length - len(dense_values))

    # Create a dictionary to store the normalized data
    data = {
        'Session Name': session_names,
        'r2': r2_values,
        'Footprint': footprints,
        'Connection Sparsity': connection_sparsities,
        'Activation Sparsity': activation_sparsities,
        'Effective MACs': effective_macs,
        'Effective ACs': effective_acs,
        'Dense': dense_values
    }

    # Create a DataFrame
    df = pd.DataFrame(data)
    return df



Define the `base_path`. After processing, the notebook will generate an output CSV file. This file will contain the aggregated data from all the log files, and will be saved at `output_file_path`.

In [2]:
# Base path containing subfolders
base_path = "./output/"

# Create a Pandas Excel writer using XlsxWriter as the engine.
output_file_path = './extracted_log_data_with_averages.xlsx'
all_subfolder_dfs = []  # To collect data from all subfolders for final averaging

with pd.ExcelWriter(output_file_path, engine='xlsxwriter') as writer:
    for subfolder in os.listdir(base_path):
        subfolder_path = os.path.join(base_path, subfolder)
        if os.path.isdir(subfolder_path):
            for file_name in os.listdir(subfolder_path):
                if file_name.endswith('.log'):
                    log_file_path = os.path.join(subfolder_path, file_name)
                    df = process_log_file(log_file_path)
                    all_subfolder_dfs.append(df)
                    sheet_name = f"{subfolder}_{file_name.replace('.log', '')}"
                    df.to_excel(writer, sheet_name=sheet_name[:31], index=False)  # Sheet name max length is 31 characters
    
    # Calculate session-level average across all subfolders
    if all_subfolder_dfs:
        combined_df = pd.concat(all_subfolder_dfs)
        session_level_avg_df = combined_df.groupby('Session Name').mean().reset_index()
        session_level_avg_df.to_excel(writer, sheet_name="Session_Level_Average", index=False)
    
        # Calculate overall average across all sessions and subfolders
        overall_avg_df = session_level_avg_df.mean(numeric_only=True).to_frame().transpose()
        overall_avg_df.insert(0, 'Session Name', 'Overall Mean')
        overall_avg_df.to_excel(writer, sheet_name="Overall_mean", index=False)

output_file_path

'./extracted_log_data.xlsx'

## Description of the Output CSV File
The output CSV file generated by this notebook contains detailed extracted information from log files, structured as follows:

1. Sheets and Data Structure:
- Individual Log Data Sheets: For each log file processed, a separate sheet is created in the Excel file. The sheet names are derived from the subfolder name and the log file name. Each of these sheets contains a detailed breakdown of the data extracted from the corresponding log file.
- The output CSV file contains several key metrics extracted from the log files. These include `Session Name`, `r2`, `Footprint`, `Connection Sparsity`, `Activation Sparsity`, `Effective MACs`, `Effective ACs`, and `Dense`. Each of these columns provides specific data points that have been extracted and organized from the log files for further analysis.
- Session Level Average Sheet: This sheet contains the average values of `r2`, `Footprint`, `Connection Sparsity`, `Activation Sparsity`, `Effective MACs`, `Effective ACs`, and `Dense` of each sessions over different subfolders (for example, different initializations). 
- Overall Mean Sheet: This sheet summarizes the overall average values across all sessions and subfolders.
