# Alouette Data Dictionary

The Data Dictionary of Alouette was prepared based on the PDS4 Data Dictionary Standards with slight modifications. For more information on PDS4 please consult the following link: https://pds.nasa.gov/datastandards/documents/dd/all/current/

The code below starts with initializing the helper function, from there it will specify the desired files to be analyzed. The user will be able to follow the promps in console to fill the .csv file.

In [None]:
import os
import time
import pandas as pd
from IPython.display import clear_output

### Function Initialization Section

Please run the cells below. 

In [None]:
# Function for reading in CSV and Excel documents. Other data types to be supported in the future
def read_file(file_path):
    if file_path.endswith('.csv'):
        data = pd.read_csv(file_path)
    elif file_path.endswith('.xlsx'):
        data = pd.read_excel(file_path)
    else:
        raise ValueError("Unsupported file format. Please upload a CSV or XLSX file.")
    return data

In [None]:
# Checks if column has a header and uses as name, otherwise is Unamed, could be updated to have user define name if no name
def get_column_name(column):
    if column.name is not None:
        return column.name
    else:
        return "Unnamed_Column"

In [None]:
# Prompts user for a description
def get_user_description(column_name):
    print(f"Please provide a description for the column '{column_name}': ")
    description = input() 
    return description


In [None]:
# Has the user select a unit from the list of available. This list can be adjusted to simplfy
def get_unit_selection(column_name):
    units_list = ["Unit_Of_Measure",
    "Units_of_Acceleration",
    "Units_of_Amount_Of_Substance",
    "Units_of_Angle",
    "Units_of_Angular_Velocity",
    "Units_of_Area",
    "Units_of_Current",
    "Units_of_Energy",
    "Units_of_Force",
    "Units_of_Frame_Rate",
    "Units_of_Frequency",
    "Units_of_Gmass",
    "Units_of_Length",
    "Units_of_Map_Scale",
    "Units_of_Mass",
    "Units_of_Mass_Density",
    "Units_of_Misc",
    "Units_of_None",
    "Units_of_Optical_Path_Length",
    "Units_of_Pixel_Resolution_Angular",
    "Units_of_Pixel_Resolution_Linear",
    "Units_of_Pixel_Resolution_Map",
    "Units_of_Pixel_Scale_Angular",
    "Units_of_Pixel_Scale_Linear",
    "Units_of_Pixel_Scale_Map",
    "Units_of_Power",
    "Units_of_Pressure",
    "Units_of_Radiance",
    "Units_of_Rates",
    "Units_of_Solid_Angle",
    "Units_of_Spectral_Irradiance",
    "Units_of_Spectral_Radiance",
    "Units_of_Storage",
    "Units_of_Temperature",
    "Units_of_Time",
    "Units_of_Velocity",
    "Units_of_Voltage",
    "Units_of_Volume",
    "Units_of_Wavenumber"]
                  
    print(f"Please select a unit for the column '{column_name}' from the following list:")
    print(*units_list, sep = '\n')
    unit = input() 
    if unit not in units_list:
        print("Invalid unit selected. Defaulting to 'Unit_of_None'.")
        unit = "Unit_of_None"
    return unit


In [None]:
# Has the user define a broad category with the user defining a specific type (or defaulting if user is not sure)
def get_data_type(column_name):
    
    type_list = ["Date_Time",
    "File_Name",
    "Numeric",
    "Identifier",
    "String",
    "Misc"]
    
    date_list = ['ASCII_DOI', 'ASCII_Date', 'ASCII_Time', 'ASCII_Time_GPS', 'ASCII_Date_DOY', 'ASCII_Date_Time', 'ASCII_Date_Time_UTC', 'ASCII_Date_Time_YMD', 'ASCII_Date_Time_DMY', 'ASCII_Date_Time_MDY']
    
    file_list = ['ASCII_Directory_Path_Name', 'ASCII_File_Name', 'ASCII_File_Specfication_Name']
    
    numeric_list = ['ASCII_Integer', 'ASCII_NonNegative_Integer', 'ASCII_Real', 'ASCII_Complex', 'ASCII_Checksum', 'ASCII_Numeric', 'ASCII_Float', 'ASCII_Double']
    
    iden_list = ['ASCII_Boolean', 'ASCII_Reference', 'ASCII_Local_Identifier', 'ASCII_Local_Identifier_Reference', 'ASCII_Identifier']
    
    string_list = ['ASCII_String', 'ASCII_Char', 'ASCII_Text']
                  
    print(f"Please select an overall data type for the column '{column_name}' from the following list:")
    print(*type_list, sep = '\n')
    data_type = input() 
    
    if data_type not in type_list:
        print("Invalid type selected. Defaulting to 'ASCII_Misc'.")
        data_type = "ASCII_Misc"
        
    elif data_type == "Date_Time":
        clear_output()
        print(f"Please select a Date Time type for '{column_name}' from the following list:")
        print(*date_list, sep = '\n')
        data_type = input() 
        if data_type not in date_list:
            print("Invalid type selected. Defaulting to 'ASCII_Date_Time'.")
            data_type = "ASCII_Date_Time"
            
    elif data_type == "File_Name":
        clear_output()
        print(f"Please select a Date Time type for '{column_name}' from the following list:")
        print(*file_list, sep = '\n')
        data_type = input() 
        if data_type not in file_list:
            print("Invalid type selected. Defaulting to 'ASCII_File_Name'.")
            data_type = "ASCII_File_Name"
            
    elif data_type == "Numeric":
        clear_output()
        print(f"Please select a Date Time type for '{column_name}' from the following list:")
        print(*numeric_list, sep = '\n')
        data_type = input() 
        if data_type not in numeric_list:
            print("Invalid type selected. Defaulting to 'ASCII_Numeric'.")
            data_type = "ASCII_Numeric"
            
    elif data_type == "Identifier":
        clear_output()
        print(f"Please select a Date Time type for '{column_name}' from the following list:")
        print(*iden_list, sep = '\n')
        data_type = input() 
        if data_type not in iden_list:
            print("Invalid type selected. Defaulting to 'ASCII_Identifier'.")
            data_type = "ASCII_Identifier"

    elif data_type == "String":
        clear_output()
        print(f"Please select a Date Time type for '{column_name}' from the following list:")
        print(*string_list, sep = '\n')
        data_type = input() 
        if data_type not in string_list:
            print("Invalid type selected. Defaulting to 'ASCII_String'.")
            data_type = "ASCII_String"
            
    elif data_type == "Misc":
        clear_output()
        data_type = "ASCII_Misc"


    return data_type


In [None]:
# Depending on the data type specfied by the user, statistics of the data is derived 
def get_data_stats(data_type, column):
        
    date_list = ['ASCII_DOI', 'ASCII_Date', 'ASCII_Time', 'ASCII_Time_GPS', 'ASCII_Date_DOY', 'ASCII_Date_Time', 'ASCII_Date_Time_UTC', 'ASCII_Date_Time_YMD', 'ASCII_Date_Time_DMY', 'ASCII_Date_Time_MDY']
    
    file_list = ['ASCII_Directory_Path_Name', 'ASCII_File_Name', 'ASCII_File_Specfication_Name']
    
    numeric_list = ['ASCII_Integer', 'ASCII_NonNegative_Integer', 'ASCII_Real', 'ASCII_Complex', 'ASCII_Checksum', 'ASCII_Numeric', 'ASCII_Float', 'ASCII_Double']
    
    iden_list = ['ASCII_Boolean', 'ASCII_Reference', 'ASCII_Local_Identifier', 'ASCII_Local_Identifier_Reference', 'ASCII_Identifier']
    
    string_list = ['ASCII_String', 'ASCII_Char', 'ASCII_Text']
    
    if data_type in date_list:
        return [float('NaN')]
    
    elif data_type in file_list:
        try:
            vals = [column.map(len).min(), column.map(len).max(), column.size]
        except:
            vals = [float('NaN')]
        return vals
    
    elif data_type in numeric_list:
        try:
            vals = [column.min(), column.max(), column.size]
        except: 
            vals = [float('NaN')]
        return vals
    
    elif data_type in iden_list:
        try: 
            vals = [column.map(len).min(), column.map(len).max(), column.size]
        except:
            vals = [column.min(), column.max(), column.size]
        return vals
    
    elif data_type in string_list:
        try: 
            vals = [column.map(len).min(), column.map(len).max(), column.size]
        except:
            vals = [float('NaN')]
        return vals
    
    else:
        return [float('NaN')]
        

In [None]:
def analyze_file(file_path):
    # Read the file into a DataFrame
    data = read_file(file_path)
    
    # Create an empty DataFrame to store the analysis results
    analysis_df = pd.DataFrame(columns=['Column_Name', 'Description', 'Data_Type', 'Stats', 'Units'])
    
    # Iterate over columns to perform analysis and populate the analysis DataFrame
    for column in data.columns:
        
        clear_output(wait=True)
        column_name = get_column_name(data[column])
        
        time.sleep(1)
        clear_output(wait=True)
        description = get_user_description(column)
        
        time.sleep(1)
        clear_output(wait=True)
        data_type = get_data_type(column)
        
        time.sleep(1)
        clear_output(wait=True)
        stats = get_data_stats(data_type, data[column])
        
        time.sleep(1)
        clear_output(wait=True)
        units = get_unit_selection(column)
        
        new_row = {'Column_Name': column_name, 'Description': description, 'Data_Type': data_type, 'Stats': stats, 'Units': units}
        
        analysis_df.loc[len(analysis_df)] = new_row
        
    
    return analysis_df


### Analysis Section

To replicate our work you can access the .csv files through the following two options

1) Downloading from Open data portal of Canadian Spaace Agency: https://donnees-data.asc-csa.gc.ca/en/dataset/221c1c75-4c42-4286-a4ce-ca6c3027b7fe
2) Generatig the .csv files by running  "Alouette_processor", "Alouette_processor2, "E-Satellite tracking processing" and "F_Prepare result_master for Alouette Microapplication" scripts on GitHub via the following link: https://github.com/asc-csa/Alouette_extract/tree/master/code

In [None]:
# run to check for name of the file.
os.listdir()

In [None]:
# run to perform actual analysis
analyzed_results = analyze_file('result_microapp.csv')

# saves results to a csv
analyzed_results.to_csv('dd_result_microapp.csv', index=False)

### Final Processing

Following the creation of data dictionaries associated with a given set of processed data, it is helpful to combine the results into a single bundle. 

1) Select and copy the paths of the processed data dictionaries. 
2) Run the cell below to combine into a single structured workbook (Excel) 
3) Add an additional sheet in the combined_data_dictionary manually to serve as an overview page.

You can find the final data dictionary for Alouette-1 here: https://github.com/asc-csa/Alouette_extract/tree/master/documentation

In [None]:
import pandas as pd

# List of sub-data dictionaries to be combined
csv_files = ['dd_result_master.csv', 'dd_result_master_orbit.csv', 'dd_result_microapp.csv']  

# Create a Pandas Excel writer using openpyxl to write to Excel workbook 
with pd.ExcelWriter('combined_data_dictionary.xlsx', engine='openpyxl') as writer:
    for file in csv_files:
        df = pd.read_csv(file)

        # Extract filename without extension for sheet name
        sheet_name = file.split('/')[-1].split('.')[0]

        # Write DataFrame to an Excel sheet
        df.to_excel(writer, sheet_name=sheet_name, index=False)