## Calculations over sample's measurements according to Study, Assay, and Groups

Processing of transformed data from Studies according to Assays, taking into consideration groupings by the combination of Factors. Average is obtained for each group of samples' measurements and then a comparison of percentage difference against each other.

## Compiling metadata for group calculations
Script modified from:
https://github.com/asaravia-butler/KnowHax2025_NASA-SPOKE

In [1]:
osd_accession = ["OSD-557", "OSD-568", "OSD-583", "OSD-679", "OSD-680", "OSD-681",]

import save_metadata as svm

for accession in osd_accession: svm.sampleToGroupMappingAndGroupToGroupMetadata(accession)

Requested metadata from: https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-557/
Requested sample table from: https://visualization.osdr.nasa.gov/biodata/api/v2/query/metadata/?id.accession=OSD-557&study.factor+value.Spaceflight
4 unique groups = 6 pairwise combinations
Requested metadata from: https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-568/
Requested sample table from: https://visualization.osdr.nasa.gov/biodata/api/v2/query/metadata/?id.accession=OSD-568&study.factor+value.Spaceflight
4 unique groups = 6 pairwise combinations
Requested metadata from: https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-583/
Requested sample table from: https://visualization.osdr.nasa.gov/biodata/api/v2/query/metadata/?id.accession=OSD-583&study.factor+value.Spaceflight
5 unique groups = 10 pairwise combinations
Requested metadata from: https://visualization.osdr.nasa.gov/biodata/api/v2/dataset/OSD-679/
Requested sample table from: https://visualization.osdr.n

## Samples averaging by group

In [2]:
METADATA_DIR = "../data/metadata"
DATA_DIR = "../data"

## Micro-CT

In [3]:
microCT_accessions = ["557"] #Add here to this list the accessions numbers of microCT studies for analysis

In [4]:
import pandas as pd
import subprocess as sp
import numpy as np
import itertools

#python warnings notifications disabling
import warnings
warnings.filterwarnings('ignore')

microCTDataSet = {} #Transformed data will be loaded into this dictionary. 
microCTSampleGroupMapping = {} #Dictionary for holding metadata/ 'OSD-<###>_SampleTable.csv' content
microCTGroupsContrast = {} #

microCTDataSet_processed = {} #This dictionary holds all the processed data, i.e. results, calculated from the data. Check for results here!

for accession in microCT_accessions:
    try:
        #Creating a dictionary for each study/accession for checking to which group a sample belongs to
        sampleGroupMap_df = pd.read_csv(METADATA_DIR +"/"+ f"OSD-{accession}_SampleTable.csv").groupby("Treatment Group").agg(list)
        microCTSampleGroupMapping[accession] = sampleGroupMap_df["Sample Name"].to_dict()
        
        assaysList = sp.check_output("ls " + DATA_DIR, shell=True).split()
    
        if len(assaysList) == 0: continue
        
        microCTAssays = [assay.decode('utf-8') for assay in assaysList if "microct" in str(assay).lower()]
    
        filesList = list()
        for mctAssay in  microCTAssays:
            filesList.append(DATA_DIR +"/"+ mctAssay)
    
        #Reading all files pertaining to the same study/accession about MCT and creating a dictionary entry for it
        for i in range(len(filesList)):
            microCTData_df = pd.read_csv(filesList[i])
            microCTData_df.head()
            if accession is not microCTDataSet.keys(): microCTDataSet[accession] = []
            microCTDataSet[accession].append(microCTData_df)
        
        #Going over all the assays to calculate for each of their measurements averages according to group belonging
        for assay_df in microCTDataSet[accession]:
            #Creating a data frame to hold all the averaged measurements for each group
            averaged_df =  pd.DataFrame(columns=list(assay_df.columns.insert(0, 'Groups')))
            averaged_df["Groups"] = microCTSampleGroupMapping[accession].keys();
            averaged_df = averaged_df[averaged_df.columns.drop(list(averaged_df.filter(regex='unit*')))]
            averaged_df = averaged_df[averaged_df.columns.drop('Sample Name')]
            averaged_df = averaged_df.fillna(0)
            #print(averaged_df)
    
            #Going over invidiual sample measurements for calculation of averages
            samplesInTheAssay = list() #not all Samples in the Study are present in a particular Assay, otherwise we could simply use microCTSampleGroupMapping dict 
            assay_df = assay_df.fillna(0)     
            for measurement_column in assay_df.columns[1:].drop(list(assay_df.filter(regex='unit*'))):
                #Finding to which group a sample belongs to
                for i in range(len(assay_df[measurement_column])):
                    for group in microCTSampleGroupMapping[accession].keys():
                        if assay_df['Sample Name'].iloc[i] in microCTSampleGroupMapping[accession][group]:
                            samplesInTheAssay.append(assay_df['Sample Name'].iloc[i])
                            #Summing up all the values for the same measurement which belong to the same group
                            averaged_df[measurement_column][list(averaged_df['Groups']).index(group)] += assay_df[measurement_column].iloc[i]
    
            #Checking which samples where present in this particular Assay
            samplesGroupMapping4Assay = {} #this is a sub-dictionary from 'microCTSampleGroupMapping' dict
            samplesInTheAssay = set(samplesInTheAssay)
            for sample in samplesInTheAssay:
                for group in microCTSampleGroupMapping[accession].keys():
                    if sample in microCTSampleGroupMapping[accession][group]:
                        if group not in samplesGroupMapping4Assay.keys(): samplesGroupMapping4Assay[group] = []
                        samplesGroupMapping4Assay[group].append(sample)
            
            #Applying column-wise division by each group size (number of samples in each group)
            averaged_df_T = averaged_df.T
            for i in range(0,4):
                #print(averaged_df_T[i][1:])
                if averaged_df_T[i][0] in samplesGroupMapping4Assay.keys(): #Only calculating averages for groupings that exist in the Assay (groups in Assay is a subgroup of groups in Study)
                    averaged_df_T[i][1:] = averaged_df_T[i][1:]/(len(samplesGroupMapping4Assay[averaged_df_T[i][0]]))
                    #print(samplesGroupMapping4Assay[averaged_df_T[i][0]])
                #print(averaged_df_T[i][1:])
    
            if accession is not microCTDataSet_processed.keys(): microCTDataSet_processed[accession] = []
            microCTDataSet_processed[accession].append(averaged_df_T)
            #print(averaged_df_T) #Sample averages by Group data frame 

        #Creating contrast tables for Group vs Group comparisons according to assay
        groupContrastCombinations = list(itertools.combinations(microCTSampleGroupMapping[accession], 2))
        contrasts_columns = [' vs '.join(x) for x in groupContrastCombinations]
        contrasts_columns.insert(0, 'Averages')
        contrast_dict = {}
        for i in range(len(contrasts_columns)):
            if i == 0 :contrast_dict[contrasts_columns[i]] = microCTDataSet_processed[accession][0].index.to_list()[1:]
            else: contrast_dict[contrasts_columns[i]] = [0]*len(microCTDataSet_processed[accession][0].index.to_list()[1:])
        
        contrasts_df = pd.DataFrame(contrast_dict)
        averages_df = microCTDataSet_processed[accession][0] #Need to improve this in case there are more assays for the same study
        for column in list(contrasts_df.columns)[1:]:
            groupsInComparison = column.split(' vs ')
            #TO DO: Need to fix average data frame headers
            #contrasts_df[column] = abs(averages_df[aux1]-averages_df[aux2]) #This is where percentage difference calculation should happen. Had problem indexing data frame
        #print(contrasts_df)
    except IOError as e:
            print(f"Failed to read file: {str(e)}")

print("Averages: ")
print(microCTDataSet_processed['557'][0]) #There is only one Assay in this only Study (557) of MicroCT Technology, in this case. In case of more assays, iterate through the list of averaged grouped measurements by changing the square brackets index accordingly: [0...<assay #>]







Averages: 
                                                 0                 1  \
Groups                            Cohort Control 1  Cohort Control 2   
sagittal_anteroposterior_length              1.766               0.0   
sagittal_superoinferior_length               2.037               0.0   
sagittal_retina_thickness                  0.10825               0.0   
sagittal_pigment_layer_thickness             0.057               0.0   
sagittal_choroid_thickness                  0.0545               0.0   
sagittal_sclera_thickness                    0.081               0.0   
axial_anteroposterior_length               1.72675               0.0   
axial_superoinferior_length                1.98675               0.0   
axial_retina_thickness                       0.108               0.0   
axial_pigment_layer_thickness               0.0535               0.0   
axial_choroid_thickness                     0.0465               0.0   
axial_sclera_thickness                     0.06725   