### Workflow

Figma diagram: https://www.figma.com/file/PFKdJO3HTIcM9g4nmVONCT/HRA-Data-Explorer?type=whiteboard&node-id=1%3A108&t=XGoN91s7IKbyZ7xk-1

Below is the information about the diagram:

1. Dataset Collections:
The process begins with RUI Registered Tissue datasets, a collection of tissue datasets registered in a specific repository or database.
These datasets are linked to specific Organs and Anatomical Structures.

2. Dataset Processing:
The datasets are processed to produce Anatomical Structures/Cell type information of 2D FTUs.
There is a decision point or a step to possibly sort these structures, which leads to the extraction of a Cell Summary, including cell type biomarker gene expression data.
    
    1. Compare FTU's cell type information between Azimuth, PopV and Celltypist
    2. Get the list of dataset for FTU.
        1. Cell summary from the atlas-as-cell-summaries.jsonld
        2. add organ data from the cell annotation crosswalk files
    3. Any dataset that is been shared among two FTUs

3. Cell Type Mapping:
list of 2D FTUs are used to map Cell Types, leading to a set of inquiries on how to best extract data from the datasets used in Step 1 and 2

4. Vasculature Mapping

5. Share cell type info:
    The following cell types from your dataset are commonly known to be present both within FTUs and around them:

        1. Endothelial Cells:

        glomerular capillary endothelial cell
        efferent arteriole endothelial cell
        afferent arteriole endothelial cell
        peritubular capillary endothelial cell
        alveolar capillary type 1 endothelial cell
        capillary endothelial cell
        endothelial cell
        endothelial cell of artery
        vein endothelial cell
        endothelial cell of hepatic sinusoid
        blood vessel endothelial cell
        splenic endothelial cell
        
        2. Fibroblasts and Myofibroblasts:

        alveolar type 1 fibroblast
        secondary crest myofibroblasts
        skin fibroblast
        hepatic portal fibroblast
        fibroblast of subepithelial connective tissue of prostatic gland
        fibroblast of connective tissue of prostate
        fibroblast
        myofibroblast
        
        3. Macrophages and Dendritic Cells:

        macrophage
        thymic plasmacytoid dendritic cell
        thymic cortical macrophage
        thymic medullary macrophage
        follicular dendritic cell
        splenic tingible body macrophage
        splenic marginal zone macrophage
        splenic red pulp macrophage
        dendritic cell, human
        splenic white pulp macrophage
        dendritic cell
        
        4. Stem/Progenitor Cells:

        intestinal crypt stem cell of large intestine
        hepatic progenitor cell
        hematopoetic stem cell

In [None]:
import pandas as pd
import requests
import json
import matplotlib.pyplot as plt
import seaborn as sns

### Load Reference files

In [None]:
# Local paths to save the downloaded files
popv_local_path = 'ref_data/popv.csv'
celltypist_local_path = 'ref_data/celltypist.csv'
azimuth_local_path = 'ref_data/azimuth.csv'
ftu_cell_count_path = 'ref_data/FTU Cell Count Table - Cell_Type_Count.csv'

# Load the files
popv = pd.read_csv(popv_local_path)
celltypist = pd.read_csv(celltypist_local_path)
azimuth = pd.read_csv(azimuth_local_path)
ftu_cell_count = pd.read_csv(ftu_cell_count_path)

# Load the JSON file
atlas_enriched_dataset_graph_path = 'ref_data/atlas-enriched-dataset-graph.jsonld'
atlas_as_cell_summaries_path = 'ref_data/atlas-as-cell-summaries.jsonld'

with open(atlas_as_cell_summaries_path, 'r') as file:
    json_data_summary = json.load(file)

with open(atlas_enriched_dataset_graph_path, 'r') as file:
    json_data_enriched = json.load(file)

### Get the Cell annotation tool information for cell types in FTUs

In [None]:
# Define the mapping dictionary
organ_to_uberon = {
    'Kidney': 'UBERON:0002113',
    'Lung': 'UBERON:0002048',
    'Pancreas': 'UBERON:0001264',
    'Large Intestine': 'UBERON:0002107',
    'Skin': 'UBERON:0002097',
    'Liver': 'UBERON:0002108',
    'Prostate': 'UBERON:0002367',
    'Thymus': 'UBERON:0002371',
    'Spleen': 'UBERON:0002370',
    'Small Intestine': 'UBERON:0002106'
}

In [None]:
ftu_cell_count = ftu_cell_count.drop(columns=['Unnamed: 6','Unnamed: 7','Unnamed: 8','Unnamed: 9','Unnamed: 10','Unnamed: 11','Unnamed: 12','Unnamed: 13','Unnamed: 14'], axis = False)

# Rename 'CT ID in CL' to 'CL_ID' for consistency
ftu_cell_count = ftu_cell_count.rename(columns={'CT ID in CL': 'CL_ID'})
ftu_cell_count = ftu_cell_count.rename(columns={'CT Label in CL': 'CL_Label_FTU'})

# Filter FTU cell count to include only rows where `CT ID in CL` contains `CL`
ftu_cell_count_filtered = ftu_cell_count[ftu_cell_count['CL_ID'].str.contains('CL', na=False)]

# Select necessary columns from ftu_cell_count
ftu_cell_count_filtered = ftu_cell_count_filtered[['Organ', 'FTU Label in Uberon', 'FTU ID in Uberon', 'CL_ID', 'CL_Label_FTU']]

# Merge the dataframes based on 'CL_ID'
merged_df = ftu_cell_count_filtered.merge(azimuth[['CL_ID', 'CL_Label']], on='CL_ID', how='left', suffixes=('', '_azimuth'))
merged_df = merged_df.rename(columns={'CL_Label': 'CL_Label_azimuth'})

merged_df = merged_df.merge(celltypist[['CL_ID', 'CL_Label']], on='CL_ID', how='left', suffixes=('', '_celltypist'))
merged_df = merged_df.rename(columns={'CL_Label': 'CL_Label_celltypist'})

merged_df = merged_df.merge(popv[['CL_ID', 'CL_Label']], on='CL_ID', how='left', suffixes=('', '_popv'))
merged_df = merged_df.rename(columns={'CL_Label': 'CL_Label_popv'})

# Populating the columns azimuth, celltypist, popv with 1 if there is a match, otherwise 0
merged_df['azimuth'] = merged_df['CL_Label_azimuth'].notna().astype(int)
merged_df['celltypist'] = merged_df['CL_Label_celltypist'].notna().astype(int)
merged_df['popv'] = merged_df['CL_Label_popv'].notna().astype(int)

# Renaming 'CT Label in CL' to 'CT Label'
merged_df = merged_df.rename(columns={'CL_Label_FTU': 'CT Label'})

# Selecting columns for final output
final_columns = ['Organ', 'FTU Label in Uberon', 'FTU ID in Uberon', 'CL_ID', 'CT Label', 'azimuth', 'celltypist', 'popv']
ftu_ct_ann_info_df = merged_df[final_columns]

# Save the final dataframe to a CSV file
output_path = 'output/FTU-CT-AnnTool-info.csv'
ftu_ct_ann_info_df.to_csv(output_path, index=False)

# Display the first few rows of the final dataframe
print(ftu_ct_ann_info_df.head())

In [None]:
# Map the organ names in FTU cell count table to UBERON IDs
ftu_cell_count_filtered.loc[:, 'Organ_ID'] = ftu_cell_count_filtered['Organ'].map(organ_to_uberon)

# Prepare a common structure to hold the results
result = []

# Function to dynamically generate match info for celltypist and azimuth
def generate_match_info(tool_name, tool_data):
    levels = set()
    for organ_level in tool_data['Organ_Level']:
        if 'level' in organ_level.lower():
            level = organ_level.split('_')[-2]
        else:
            level = organ_level.split('_')[-1]
        levels.add(level)
    
    match_info = {f'{tool_name}_{level}': 0 for level in levels}
    return match_info, levels


# Dynamically generate the match info for celltypist and azimuth
celltypist_info, celltypist_levels = generate_match_info('celltypist', celltypist)
azimuth_info, azimuth_levels = generate_match_info('azimuth', azimuth)

# Iterate over the filtered FTU cell count data
for idx, ftu_row in ftu_cell_count_filtered.iterrows():
    organ = ftu_row['Organ_ID']
    ftu_label = ftu_row['FTU Label in Uberon']
    ftu_id = ftu_row['FTU ID in Uberon']
    cl_id = ftu_row['CL_ID']
    cl_label = ftu_row['CL_Label_FTU']
    
    # Initialize match columns
    match_info = {
        'Organ': ftu_row['Organ'],
        'FTU Label in Uberon': ftu_label,
        'FTU ID in Uberon': ftu_id,
        'CL_ID': cl_id,
        'CL_Label_FTU': cl_label,
        'popv': 0
    }
    match_info.update({key: 0 for key in celltypist_info.keys()})
    match_info.update({key: 0 for key in azimuth_info.keys()})
    
    # Check in celltypist
    celltypist_matches = celltypist[(celltypist['CL_ID'] == cl_id) & (celltypist['Organ_ID'] == organ)]
    for _, ct_match in celltypist_matches.iterrows():
        if 'level' in ct_match['Organ_Level'].lower():
            level = ct_match['Organ_Level'].split('_')[-2]  # Extract the level number
        else:
            level = ct_match['Organ_Level'].split('_')[-1]  # Extract the level number
        match_info[f'celltypist_{level}'] = 1

    # Check in azimuth
    azimuth_matches = azimuth[(azimuth['CL_ID'] == cl_id) & (azimuth['Organ_ID'] == organ)]
    for _, az_match in azimuth_matches.iterrows():
        if 'level' in az_match['Organ_Level'].lower():
            level = az_match['Organ_Level'].split('_')[-2]  # Extract the level number
        else:
            level = az_match['Organ_Level'].split('_')[-1]  # Extract the level number
        match_info[f'azimuth_{level}'] = 1

    # Check in popv
    popv_matches = popv[(popv['CL_ID'] == cl_id) & (popv['Organ_ID'] == organ)]
    if not popv_matches.empty:
        match_info['popv'] = 1

    # Append to results
    result.append(match_info)

# Create a dataframe from the results
result_df = pd.DataFrame(result)

result_df.head()

In [None]:
# Save the result to a CSV file
output_path = 'output/FTU-CT-AnnTool-level-info.csv'
result_df.to_csv(output_path, index=False)

print(f"File saved to {output_path}")

### Retrieve the dataset information pertaining to FTU cell types.

#### Understanding the structure, content, and formatting of the JSON files namely, atlas-as-cell-summaries.jsonld and atlas-enriched-dataset-graph.jsonld.

In [None]:
# Function to extract structure and keys of the JSON
def extract_structure(data, level=0):
    structure = {}
    if isinstance(data, dict):
        for key, value in data.items():
            structure[key] = extract_structure(value, level + 1)
    elif isinstance(data, list) and len(data) > 0:
        structure = [extract_structure(data[0], level + 1)]
    else:
        structure = None
    return structure

In [None]:
# Extract structure
json_structure_cell_summaries = extract_structure(json_data_summary)
json_structure_cell_summaries

In [None]:
# Extract structure
json_structure_enriched_dataset_graph = extract_structure(json_data_enriched)
json_structure_enriched_dataset_graph

#### Extract Source, CT and dataset information from as-cell-summaries

In [None]:
# Extract the relevant data from the JSON file
json_graph = json_data_summary['@graph']

# Convert the JSON graph to a DataFrame for easier manipulation
json_df = pd.json_normalize(json_graph)

# Ensure modality is expanded correctly
expanded_json_df = json_df.explode('summary').reset_index(drop=True)
expanded_json_df['modality'] = json_df.explode('modality').reset_index(drop=True)['modality']

# Normalize the summary column and merge with expanded_json_df
summary_df = pd.json_normalize(expanded_json_df['summary'])
merged_json_df = pd.concat([expanded_json_df.drop(columns=['summary']), summary_df], axis=1)

# Ensure correct modality assignment
merged_json_df['modality'] = merged_json_df['modality'].fillna(method='ffill')

# Extract relevant columns from the CSV data
ftu_relevant_columns = ftu_cell_count[['Organ', 'FTU Label in Uberon', 'FTU ID in Uberon','CL_ID', 'CL_Label_FTU']]

# Merge the CSV and JSON DataFrames based on matching cell_id
merged_data = pd.merge(
    merged_json_df,
    ftu_relevant_columns,
    left_on='cell_id',
    right_on='CL_ID',
    how='inner'
)

# Create the final DataFrame with the desired columns
CTs_with_datasets_df = merged_data[[
    'cell_id',
    'cell_label',
    'annotation_method',
    'modality',
    'cell_source_label',
    'sex',
    'aggregated_summaries'
]]

# Rename the columns as specified
CTs_with_datasets_df.columns = [
    'cell_id',
    'cell_label',
    'annotation_method',
    'modality',
    'cell_source_label',
    'sex',
    'datasets'
]

# Add the new column "#datasets"
CTs_with_datasets_df['#datasets'] = CTs_with_datasets_df['datasets'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Check for null values in the 'modality' column again
null_modality_entries_after_correction = CTs_with_datasets_df[CTs_with_datasets_df['modality'].isnull()]

# Display the first few rows of the final DataFrame
print(CTs_with_datasets_df.head())

In [None]:
# Save the final DataFrame to a CSV file
final_csv_path = 'output/CTs-with-datasets-info.csv'
CTs_with_datasets_df.to_csv(final_csv_path, index=False)
print(f"Final CSV saved to: {final_csv_path}")

In [None]:
CTs_with_datasets_df['modality'].unique()

#### Extract gene information for idenfied CT - dataset combination from enriched-dataset-graph

In [None]:
# Create an empty dataframe with the specified columns
columns = [
    'organ_id', 'organ_name', 'reference_organ', 'cell_id', 'cell_label', 'annotation_method', 
    'cell_source_label', 'as_label', 'sex', 'aggregated_summaries', 'gene_id', 
    'gene_label', 'ensembl_id', 'mean_gene_expr_value'
]
result_df = pd.DataFrame(columns=columns)

# Function to extract and match data based on cell_id and annotation_method
def extract_data(csv_row, json_data):
    matches = []
    cell_id = csv_row['cell_id']
    annotation_method = csv_row['annotation_method']
    aggregated_summaries = csv_row['datasets']  # convert string representation of list to actual list
    organ_name = 'unknown'
    for entry in json_data["@graph"]:
        for sample in entry.get('samples', []):
            rui_location = sample.get('rui_location', {})
            all_collisions = rui_location.get('all_collisions', [])
            for collision in all_collisions:
                collisions = collision.get('collisions', [])
                for col in collisions:
                    reference_organ = col.get('reference_organ', 'unknown')
                    as_label = col.get('as_label', 'unknown')
                    for section in sample.get('sections', []):
                        for dataset in section.get('datasets', []):
                            summaries = dataset.get('summaries', [])
                            organ_id = dataset.get('organ_id', 'unknown')
                            for summary in summaries:
                                if summary.get('annotation_method') == annotation_method:
                                    for summ in summary['summary']:
                                        if isinstance(summ, dict) and summ['cell_id'] == cell_id:
                                            for aggregated_summary in aggregated_summaries:
                                                if aggregated_summary in dataset['@id']:
                                                    for gene_expr in summ.get('gene_expr', []):
                                                        if isinstance(gene_expr, dict):  # Ensure gene_expr is a dictionary
                                                            new_row = {
                                                                'organ_id': organ_id,
                                                                'organ_name': organ_name,
                                                                'reference_organ': reference_organ,                                                                
                                                                'cell_id': cell_id,
                                                                'cell_label': summ['cell_label'],
                                                                'annotation_method': annotation_method,
                                                                'cell_source_label': csv_row['cell_source_label'],
                                                                'as_label': as_label,
                                                                'sex': csv_row['sex'],
                                                                'dataset': aggregated_summary,
                                                                'gene_id': gene_expr.get('gene_id'),
                                                                'gene_label': gene_expr.get('gene_label'),
                                                                'ensembl_id': gene_expr.get('ensembl_id'),
                                                                'mean_gene_expr_value': gene_expr.get('mean_gene_expr_value'),
                                                                
                                                            }
                                                            matches.append(new_row)
    return matches

# Iterate over the CSV data and extract matched data
for index, row in CTs_with_datasets_df.iterrows():
    if row['cell_source_label']:  # Filter data by cell_source_label
        matches = extract_data(row, json_data_enriched)
        if matches:
            result_df = pd.concat([result_df, pd.DataFrame(matches)], ignore_index=True)

# Display the new dataframe
print(result_df.head())


In [None]:
# Get unique organ IDs
result_df["organ_id"].unique()

In [None]:
organ_id_to_name = {
    'http://purl.obolibrary.org/obo/UBERON_0002108' : 'small intestine',
'http://purl.obolibrary.org/obo/UBERON_0000059': 'large intestine',
'http://purl.obolibrary.org/obo/UBERON_0002113': 'kidney',
'http://purl.obolibrary.org/obo/UBERON_0000948': 'heart',
'http://purl.obolibrary.org/obo/UBERON_0001255': 'urinary bladder',
'http://purl.obolibrary.org/obo/UBERON_0002048': 'lung',
'http://purl.obolibrary.org/obo/UBERON_0002107': 'liver'}

In [None]:
# Adding a new column to the dataframe
result_df['organ_name'] = result_df['organ_id'].map(organ_id_to_name)

# Display the first few rows of the updated dataframe
result_df.head()

In [None]:
result_df[result_df['organ_name'] == 'urinary bladder'][['cell_label', 'cell_source_label']]['cell_label'].unique()

# Corrected query to filter the DataFrame
filtered_df_ub = result_df[result_df['organ_name'] == 'urinary bladder']
aggregated_summaries = filtered_df_ub['dataset']

# Display the aggregated_summaries
print(aggregated_summaries.unique())

In [None]:
# Corrected query to filter the DataFrame
filtered_df = result_df[(result_df['organ_name'] == 'heart') & (result_df['cell_label'] == 'hepatocyte')]
aggregated_summaries = filtered_df['dataset']

# Display the aggregated_summaries
print(aggregated_summaries.unique())

In [None]:
result_df.to_csv('output/filtered-CTs-with-datasets-with-gene-information-1.csv', index=False)

##### Sanity check of the organ_id, cell_source_label and cell_ids

In [None]:
result_df['cell_source_label'].unique()

In [None]:
final_df['cell_source_label'].unique()

In [None]:
difference_labels = set(final_df['cell_source_label'].unique()) - set(result_df['cell_source_label'])
print("Difference in cell_source_label:")
print(difference_labels)

In [None]:
difference_ids = set(final_df['cell_id'].unique()) - set(result_df['cell_id'])
print("Difference in cell_ids:")
print(difference_ids)

In [None]:
# Find the unique cell_labels in the provided CSV file
ftu_cell_labels = set(ftu_cell_count['CL_Label_FTU'].unique())
ftu_cell_ids = set(ftu_cell_count['CL_ID'].unique())

In [None]:
cell_labels_set = set()
for label in difference_labels:
    cell_labels = final_df[final_df['cell_source_label'] == label]['cell_label']
    cell_labels_set.update(cell_labels) 
    
# Check for matches between cell_labels_set and csv_cell_labels
matches_label = cell_labels_set.intersection(ftu_cell_labels)
matches_label

In [None]:
cell_ids_set = set()
for ids in difference_ids:
    cell_ids = final_df[final_df['cell_id'] == ids]['cell_id']
    cell_ids_set.update(cell_ids)
    
matches_id = cell_ids_set.intersection(ftu_cell_ids)
matches_id

In [None]:
match_df = pd.DataFrame()
for ids in matches_id:
    match_df = pd.concat([match_df, (ftu_cell_count[ftu_cell_count['CL_ID'] == ids])],ignore_index=True)

match_df

In [None]:
CTs_with_datasets_df.columns

### Organ and CTann level information as per CTann crosswalk files 

In [None]:
# Get the unique organ levels for each file
azimuth_organs = azimuth['Organ_Level'].unique().tolist()
celltypist_organs = celltypist['Organ_Level'].unique().tolist()
popv_organs = popv['Organ_Level'].unique().tolist()

print("Azimuth Organs:", azimuth_organs)
print()
print("Celltypist Organs:", celltypist_organs)
print()
print("Popv Organs:", popv_organs)

In [None]:
# Rename the column in final_output_df
CTs_with_datasets_df.rename(columns={'cell_id': 'CL_ID'}, inplace=True)

# Merge with azimuth_df
final_output_azimuth = CTs_with_datasets_df[CTs_with_datasets_df['annotation_method'] == 'azimuth'].merge(
    azimuth[['CL_ID', 'Organ_Level']], on='CL_ID', how='left')

# Merge with celltypist_df
final_output_celltypist = CTs_with_datasets_df[CTs_with_datasets_df['annotation_method'] == 'celltypist'].merge(
    celltypist[['CL_ID', 'Organ_Level']], on='CL_ID', how='left')

# Merge with popv_df
final_output_popv = CTs_with_datasets_df[CTs_with_datasets_df['annotation_method'] == 'popv'].merge(
    popv[['CL_ID', 'Organ_Level']], on='CL_ID', how='left')

# Combine all the dataframes
final_output_combined = pd.concat([final_output_azimuth, final_output_celltypist, final_output_popv], ignore_index=True)

# Define the organ levels for each annotation method
azimuth_organs = ['Kidney_L3', 'Lung_v2_finest_level', 'Liver_L2', 'Liver_L1', 'Kidney_L1',  'Kidney_L2', 
                  'Lung_v2_L1', 'Lung_v2_L2', 'Lung_v2_L3', 'Lung_v2_L4', 'Lung_v2_L5', 'Pancreas_L1']
celltypist_organs = ['intestine_L1', 'kidney_L1', 'liver_L1', 'lung_L1', 'pancreas_L1', 'spleen_L1', 
                     'Adult_Human_Skin_pkl', 'Healthy_Human_Liver_pkl', 'Adult_Human_PancreaticIslet_pkl', 
                     'Human_Lung_Atlas_pkl']
popv_organs = ['large intestine', 'liver', 'lung', 'male reproductive system', 'pancreas', 'prostate gland', 
               'respiratory system', 'skin', 'small intestine', 'spleen', 'thymus']

# Filter the combined dataframe for each annotation method and their corresponding organ levels
filtered_azimuth = final_output_combined[(final_output_combined['annotation_method'] == 'azimuth') & 
                                         (final_output_combined['Organ_Level'].isin(azimuth_organs))]

filtered_celltypist = final_output_combined[(final_output_combined['annotation_method'] == 'celltypist') & 
                                            (final_output_combined['Organ_Level'].isin(celltypist_organs))]

filtered_popv = final_output_combined[(final_output_combined['annotation_method'] == 'popv') & 
                                      (final_output_combined['Organ_Level'].isin(popv_organs))]

# Combine the filtered dataframes
final_filtered_combined = pd.concat([filtered_azimuth, filtered_celltypist, filtered_popv], ignore_index=True)

final_filtered_combined.head()

In [None]:
# Save the combined dataframe to a CSV file
output_path = 'output/filtered-CTs-with-datasets-with-organ.csv'
final_filtered_combined.to_csv(output_path, index=False)

###  Extract gene information for Cell types in vasculature

In [None]:
# Provided CT Labels and CT IDs
ct_labels_ids = {
    "glomerular capillary endothelial cell": "CL:1001005",
    "efferent arteriole endothelial cell": "CL:1001099",
    "afferent arteriole endothelial cell": "CL:1001096",
    "peritubular capillary endothelial cell": "CL:1001033",
    "vasa recta ascending limb cell": "CL:1001131",
    "vasa recta descending limb cell": "CL:1001285",
    "alveolar capillary type 1 endothelial cell": "CL:4028002",
    "capillary endothelial cell": "CL:0002144",
    "blood vessel smooth muscle cell": "CL:0019018",
    "endothelial cell of artery": "CL:1000413",
    "vein endothelial cell": "CL:0002543",
    "endothelial cell of hepatic sinusoid": "CL:1000398",
    "prostate gland microvascular endothelial cell": "CL:2000059",
    "blood vessel endothelial cell": "CL:0000071",
    "splenic endothelial cell": "CL:2000053"
}


In [None]:
# Identifying the structure within the "@graph" key to process accordingly
graph_data = json_data_enriched["@graph"]

In [None]:
# Grouping ensembl_ids by cell_id and adding cell_label
grouped_matches = {}
for item in graph_data:
    samples = item.get("samples", [])
    for sample in samples:
        sections = sample.get("sections", [])
        for section in sections:
            datasets = section.get("datasets", [])
            for dataset in datasets:
                summaries = dataset.get("summaries", [])
                for summary in summaries:
                    sum_details = summary.get("summary", [])
                    for sum_detail in sum_details:
                        cell_id = sum_detail.get("cell_id")
                        gene_expr_list = sum_detail.get("gene_expr", [])
                        if cell_id in ct_labels_ids.values():
                            if isinstance(gene_expr_list, list):
                                for gene_expr in gene_expr_list:
                                    if isinstance(gene_expr, dict):
                                        ensembl_id = gene_expr.get("ensembl_id")
                                        if cell_id not in grouped_matches:
                                            grouped_matches[cell_id] = {"cell_label": "", "ensembl_ids": []}
                                        grouped_matches[cell_id]["cell_label"] = [label for label, id in ct_labels_ids.items() if id == cell_id][0]
                                        grouped_matches[cell_id]["ensembl_ids"].append(ensembl_id)

# Preparing the final dataframe
final_matches = []
for cell_id, details in grouped_matches.items():
    final_matches.append({
        "cell_id": cell_id,
        "cell_label": details["cell_label"],
        "ensembl_ids": ", ".join(details["ensembl_ids"])
    })

final_matches_df = pd.DataFrame(final_matches)
final_matches_df.head()

In [None]:
# Displaying the final dataframe
final_matches_df.to_csv('Biomarker_for_Vasculature_CTs.csv')

In [None]:
final_matches_df['cell_id'].unique()

In [None]:
# Grouping ensembl_ids by cell_id and organ_id, adding cell_label and mean_gene_expr_value
grouped_matches = {}
for item in graph_data:
    samples = item.get("samples", [])
    for sample in samples:
        sections = sample.get("sections", [])
        for section in sections:
            datasets = section.get("datasets", [])
            for dataset in datasets:
                summaries = dataset.get("summaries", [])               
                for summary in summaries:
                    annotation_method = summary.get("annotation_method", "Unknown")
                    sum_details = summary.get("summary", [])
                    for sum_detail in sum_details:
                        cell_id = sum_detail.get("cell_id")
                        gene_expr_list = sum_detail.get("gene_expr", [])
                        organ_id = dataset.get("organ_id", "Unknown")
                        if cell_id in ct_labels_ids.values():
                            if isinstance(gene_expr_list, list) and gene_expr_list:
                                for gene_expr in gene_expr_list:
                                    if isinstance(gene_expr, dict):
                                        ensembl_id = gene_expr.get("ensembl_id")
                                        mean_expr_value = gene_expr.get("mean_gene_expr_value")
                                        key = (cell_id, organ_id, annotation_method)
                                        if key not in grouped_matches:
                                            grouped_matches[key] = {"cell_label": "", "ensembl_ids": [], "mean_expr_values": []}
                                        grouped_matches[key]["cell_label"] = [label for label, id in ct_labels_ids.items() if id == cell_id][0]
                                        grouped_matches[key]["ensembl_ids"].append(ensembl_id)
                                        grouped_matches[key]["mean_expr_values"].append(mean_expr_value)
                            else:
                                # Handle entries with empty gene_expr lists
                                key = (cell_id, organ_id, annotation_method)
                                if key not in grouped_matches:
                                    grouped_matches[key] = {"cell_label": "", "ensembl_ids": [], "mean_expr_values": []}
                                grouped_matches[key]["cell_label"] = [label for label, id in ct_labels_ids.items() if id == cell_id][0]

# Preparing the final dataframe
final_matches = []
for (cell_id, organ_id, annotation_method), details in grouped_matches.items():
    final_matches.append({
        "cell_id": cell_id,
        "cell_label": details["cell_label"],
        "organ_id": organ_id,
        "annotation_method": annotation_method,
        "ensembl_ids": ", ".join(details["ensembl_ids"]),
        "mean_expr_values": ", ".join(map(str, details["mean_expr_values"]))
    })

final_matches_df = pd.DataFrame(final_matches)

# Expanding the dataframe to have each row for each ensembl_id
expanded_matches = []
for _, row in final_matches_df.iterrows():
    cell_id = row["cell_id"]
    cell_label = row["cell_label"]
    organ_id = row["organ_id"]
    annotation_method = row["annotation_method"]
    ensembl_ids = row["ensembl_ids"].split(", ")
    mean_expr_values = row["mean_expr_values"].split(", ")
    
    for ensembl_id, mean_expr_value in zip(ensembl_ids, mean_expr_values):
        try:
            mean_expr_value_float = float(mean_expr_value)
        except ValueError:
            mean_expr_value_float = None  # Handle non-numeric values
        expanded_matches.append({
            "cell_id": cell_id,
            "cell_label": cell_label,
            "organ_id": organ_id,
            "annotation_method": annotation_method,
            "ensembl_id": ensembl_id,
            "mean_expr_value": mean_expr_value_float
        })

expanded_matches_df = pd.DataFrame(expanded_matches)

# Displaying the expanded dataframe
expanded_matches_df.head()  # Displaying only the first few rows

In [None]:
# Extract unique organ IDs from the expanded dataframe
unique_organ_ids = expanded_matches_df["organ_id"].unique()

unique_organ_ids_list = unique_organ_ids.tolist()
unique_organ_ids_list

In [None]:
organ_id_to_organ_name = {
         'http://purl.obolibrary.org/obo/UBERON_0002113' : 'Kidney',
         'http://purl.obolibrary.org/obo/UBERON_0000948' : 'heart',
         'http://purl.obolibrary.org/obo/UBERON_0001255': 'urinary bladder',
         'http://purl.obolibrary.org/obo/UBERON_0002048' : 'lung',
         'http://purl.obolibrary.org/obo/UBERON_0002107' : 'liver'
        }

In [None]:
# Adding a new column to the dataframe
expanded_matches_df['organ_name'] = expanded_matches_df['organ_id'].map(organ_id_to_organ_name)

# Display the first few rows of the updated dataframe
expanded_matches_df.head() 

In [None]:
expanded_matches_df[expanded_matches_df['organ_name'].unique()

In [None]:
# Select and clean up relevant columns from FTU dataframe
ftu_cleaned_df = ftu_cell_count[['Organ', 'CL_Label_FTU', 'FTU Label in Uberon']].dropna().drop_duplicates()

# Ensure unique mapping by dropping duplicates based on 'CT Label in CL'
ftu_unique_df = ftu_cleaned_df.drop_duplicates(subset=['CL_Label_FTU'])

# Perform the match and add the new column for FTU name
expanded_matches_df['FTU_name'] = expanded_matches_df['cell_label'].map(
    ftu_unique_df.set_index('CL_Label_FTU')['FTU Label in Uberon']
)

# Adding the organ name from FTU data
expanded_matches_df['Organ_name_FTU'] = expanded_matches_df['cell_label'].map(
    ftu_unique_df.set_index('CL_Label_FTU')['Organ']
)

# Rearrange columns as per the requirement
expanded_matches_df = expanded_matches_df[[
    'organ_name', 'Organ_name_FTU','organ_id', 'annotation_method', 'FTU_name', 'cell_id', 'cell_label', 'ensembl_id', 'mean_expr_value'
]]

# Rename columns for consistency
expanded_matches_df.rename(columns={
    'organ_name': 'organ_name',
    'Organ_name_FTU': 'organ_name_in_FTU'
}, inplace=True)

# Display the first few rows of the updated dataframe
expanded_matches_df.head()

In [None]:
# Drop rows where organ_name is 'heart' or 'urinary bladder'
matches_df_filtered = expanded_matches_df[~expanded_matches_df['organ_name'].isin(['heart', 'urinary bladder'])]

In [None]:
matches_df_filtered.to_csv('output/vasculature/gene_information-for-vasculature-CTs_1.csv', index=False)

In [None]:
matches_df_filtered['FTU_name'].unique()

#### Generate box plot images for unique cell label

In [None]:
# Create individual box plots for each cell_label
unique_cell_labels = matches_df_filtered['cell_label'].unique()

for cell_label in unique_cell_labels:
    cell_label_data = matches_df_filtered[matches_df_filtered['cell_label'] == cell_label]
    
    if len(cell_label_data) > 1:  # Ensure there is enough data to plot
        plt.figure(figsize=(8, 6))
        sns.boxplot(data=cell_label_data, y='mean_expr_value', color='skyblue')
        
        # Customize the box plot
        plt.xticks([])
        plt.title(f'Box Plot of Mean Expression Values for {cell_label}')
        plt.xlabel('Cell Label')
        plt.ylabel('Mean Expression Value')
        
        # Adding additional visual elements
        plt.axhline(y=cell_label_data['mean_expr_value'].median(), color='red', linestyle='-', label='Median')
        plt.legend()

        # Save each plot as an image file
        individual_box_plot_file_path = f'output/vasculature/figures/box_plot_mean_expr_value_{cell_label.replace(" ", "_")}.png'
        plt.tight_layout()
        plt.savefig(individual_box_plot_file_path)
        plt.close()

#### Generate box plot images for unique cell label per annotation method

In [None]:
unique_annotation_methods = matches_df_filtered['annotation_method'].explode().unique()
unique_annotation_methods

In [None]:
unique_annotation_methods = matches_df_filtered['annotation_method'].explode().unique()

for cell_label in unique_cell_labels:
    for annotation_method in unique_annotation_methods:
        cell_label_data = matches_df_filtered[
            (matches_df_filtered['cell_label'] == cell_label) &
            (matches_df_filtered['annotation_method'] == annotation_method)
        ]
        
        if len(cell_label_data) > 1:  # Ensure there is enough data to plot
            plt.figure(figsize=(8, 6))
            sns.boxplot(data=cell_label_data, y='mean_expr_value', color='skyblue')
            
            # Customize the box plot
            plt.xticks([])
            plt.title(f'Box Plot of Mean Expression Values for {cell_label} ({annotation_method})')
            plt.xlabel('Cell Label')
            plt.ylabel('Mean Expression Value')
            
            # Adding additional visual elements
            plt.axhline(y=cell_label_data['mean_expr_value'].median(), color='red', linestyle='-', label='Median')
            plt.legend()

            # Save each plot as an image file
            individual_box_plot_file_path = f'output/vasculature/figures/box_plot_mean_expr_value_{cell_label.replace(" ", "_")}_{annotation_method.replace(" ", "_")}.png'
            plt.tight_layout()
            plt.savefig(individual_box_plot_file_path)
            plt.close()

# List of generated plot file paths
plot_file_paths = [f'output/vasculature/figures/box_plot_mean_expr_value_{cell_label.replace(" ", "_")}_{annotation_method.replace(" ", "_")}.png'
                   for cell_label in unique_cell_labels
                   for annotation_method in unique_annotation_methods
                   if len(matches_df_filtered[(matches_df_filtered['cell_label'] == cell_label) & (matches_df_filtered['annotation_method'] == annotation_method)]) > 1]

plot_file_paths

In [None]:
# Create a box plot
plt.figure(figsize=(12, 8))
sns.boxplot(x='FTU_name', y='mean_expr_value', hue='annotation_method', data=matches_df_filtered)

# Customize the plot
plt.title('Mean Expression Value per Cell Type per FTU Name per Annotation Method')
plt.xlabel('FTU Name')
plt.ylabel('Mean Expression Value')
plt.xticks(rotation=45)
plt.legend(title='Annotation Method', bbox_to_anchor=(1.05, 1), loc='upper left')

# Save the plot to a file
output_file_path = 'output/vasculature/mean-expression-box-plot-per-CT-per-FTU-per-AnnMethod.png'
plt.tight_layout()
plt.savefig(output_file_path)
plt.show()

In [None]:
# Create a box plot
plt.figure(figsize=(12, 8))
sns.boxplot(x='FTU_name', y='mean_expr_value', hue='cell_label', data=matches_df_filtered)

# Customize the plot
plt.title('Mean Expression Value per Cell Type per FTU Name')
plt.xlabel('FTU Name')
plt.ylabel('Mean Expression Value')
plt.xticks(rotation=45)
plt.legend(title='Cell Type', bbox_to_anchor=(1.05, 1), loc='upper left')

# Save the plot to a file
output_file_path = 'output/vasculature/mean_expression_box_plot_per_cell_type.png'
plt.tight_layout()
plt.savefig(output_file_path)
plt.show()

### Understand the JSON files

In [None]:
file_path = 'ref_data/atlas-enriched-dataset-graph.jsonld'

with open(file_path, 'r') as file:
    data = json.load(file)

In [None]:
# Function to extract as_label, link, and cell_id from the JSON data
def extract_all_entries(json_data):
    result = []
    for entry in json_data['@graph']:
        for sample in entry.get('samples', []):
            rui_location = sample.get('rui_location', {})
            all_collisions = rui_location.get('all_collisions', [])
            for collision in all_collisions:
                collisions = collision.get('collisions', [])
                for col in collisions:
                    as_label = col.get('as_label', 'unknown')
                    for section in sample.get('sections', []):
                        for dataset in section.get('datasets', []):
                            link = dataset.get('link', 'unknown')
                            for summary in dataset.get('summaries', []):
                                for summary_item in summary.get('summary', []):
                                    cell_id = summary_item.get('cell_id', 'unknown')
                                    cell_label = summary_item.get('cell_label', 'unknown')
                                    new_row = {
                                        'as_label': as_label,
                                        'link': link,
                                        'cell_label': cell_label,
                                        'cell_id': cell_id
                                    }
                                    result.append(new_row)
    return result

# Extract all entries
extracted_data_enriched = extract_all_entries(json_data_enriched)

# Convert extracted data to a DataFrame
extracted_df_enriched = pd.DataFrame(extracted_data_enriched)

# Display the DataFrame's head
print(extracted_df_enriched.head())

In [None]:
extracted_df.to_csv('output/Dataset-source-cell-enriched.csv', index=False)

In [None]:
# Extract cell IDs from the summary section
def extract_cell_ids(data):
    cell_ids = []
    for item in data['@graph']:
        samples = item.get("samples", [])
        for sample in samples:
            sections = sample.get("sections", [])
            for section in sections:
                datasets = section.get("datasets", [])
                for dataset in datasets:
                    summaries = dataset.get("summaries", [])               
                    for summary in summaries:
                        sum_details = summary.get("summary", [])
                        for sum_detail in sum_details:
                            cell_id = sum_detail.get("cell_id")
                            cell_ids.append(cell_id)
    return cell_ids

cell_ids = extract_cell_ids(json_data_enriched)
print(f"Cell IDs from the file: {len(cell_ids)}")

In [None]:
def extract_cell_ids_summaries(data):
    cell_ids = []
    for item in data['@graph']:
        summaries = item.get('summary', [])
        for summary in summaries:
            for detail in summary:
                if 'cell_id' in detail:
                    cell_ids.append(summary['cell_id'])
    return cell_ids

cell_ids_summaries = extract_cell_ids_summaries(json_data_summary)
print(f"Cell IDs from the file: {len(cell_ids_summaries)}")

In [None]:
# Function to extract aggregated_summaries, cell_source_label, cell_label, and cell_id from the JSON data
graph_data = json_data_summary.get('@graph', [])

extracted_data_summary = []

for item in graph_data:
    if 'summary' in item:
        for summary in item['summary']:
            for summary in summaries:
                if 'cell_id' in summary and 'cell_label' in summary:
                    for aggregated_summary in aggregated_summaries:
                        extracted_entry = {
                    'cell_source_label': cell_source_label,
                    'aggregated_summary': aggregated_summary,
                    'cell_label': summary.get('cell_label'),
                    'cell_id': summary.get('cell_id')
                }
                extracted_data_summary.append(extracted_entry)

# Convert extracted data to a DataFrame
extracted_df_summary = pd.DataFrame(extracted_data_summary)

# Display the DataFrame's head
print(extracted_df_summary.head())


In [None]:
extracted_df_summary.to_csv('output/Dataset-source-cell-summary.csv', index=False)