### Workflow

Figma diagram: https://www.figma.com/file/PFKdJO3HTIcM9g4nmVONCT/HRA-Data-Explorer?type=whiteboard&node-id=1%3A108&t=XGoN91s7IKbyZ7xk-1

Below is the information about the diagram:

1. Dataset Collections:
The process begins with RUI Registered Tissue datasets, a collection of tissue datasets registered in a specific repository or database.
These datasets are linked to specific Organs and Anatomical Structures.

2. Dataset Processing:
The datasets are processed to produce Anatomical Structures/Cell type information of 2D FTUs.
There is a decision point or a step to possibly sort these structures, which leads to the extraction of a Cell Summary, including cell type biomarker gene expression data.
    
    1. Compare FTU's cell type information between Azimuth, PopV and Celltypist
    2. Get the list of dataset for FTU.
        1. Cell summary from the atlas-as-cell-summaries.jsonld
        2. add organ data from the cell annotation crosswalk files
    3. Any dataset that is been shared among two FTUs

3. Cell Type Mapping:
list of 2D FTUs are used to map Cell Types, leading to a set of inquiries on how to best extract data from the datasets used in Step 1 and 2.

4. Additonal work

### Get the Cell annotation tool information for cell types in FTUs

In [4]:
import pandas as pd
import requests

# Function to download CSV files from GitHub
def download_csv_from_github(url, local_path):
    response = requests.get(url)
    response.raise_for_status()  # Check if the request was successful
    with open(local_path, 'wb') as file:
        file.write(response.content)

# URLs of the CSV files in the GitHub repository
# github_base_url = 'https://github.com/hubmapconsortium/hra-workflows-runner/blob/main/crosswalking-tables/'
# popv_url = github_base_url + 'popv.csv'
# celltypist_url = github_base_url + 'celltypist.csv'
# azimuth_url = github_base_url + 'azimuth.csv'
# ftu_cell_count_url = github_base_url + 'FTU_Cell_Count_Table_Cell_Type_Count.csv'

# Local paths to save the downloaded files
popv_local_path = 'C:\\Users\\Supriya\\Downloads\\popv.csv'
celltypist_local_path = 'C:\\Users\\Supriya\\Downloads\\celltypist.csv'
azimuth_local_path = 'C:\\Users\\Supriya\\Downloads\\azimuth.csv'
ftu_cell_count_path = 'C:\\Users\\Supriya\\Downloads\\FTU Cell Count Table - Cell_Type_Count.csv'


# Download the files
#download_csv_from_github(popv_url, popv_local_path)
#download_csv_from_github(celltypist_url, celltypist_local_path)
#download_csv_from_github(azimuth_url, azimuth_local_path)

# Load the downloaded files
popv = pd.read_csv(popv_local_path)
celltypist = pd.read_csv(celltypist_local_path)
azimuth = pd.read_csv(azimuth_local_path)
ftu_cell_count = pd.read_csv(ftu_cell_count_path)

# Define the mapping dictionary
organ_to_uberon = {
    'Kidney': 'UBERON:0002113',
    'Lung': 'UBERON:0002048',
    'Pancreas': 'UBERON:0001264',
    'Large Intestine': 'UBERON:0002107',
    'Skin': 'UBERON:0002097',
    'Liver': 'UBERON:0002108',
    'Prostate': 'UBERON:0002367',
    'Thymus': 'UBERON:0002371',
    'Spleen': 'UBERON:0002370',
    'Small Intestine': 'UBERON:0002106'
}

# Filter FTU cell count to include only rows where `CT ID in CL` contains `CL`
ftu_cell_count_filtered = ftu_cell_count[ftu_cell_count['CT ID in CL'].str.contains('CL', na=False)]

# Map the organ names in FTU cell count table to UBERON IDs
ftu_cell_count_filtered.loc[:, 'Organ_ID'] = ftu_cell_count_filtered['Organ'].map(organ_to_uberon)

# Prepare a common structure to hold the results
result = []

# Function to dynamically generate match info
def generate_match_info(tool_name, tool_data):
    levels = set()
    for organ_level in tool_data['Organ_Level']:
        level = organ_level.split('_')[-1]
        levels.add(level)
    
    match_info = {f'{tool_name}_{level}': 0 for level in levels}
    return match_info, levels

# Dynamically generate the match info for each tool
celltypist_info, celltypist_levels = generate_match_info('celltypist', celltypist)
azimuth_info, azimuth_levels = generate_match_info('azimuth', azimuth)
popv_info, popv_levels = generate_match_info('popv', popv)

# Iterate over the filtered FTU cell count data
for idx, ftu_row in ftu_cell_count_filtered.iterrows():
    organ = ftu_row['Organ_ID']
    ftu_label = ftu_row['FTU Label in Uberon']
    ftu_id = ftu_row['FTU ID in Uberon']
    cl_id = ftu_row['CT ID in CL']
    cl_label = ftu_row['CT Label in CL']
    
    # Initialize match columns
    match_info = {
        'Organ': ftu_row['Organ'],
        'FTU Label in Uberon': ftu_label,
        'FTU ID in Uberon': ftu_id,
        'CL_id': cl_id,
        'CT Label in CL': cl_label,
    }
    match_info.update({key: 0 for key in celltypist_info.keys()})
    match_info.update({key: 0 for key in azimuth_info.keys()})
    match_info.update({key: 0 for key in popv_info.keys()})
    
    # Check in celltypist
    celltypist_matches = celltypist[(celltypist['CL_ID'] == cl_id) & (celltypist['Organ_ID'] == organ)]
    for _, ct_match in celltypist_matches.iterrows():
        level = ct_match['Organ_Level'].split('_')[-1]  # Extract the level number
        match_info[f'celltypist_{level}'] = 1

    # Check in azimuth
    azimuth_matches = azimuth[(azimuth['CL_ID'] == cl_id) & (azimuth['Organ_ID'] == organ)]
    for _, az_match in azimuth_matches.iterrows():
        level = az_match['Organ_Level'].split('_')[-1]  # Extract the level number
        match_info[f'azimuth_{level}'] = 1

    # Check in popv
    popv_matches = popv[(popv['CL_ID'] == cl_id) & (popv['Organ_ID'] == organ)]
    if not popv_matches.empty:
        for level in popv_levels:
            match_info[f'popv_{level}'] = 1

    # Append to results
    result.append(match_info)

# Create a dataframe from the results
result_df = pd.DataFrame(result)

# Save the result to a CSV file
output_path = 'FTU_Cell_Count_Annotated.csv'
result_df.to_csv(output_path, index=False)

print(f"File saved to {output_path}")


File saved to FTU_Cell_Count_Annotated.csv


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ftu_cell_count_filtered.loc[:, 'Organ_ID'] = ftu_cell_count_filtered['Organ'].map(organ_to_uberon)


In [6]:
import pandas as pd
import json

# Load all the files into dataframes
atlas_as_cell_summaries_path = 'ref_data/atlas-as-cell-summaries.jsonld'

with open(atlas_as_cell_summaries_path, 'r') as file:
    json_data = json.load(file)

# Function to extract structure and keys of the JSON
def extract_structure(data, level=0):
    structure = {}
    if isinstance(data, dict):
        for key, value in data.items():
            structure[key] = extract_structure(value, level + 1)
    elif isinstance(data, list) and len(data) > 0:
        structure = [extract_structure(data[0], level + 1)]
    else:
        structure = None
    return structure

# Extract structure
json_structure_cell_summaries = extract_structure(json_data)

In [7]:
json_structure_cell_summaries

{'@context': {'CL': {'@id': None, '@prefix': None},
  'ASCTB-TEMP': {'@id': None, '@prefix': None},
  'ctpop': {'@id': None, '@prefix': None},
  'as_3d_id': {'@type': None},
  'as_id': {'@type': None},
  'all_collisions': {'@id': None},
  'collision_source': {'@reverse': None, '@type': None},
  'collisions': {'@id': None},
  'corridor_source': {'@reverse': None, '@type': None},
  'corridor': {'@id': None},
  'summaries': {'@id': None},
  'cell_source': {'@reverse': None, '@type': None},
  'aggregated_summaries': {'@id': None, '@type': None},
  'annotation_method': {'@id': None},
  'summary': {'@id': None},
  'cell_id': {'@type': None},
  'count': {'@id': None},
  'percentage': {'@id': None},
  'cell_count': {'@id': None, '@type': None},
  'gene_count': {'@id': None, '@type': None},
  'organ_id': {'@type': None},
  'cell_source_a': {'@type': None},
  'cell_source_b': {'@type': None},
  'entity_a': {'@type': None},
  'entity_b': {'@type': None},
  '@base': None,
  '@vocab': None,
  'ccf'

In [8]:
# Load the JSON file
json_file_path = 'ref_data/atlas-enriched-dataset-graph.jsonld'
with open(json_file_path, 'r') as file:
    json_data = json.load(file)

# Function to extract structure and keys of the JSON
def extract_structure(data, level=0):
    structure = {}
    if isinstance(data, dict):
        for key, value in data.items():
            structure[key] = extract_structure(value, level + 1)
    elif isinstance(data, list) and len(data) > 0:
        structure = [extract_structure(data[0], level + 1)]
    else:
        structure = None
    return structure

# Extract structure
json_structure = extract_structure(json_data)

In [9]:
json_structure

{'@context': {'CL': {'@id': None, '@prefix': None},
  'ASCTB-TEMP': {'@id': None, '@prefix': None},
  'ctpop': {'@id': None, '@prefix': None},
  'as_3d_id': {'@type': None},
  'as_id': {'@type': None},
  'all_collisions': {'@id': None},
  'collision_source': {'@reverse': None, '@type': None},
  'collisions': {'@id': None},
  'corridor_source': {'@reverse': None, '@type': None},
  'corridor': {'@id': None},
  'summaries': {'@id': None},
  'cell_source': {'@reverse': None, '@type': None},
  'aggregated_summaries': {'@id': None, '@type': None},
  'annotation_method': {'@id': None},
  'summary': {'@id': None},
  'cell_id': {'@type': None},
  'count': {'@id': None},
  'percentage': {'@id': None},
  'cell_count': {'@id': None, '@type': None},
  'gene_count': {'@id': None, '@type': None},
  'organ_id': {'@type': None},
  'cell_source_a': {'@type': None},
  'cell_source_b': {'@type': None},
  'entity_a': {'@type': None},
  'entity_b': {'@type': None},
  '@base': None,
  '@vocab': None,
  'ccf'

In [31]:
import pandas as pd
import json

# Load the provided CSV and JSON files
ftu_file_path = 'ref_data\\FTU Cell Count Table - Cell_Type_Count.csv'
json_file_path = 'ref_data\\atlas-as-cell-summaries.jsonld'

# Read the CSV file
ftu_data = pd.read_csv(ftu_file_path)
ftu_data = ftu_data.drop(columns=['Unnamed: 6','Unnamed: 7','Unnamed: 8','Unnamed: 9','Unnamed: 10','Unnamed: 11','Unnamed: 12','Unnamed: 13','Unnamed: 14'], axis = False)

# Load the JSON file
with open(json_file_path, 'r') as file:
    json_data = json.load(file)

# Extract the relevant data from the JSON file
json_graph = json_data['@graph']

# Convert the JSON graph to a DataFrame for easier manipulation
json_df = pd.json_normalize(json_graph)

# Ensure modality is expanded correctly
expanded_json_df = json_df.explode('summary').reset_index(drop=True)
expanded_json_df['modality'] = json_df.explode('modality').reset_index(drop=True)['modality']

# Normalize the summary column and merge with expanded_json_df
summary_df = pd.json_normalize(expanded_json_df['summary'])
merged_json_df = pd.concat([expanded_json_df.drop(columns=['summary']), summary_df], axis=1)

# Ensure correct modality assignment
merged_json_df['modality'] = merged_json_df['modality'].fillna(method='ffill')

# Extract relevant columns from the CSV data
csv_relevant_columns = csv_data[['CT ID in CL', 'Organ', 'FTU Label in Uberon', 'FTU ID in Uberon', 'CT Label in CL']]

# Merge the CSV and JSON DataFrames based on matching cell_id
merged_data = pd.merge(
    merged_json_df,
    csv_relevant_columns,
    left_on='cell_id',
    right_on='CT ID in CL',
    how='inner'
)

# Create the final DataFrame with the desired columns
final_df = merged_data[[
    'cell_id',
    'cell_label',
    'annotation_method',
    'modality',
    'cell_source_label',
    'sex',
    'aggregated_summaries'
]]

# Rename the columns as specified
final_df.columns = [
    'cell_id',
    'cell_label',
    'annotation_method',
    'modality',
    'cell_source_label',
    'sex',
    'aggregated_summaries'
]

# Add the new column "#datasets"
final_df['#datasets'] = final_df['aggregated_summaries'].apply(lambda x: len(x) if isinstance(x, list) else 0)

# Check for null values in the 'modality' column again
null_modality_entries_after_correction = final_df[final_df['modality'].isnull()]

# Save the final DataFrame to a CSV file
final_csv_path = 'output/CTs-with-datasets.csv'
final_df.to_csv(final_csv_path, index=False)

# Display the first few rows of the final DataFrame
print(final_df.head(), '\n')

print(f"Final CSV saved to: {final_csv_path}")

      cell_id            cell_label annotation_method            modality  \
0  CL:0009080  intestinal tuft cell              popv  sc_transcriptomics   
1  CL:0000057            fibroblast              popv  sc_transcriptomics   
2  CL:0000775            neutrophil              popv  sc_transcriptomics   
3  CL:0000775            neutrophil              popv  sc_transcriptomics   
4  CL:0000097             mast cell              popv  sc_transcriptomics   

  cell_source_label     sex  \
0           jejunum  Female   
1           jejunum  Female   
2           jejunum  Female   
3           jejunum  Female   
4           jejunum  Female   

                                aggregated_summaries  #datasets  
0  [https://entity.api.hubmapconsortium.org/entit...          2  
1  [https://entity.api.hubmapconsortium.org/entit...          2  
2  [https://entity.api.hubmapconsortium.org/entit...          2  
3  [https://entity.api.hubmapconsortium.org/entit...          2  
4  [https://entity.a

  merged_json_df['modality'] = merged_json_df['modality'].fillna(method='ffill')
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  final_df['#datasets'] = final_df['aggregated_summaries'].apply(lambda x: len(x) if isinstance(x, list) else 0)


In [11]:
final_df['modality'].unique()

array(['sc_transcriptomics'], dtype=object)

In [55]:
import pandas as pd
import json

# Load CSV file
csv_file_path = 'output/CTs-with-datasets.csv'
csv_df = pd.read_csv(csv_file_path)

# Load JSON file
json_file_path = 'ref_data/atlas-enriched-dataset-graph.jsonld'
with open(json_file_path, 'r') as f:
    json_data = json.load(f)


In [78]:
# Create an empty dataframe with the specified columns
columns = [
    'organ_id', 'organ_name', 'reference_organ', 'cell_id', 'cell_label', 'annotation_method', 
    'cell_source_label', 'as_label', 'sex', 'aggregated_summaries', 'gene_id', 
    'gene_label', 'ensembl_id', 'mean_gene_expr_value'
]
result_df = pd.DataFrame(columns=columns)

# Function to extract and match data based on cell_id and annotation_method
def extract_data(csv_row, json_data):
    matches = []
    cell_id = csv_row['cell_id']
    annotation_method = csv_row['annotation_method']
    aggregated_summaries = eval(csv_row['aggregated_summaries'])  # convert string representation of list to actual list
    organ_name = 'unknown'
    for entry in json_data["@graph"]:
        for sample in entry.get('samples', []):
            rui_location = sample.get('rui_location', {})
            all_collisions = rui_location.get('all_collisions', [])
            for collision in all_collisions:
                collisions = collision.get('collisions', [])
                for col in collisions:
                    reference_organ = col.get('reference_organ', 'unknown')
                    as_label = col.get('as_label', 'unknown')
                    for section in sample.get('sections', []):
                        for dataset in section.get('datasets', []):
                            summaries = dataset.get('summaries', [])
                            organ_id = dataset.get('organ_id', 'unknown')
                            for summary in summaries:
                                if summary.get('annotation_method') == annotation_method:
                                    for summ in summary['summary']:
                                        if isinstance(summ, dict) and summ['cell_id'] == cell_id:
                                            for aggregated_summary in aggregated_summaries:
                                                if aggregated_summary in dataset['@id']:
                                                    for gene_expr in summ.get('gene_expr', []):
                                                        if isinstance(gene_expr, dict):  # Ensure gene_expr is a dictionary
                                                            new_row = {
                                                                'organ_id': organ_id,
                                                                'organ_name': organ_name,
                                                                'cell_id': cell_id,
                                                                'cell_label': summ['cell_label'],
                                                                'annotation_method': annotation_method,
                                                                'cell_source_label': csv_row['cell_source_label'],
                                                                'sex': csv_row['sex'],
                                                                'aggregated_summaries': aggregated_summary,
                                                                'gene_id': gene_expr.get('gene_id'),
                                                                'gene_label': gene_expr.get('gene_label'),
                                                                'ensembl_id': gene_expr.get('ensembl_id'),
                                                                'mean_gene_expr_value': gene_expr.get('mean_gene_expr_value'),
                                                                'reference_organ': reference_organ,
                                                                'as_label': as_label
                                                            }
                                                            matches.append(new_row)
    return matches

# Iterate over the CSV data and extract matched data
for index, row in csv_data.iterrows():
    if row['cell_source_label']:  # Filter data by cell_source_label
        matches = extract_data(row, json_data)
        if matches:
            result_df = pd.concat([result_df, pd.DataFrame(matches)], ignore_index=True)

# Display the new dataframe
print(result_df.head())


  result_df = pd.concat([result_df, pd.DataFrame(matches)], ignore_index=True)


                                        organ_id organ_name  \
0  http://purl.obolibrary.org/obo/UBERON_0002108    unknown   
1  http://purl.obolibrary.org/obo/UBERON_0002108    unknown   
2  http://purl.obolibrary.org/obo/UBERON_0002108    unknown   
3  http://purl.obolibrary.org/obo/UBERON_0002108    unknown   
4  http://purl.obolibrary.org/obo/UBERON_0002108    unknown   

                                     reference_organ     cell_id  \
0  http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...  CL:0009080   
1  http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...  CL:0009080   
2  http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...  CL:0009080   
3  http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...  CL:0009080   
4  http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...  CL:0009080   

             cell_label annotation_method cell_source_label as_label     sex  \
0  intestinal tuft cell              popv           jejunum  jejunum  Female   
1  intestinal tuft cell              popv           

In [85]:
# Get unique organ IDs
result_df["organ_id"].unique()

array(['http://purl.obolibrary.org/obo/UBERON_0002108',
       'http://purl.obolibrary.org/obo/UBERON_0000059',
       'http://purl.obolibrary.org/obo/UBERON_0002113',
       'http://purl.obolibrary.org/obo/UBERON_0000948',
       'http://purl.obolibrary.org/obo/UBERON_0001255',
       'http://purl.obolibrary.org/obo/UBERON_0002048',
       'http://purl.obolibrary.org/obo/UBERON_0002107'], dtype=object)

In [86]:
organ_id_to_name = {
    'http://purl.obolibrary.org/obo/UBERON_0002108' : 'small intestine',
'http://purl.obolibrary.org/obo/UBERON_0000059': 'large intestine',
'http://purl.obolibrary.org/obo/UBERON_0002113': 'kidney',
'http://purl.obolibrary.org/obo/UBERON_0000948': 'heart',
'http://purl.obolibrary.org/obo/UBERON_0001255': 'urinary bladder',
'http://purl.obolibrary.org/obo/UBERON_0002048': 'lung',
'http://purl.obolibrary.org/obo/UBERON_0002107': 'liver'}

In [87]:
# Adding a new column to the dataframe
result_df['organ_name'] = result_df['organ_id'].map(organ_id_to_name)

# Display the first few rows of the updated dataframe
result_df.head()

Unnamed: 0,organ_id,organ_name,reference_organ,cell_id,cell_label,annotation_method,cell_source_label,as_label,sex,aggregated_summaries,gene_id,gene_label,ensembl_id,mean_gene_expr_value
0,http://purl.obolibrary.org/obo/UBERON_0002108,small intestine,http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...,CL:0009080,intestinal tuft cell,popv,jejunum,jejunum,Female,https://entity.api.hubmapconsortium.org/entiti...,ASCTB-TEMP:plcg2,PLCG2,PLCG2,4.136788
1,http://purl.obolibrary.org/obo/UBERON_0002108,small intestine,http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...,CL:0009080,intestinal tuft cell,popv,jejunum,jejunum,Female,https://entity.api.hubmapconsortium.org/entiti...,ASCTB-TEMP:st18,ST18,ST18,3.566198
2,http://purl.obolibrary.org/obo/UBERON_0002108,small intestine,http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...,CL:0009080,intestinal tuft cell,popv,jejunum,jejunum,Female,https://entity.api.hubmapconsortium.org/entiti...,ASCTB-TEMP:grk5,GRK5,GRK5,3.731159
3,http://purl.obolibrary.org/obo/UBERON_0002108,small intestine,http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...,CL:0009080,intestinal tuft cell,popv,jejunum,jejunum,Female,https://entity.api.hubmapconsortium.org/entiti...,ASCTB-TEMP:itpr2,ITPR2,ITPR2,4.091242
4,http://purl.obolibrary.org/obo/UBERON_0002108,small intestine,http://purl.org/ccf/latest/ccf.owl#VHFSmallInt...,CL:0009080,intestinal tuft cell,popv,jejunum,jejunum,Female,https://entity.api.hubmapconsortium.org/entiti...,ASCTB-TEMP:zfhx3,ZFHX3,ZFHX3,3.002826


In [89]:
result_df.to_csv('output/filtered-CTs-with-datasets-with-gene-information_1.csv', index=False)

##### Sanity check of the organ_id, cell_source_label and cell_ids

In [90]:
result_df['cell_source_label'].unique()

array(['jejunum', 'descending colon', 'superior part of duodenum',
       'descending part of duodenum', 'ascending part of duodenum',
       'horizontal part of duodenum', 'sigmoid colon',
       'distal part of ileum', 'ileum', 'renal pyramid',
       'outer cortex of kidney', 'heart left ventricle', 'rectum',
       'ascending colon', 'transverse colon', 'renal column',
       'renal papilla', 'fundus of urinary bladder',
       'Right Medial Bronchopulmonary Segment',
       'Right Lateral Bronchopulmonary Segment', 'left ureter',
       'hilum of kidney', 'kidney capsule', 'gastric impression of liver',
       'diaphragmatic surface of liver', 'capsule of the liver'],
      dtype=object)

In [91]:
final_df['cell_source_label'].unique()

array(['jejunum', 'descending colon', 'superior part of duodenum',
       'descending part of duodenum', 'ascending part of duodenum',
       'horizontal part of duodenum', 'sigmoid colon',
       'distal part of ileum', 'ileum', 'renal pyramid',
       'outer cortex of kidney', 'diaphragmatic surface of spleen',
       'hilum of spleen', 'heart right ventricle',
       'interventricular septum', 'heart left ventricle', 'rectum',
       'ascending colon', 'transverse colon', 'renal column',
       'renal papilla', 'fundus of urinary bladder',
       'Right Medial Bronchopulmonary Segment',
       'Right Lateral Bronchopulmonary Segment', 'left ureter',
       'hilum of kidney', 'kidney capsule', 'caecum',
       'gastric impression of liver', 'diaphragmatic surface of liver',
       'capsule of the liver', 'left cardiac atrium',
       'right cardiac atrium',
       'Posteromedial head of posterior papillary muscle of left ventricle',
       'Lateral segmental bronchus',
       'Left a

In [92]:
difference_labels = set(final_df['cell_source_label'].unique()) - set(result_df['cell_source_label'])
print("Difference in cell_source_label:")
print(difference_labels)

Difference in cell_source_label:
{'right cardiac atrium', 'peripheral zone of prostate', 'Posteromedial head of posterior papillary muscle of left ventricle', 'caecum', 'hilum of spleen', 'left cardiac atrium', 'Interlobar adipose tissue of right mammary gland', 'skin of body', 'Left apical bronchopulmonary segment', 'central zone of prostate', 'diaphragmatic surface of spleen', 'interventricular septum', 'heart right ventricle', 'Lateral segmental bronchus'}


In [96]:
difference_ids = set(final_df['cell_id'].unique()) - set(result_df['cell_id'])
print("Difference in cell_ids:")
print(difference_ids)

Difference in cell_ids:
{'CL:0000788', 'CL:1000458', 'CL:0000071', 'CL:0009016', 'CL:0002341', 'CL:0002457', 'CL:0000787', 'CL:0002340', 'CL:0001065', 'CL:0000499'}


In [97]:
# Find the unique cell_labels in the provided CSV file
ftu_cell_labels = set(ftu_data['CT Label in CL'].unique())
ftu_cell_ids = set(ftu_data['CT ID in CL'].unique())

In [101]:
cell_labels_set = set()
for label in difference_labels:
    cell_labels = final_df[final_df['cell_source_label'] == label]['cell_label']
    cell_labels_set.update(cell_labels) 
    
# Check for matches between cell_labels_set and csv_cell_labels
matches_label = cell_labels_set.intersection(ftu_cell_labels)
matches_label

{'B cell',
 'basal cell',
 'basal cell of prostate epithelium',
 'blood vessel endothelial cell',
 'capillary endothelial cell',
 'dendritic cell',
 'endothelial cell',
 'endothelial cell of artery',
 'enterocyte of epithelium of large intestine',
 'fibroblast',
 'hepatocyte',
 'innate lymphoid cell',
 'intestinal crypt stem cell of large intestine',
 'intestinal tuft cell',
 'luminal cell of prostate epithelium',
 'macrophage',
 'mast cell',
 'memory B cell',
 'naive B cell',
 'neutrophil',
 'pericyte',
 'plasma cell',
 'smooth muscle cell',
 'stromal cell',
 'type I pneumocyte',
 'type II pneumocyte',
 'vascular associated smooth muscle cell',
 'vein endothelial cell'}

In [102]:
cell_ids_set = set()
for ids in difference_ids:
    cell_ids = final_df[final_df['cell_id'] == ids]['cell_id']
    cell_ids_set.update(cell_ids)
    
matches_id = cell_ids_set.intersection(ftu_cell_ids)
matches_id

{'CL:0000071',
 'CL:0000499',
 'CL:0000787',
 'CL:0000788',
 'CL:0001065',
 'CL:0002340',
 'CL:0002341',
 'CL:0002457',
 'CL:0009016',
 'CL:1000458'}

In [105]:
match_df = pd.DataFrame()
for ids in matches_id:
    match_df = pd.concat([match_df, (ftu_data[ftu_data['CT ID in CL'] == ids])],ignore_index=True)

match_df

Unnamed: 0,Organ,FTU Label in Uberon,FTU ID in Uberon,CT Label in CL,CT ID in CL,CT Label in 2D Object,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,Thymus,Thymus Lobule,UBERON:0002125,naive B cell,CL:0000788,B Cell,,,,,,,,,
1,Skin,Epidermal Ridge,UBERON:0013487,melanocyte of skin,CL:1000458,Melanocyte,,,,,,,,,
2,Thymus,Thymus Lobule,UBERON:0002125,blood vessel endothelial cell,CL:0000071,Blood Vessel Endothelial Cell,,,,,,,,,Liver
3,Large Intestine,Crypt of Lieberkuhn,UBERON:0001984,intestinal crypt stem cell of large intestine,CL:0009016,Epithelial Stem Cells,,,,,,,,,
4,Prostate,Prostate Glandular Acinus,UBERON:0004179,basal cell of prostate epithelium,CL:0002341,Basal Cell Of Prostate Epithelium,,,,,,,,,
5,Skin,Epidermal Ridge,UBERON:0013487,epidermal Langerhans cell,CL:0002457,Epidermal Langerhans Cell,,,,,,,,,
6,Thymus,Thymus Lobule,UBERON:0002125,memory B cell,CL:0000787,B Cell,,,,,,,,,Prostate
7,Prostate,Prostate Glandular Acinus,UBERON:0004179,luminal cell of prostate epithelium,CL:0002340,Luminal Cell Of Prostate Epithelium,,,,,,,,,
8,Small Intestine,intestinal villus,UBERON 0001213,innate lymphoid cell,CL:0001065,innate lymphoid cell,,,,,,,,,
9,Spleen,White Pulp,UBERON:0001959,stromal cell,CL:0000499,Stromal Cell,,,,,,,,,


In [40]:
matched_data = pd.DataFrame()
for cl_label in matches:
    matched_data = pd.concat([matched_data, final_df[final_df['cell_label'] == cl_label]])
    
matched_data['cell_source_label'].unique()

array(['diaphragmatic surface of spleen', 'hilum of spleen',
       'central zone of prostate', 'peripheral zone of prostate',
       'skin of body', 'descending colon', 'sigmoid colon', 'rectum',
       'ascending colon', 'transverse colon', 'caecum', 'jejunum',
       'superior part of duodenum', 'descending part of duodenum',
       'ascending part of duodenum', 'horizontal part of duodenum',
       'distal part of ileum', 'ileum', 'fundus of urinary bladder',
       'Interlobar adipose tissue of right mammary gland',
       'Right Medial Bronchopulmonary Segment',
       'Right Lateral Bronchopulmonary Segment',
       'gastric impression of liver', 'diaphragmatic surface of liver',
       'capsule of the liver', 'Lateral segmental bronchus',
       'Left apical bronchopulmonary segment', 'heart right ventricle',
       'interventricular septum', 'heart left ventricle',
       'left cardiac atrium', 'right cardiac atrium',
       'Posteromedial head of posterior papillary muscle of

### Organ information from CTann crosswalk files

In [60]:
# Load the uploaded files
final_output_df = pd.read_csv('output/CTs-with-datasets.csv')
popv_df = pd.read_csv('ref_data/popv.csv')
celltypist_df = pd.read_csv('ref_data/celltypist.csv')
azimuth_df = pd.read_csv('ref_data/azimuth.csv')

# Rename the column in final_output_df
final_output_df.rename(columns={'cell_id': 'CL_ID'}, inplace=True)

# Merge with azimuth_df
final_output_azimuth = final_output_df[final_output_df['annotation_method'] == 'azimuth'].merge(
    azimuth_df[['CL_ID', 'Organ_Level']], on='CL_ID', how='left')

# Merge with celltypist_df
final_output_celltypist = final_output_df[final_output_df['annotation_method'] == 'celltypist'].merge(
    celltypist_df[['CL_ID', 'Organ_Level']], on='CL_ID', how='left')

# Merge with popv_df
final_output_popv = final_output_df[final_output_df['annotation_method'] == 'popv'].merge(
    popv_df[['CL_ID', 'Organ_Level']], on='CL_ID', how='left')

# Combine all the dataframes
final_output_combined = pd.concat([final_output_azimuth, final_output_celltypist, final_output_popv], ignore_index=True)

# Define the organ levels for each annotation method
azimuth_organs = ['Kidney_L3', 'Lung_v2_finest_level', 'Liver_L2', 'Liver_L1', 'Kidney_L1',  'Kidney_L2', 
                  'Lung_v2_L1', 'Lung_v2_L2', 'Lung_v2_L3', 'Lung_v2_L4', 'Lung_v2_L5', 'Pancreas_L1']
celltypist_organs = ['intestine_L1', 'kidney_L1', 'liver_L1', 'lung_L1', 'pancreas_L1', 'spleen_L1', 
                     'Adult_Human_Skin_pkl', 'Healthy_Human_Liver_pkl', 'Adult_Human_PancreaticIslet_pkl', 
                     'Human_Lung_Atlas_pkl']
popv_organs = ['large intestine', 'liver', 'lung', 'male reproductive system', 'pancreas', 'prostate gland', 
               'respiratory system', 'skin', 'small intestine', 'spleen', 'thymus']

# Filter the combined dataframe for each annotation method and their corresponding organ levels
filtered_azimuth = final_output_combined[(final_output_combined['annotation_method'] == 'azimuth') & 
                                         (final_output_combined['Organ_Level'].isin(azimuth_organs))]

filtered_celltypist = final_output_combined[(final_output_combined['annotation_method'] == 'celltypist') & 
                                            (final_output_combined['Organ_Level'].isin(celltypist_organs))]

filtered_popv = final_output_combined[(final_output_combined['annotation_method'] == 'popv') & 
                                      (final_output_combined['Organ_Level'].isin(popv_organs))]

# Combine the filtered dataframes
final_filtered_combined = pd.concat([filtered_azimuth, filtered_celltypist, filtered_popv], ignore_index=True)

# Save the combined dataframe to a CSV file
output_path = 'output/filtered-CTs-with-datasets-with-organ.csv'
final_filtered_combined.to_csv(output_path, index=False)

final_filtered_combined.head()

Unnamed: 0,CL_ID,cell_label,annotation_method,modality,cell_source_label,sex,aggregated_summaries,#datasets,Organ_Level
0,CL:1000716,Outer Medullary Collecting Duct Principal,azimuth,sc_transcriptomics,renal pyramid,Female,['https://entity.api.hubmapconsortium.org/enti...,55,Kidney_L3
1,CL:1000716,Outer Medullary Collecting Duct Principal,azimuth,sc_transcriptomics,renal pyramid,Female,['https://entity.api.hubmapconsortium.org/enti...,55,Kidney_L2
2,CL:1000768,Connecting Tubule,azimuth,sc_transcriptomics,renal pyramid,Female,['https://entity.api.hubmapconsortium.org/enti...,55,Kidney_L3
3,CL:1000768,Connecting Tubule,azimuth,sc_transcriptomics,renal pyramid,Female,['https://entity.api.hubmapconsortium.org/enti...,55,Kidney_L1
4,CL:1000768,Connecting Tubule,azimuth,sc_transcriptomics,renal pyramid,Female,['https://entity.api.hubmapconsortium.org/enti...,55,Kidney_L2


###  Additonal work

In [4]:
# Load the files
azimuth = pd.read_csv("C:\\Users\\Supriya\\Downloads\\azimuth.csv")
celltypist = pd.read_csv("C:\\Users\\Supriya\\Downloads\\celltypist.csv")
popv = pd.read_csv("C:\\Users\\Supriya\\Downloads\\popv.csv")

# Get the unique organ levels for each file
azimuth_organs = azimuth['Organ_Level'].unique().tolist()
celltypist_organs = celltypist['Organ_Level'].unique().tolist()
popv_organs = popv['Organ_Level'].unique().tolist()

print("Azimuth Organs:", azimuth_organs)
print()
print("Celltypist Organs:", celltypist_organs)
print()
print("Popv Organs:", popv_organs)

Azimuth Organs: ['Heart_L2', 'Kidney_L3', 'Lung_v2_finest_level', 'Liver_L2', 'Liver_L1', 'Heart_L1', 'Kidney_L1', 'Human_PBMC_L3', 'Human_PBMC_L1', 'Kidney_L2', 'Lung_v2_L1', 'Lung_v2_L2', 'Lung_v2_L3', 'Lung_v2_L4', 'Lung_v2_L5', 'Pancreas_L1', 'Human_PBMC_L2', 'Bone_marrow_L1', 'Bone_marrow_L2', 'Adipose_L1', 'Adipose_L2', 'Tonsil_v2_L2', 'Tonsil_v2_L1']

Celltypist Organs: ['blood_L1', 'bone marrow_L1', 'heart_L1', 'hippocampus_L1', 'intestine_L1', 'kidney_L1', 'liver_L1', 'lung_L1', 'lymph node_L1', 'pancreas_L1', 'skeletal muscle_L1', 'spleen_L1', 'Adult_Human_Skin_pkl', 'Healthy_Human_Liver_pkl', 'Adult_Human_PancreaticIslet_pkl', 'Human_Lung_Atlas_pkl', 'Healthy_Adult_Heart_pkl', 'Human_AdultAged_Hippocampus_pkl']

Popv Organs: ['blood', 'blood vasculature', 'bone marrow', 'eye', 'heart', 'large intestine', 'liver', 'lung', 'lymph node', 'male reproductive system', 'mammary gland', 'mesenteric lymph node', 'pancreas', 'prostate gland', 'respiratory system', 'skin', 'small intes

In [None]:
import json
import pandas as pd

# Load the JSON file
file_path = 'C:\\Users\\Supriya\\Downloads\\atlas-enriched-dataset-graph.jsonld'

with open(file_path, 'r') as file:
    json_data = json.load(file)

# Provided CT Labels and CT IDs
ct_labels_ids = {
    "glomerular capillary endothelial cell": "CL:1001005",
    "efferent arteriole endothelial cell": "CL:1001099",
    "afferent arteriole endothelial cell": "CL:1001096",
    "peritubular capillary endothelial cell": "CL:1001033",
    "vasa recta ascending limb cell": "CL:1001131",
    "vasa recta descending limb cell": "CL:1001285",
    "alveolar capillary type 1 endothelial cell": "CL:4028002",
    "capillary endothelial cell": "CL:0002144",
    "blood vessel smooth muscle cell": "CL:0019018",
    "endothelial cell of artery": "CL:1000413",
    "vein endothelial cell": "CL:0002543",
    "endothelial cell of hepatic sinusoid": "CL:1000398",
    "prostate gland microvascular endothelial cell": "CL:2000059",
    "blood vessel endothelial cell": "CL:0000071",
    "splenic endothelial cell": "CL:2000053"
}

# Identifying the structure within the "@graph" key to process accordingly
graph_data = json_data["@graph"]

In [29]:
# Grouping ensembl_ids by cell_id and adding cell_label
grouped_matches = {}
for item in graph_data:
    samples = item.get("samples", [])
    for sample in samples:
        sections = sample.get("sections", [])
        for section in sections:
            datasets = section.get("datasets", [])
            for dataset in datasets:
                summaries = dataset.get("summaries", [])
                for summary in summaries:
                    sum_details = summary.get("summary", [])
                    for sum_detail in sum_details:
                        cell_id = sum_detail.get("cell_id")
                        gene_expr_list = sum_detail.get("gene_expr", [])
                        if cell_id in ct_labels_ids.values():
                            if isinstance(gene_expr_list, list):
                                for gene_expr in gene_expr_list:
                                    if isinstance(gene_expr, dict):
                                        ensembl_id = gene_expr.get("ensembl_id")
                                        if cell_id not in grouped_matches:
                                            grouped_matches[cell_id] = {"cell_label": "", "ensembl_ids": []}
                                        grouped_matches[cell_id]["cell_label"] = [label for label, id in ct_labels_ids.items() if id == cell_id][0]
                                        grouped_matches[cell_id]["ensembl_ids"].append(ensembl_id)

# Preparing the final dataframe
final_matches = []
for cell_id, details in grouped_matches.items():
    final_matches.append({
        "cell_id": cell_id,
        "cell_label": details["cell_label"],
        "ensembl_ids": ", ".join(details["ensembl_ids"])
    })

final_matches_df = pd.DataFrame(final_matches)

# Displaying the final dataframe
final_matches_df.to_csv('Biomarker_for_Vasculature_CTs.csv')

In [28]:
final_matches_df['cell_id'].unique()

array(['CL:1001131', 'CL:1001005', 'CL:0002144', 'CL:1000413',
       'CL:0002543', 'CL:1000398'], dtype=object)

In [59]:
import json
import pandas as pd

# Load the JSON file
file_path = 'ref_data/atlas-enriched-dataset-graph.jsonld'

with open(file_path, 'r') as file:
    json_data = json.load(file)

# Provided CT Labels and CT IDs
ct_labels_ids = {
    "glomerular capillary endothelial cell": "CL:1001005",
    "efferent arteriole endothelial cell": "CL:1001099",
    "afferent arteriole endothelial cell": "CL:1001096",
    "peritubular capillary endothelial cell": "CL:1001033",
    "vasa recta ascending limb cell": "CL:1001131",
    "vasa recta descending limb cell": "CL:1001285",
    "alveolar capillary type 1 endothelial cell": "CL:4028002",
    "capillary endothelial cell": "CL:0002144",
    "blood vessel smooth muscle cell": "CL:0019018",
    "endothelial cell of artery": "CL:1000413",
    "vein endothelial cell": "CL:0002543",
    "endothelial cell of hepatic sinusoid": "CL:1000398",
    "prostate gland microvascular endothelial cell": "CL:2000059",
    "blood vessel endothelial cell": "CL:0000071",
    "splenic endothelial cell": "CL:2000053"
}

# Identifying the structure within the "@graph" key to process accordingly
graph_data = json_data["@graph"]

# Grouping ensembl_ids by cell_id and organ_id, adding cell_label and mean_gene_expr_value
grouped_matches = {}
for item in graph_data:
    samples = item.get("samples", [])
    for sample in samples:
        sections = sample.get("sections", [])
        for section in sections:
            datasets = section.get("datasets", [])
            for dataset in datasets:
                summaries = dataset.get("summaries", [])               
                for summary in summaries:
                    annotation_method = summary.get("annotation_method", "Unknown")
                    sum_details = summary.get("summary", [])
                    for sum_detail in sum_details:
                        cell_id = sum_detail.get("cell_id")
                        gene_expr_list = sum_detail.get("gene_expr", [])
                        organ_id = dataset.get("organ_id", "Unknown")
                        if cell_id in ct_labels_ids.values():
                            if isinstance(gene_expr_list, list) and gene_expr_list:
                                for gene_expr in gene_expr_list:
                                    if isinstance(gene_expr, dict):
                                        ensembl_id = gene_expr.get("ensembl_id")
                                        mean_expr_value = gene_expr.get("mean_gene_expr_value")
                                        key = (cell_id, organ_id, annotation_method)
                                        if key not in grouped_matches:
                                            grouped_matches[key] = {"cell_label": "", "ensembl_ids": [], "mean_expr_values": []}
                                        grouped_matches[key]["cell_label"] = [label for label, id in ct_labels_ids.items() if id == cell_id][0]
                                        grouped_matches[key]["ensembl_ids"].append(ensembl_id)
                                        grouped_matches[key]["mean_expr_values"].append(mean_expr_value)
                            else:
                                # Handle entries with empty gene_expr lists
                                key = (cell_id, organ_id, annotation_method)
                                if key not in grouped_matches:
                                    grouped_matches[key] = {"cell_label": "", "ensembl_ids": [], "mean_expr_values": []}
                                grouped_matches[key]["cell_label"] = [label for label, id in ct_labels_ids.items() if id == cell_id][0]

# Preparing the final dataframe
final_matches = []
for (cell_id, organ_id, annotation_method), details in grouped_matches.items():
    final_matches.append({
        "cell_id": cell_id,
        "cell_label": details["cell_label"],
        "organ_id": organ_id,
        "annotation_method": annotation_method,
        "ensembl_ids": ", ".join(details["ensembl_ids"]),
        "mean_expr_values": ", ".join(map(str, details["mean_expr_values"]))
    })

final_matches_df = pd.DataFrame(final_matches)

# Expanding the dataframe to have each row for each ensembl_id
expanded_matches = []
for _, row in final_matches_df.iterrows():
    cell_id = row["cell_id"]
    cell_label = row["cell_label"]
    organ_id = row["organ_id"]
    annotation_method = row["annotation_method"]
    ensembl_ids = row["ensembl_ids"].split(", ")
    mean_expr_values = row["mean_expr_values"].split(", ")
    
    for ensembl_id, mean_expr_value in zip(ensembl_ids, mean_expr_values):
        try:
            mean_expr_value_float = float(mean_expr_value)
        except ValueError:
            mean_expr_value_float = None  # Handle non-numeric values
        expanded_matches.append({
            "cell_id": cell_id,
            "cell_label": cell_label,
            "organ_id": organ_id,
            "annotation_method": annotation_method,
            "ensembl_id": ensembl_id,
            "mean_expr_value": mean_expr_value_float
        })

expanded_matches_df = pd.DataFrame(expanded_matches)

# Displaying the expanded dataframe
expanded_matches_df.head()  # Displaying only the first few rows

Unnamed: 0,cell_id,cell_label,organ_id,annotation_method,ensembl_id,mean_expr_value
0,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000130300.9,5.059016
1,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000169744.13,40.209354
2,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000072163.20,4.431148
3,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000150625.16,13.383242
4,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000251322.9,3.683606


In [61]:
# Extract unique organ IDs from the expanded dataframe
unique_organ_ids = expanded_matches_df["organ_id"].unique()

unique_organ_ids_list = unique_organ_ids.tolist()
unique_organ_ids_list

['http://purl.obolibrary.org/obo/UBERON_0002113',
 'http://purl.obolibrary.org/obo/UBERON_0000948',
 'http://purl.obolibrary.org/obo/UBERON_0001255',
 'http://purl.obolibrary.org/obo/UBERON_0002048',
 'http://purl.obolibrary.org/obo/UBERON_0002107']

In [62]:
organ_id_to_organ_name = {
         'http://purl.obolibrary.org/obo/UBERON_0002113' : 'Kidney',
         'http://purl.obolibrary.org/obo/UBERON_0000948' : 'heart',
         'http://purl.obolibrary.org/obo/UBERON_0001255': 'urinary bladder',
         'http://purl.obolibrary.org/obo/UBERON_0002048' : 'lung',
         'http://purl.obolibrary.org/obo/UBERON_0002107' : 'liver'
        }

In [63]:
# Adding a new column to the dataframe
expanded_matches_df['organ_name'] = expanded_matches_df['organ_id'].map(organ_id_to_organ_name)

# Display the first few rows of the updated dataframe
expanded_matches_df.head() 

Unnamed: 0,cell_id,cell_label,organ_id,annotation_method,ensembl_id,mean_expr_value,organ_name
0,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000130300.9,5.059016,Kidney
1,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000169744.13,40.209354,Kidney
2,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000072163.20,4.431148,Kidney
3,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000150625.16,13.383242,Kidney
4,CL:1001131,vasa recta ascending limb cell,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,ENSG00000251322.9,3.683606,Kidney


In [64]:
# Load the new CSV file
ftu_file_path = 'ref_data/FTU Cell Count Table - Cell_Type_Count.csv'
ftu_df = pd.read_csv(ftu_file_path)

# Select and clean up relevant columns from FTU dataframe
ftu_cleaned_df = ftu_df[['Organ', 'CT Label in CL', 'FTU Label in Uberon']].dropna().drop_duplicates()

# Ensure unique mapping by dropping duplicates based on 'CT Label in CL'
ftu_unique_df = ftu_cleaned_df.drop_duplicates(subset=['CT Label in CL'])

# Perform the match and add the new column for FTU name
expanded_matches_df['FTU_name'] = expanded_matches_df['cell_label'].map(
    ftu_unique_df.set_index('CT Label in CL')['FTU Label in Uberon']
)

# Adding the organ name from FTU data
expanded_matches_df['Organ_name_FTU'] = expanded_matches_df['cell_label'].map(
    ftu_unique_df.set_index('CT Label in CL')['Organ']
)

# Rearrange columns as per the requirement
expanded_matches_df = expanded_matches_df[[
    'organ_name', 'Organ_name_FTU','organ_id', 'annotation_method', 'FTU_name', 'cell_id', 'cell_label', 'ensembl_id', 'mean_expr_value'
]]

# Rename columns for consistency
expanded_matches_df.rename(columns={
    'organ_name': 'organ_name',
    'Organ_name_FTU': 'organ_name_in_FTU'
}, inplace=True)

# Display the first few rows of the updated dataframe
expanded_matches_df.head()

Unnamed: 0,organ_name,organ_name_in_FTU,organ_id,annotation_method,FTU_name,cell_id,cell_label,ensembl_id,mean_expr_value
0,Kidney,Kidney,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,Thick Ascending Loop Of Henle,CL:1001131,vasa recta ascending limb cell,ENSG00000130300.9,5.059016
1,Kidney,Kidney,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,Thick Ascending Loop Of Henle,CL:1001131,vasa recta ascending limb cell,ENSG00000169744.13,40.209354
2,Kidney,Kidney,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,Thick Ascending Loop Of Henle,CL:1001131,vasa recta ascending limb cell,ENSG00000072163.20,4.431148
3,Kidney,Kidney,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,Thick Ascending Loop Of Henle,CL:1001131,vasa recta ascending limb cell,ENSG00000150625.16,13.383242
4,Kidney,Kidney,http://purl.obolibrary.org/obo/UBERON_0002113,azimuth,Thick Ascending Loop Of Henle,CL:1001131,vasa recta ascending limb cell,ENSG00000251322.9,3.683606


In [65]:
expanded_matches_df.to_csv('output/vasculature/gene_information-for-vasculature-CTs_1.csv', index=False)

#### Generate box plot images for unique cell label

In [69]:
import matplotlib.pyplot as plt
import seaborn as sns

# Create individual box plots for each cell_label
unique_cell_labels = expanded_matches_df['cell_label'].unique()

for cell_label in unique_cell_labels:
    cell_label_data = expanded_matches_df[expanded_matches_df['cell_label'] == cell_label]
    
    if len(cell_label_data) > 1:  # Ensure there is enough data to plot
        plt.figure(figsize=(8, 6))
        sns.boxplot(data=cell_label_data, y='mean_expr_value', color='skyblue')
        
        # Customize the box plot
        plt.xticks([])
        plt.title(f'Box Plot of Mean Expression Values for {cell_label}')
        plt.xlabel('Cell Label')
        plt.ylabel('Mean Expression Value')
        
        # Adding additional visual elements
        plt.axhline(y=cell_label_data['mean_expr_value'].median(), color='red', linestyle='-', label='Median')
        plt.legend()

        # Save each plot as an image file
        individual_box_plot_file_path = f'output/vasculature/figures/box_plot_mean_expr_value_{cell_label.replace(" ", "_")}.png'
        plt.tight_layout()
        plt.savefig(individual_box_plot_file_path)
        plt.close()

#### Generate box plot images for unique cell label per annotation method

In [67]:
unique_annotation_methods = expanded_matches_df['annotation_method'].explode().unique()
unique_annotation_methods

array(['azimuth', 'celltypist', 'popv'], dtype=object)

In [70]:
unique_annotation_methods = expanded_matches_df['annotation_method'].explode().unique()

for cell_label in unique_cell_labels:
    for annotation_method in unique_annotation_methods:
        cell_label_data = expanded_matches_df[
            (expanded_matches_df['cell_label'] == cell_label) &
            (expanded_matches_df['annotation_method'] == annotation_method)
        ]
        
        if len(cell_label_data) > 1:  # Ensure there is enough data to plot
            plt.figure(figsize=(8, 6))
            sns.boxplot(data=cell_label_data, y='mean_expr_value', color='skyblue')
            
            # Customize the box plot
            plt.xticks([])
            plt.title(f'Box Plot of Mean Expression Values for {cell_label} ({annotation_method})')
            plt.xlabel('Cell Label')
            plt.ylabel('Mean Expression Value')
            
            # Adding additional visual elements
            plt.axhline(y=cell_label_data['mean_expr_value'].median(), color='red', linestyle='-', label='Median')
            plt.legend()

            # Save each plot as an image file
            individual_box_plot_file_path = f'output/vasculature/figures/box_plot_mean_expr_value_{cell_label.replace(" ", "_")}_{annotation_method.replace(" ", "_")}.png'
            plt.tight_layout()
            plt.savefig(individual_box_plot_file_path)
            plt.close()

# List of generated plot file paths
plot_file_paths = [f'output/vasculature/figures/box_plot_mean_expr_value_{cell_label.replace(" ", "_")}_{annotation_method.replace(" ", "_")}.png'
                   for cell_label in unique_cell_labels
                   for annotation_method in unique_annotation_methods
                   if len(expanded_matches_df[(expanded_matches_df['cell_label'] == cell_label) & (expanded_matches_df['annotation_method'] == annotation_method)]) > 1]

plot_file_paths

['output/vasculature/figures/box_plot_mean_expr_value_vasa_recta_ascending_limb_cell_azimuth.png',
 'output/vasculature/figures/box_plot_mean_expr_value_glomerular_capillary_endothelial_cell_azimuth.png',
 'output/vasculature/figures/box_plot_mean_expr_value_capillary_endothelial_cell_azimuth.png',
 'output/vasculature/figures/box_plot_mean_expr_value_capillary_endothelial_cell_celltypist.png',
 'output/vasculature/figures/box_plot_mean_expr_value_capillary_endothelial_cell_popv.png',
 'output/vasculature/figures/box_plot_mean_expr_value_endothelial_cell_of_artery_azimuth.png',
 'output/vasculature/figures/box_plot_mean_expr_value_endothelial_cell_of_artery_celltypist.png',
 'output/vasculature/figures/box_plot_mean_expr_value_endothelial_cell_of_artery_popv.png',
 'output/vasculature/figures/box_plot_mean_expr_value_vein_endothelial_cell_celltypist.png',
 'output/vasculature/figures/box_plot_mean_expr_value_vein_endothelial_cell_popv.png',
 'output/vasculature/figures/box_plot_mean_ex

In [26]:
import json

file_path = 'ref_data/atlas-enriched-dataset-graph.jsonld'

with open(file_path, 'r') as file:
    data = json.load(file)

# Extract cell IDs from the summary section
def extract_cell_ids(data):
    cell_ids = []
    for item in data['@graph']:
        samples = item.get("samples", [])
        for sample in samples:
            sections = sample.get("sections", [])
            for section in sections:
                datasets = section.get("datasets", [])
                for dataset in datasets:
                    summaries = dataset.get("summaries", [])               
                    for summary in summaries:
                        sum_details = summary.get("summary", [])
                        for sum_detail in sum_details:
                            cell_id = sum_detail.get("cell_id")
                            cell_ids.append(cell_id)
    return cell_ids

cell_ids = extract_cell_ids(data)
print(f"Cell IDs from the file: {len(cell_ids)}")

Cell IDs from the file: 12650


In [45]:
# Load the data
file_path = 'ref_data/atlas-as-cell-summaries.jsonld'

with open(file_path, 'r') as file:
    data = json.load(file)

def extract_cell_ids_summaries(data):
    cell_ids = []
    for item in data['@graph']:
        summaries = item.get('summary', [])
        for summary in summaries:
            for detail in summary:
                if 'cell_id' in detail:
                    cell_ids.append(summary['cell_id'])
    return cell_ids

cell_ids_summaries = extract_cell_ids_summaries(data)
print(f"Cell IDs from the file: {len(cell_ids_summaries)}")

Cell IDs from the file: 6072
