## Download and preprocess anatomogram data from the EMBL-EBI

Download links:
1. Kidney: https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download/zip?fileType=normalised
2. Liver: https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download/zip?fileType=normalised
3. Lung: https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download/zip?fileType=normalised
4. Pancreas : https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download/zip?fileType=normalised

Experimental design files:
1. Kidney: https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download?fileType=experiment-design
2. Liver: https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download?fileType=experiment-design
3. Lung: https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download?fileType=experiment-design
4. Pancreas : https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download?fileType=experiment-design

SCEA websites: 
1. Kidney: https://www.ebi.ac.uk/gxa/sc/experiments/E-CURD-119/downloads
2. Liver: https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-10553/downloads
3. Lung: https://www.ebi.ac.uk/gxa/sc/experiments/E-GEOD-130148/downloads
4. Pancreas : https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-5061/downloads

## Install and import libraries

In [56]:
%pip install numpy pandas scanpy anndata mygene ipywidgets hra_jupyter_widgets

import os
from pathlib import Path
import shutil
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import csv
import mygene
import ipywidgets as widgets
# Import hra-jupyter-widgets. For documentation, please see https://github.com/x-atlas-consortia/hra-jupyter-widgets/blob/main/usage.ipynb
from hra_jupyter_widgets import FtuExplorerSmall

import warnings
warnings.filterwarnings("ignore")




## Global settings

In [57]:
organ_metadata = [
    {
        'name': 'kidney',
        'url_counts': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download/zip?fileType=normalised',
        'url_experimental_design': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download?fileType=experiment-design',
        'experiment_id': 'E-CURD-119'
    },
    {
        'name': 'liver',
        'url_counts': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download/zip?fileType=normalised',
        'url_experimental_design': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download?fileType=experiment-design',
        'experiment_id': 'E-MTAB-10553'
    },
    {
        'name': 'lung',
        'url_counts': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download/zip?fileType=normalised',
        'url_experimental_design': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download?fileType=experiment-design',
        'experiment_id': 'E-GEOD-130148'
    },
    {
        'name': 'pancreas',
        'url_counts': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download/zip?fileType=normalised',
        'url_experimental_design': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download?fileType=experiment-design',
        'experiment_id': 'E-MTAB-5061'
    }
]

## Functions

In [58]:
def download_file(url:str, file_name:str, sub_folder_name:str):
  """Downloads 

  Args:
      url (str): URL for file download
      file_name (str): file name
      subfolder_name (str): subfolder name
  """
  # Make sure the data folder is present
  folder_path = f"data/{sub_folder_name}"

  if not os.path.exists(folder_path):
      os.makedirs(folder_path)
      print(f"Folder '{folder_path}' created.")
  else:
      print(f"Folder '{folder_path}' already exists.")

  # Define the path to the file. 
  file_path = f'{folder_path}/{file_name}'

  # Check if the file exists
  if not os.path.exists(file_path):
      # If the file doesn't exist, run the curl command
      !curl -L {url} -o {file_path}
      print(f"File downloaded and saved at {file_path}")
  else:
      print(f"File already exists at {file_path}")

In [59]:
def unzip_to_folder(file_path: str, target_folder: str):
    """
    Unzip the file at the specified file_path into target_folder,
    but only if the folder is empty.

    Args:
        file_path (str): Path to the .zip (or other archive) file.
        target_folder (str): Path where the archive should be extracted.
    """
    target = Path(target_folder)

    # Exclude the archive itself when checking contents
    if target.exists() and any(
        p.is_file() and p.suffix != ".zip" and ".tsv" not in p.name
        for p in target.iterdir()
    ):
        print(f"Skipped: {target} already contains extracted files.")
        return

    # Otherwise, unzip
    shutil.unpack_archive(file_path, target)
    print(f"Unzipped {file_path} → {target}")

In [60]:
def download_anatomogram_data(url_counts:str, url_experiment:str, experiment_name:str, organ_name:str):
  """Download and unzip anatomogram data for a given organ.

    Args:
        url_counts (str): The URL to download the data from.
        url_experiment (str): The URL to download the experimental metadata from.
        url_experiment (str): The name for the experimental design file.
        organ_name (str): The name of the organ (used for file and folder names).
  """
  download_file(url_counts, f'{organ_name}.zip', f'{organ_name}')
  download_file(url_experiment, f'{experiment_name}.tsv', f'{organ_name}')
  unzip_to_folder(f'data/{organ_name}/{organ_name}.zip', f'data/{organ_name}')

## Download all data

In [61]:
for organ in organ_metadata:
  download_anatomogram_data(
      organ['url_counts'],
      organ['url_experimental_design'],
      organ['experiment_id'],
      organ['name']
  )

Folder 'data/kidney' already exists.
File already exists at data/kidney/kidney.zip
Folder 'data/kidney' already exists.
File already exists at data/kidney/E-CURD-119.tsv
Skipped: data\kidney already contains extracted files.
Folder 'data/liver' already exists.
File already exists at data/liver/liver.zip
Folder 'data/liver' already exists.
File already exists at data/liver/E-MTAB-10553.tsv
Skipped: data\liver already contains extracted files.
Folder 'data/lung' already exists.
File already exists at data/lung/lung.zip
Folder 'data/lung' already exists.
File already exists at data/lung/E-GEOD-130148.tsv
Skipped: data\lung already contains extracted files.
Folder 'data/pancreas' already exists.
File already exists at data/pancreas/pancreas.zip
Folder 'data/pancreas' already exists.
File already exists at data/pancreas/E-MTAB-5061.tsv
Skipped: data\pancreas already contains extracted files.


## Process kidney data

In [62]:
anndata_kidney = sc.read_mtx('data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx')
df_kidney = anndata_kidney.to_df()
df_kidney

In [None]:
anndata_kidney.

AttributeError: 'AnnData' object has no attribute 'colums'

In [None]:
sc.tl.rank_genes_groups(anndata_kidney, groupby='', n_genes=10)

TypeError: rank_genes_groups() missing 1 required positional argument: 'groupby'

In [64]:
# Load genes and cell type information for kidney
rows_kidney = pd.read_csv('data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx_rows',names=['col1', 'col2'], sep='\t').drop(['col2'], axis=1)
cols_kidney = pd.read_csv('data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx_cols', names=['col1'])

In [65]:
rows_kidney

Unnamed: 0,col1
0,ENSG00000000003
1,ENSG00000000005
2,ENSG00000000419
3,ENSG00000000457
4,ENSG00000000460
...,...
38912,ENSG00000290147
38913,ENSG00000290149
38914,ENSG00000290163
38915,ENSG00000290164


In [66]:
cols_kidney

Unnamed: 0,col1
0,SAMN15040593-AAACCTGAGGACATTA
1,SAMN15040593-AAACCTGCAGCTCGAC
2,SAMN15040593-AAACCTGCAGTATAAG
3,SAMN15040593-AAACCTGCATGCCTAA
4,SAMN15040593-AAACCTGGTATAGTAG
...,...
30584,SAMN15040597-TTTGTCATCACAGGCC
30585,SAMN15040597-TTTGTCATCACCAGGC
30586,SAMN15040597-TTTGTCATCACCGTAA
30587,SAMN15040597-TTTGTCATCCTACAGA


In [67]:
ref_data_kidney = pd.read_csv('data/kidney/E-CURD-119.tsv', sep='\t')
ref_data_kidney.head()

Unnamed: 0,Assay,Sample Characteristic[organism],Sample Characteristic Ontology Term[organism],Sample Characteristic[individual],Sample Characteristic Ontology Term[individual],Sample Characteristic[ethnic group],Sample Characteristic Ontology Term[ethnic group],Sample Characteristic[sex],Sample Characteristic Ontology Term[sex],Sample Characteristic[age],...,Sample Characteristic[organism part],Sample Characteristic Ontology Term[organism part],Sample Characteristic[clinical information],Sample Characteristic Ontology Term[clinical information],Factor Value[sex],Factor Value Ontology Term[sex],Factor Value[inferred cell type - ontology labels],Factor Value Ontology Term[inferred cell type - ontology labels],Factor Value[inferred cell type - authors labels],Factor Value Ontology Term[inferred cell type - authors labels]
0,SAMN15040593-AAACCTGAGGACATTA,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Healthy5,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,female,http://purl.obolibrary.org/obo/PATO_0000383,52 year,...,cortex of kidney,http://purl.obolibrary.org/obo/UBERON_0001225,glomerular filtration rate 98 ml/min/1.73âm2...,,female,http://purl.obolibrary.org/obo/PATO_0000383,kidney loop of Henle thick ascending limb epit...,http://purl.obolibrary.org/obo/CL_1001106,thick ascending limb,
1,SAMN15040593-AAACCTGCAGCTCGAC,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Healthy5,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,female,http://purl.obolibrary.org/obo/PATO_0000383,52 year,...,cortex of kidney,http://purl.obolibrary.org/obo/UBERON_0001225,glomerular filtration rate 98 ml/min/1.73âm2...,,female,http://purl.obolibrary.org/obo/PATO_0000383,,,,
2,SAMN15040593-AAACCTGCAGTATAAG,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Healthy5,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,female,http://purl.obolibrary.org/obo/PATO_0000383,52 year,...,cortex of kidney,http://purl.obolibrary.org/obo/UBERON_0001225,glomerular filtration rate 98 ml/min/1.73âm2...,,female,http://purl.obolibrary.org/obo/PATO_0000383,,,,
3,SAMN15040593-AAACCTGCATGCCTAA,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Healthy5,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,female,http://purl.obolibrary.org/obo/PATO_0000383,52 year,...,cortex of kidney,http://purl.obolibrary.org/obo/UBERON_0001225,glomerular filtration rate 98 ml/min/1.73âm2...,,female,http://purl.obolibrary.org/obo/PATO_0000383,,,,
4,SAMN15040593-AAACCTGGTATAGTAG,Homo sapiens,http://purl.obolibrary.org/obo/NCBITaxon_9606,Healthy5,,European,http://purl.obolibrary.org/obo/HANCESTRO_0005,female,http://purl.obolibrary.org/obo/PATO_0000383,52 year,...,cortex of kidney,http://purl.obolibrary.org/obo/UBERON_0001225,glomerular filtration rate 98 ml/min/1.73âm2...,,female,http://purl.obolibrary.org/obo/PATO_0000383,,,,


In [68]:
index_kidney = pd.read_csv(
    'data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx_rows', names=['col1', 'col2'], sep='\t')
cols_kidney = pd.read_csv(
    'data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx_cols', names=['col1'])

In [69]:
index_kidney = index_kidney.drop(['col2'], axis=1)
index_kidney

Unnamed: 0,col1
0,ENSG00000000003
1,ENSG00000000005
2,ENSG00000000419
3,ENSG00000000457
4,ENSG00000000460
...,...
38912,ENSG00000290147
38913,ENSG00000290149
38914,ENSG00000290163
38915,ENSG00000290164


In [70]:
ref_data_kidney = ref_data_kidney.rename(
    columns={'Factor Value[inferred cell type - authors labels]': 'Cell_Type', 'Factor Value Ontology Term[inferred cell type - authors labels]': 'CL_ID'})

In [71]:
ref_data_mod_kidney = ref_data_kidney[['Assay', 'Cell_Type', 'CL_ID']]

In [72]:
ref_data_mod_kidney['CL_ID'] = ref_data_mod_kidney['CL_ID'].str.split(
    '/').str[-1]

ref_data_mod_kidney['CL_ID'] = ref_data_mod_kidney['CL_ID'].str.replace(
    '_', ':')

ref_data_mod_kidney

Unnamed: 0,Assay,Cell_Type,CL_ID
0,SAMN15040593-AAACCTGAGGACATTA,thick ascending limb,
1,SAMN15040593-AAACCTGCAGCTCGAC,,
2,SAMN15040593-AAACCTGCAGTATAAG,,
3,SAMN15040593-AAACCTGCATGCCTAA,,
4,SAMN15040593-AAACCTGGTATAGTAG,,
...,...,...,...
30584,SAMN15040597-TTTGTCATCACAGGCC,,
30585,SAMN15040597-TTTGTCATCACCAGGC,,
30586,SAMN15040597-TTTGTCATCACCGTAA,connecting tubule,
30587,SAMN15040597-TTTGTCATCCTACAGA,distal convoluted tubule 1,UBERON:0001292


In [73]:
# Create a mapping using dataframe2
mapping = ref_data_mod_kidney.set_index('Assay')['Cell_Type']

# Use the map function to replace values in dataframe1
cols_kidney['col1'] = cols_kidney['col1'].map(mapping)

# Display the modified dataframe1
print(cols_kidney)

                             col1
0            thick ascending limb
1                             NaN
2                             NaN
3                             NaN
4                             NaN
...                           ...
30584                         NaN
30585                         NaN
30586           connecting tubule
30587  distal convoluted tubule 1
30588  distal convoluted tubule 1

[30589 rows x 1 columns]


In [74]:
cols_kidney.value_counts()

col1                                 
proximal tubule                          5033
thick ascending limb                     4431
distal convoluted tubule 1               2760
connecting tubule                        1805
Type A intercalated cell                 1107
principle cell                           1022
endothelial cell                         1008
parietal epithelial cell                  551
distal convoluted tubule 2                489
podocyte                                  463
proximal tubule with VCAM1 expression     448
Type B intercalated cell                  349
mesangial cell                            239
fibroblast                                206
leukocyte                                  63
Name: count, dtype: int64

In [75]:
index_val_kidney = index_kidney['gene_name'].tolist()
col_val_kidney = cols_kidney['col1'].tolist()

print(len(index_val_kidney), len(col_val_kidney))

KeyError: 'gene_name'

In [None]:
anndata_kidney = anndata_kidney.T

## Process liver data

In [None]:
# download_anatomogram_data('https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download/zip?fileType=normalised', 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download?fileType=experiment-design', 'ExpDesign-E-MTAB-10553', 'liver')

## Process lung data

In [None]:
# download_anatomogram_data('https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download/zip?fileType=normalised', 'https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download?fileType=experiment-design', 'ExpDesign-E-GEOD-130148', 'lung')

## Process pancreas

In [None]:
# download_anatomogram_data('https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download/zip?fileType=normalised', 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download?fileType=experiment-design','ExpDesign-/E-MTAB-5061', 'pancreas')

## Filter to only keep CTs in FTUs

In [None]:
# neet to get listing of AS-CT combos in tables only for FTUs and then the same for crosswalks