## Download and preprocess anatomogram data from the EMBL-EBI

Download links:
1. Kidney: https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download/zip?fileType=normalised
2. Liver: https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download/zip?fileType=normalised
3. Lung: https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download/zip?fileType=normalised
4. Pancreas : https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download/zip?fileType=normalised

Experimental design files:
1. Kidney: https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download?fileType=experiment-design
2. Liver: https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download?fileType=experiment-design
3. Lung: https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download?fileType=experiment-design
4. Pancreas : https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download?fileType=experiment-design

SCEA websites: 
1. Kidney: https://www.ebi.ac.uk/gxa/sc/experiments/E-CURD-119/downloads
2. Liver: https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-10553/downloads
3. Lung: https://www.ebi.ac.uk/gxa/sc/experiments/E-GEOD-130148/downloads
4. Pancreas : https://www.ebi.ac.uk/gxa/sc/experiments/E-MTAB-5061/downloads

## Install and import libraries

In [15]:
%pip install numpy pandas scanpy anndata mygene ipywidgets hra_jupyter_widgets

import os
from pathlib import Path
import shutil
import numpy as np
import pandas as pd
import scanpy as sc
import anndata as ad
import csv
import mygene
import ipywidgets as widgets
# Import hra-jupyter-widgets. For documentation, please see https://github.com/x-atlas-consortia/hra-jupyter-widgets/blob/main/usage.ipynb
from hra_jupyter_widgets import FtuExplorerSmall

import warnings
warnings.filterwarnings("ignore")

Note: you may need to restart the kernel to use updated packages.


## Global settings

In [16]:
organ_links = {
  'kidney':{
      'url_counts': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download/zip?fileType=normalised',
      'url_experimental_design': 'https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download?fileType=experiment-design',
      'experiment_id': 'E-CURD-119'
  }
}

## Functions

In [17]:
def download_file(url:str, file_name:str, sub_folder_name:str):
  """Downloads 

  Args:
      url (str): URL for file download
      file_name (str): file name
      subfolder_name (str): subfolder name
  """
  # Make sure the data folder is present
  folder_path = f"data/{sub_folder_name}"

  if not os.path.exists(folder_path):
      os.makedirs(folder_path)
      print(f"Folder '{folder_path}' created.")
  else:
      print(f"Folder '{folder_path}' already exists.")

  # Define the path to the file. 
  file_path = f'{folder_path}/{file_name}'

  # Check if the file exists
  if not os.path.exists(file_path):
      # If the file doesn't exist, run the curl command
      !curl -L {url} -o {file_path}
      print(f"File downloaded and saved at {file_path}")
  else:
      print(f"File already exists at {file_path}")

In [18]:
def unzip_to_folder(file_path: str, target_folder: str):
    """
    Unzip the file at the specified file_path into target_folder,
    but only if the folder is empty.

    Args:
        file_path (str): Path to the .zip (or other archive) file.
        target_folder (str): Path where the archive should be extracted.
    """
    target = Path(target_folder)

    # Exclude the archive itself when checking contents
    if target.exists() and any(
        p.is_file() and p.suffix != ".zip" and "ExpDesign" not in p.name
        for p in target.iterdir()
    ):
        print(f"Skipped: {target} already contains extracted files.")
        return

    # Otherwise, unzip
    shutil.unpack_archive(file_path, target)
    print(f"Unzipped {file_path} → {target}")

In [19]:
def download_anatomogram_data(url_counts:str, url_experiment:str, experiment_name:str, organ_name:str):
  """Download and unzip anatomogram data for a given organ.

    Args:
        url_counts (str): The URL to download the data from.
        url_experiment (str): The URL to download the experimental metadata from.
        url_experiment (str): The name for the experimental design file.
        organ_name (str): The name of the organ (used for file and folder names).
  """
  download_file(url_counts, f'{organ_name}.zip', f'{organ_name}')
  download_file(url_experiment, f'{experiment_name}', f'{organ_name}')
  unzip_to_folder(f'data/{organ_name}/{organ_name}.zip', f'data/{organ_name}')

## Process kidney data

In [20]:
download_anatomogram_data(
  'https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download/zip?fileType=normalised', 
  'https://www.ebi.ac.uk/gxa/sc/experiment/E-CURD-119/download?fileType=experiment-design',
  'ExpDesign-E-CURD-119',
  'kidney'
  )

Folder 'data/kidney' already exists.
File already exists at data/kidney/kidney.zip
Folder 'data/kidney' already exists.
File already exists at data/kidney/ExpDesign-E-CURD-119
Skipped: data\kidney already contains extracted files.


In [21]:
# anndata_kidney = sc.read_mtx('data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx')
# df_kidney = anndata_kidney.to_df()
# df_kidney

In [22]:
# # Load genes and cell type information for kidney
# rows_kidney = pd.read_csv('data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx_rows',names=['col1', 'col2'], sep='\t').drop(['col2'], axis=1)
# cols_kidney = pd.read_csv('data/kidney/E-CURD-119.aggregated_filtered_normalised_counts.mtx_cols', names=['col1'])

In [23]:
# rows_kidney

In [24]:
# cols_kidney

In [25]:
# ref_data_kidney = pd.read_csv(
#     'data/kidney/ExpDesign-E-CURD-119.tsv', sep='\t')
# ref_data_kidney

## Process liver data

In [26]:
download_anatomogram_data('https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download/zip?fileType=normalised', 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-10553/download?fileType=experiment-design', 'ExpDesign-E-MTAB-10553', 'liver')

Folder 'data/liver' already exists.
File already exists at data/liver/liver.zip
Folder 'data/liver' already exists.
File already exists at data/liver/ExpDesign-E-MTAB-10553
Skipped: data\liver already contains extracted files.


## Process lung data

In [27]:
download_anatomogram_data('https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download/zip?fileType=normalised', 'https://www.ebi.ac.uk/gxa/sc/experiment/E-GEOD-130148/download?fileType=experiment-design', 'ExpDesign-E-GEOD-130148', 'lung')

Folder 'data/lung' already exists.
File already exists at data/lung/lung.zip
Folder 'data/lung' already exists.
File already exists at data/lung/ExpDesign-E-GEOD-130148
Skipped: data\lung already contains extracted files.


## Process pancreas

In [28]:
download_anatomogram_data('https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download/zip?fileType=normalised', 'https://www.ebi.ac.uk/gxa/sc/experiment/E-MTAB-5061/download?fileType=experiment-design','ExpDesign-/E-MTAB-5061', 'pancreas')

Folder 'data/pancreas' already exists.
File already exists at data/pancreas/pancreas.zip
Folder 'data/pancreas' already exists.
File downloaded and saved at data/pancreas/ExpDesign-/E-MTAB-5061
Skipped: data\pancreas already contains extracted files.


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed


  0 2234k    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
curl: (23) client returned ERROR on write of 6993 bytes


## Filter to only keep CTs in FTUs

In [29]:
# neet to get listing of AS-CT combos in tables only for FTUs and then the same for crosswalks