# CellxGene Census Data Functions Documentation

This document describes functions for managing donor-specific data slices from CellxGene Census datasets. These functions help retrieve, save, and track donor data for specified datasets using lower memory resources. 

---

## 1. `donor_id_information`

Retrieves all available donor IDs for a specified dataset and additional metadata from the census.

**Parameters:**
- `dataset_id` *(str)*: The ID of the dataset to retrieve 
- `organism` *(str)*: can be either `Mus Musculus` or `Homo Sapiens`
- `display_info` *(bool, optional)*: If `True`, displays dataset information as a DataFrame. Default is `False`

**Returns:**
- `metadata_dict` *(dict)*: A dictionary containing:
  - **`dataset_id`**: The dataset ID.
  - **`available_donor_ids`**: A list of all donor IDs in the dataset.
  - **`downloaded_donor_ids`**: A list to track downloaded donor IDs (initially empty).
  - **`dataset_info`**: Additional dataset metadata in dictionary format.


In [11]:
import os
import json
import cellxgene_census
import numpy as np 
import pandas as pd
import anndata

# This function is to retrieve the donor IDs and can be used by itself 
def donor_id_information(dataset_id, organism, display_info=False):
    """
    Retrieves all available donor IDs for the specified dataset from the census.
    
    Parameters:
    - dataset_id (str): The ID of the dataset to retrieve information for.
    - organism can be 'Mus Musculus' or 'Homo Sapiens'
    - display_info (bool): Whether to display the dataset information as a DataFrame (default is False).
    
    Returns:
    - metadata_dict (dict): Dictionary containing dataset metadata including available donor IDs.
    """
    # Open the census and retrieve metadata
    census = cellxgene_census.open_soma()
    metadata = cellxgene_census.get_obs(census, organism = organism , value_filter=f"dataset_id == '{dataset_id}'")
    available_donor_ids = set(np.unique(metadata['donor_id']))

    # Read the census datasets table to get additional dataset information
    census_datasets = census["census_info"]["datasets"].read().concat().to_pandas()
    census_datasets = census_datasets.set_index("soma_joinid")

    # Filter for the specific dataset_id
    dataset_info = pd.DataFrame(census_datasets[census_datasets.dataset_id == dataset_id])
    if dataset_info.empty:
        raise ValueError(f"Dataset ID {dataset_id} not found in CellxGene Census.")
    else:
        if display_info:
            display(dataset_info)

        # Build the metadata dictionary
        metadata_dict = {
            "dataset_id": dataset_id,
            "available_donor_ids": list(available_donor_ids),  # All IDs from census
            "downloaded_donor_ids": [],  # To track IDs that have been downloaded
            "dataset_info": dataset_info.to_dict(orient="records")  # Additional dataset info
        }
        
        return metadata_dict

In [2]:
# Example usage of the function donor_id_information
# It is necessary to know the ID of the dataset you need to fetch 
dataset_id = '0895c838-e550-48a3-a777-dbcd35d30272'
donor_id_information(dataset_id, organism = 'Homo Sapiens')

The "stable" release is currently 2024-07-01. Specify 'census_version="2024-07-01"' in future calls to open_soma() to ensure data consistency.


{'dataset_id': '0895c838-e550-48a3-a777-dbcd35d30272',
 'available_donor_ids': ['C70', 'C58', 'C72', 'C41'],
 'downloaded_donor_ids': [],
 'dataset_info': [{'citation': 'Publication: https://doi.org/10.1002/hep4.1854 Dataset Version: https://datasets.cellxgene.cziscience.com/fb76c95f-0391-4fac-9fb9-082ce2430b59.h5ad curated and distributed by CZ CELLxGENE Discover in Collection: https://cellxgene.cziscience.com/collections/44531dd9-1388-4416-a117-af0a99de2294',
   'collection_id': '44531dd9-1388-4416-a117-af0a99de2294',
   'collection_name': 'Single-Cell, Single-Nucleus, and Spatial RNA Sequencing of the Human Liver Identifies Cholangiocyte and Mesenchymal Heterogeneity',
   'collection_doi': '10.1002/hep4.1854',
   'dataset_id': '0895c838-e550-48a3-a777-dbcd35d30272',
   'dataset_version_id': 'fb76c95f-0391-4fac-9fb9-082ce2430b59',
   'dataset_title': 'Healthy human liver: B cells',
   'dataset_h5ad_path': '0895c838-e550-48a3-a777-dbcd35d30272.h5ad',
   'dataset_total_cell_count': 146

## 2. `update_metadata_file`

This function creates or eventually updates a JSON file containing metadata for each dataset. It appends a donor ID to the list of downloaded donors to keep track of which donor data has been saved.

**Parameters**:

- `dataset_id` *(str)*: The ID of the dataset, whose metadata needs be to updated.
- `donor_id` *(str)*: The donor ID to mark as downloaded.
- `organism` *(str)*: can be either `Mus Musculus` or `Homo Sapiens`

**Process**:

1. Defines the metadata file path using dataset_id.
2. If the metadata file already exists, it loads it; if not, it calls donor_id_information() to create it.
3. Checks if donor_id is already in downloaded_donor_ids. If not, it appends it to the list.
4. Saves the updated metadata to the JSON file.

**Output**:

Updates or creates a `metadata` JSON file for the dataset, tracking downloaded donor IDs.


**json metadata file structure example**
```json
{
    "dataset_id": "0895c838-e550-48a3-a777-dbcd35d30272",
    "available_donor_ids": [
        "C58",
        "C70",
        "C41",
        "C72"
    ],
    "downloaded_donor_ids": [
        "C58",
        "C70",
        "C41"
    ],
    "dataset_info": [
        {
            "citation": "Publication: https://doi.org/10.1002/hep4.1854 Dataset Version: https://datasets.cellxgene.czi>            "collection_id": "44531dd9-1388-4416-a117-af0a99de2294",
            "collection_name": "Single-Cell, Single-Nucleus, and Spatial RNA Sequencing of the Human Liver Identifies C>            "collection_doi": "10.1002/hep4.1854",
            "dataset_id": "0895c838-e550-48a3-a777-dbcd35d30272",
            "dataset_version_id": "fb76c95f-0391-4fac-9fb9-082ce2430b59",
            "dataset_title": "Healthy human liver: B cells",
            "dataset_h5ad_path": "0895c838-e550-48a3-a777-dbcd35d30272.h5ad",
            "dataset_total_cell_count": 146
        }
    ]
}

In [3]:
# These two function takes as input the dataset_id and the donor_id to retrieve
# The census is accessed only if necessary, if the file are already present in the directory then the census is not invoked 
# The file are stored in slices (one file for each donor_id) to garantee a low RAM necessities 
def update_metadata_file(dataset_id, donor_id, organism):
    """
    Updates the metadata file for the specified dataset_id by appending the queried donor_id to downloaded_donor_ids.

    Parameters:
    - dataset_id (str): The ID of the dataset to update.
    - donor_id (str): The donor ID to append to downloaded_donor_ids.
    - organism can be 'Mus Musculus' or 'Homo Sapiens'
    """
    # Define the directory and path for the metadata file
    metadata_directory = os.path.join("my_data/adata_slices/", dataset_id)
    metadata_file_path = os.path.join(metadata_directory, f"{dataset_id}_metadata.json")
    
    # Load or create metadata structure
    if os.path.exists(metadata_file_path):
        # Load existing metadata
        with open(metadata_file_path, "r") as metadata_file:
            metadata_dict = json.load(metadata_file)
    else:
        # If file does not exist, retrieve available donor IDs and create metadata structure
        metadata_dict = donor_id_information(dataset_id, organism = organism)
        os.makedirs(metadata_directory, exist_ok=True)  # Ensure the directory exists

    # Ensure donor_id is not already in downloaded_donor_ids before appending
    if donor_id not in metadata_dict["downloaded_donor_ids"]:
        metadata_dict["downloaded_donor_ids"].append(donor_id)

        # Save the updated metadata back to the file
        with open(metadata_file_path, "w") as metadata_file:
            json.dump(metadata_dict, metadata_file, indent=4)

        print(f"Updated metadata for dataset_id '{dataset_id}': added donor_id '{donor_id}' to downloaded_donor_ids.")
    else:
        print(f"donor_id '{donor_id}' already exists in downloaded_donor_ids for dataset_id '{dataset_id}'.")


## 3. save_adata_slices

This function retrieves and saves data slices (subsets) of an AnnData object for each specified donor ID and dataset ID combination.

**Parameters**:

- `donor_ids` *(list)*: List of donor IDs for which to retrieve data.
- `dataset_id` *(str)*: List of dataset IDs to retrieve data from.
- `organism` *(str)*: The name of the organism to query in the census, can be either 


**Process**:

1. Connects to the CellxGene Census data.
2. Iterates over each dataset_id and donor_id pair.
3. Queries the census data based on the specified dataset_id and donor_id to retrieve an AnnData slice.
4. Saves the AnnData slice to a .h5ad file if it contains observations.
5. Calls update_metadata_file() to record the downloaded donor_id in the metadata file.

**Output**:

- Saves each AnnData slice as an .h5ad file in a folder structure organized by dataset.
- Print the status of each saved file and handles any errors if a query fails or data is not found.

In [4]:

def save_adata_slices(donor_ids, dataset_id, organism):
    """
    Retrieves and saves AnnData slices for each donor_id and dataset_id combination.

    Parameters:
    - donor_ids (list): List of donor IDs to retrieve data for.
    - dataset_ids (list): List of dataset IDs to retrieve data for.
    - organism (str): Organism name to use in the census query ('Mus musculus' or 'Homo Sapiens').
    """
    main_directory = "my_data/adata_slices"
    os.makedirs(main_directory, exist_ok=True)

    
    dataset_directory = os.path.join(main_directory, dataset_id)
    metadata_file_path = os.path.join(dataset_directory, f"{dataset_id}_metadata.json")

    # Load metadata if it exists, or initialize by calling donor_id_information
    if os.path.exists(metadata_file_path):
        with open(metadata_file_path, "r") as metadata_file:
            metadata_dict = json.load(metadata_file)
        available_donor_ids = set(metadata_dict["available_donor_ids"])
        downloaded_donor_ids = set(metadata_dict["downloaded_donor_ids"])
    else:
        # Retrieve metadata and create metadata file if it doesn't exist
        metadata_dict = donor_id_information(dataset_id, organism)
        available_donor_ids = set(metadata_dict["available_donor_ids"])
        downloaded_donor_ids = set(metadata_dict["downloaded_donor_ids"])
        os.makedirs(dataset_directory, exist_ok=True)
        with open(metadata_file_path, "w") as metadata_file:
            json.dump(metadata_dict, metadata_file, indent=4)

    # Identify which donor_ids need to be downloaded
    donor_ids_to_download = set(donor_ids) - downloaded_donor_ids
    if not donor_ids_to_download:
        print(f"All requested donor IDs have already been downloaded for dataset_id '{dataset_id}'.")
    else: 
        print(f"Downloading {donor_ids_to_download} for dataset_id '{dataset_id}'.")
        # Open the census only if there are donor IDs to download
        census = cellxgene_census.open_soma()

    for donor_id in donor_ids_to_download:
        obs_value_filter = f"dataset_id == '{dataset_id}' and donor_id == '{donor_id}'"
        try:
            adata_slice = cellxgene_census.get_anndata(
                census=census,
                organism=organism,
                obs_value_filter=obs_value_filter
            )
            
            # Check if the slice is not empty
            if adata_slice.n_obs > 0:
                file_name = f"{dataset_id}_{donor_id}.h5ad"
                file_path = os.path.join(dataset_directory, file_name)
                adata_slice.write(file_path)
                print(f"Saved AnnData slice for dataset_id '{dataset_id}' and donor_id '{donor_id}' in '{file_path}'")

                # Update metadata to include the newly downloaded donor_id
                update_metadata_file(dataset_id, donor_id, organism)

            else:
                print(f"No data found for dataset_id '{dataset_id}' and donor_id '{donor_id}'")

        except Exception as e:
            print(f"Failed to retrieve data for dataset_id '{dataset_id}' and donor_id '{donor_id}': {e}")

    print(f"All AnnData slices have been saved in directory: {main_directory}")


una cosa che manca è dargli la main directory, perchè in questo caso è una roba ad oc fatta per il cluster 

In [67]:
# example usage... 
dataset_id = '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'
donor_id_list = ['21-F-55', '30-M-3']
save_adata_slices(donor_id_list, dataset_id, organism="Mus musculus")

All requested donor IDs have already been downloaded for dataset_id '0bd1a1de-3aee-40e0-b2ec-86c7a30c7149'.
All AnnData slices have been saved in directory: my_data/adata_slices


In [8]:
dataset_id = '0895c838-e550-48a3-a777-dbcd35d30272'
donor_id_list = ['C58', 'C70', 'C41']
save_adata_slices(donor_id_list, dataset_id, organism="Homo Sapiens")

All requested donor IDs have already been downloaded for dataset_id '0895c838-e550-48a3-a777-dbcd35d30272'.
All AnnData slices have been saved in directory: my_data/adata_slices


## 4. agregate_adata_slices

This fucntion aggregate already downloaded files an retrieve the eventually missing files into an anndata object ready to be proccessed and analyzed 

**Parameters**: 

- `donor_ids` *(list)*: List of donor IDs for which to retrieve data.
- `dataset_id` *(str)*: List of dataset IDs to retrieve data from.
- `organism` *(str)*: The name of the organism to query in the census, can be either 

**Process**: 

1. The function calls save_adata_slices to ensure that all the required data are locally saved 
2. It retrieves all the needed files, previously saved in *h5ad* format 
3. Gives as output an aggregated anndata object using the function `anndata.concat`


In [65]:
# This function instead serve to upload in memory the aggregated anndata objects
# Takes as input donor_ids, dataset_id and organism, gives as an output an anndata aggregated object
# This function calles save_adata_slices to prevent any missing file and concats the file with the specified donor ID 
def aggregate_adata_slices(donor_ids, dataset_id, organism="Mus musculus"):
    """
    Aggregates the AnnData files for specified donor_ids and dataset_ids into a single AnnData object.

    Parameters:
    - donor_ids (list): List of donor IDs to include in the aggregation.
    - dataset_ids (list): List of dataset IDs to include in the aggregation.
    - organism (str): Organism name to use in the census query (default is 'Mus musculus').

    Returns:
    - AnnData: A single AnnData object containing all specified donor_id and dataset_id data.
    """
    # Ensure required data slices are downloaded
    save_adata_slices(donor_ids, dataset_id, organism=organism)

    # Initialize an empty list to collect AnnData slices
    adata_list = []
    main_directory = "my_data/adata_slices"

    # Iterate over each dataset_id and donor_id to load the saved AnnData slices
    
    dataset_directory = os.path.join(main_directory, dataset_id)
    
    for donor_id in donor_ids:
        # Construct the file path for the specific donor_id and dataset_id
        file_name = f"{dataset_id}_{donor_id}.h5ad"
        file_path = os.path.join(dataset_directory, file_name)

        # Check if the file exists and load it if it does
        if os.path.exists(file_path):
            # Load the AnnData slice and add to list
            adata_slice = anndata.read_h5ad(file_path)
            adata_list.append(adata_slice)
            print(f"Loaded data for dataset_id '{dataset_id}' and donor_id '{donor_id}' from '{file_path}'")
        else:
            print(f"Warning: Expected file '{file_path}' does not exist. Skipping this donor_id.")

    # Concatenate all the loaded AnnData objects if there are any
    if adata_list:
        aggregated_adata = anndata.concat(adata_list, join="outer", label="donor_id", index_unique="-")
        # grab all var DataFrames from our dictionary
        all_var = [x.var for x in adata_list]
        # concatenate them
        all_var = pd.concat(all_var, join="outer")
        # remove duplicates
        all_var = all_var[~all_var.index.duplicated()]
        # put all together
        aggregated_adata.var = all_var.loc[adata.var_names]
        # This is for preventing a warning
        #aggregated_adata.obs_names_make_unique()
        print(f"Aggregated AnnData object created with {aggregated_adata.n_obs} observations.")
        return aggregated_adata
    else:
        print("No AnnData files were found to aggregate.")
        return None


In [66]:
# Example usage
adata = aggregate_adata_slices(['C58', 'C70'], '0895c838-e550-48a3-a777-dbcd35d30272', organism="Homo Sapiens")

All requested donor IDs have already been downloaded for dataset_id '0895c838-e550-48a3-a777-dbcd35d30272'.
All AnnData slices have been saved in directory: my_data/adata_slices
Loaded data for dataset_id '0895c838-e550-48a3-a777-dbcd35d30272' and donor_id 'C58' from 'my_data/adata_slices/0895c838-e550-48a3-a777-dbcd35d30272/0895c838-e550-48a3-a777-dbcd35d30272_C58.h5ad'
Loaded data for dataset_id '0895c838-e550-48a3-a777-dbcd35d30272' and donor_id 'C70' from 'my_data/adata_slices/0895c838-e550-48a3-a777-dbcd35d30272/0895c838-e550-48a3-a777-dbcd35d30272_C70.h5ad'
Aggregated AnnData object created with 44 observations.


In [58]:
adata.var

Unnamed: 0,soma_joinid,feature_id,feature_name,feature_length,nnz,n_measured_obs
0,0,ENSG00000000003,TSPAN6,4530,4530448,73855064
1,1,ENSG00000000005,TNMD,1476,236059,61201828
2,2,ENSG00000000419,DPM1,9276,17576462,74159149
3,3,ENSG00000000457,SCYL3,6883,9117322,73988868
4,4,ENSG00000000460,C1orf112,5970,6287794,73636201
...,...,...,...,...,...,...
60525,60525,ENSG00000288718,ENSG00000288718.1,1070,4,1248980
60526,60526,ENSG00000288719,ENSG00000288719.1,4252,2826,1248980
60527,60527,ENSG00000288724,ENSG00000288724.1,625,36,1248980
60528,60528,ENSG00000290791,ENSG00000290791.1,3612,1642,43485


In [70]:
adata.obs.head()

Unnamed: 0,soma_joinid,dataset_id,assay,assay_ontology_term_id,cell_type,cell_type_ontology_term_id,development_stage,development_stage_ontology_term_id,disease,disease_ontology_term_id,...,tissue,tissue_ontology_term_id,tissue_type,tissue_general,tissue_general_ontology_term_id,raw_sum,nnz,raw_mean_nnz,raw_variance_nnz,n_measured_vars
0-0,85,0895c838-e550-48a3-a777-dbcd35d30272,10x 3' v3,EFO:0009922,plasma cell,CL:0000786,human adult stage,HsapDv:0000087,normal,PATO:0000461,...,caudate lobe of liver,UBERON:0001117,tissue,liver,UBERON:0002107,176.0,121,1.454545,9.233333,13696
1-0,86,0895c838-e550-48a3-a777-dbcd35d30272,10x 3' v3,EFO:0009922,mature B cell,CL:0000785,human adult stage,HsapDv:0000087,normal,PATO:0000461,...,caudate lobe of liver,UBERON:0001117,tissue,liver,UBERON:0002107,269.0,212,1.268868,5.657225,13696
2-0,87,0895c838-e550-48a3-a777-dbcd35d30272,10x 3' v3,EFO:0009922,mature B cell,CL:0000785,human adult stage,HsapDv:0000087,normal,PATO:0000461,...,caudate lobe of liver,UBERON:0001117,tissue,liver,UBERON:0002107,193.0,174,1.109195,0.225002,13696
3-0,88,0895c838-e550-48a3-a777-dbcd35d30272,10x 3' v2,EFO:0009899,mature B cell,CL:0000785,human adult stage,HsapDv:0000087,normal,PATO:0000461,...,caudate lobe of liver,UBERON:0001117,tissue,liver,UBERON:0002107,1971.0,600,3.285,37.409457,13696
4-0,89,0895c838-e550-48a3-a777-dbcd35d30272,10x 3' v2,EFO:0009899,mature B cell,CL:0000785,human adult stage,HsapDv:0000087,normal,PATO:0000461,...,caudate lobe of liver,UBERON:0001117,tissue,liver,UBERON:0002107,1880.0,672,2.797619,30.310695,13696
