# Combine raw data across all samples

In this notebook, we'll use the file manifest we previously assembled to retrieve all of the human PBMC data that we'll use to assemble our reference dataset.

Here, we'll retrieve data from each each individual sample, stored in HDF5 format in HISE, and concatenate them into a single AnnData object.

## Load Packages

anndata: Data structures for scRNA-seq  
h5py: HDF5 file I/O  
hisepy: The HISE SDK for Python  
numpy: Mathematical data structures and computation  
os: operating system calls  
pandas: DataFrame data structures  
re: Regular expressions  
scanpy: scRNA-seq analysis  
scipy.sparse: Spare matrix data structures  
shutil: Shell utilities

In [1]:
import anndata
import h5py
import hisepy
import numpy as np
import os
import pandas as pd
import re
import scanpy as sc
import scipy.sparse as scs
import shutil

## Helper functions

These functions assist in reading from our pipeline .h5 file format into AnnData format:

In [2]:
# define a function to read count data
def read_mat(h5_con):
    mat = scs.csc_matrix(
        (h5_con['matrix']['data'][:], # Count values
         h5_con['matrix']['indices'][:], # Row indices
         h5_con['matrix']['indptr'][:]), # Pointers for column positions
        shape = tuple(h5_con['matrix']['shape'][:]) # Matrix dimensions
    )
    return mat

# define a function to read obeservation metadata (i.e. cell metadata)
def read_obs(h5con):
    bc = h5con['matrix']['barcodes'][:]
    bc = [x.decode('UTF-8') for x in bc]

    # Initialized the DataFrame with cell barcodes
    obs_df = pd.DataFrame({ 'barcodes' : bc })

    # Get the list of available metadata columns
    obs_columns = h5con['matrix']['observations'].keys()

    # For each column
    for col in obs_columns:
        # Read the values
        values = h5con['matrix']['observations'][col][:]
        # Check for byte storage
        if(isinstance(values[0], (bytes, bytearray))):
            # Decode byte strings
            values = [x.decode('UTF-8') for x in values]
        # Add column to the DataFrame
        obs_df[col] = values

    obs_df = obs_df.set_index('barcodes', drop = False)
    
    return obs_df

# define a function to construct anndata object from a h5 file
def read_h5_anndata(h5_con):
    #h5_con = h5py.File(h5_file, mode = 'r')
    # extract the expression matrix
    mat = read_mat(h5_con)
    # extract gene names
    genes = h5_con['matrix']['features']['name'][:]
    genes = [x.decode('UTF-8') for x in genes]
    # extract metadata
    obs_df = read_obs(h5_con)
    # construct anndata
    adata = anndata.AnnData(mat.T,
                             obs = obs_df)
    # make sure the gene names aligned
    adata.var_names = genes

    adata.var_names_make_unique()
    return adata

## Retrieve file list from HISE

First, we'll pull the manifest of samples and associated files that we want to retrieve for building our reference dataset. We previously assembled this file and loaded it into HISE via a Watchfolder.

In [3]:
help(hisepy.reader.download_files)

Help on function download_files in module hisepy.reader:

download_files(file_dict: dict)
    Read the contents of a dictionary of non-result file ids into hise_file objects
    These files will contain NULL descriptors (since they are not result files)
    
    Parameters:
        file_dict (dict): a dictionary of file_uuid: file_name
    
    Returns:
        a list of hise_file objects with empty descriptors



In [4]:
manifest_file_uuid = '223b4aa9-19fc-41e1-8bea-43682e5ac278'
manifest_file_name = 'ref_h5_meta_data_2024-02-08.csv'
manifest_file_dict = {
    manifest_file_uuid: manifest_file_name
}

In [5]:
res = hisepy.reader.download_files(manifest_file_dict)

After download, the file is stored in `cache/downloadable/`

In [6]:
manifest_file = "cache/downloadable/{fn}".format(fn = manifest_file_name)
meta_data = pd.read_csv(manifest_file)

Next, we'll use this list of files to read and assemble a combined dataset

In [7]:
help(hisepy.reader.read_files)

Help on function read_files in module hisepy.reader:

read_files(file_list: list = None, query_id: list = None, query_dict: dict = None, to_df: bool = True)
    Read the contents of a list of file ids into a hise_file object
    Note: users should only use 1 parameter per function call
    
    Parameters:
        file_list (list): a list of UUIDS to retrieve
        query_id (str): string value of queryID from Advanced Search
        query_dict (dict): dictionary that allows users to submit a query.
            Note: for each key:value pair, the value must be of type list
        to_df (bool):  boolean determining whether result should be returned as a data.frame. 
    
    Returns:
        a list of hise_file objects
    
    Example: hp.read_files(file_list=['6cb2f536-2d20-4e66-b04d-327dce6870f4'])



read_files will return a dictionary with an entry, `values`, that contains a list of `h5py.File()` objects. We can use these directly to read in each .h5 file to an AnnData object with our helper function, `read_h5_anndata()`, defined above.

After we read the file, we don't need to keep the file on disk, so we'll remove it.

In [8]:
def get_adata(uuid):
    # Load the file using HISE
    res = hisepy.reader.read_files([uuid])
    
    # Read the file to adata
    h5_con = res['values'][0]
    adata = read_h5_anndata(h5_con)
    
    # Clean up the file now that we're done with it
    h5_file = h5_con.filename
    h5_con.close()
    h5_dir = re.sub('(cache/[^/]+/).+','\\1',h5_file)
    shutil.rmtree(h5_dir)

    return(adata)

Here, we'll iterate over each file in our manifest

In [9]:
adata_list = []
for i in range(meta_data.shape[0]):
    uuid = meta_data['file.id'][i]
    adata_list.append(get_adata(uuid))

Concatenate all of the datasets into a single object

In [10]:
adata = anndata.concat(adata_list)

Save combined object to .h5ad

In [11]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [12]:
out_h5ad = 'output/adata_all_raw.h5ad'
adata.write_h5ad(out_h5ad)

## Upload assembled data to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [13]:
help(hisepy.upload.upload_files)

Help on function upload_files in module hisepy.upload:

upload_files(files: list, study_space_id: str = None, project: str = None, title: str = None, input_file_ids=None, input_sample_ids=None, file_types=None, store=None, destination=None, do_prompt: bool = True)
    Uploads files to a specified study.
    
    Parameters:
        files (list): absolute filepath of file to be uploaded
        study_space_id (str): ID that pertains to a study in the collaboration space (optional)
        project (str): project short name (required if study space is not specified)
        title (str): 10+ character title for upload result 
        input_file_ids (list): fileIds from HISE that were utilized to generate a user's result
        input_sample_ids (list): sampleIds from HISE that were utilized to generate a user's result
        file_types (str): filetype of uploaded files 
        store (str): Which store ('project' or 'permanent') to use for the files (default in 'project')
        destinat

In [14]:
study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'Assembled Raw PBMC .h5ad'

In [15]:
in_files = [manifest_file_uuid] + meta_data['file.id'].to_list()

In [16]:
out_file = [out_h5ad]

In [17]:
hisepy.upload.upload_files(
    files = out_file,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

you are trying to upload file_ids... ['output/adata_all_raw.h5ad']. Do you truly want to proceed?


(y/n) y


{'trace_id': '74e553f6-30eb-45ca-ac9e-ff6bb20c8812',
 'files': ['output/adata_all_raw.h5ad']}

In [18]:
import session_info
session_info.show()