# R SingleCellExperiment Object to Disk (for Python Usage)

**TODO:** Add this documentation to local readme file

**Author:** Prisca Dotti

**Last modified:** 12.08.2025

This script imports R datasets saved as `.rds` files into Python and saves them in a format compatible with the PAGEpy pipeline.

📋 **Requirements**

- R with the `SingleCellExperiment` package installed  
- Python package `anndata2ri`  
  Install via:  
  `%pip install anndata2ri`

🗂 **Outputs**

Creates the following files in a given output directory:

1. `count_matrix.mtx` — count matrix in Matrix Market format  
2. `gene_names.txt` — list of gene names  
3. `sample_names.txt` — list of sample IDs  
4. `response_labels.csv` — CSV file containing sample response labels

🔍 **Context**

This project currently works with an HIV single-cell dataset. The goal here is to convert bulk dataset R data to match the single-cell data structure, by changing names of variables, columns, rows, and so on, to make sure that the data is compatible with the PAGEpy pipeline.

When running the package with a new dataset, this could be a good starting point to process the data (at least for people unfamiliar with R).


In [1]:
import pandas as pd
import numpy as np
import scanpy as sc
from scipy.sparse import csr_matrix
from scipy.io import mmwrite
import os

In [None]:
%pip install anndata2ri

Collecting anndata2ri
  Downloading anndata2ri-2.0-py3-none-any.whl.metadata (4.9 kB)
Collecting rpy2>=3.5.2 (from anndata2ri)
  Downloading rpy2-3.6.2-py3-none-any.whl.metadata (5.4 kB)
Collecting tzlocal (from anndata2ri)
  Downloading tzlocal-5.3.1-py3-none-any.whl.metadata (7.6 kB)
Collecting rpy2-rinterface>=3.6.2 (from rpy2>=3.5.2->anndata2ri)
  Downloading rpy2_rinterface-3.6.2.tar.gz (79 kB)
  Installing build dependencies ... [?25l\

In [2]:
import anndata2ri
%load_ext rpy2.ipython
anndata2ri.set_ipython_converter()

ModuleNotFoundError: No module named 'anndata2ri'

In [None]:
# %%R
# # Install BiocManager if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")

# # Install SingleCellExperiment
# BiocManager::install("SingleCellExperiment")

# # Also install other potentially needed packages
# BiocManager::install(c("SingleCellExperiment", "SummarizedExperiment"))

Load .rds file in R and convert to AnnData

In [None]:
%%R -o adata
library(SingleCellExperiment)

# Read the SingleCellExperiment object
sce <- readRDS("../../bulk_data/GEO_singlecellexperiment_11ds.rds")

# Check the structure
print("SingleCellExperiment object:")
print(sce)
print("Available assays:")
print(names(assays(sce)))
print("colData columns:")
print(colnames(colData(sce)))
print("rowData columns:")
print(colnames(rowData(sce)))

# Convert to AnnData (this will automatically transfer to Python)
adata <- sce

Verify the AnnData object in Python

In [None]:
print("AnnData object received from R:")
print(adata)
print(f"\nShape: {adata.shape}")
print(
    f"Available layers: {list(adata.layers.keys()) if adata.layers else 'No layers'}")
print(f"obs columns: {adata.obs.columns.tolist()}")
print(f"var columns: {adata.var.columns.tolist()}")

Extract and save required data to run PAGEpy

In [None]:
# Filter out samples labeled as "partial" from adata.obs['Response']

if 'Response' in adata.obs.columns:
    print("Before filtering:", adata.shape, "Response counts:")
    print(adata.obs['Response'].value_counts())

    # Keep only samples with Response "yes" or "no"
    adata = adata[adata.obs['Response'].isin(['yes', 'no'])].copy()

    print("After filtering:", adata.shape, "Response counts:")
    print(adata.obs['Response'].value_counts())
else:
    print("⚠ 'Response' column not found; skipping filtering.")

In [None]:
def save_data_files(adata, output_dir="output"):
    """Extract and save the data in PAGEpy required formats."""

    os.makedirs(output_dir, exist_ok=True)

    # 1. Get the scalelogcounts matrix
    # if 'scalelogcounts' in adata.layers:
    #     count_matrix = adata.layers['scalelogcounts']
    #     print("✓ Found scalelogcounts in layers")
    if 'counts' in adata.layers:
        count_matrix = adata.layers['counts']
        print("✓ Found counts in layers")
    elif hasattr(adata, 'X') and adata.X is not None:
        count_matrix = adata.X
        print("✓ Using main X matrix")
    else:
        print("⚠ Could not find count matrix")
        return

    # Convert to dense array if sparse
    if hasattr(count_matrix, 'toarray'):
        count_matrix_dense = count_matrix.toarray()
    else:
        count_matrix_dense = np.array(count_matrix)
    count_matrix_dense = count_matrix_dense.T  # To match HIV dataset

    print(f"Count matrix shape: {count_matrix_dense.shape} (genes × samples)")

    # 2. Get gene names
    if 'gene_name' in adata.var.columns:
        gene_names = adata.var['gene_name'].tolist()
        print(f"✓ Found {len(gene_names)} gene names from 'gene_name' column")
    else:
        gene_names = adata.var.index.tolist()
        print(f"✓ Using var index as gene names: {len(gene_names)} genes")
        print("Available var columns:", adata.var.columns.tolist())

    # 3. Get sample IDs
    sample_ids = adata.obs.index.tolist()
    print(f"✓ Found {len(sample_ids)} sample IDs")

    # 4. Get Response labels
    if 'Response' in adata.obs.columns:
        response_labels = adata.obs['Response']
        print(f"✓ Found Response column")
        print("Response distribution:", response_labels.value_counts().to_dict())
    else:
        print("⚠ 'Response' column not found in obs")
        print("Available obs columns:", adata.obs.columns.tolist())
        response_labels = None

    # Save files
    print("\nSaving files...")

    # 1. Count matrix as .mtx
    sparse_matrix = csr_matrix(count_matrix_dense)
    mmwrite(os.path.join(output_dir, 'count_matrix.mtx'), sparse_matrix)
    print("✓ Saved count_matrix.mtx")

    # 2. Gene names as .txt
    with open(os.path.join(output_dir, 'gene_names.txt'), 'w') as f:
        for gene in gene_names:
            f.write(f"{gene}\n")
    print("✓ Saved gene_names.txt")

    # 3. Sample names as .txt
    with open(os.path.join(output_dir, 'sample_names.txt'), 'w') as f:
        for sample in sample_ids:
            f.write(f"{sample}\n")
    print("✓ Saved sample_names.txt")

    # 4. Response labels as .csv
    if response_labels is not None:
        labels_df = pd.DataFrame({
            'Sample': sample_ids,
            'Status': response_labels.values
        })
        labels_df.to_csv(os.path.join(
            output_dir, 'response_labels.csv'), index=False)
        print("✓ Saved response_labels.csv")

    # # 5. Save AnnData object for future use # <- doesn't work
    # adata.write(os.path.join(output_dir, 'data.h5ad'))
    # print("✓ Saved data.h5ad (AnnData format)")

    print(f"\nAll files saved to '{output_dir}' directory!")

    return {
        'count_matrix': count_matrix_dense,
        'gene_names': gene_names,
        'sample_ids': sample_ids,
        'response_labels': response_labels.values if response_labels is not None else None
    }

In [None]:
# Run the extraction
extracted_data = save_data_files(
    adata=adata, output_dir="../../bulk_data")