# Converting h5ad File to coloncancer.csv Format

This notebook converts `data_debug_restricted.h5ad` to match the format of `coloncancer.csv`. 

The target format has the following columns:
- dataset: The name of the dataset
- tissue: The tissue type
- marker: A comma-separated list of marker genes for each cell type
- manual_annotation: The cell type annotation
- manual_CLname: The cell ontology name
- manual_CLID: The cell ontology ID
- manual_broadtype: The broad cell type category

In [25]:
# Import required libraries
import scanpy as sc
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

# Set plotting parameters
sc.settings.set_figure_params(dpi=100, frameon=False)
plt.rcParams['figure.figsize'] = (8, 8)
plt.rcParams['figure.dpi'] = 100

# Print versions for reproducibility
print(f"scanpy version: {sc.__version__}")
print(f"pandas version: {pd.__version__}")
print(f"numpy version: {np.__version__}")

scanpy version: 1.11.2
pandas version: 2.3.0
numpy version: 2.2.6


## Loading the h5ad File

First, let's load the `data_debug_restricted.h5ad` file using scanpy and explore its structure.

In [26]:
# Define the path to the h5ad file
file_path = 'dataset_debug_restricted.h5ad'

# Load the h5ad file
adata = sc.read_h5ad(file_path)

# Print basic information about the AnnData object
print(f"Shape of AnnData: {adata.shape}")
print("\nKeys in .obs (per-cell annotations):")
print(adata.obs.columns.tolist())
print("\nKeys in .var (per-gene annotations):")
print(adata.var.columns.tolist())
print("\nKeys in .uns (unstructured annotations):")
print(list(adata.uns.keys()) if hasattr(adata, 'uns') else "No .uns keys found")

Shape of AnnData: (1000, 33541)

Keys in .obs (per-cell annotations):
[]

Keys in .var (per-gene annotations):
[]

Keys in .uns (unstructured annotations):
[]


In [27]:
# Let's look at the first few rows of the observation metadata
print("First few rows of .obs:")
display(adata.obs.head())

# Check if there's any cell type annotation in the metadata
# Common column names for cell type annotations: cell_type, leiden, louvain, cluster, etc.
cell_type_cols = [col for col in adata.obs.columns if any(x in col.lower() for x in ['cell_type', 'celltype', 'leiden', 'louvain', 'cluster'])]

if cell_type_cols:
    print(f"\nPotential cell type annotation columns: {cell_type_cols}")
    for col in cell_type_cols:
        print(f"\nUnique values in {col}:")
        print(adata.obs[col].value_counts())
else:
    print("\nNo obvious cell type annotation columns found.")

First few rows of .obs:


Gao2021_ACTGCTCAGAAGAAGC
Gao2021_AGGCCACTCAACTCTT
Gao2021_CAAGGCCAGTGTCCCG
Gao2021_CGGACTGTCTACTTAC
Gao2021_GATCGATAGTATCGAA



No obvious cell type annotation columns found.


## Identifying Marker Genes for Each Cell Type

Now we'll identify marker genes for each cell type/cluster by computing differential expression between clusters. We'll use these as the marker genes in our CSV output.

In [28]:
# Define a cell type column to use for identifying markers
# This will be set based on the available columns we found above

# This is a placeholder - we need to fill in the actual column name after inspecting the data
cluster_col = None

# Dynamically set the cluster column based on what we found above
if cell_type_cols:
    cluster_col = cell_type_cols[0]  # Use the first identified cell type column
else:
    # If no cell type columns were found, we'll try to cluster the data
    print("No cell type annotations found. Performing clustering...")
    sc.pp.normalize_total(adata, target_sum=1e4)
    sc.pp.log1p(adata)
    sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
    sc.pp.pca(adata, svd_solver='arpack')
    sc.pp.neighbors(adata)
    sc.tl.leiden(adata)
    cluster_col = 'leiden'

print(f"Using '{cluster_col}' as cell type/cluster column")

# Compute marker genes for each cluster
sc.tl.rank_genes_groups(adata, groupby=cluster_col, method='wilcoxon')

# Function to get top marker genes for each cluster
def get_top_markers(adata, n_genes=10):
    markers_dict = {}
    
    try:
        for cluster in adata.obs[cluster_col].unique():
            # Get the rank_genes_groups result for the cluster
            genes = adata.uns['rank_genes_groups']['names'][cluster][:n_genes]
            markers_dict[cluster] = list(genes)
            
        return markers_dict
    except KeyError:
        print("Error accessing rank_genes_groups results. Check if the computation completed successfully.")
        return {}

# Get top 10 marker genes for each cluster
top_markers = get_top_markers(adata, n_genes=10)

# Display the markers for each cluster
for cluster, markers in top_markers.items():
    print(f"\nCluster {cluster} top markers:")
    print(", ".join(markers))

No cell type annotations found. Performing clustering...
Using 'leiden' as cell type/cluster column
Using 'leiden' as cell type/cluster column

Cluster 0 top markers:
SCGB2A2, XBP1, SCGB1D2, KRT18, TRPS1, TFF3, KRT8, MGP, RPL31, AZGP1

Cluster 11 top markers:
IGFBP7, PCAT19, CD93, SPARCL1, CD59, GNG11, SPRY1, TM4SF1, TCF4, EIF1

Cluster 2 top markers:
AIF1, TYROBP, FCER1G, HLA-DRA, FTL, HLA-DPB1, HLA-DPA1, CD74, CTSS, LYZ

Cluster 5 top markers:
CRYAB, KRT7, KRT17, SFRP1, NFIB, FBXO32, KRT14, CALD1, TAGLN, CD59

Cluster 4 top markers:
COL3A1, COL1A2, COL1A1, CALD1, SPARC, COL6A2, COL6A1, DCN, LUM, AEBP1

Cluster 10 top markers:
RPS17, CD59, S100A6, RPL37A, TNFRSF12A, SOD2, CXCL1, KRT7, RPL14, HMGA1

Cluster 14 top markers:
GEM, CXCL1, NFKBIA, CXCL3, CXCL2, HSP90AB1, SOD2, CCL2, CTSL, PLPP3

Cluster 8 top markers:
RPS15, POLR2L, MUC1, TOMM7, STC2, RPS2, CRIP1, RPL27A, POLR2J3.1, RPL41

Cluster 9 top markers:
FABP3, COX6C, ADIRF, RPS4X, SMIM22, TCEAL4, RPL41, MT-ND5, KIF22, HEBP1

Cluste

## Preparing the Output CSV Format

Now we'll prepare a DataFrame that matches the format of `coloncancer.csv`. We need to create:

1. `dataset`: Name of our dataset
2. `tissue`: Tissue type (inferred from metadata if available)
3. `marker`: Comma-separated list of marker genes
4. `manual_annotation`: Cell type annotation (use cluster names initially)
5. `manual_CLname`: Cell ontology name (we may need to leave this blank)
6. `manual_CLID`: Cell ontology ID (we may need to leave this blank) 
7. `manual_broadtype`: Broad cell type category (we may need to leave this blank)

In [29]:
# Get the dataset name from the file name
dataset_name = os.path.basename(file_path).split('.')[0]  # 'dataset_debug_restricted'

# Try to infer tissue type from metadata if available
tissue = None

# Look for tissue information in adata.uns
if hasattr(adata, 'uns') and isinstance(adata.uns, dict):
    tissue_keys = [k for k in adata.uns.keys() if 'tissue' in k.lower()]
    if tissue_keys:
        tissue = str(adata.uns[tissue_keys[0]])

# Look for tissue information in adata.obs
if tissue is None:
    tissue_cols = [c for c in adata.obs.columns if 'tissue' in c.lower()]
    if tissue_cols:
        tissue = adata.obs[tissue_cols[0]].iloc[0]

# If we still couldn't find tissue information, use a placeholder
if tissue is None:
    tissue = "unknown tissue"

# Create a list to store our rows
csv_rows = []

# For each cluster, create a row in our CSV format
for cluster, markers in top_markers.items():
    # Convert marker list to comma-separated string
    marker_str = ",".join(markers)
    
    # Create a row
    row = {
        'dataset': dataset_name,
        'tissue': tissue,
        'marker': marker_str,
        'manual_annotation': f"Cluster {cluster}",  # Initial annotation based on cluster name
        'manual_CLname': "",  # Would need cell ontology mapping
        'manual_CLID': "",    # Would need cell ontology mapping
        'manual_broadtype': ""  # Would need broad cell type mapping
    }
    
    csv_rows.append(row)

# Create DataFrame
output_df = pd.DataFrame(csv_rows)

# Display the DataFrame
display(output_df)

Unnamed: 0,dataset,tissue,marker,manual_annotation,manual_CLname,manual_CLID,manual_broadtype
0,dataset_debug_restricted,unknown tissue,"SCGB2A2,XBP1,SCGB1D2,KRT18,TRPS1,TFF3,KRT8,MGP...",Cluster 0,,,
1,dataset_debug_restricted,unknown tissue,"IGFBP7,PCAT19,CD93,SPARCL1,CD59,GNG11,SPRY1,TM...",Cluster 11,,,
2,dataset_debug_restricted,unknown tissue,"AIF1,TYROBP,FCER1G,HLA-DRA,FTL,HLA-DPB1,HLA-DP...",Cluster 2,,,
3,dataset_debug_restricted,unknown tissue,"CRYAB,KRT7,KRT17,SFRP1,NFIB,FBXO32,KRT14,CALD1...",Cluster 5,,,
4,dataset_debug_restricted,unknown tissue,"COL3A1,COL1A2,COL1A1,CALD1,SPARC,COL6A2,COL6A1...",Cluster 4,,,
5,dataset_debug_restricted,unknown tissue,"RPS17,CD59,S100A6,RPL37A,TNFRSF12A,SOD2,CXCL1,...",Cluster 10,,,
6,dataset_debug_restricted,unknown tissue,"GEM,CXCL1,NFKBIA,CXCL3,CXCL2,HSP90AB1,SOD2,CCL...",Cluster 14,,,
7,dataset_debug_restricted,unknown tissue,"RPS15,POLR2L,MUC1,TOMM7,STC2,RPS2,CRIP1,RPL27A...",Cluster 8,,,
8,dataset_debug_restricted,unknown tissue,"FABP3,COX6C,ADIRF,RPS4X,SMIM22,TCEAL4,RPL41,MT...",Cluster 9,,,
9,dataset_debug_restricted,unknown tissue,"FTH1,LINC01238,RHOBTB3,KCNE4,PIP,CST5,MRPS30-D...",Cluster 3,,,


## Saving the Final CSV Output

Finally, let's save our DataFrame to a CSV file that matches the format of `coloncancer.csv`.

In [30]:
# Define the output file path
output_file = f"{dataset_name}_formatted.csv"

# Save the DataFrame to CSV
output_df.to_csv(output_file, index=False)

print(f"Successfully saved the formatted data to {output_file}")

# Let's also display the first few rows of the output
print("\nPreview of the output CSV:")
display(output_df.head())

Successfully saved the formatted data to dataset_debug_restricted_formatted.csv

Preview of the output CSV:


Unnamed: 0,dataset,tissue,marker,manual_annotation,manual_CLname,manual_CLID,manual_broadtype
0,dataset_debug_restricted,unknown tissue,"SCGB2A2,XBP1,SCGB1D2,KRT18,TRPS1,TFF3,KRT8,MGP...",Cluster 0,,,
1,dataset_debug_restricted,unknown tissue,"IGFBP7,PCAT19,CD93,SPARCL1,CD59,GNG11,SPRY1,TM...",Cluster 11,,,
2,dataset_debug_restricted,unknown tissue,"AIF1,TYROBP,FCER1G,HLA-DRA,FTL,HLA-DPB1,HLA-DP...",Cluster 2,,,
3,dataset_debug_restricted,unknown tissue,"CRYAB,KRT7,KRT17,SFRP1,NFIB,FBXO32,KRT14,CALD1...",Cluster 5,,,
4,dataset_debug_restricted,unknown tissue,"COL3A1,COL1A2,COL1A1,CALD1,SPARC,COL6A2,COL6A1...",Cluster 4,,,


## Summary and Verification

Let's compare our output with the original `coloncancer.csv` format to ensure we've matched it correctly.

In [31]:
# Load the original coloncancer.csv for comparison
colon_cancer = pd.read_csv('coloncancer.csv')

# Display the original CSV
print("Original coloncancer.csv format:")
display(colon_cancer.head())

print("\nOur converted format:")
display(output_df.head())

# Check if we have all the required columns
required_cols = colon_cancer.columns
missing_cols = [col for col in required_cols if col not in output_df.columns]

if missing_cols:
    print(f"\nWarning: Missing columns in our output: {missing_cols}")
else:
    print("\nSuccess! Our output contains all the required columns.")
    
print("\nConversion completed. The h5ad file has been successfully converted to match the coloncancer.csv format.")

Original coloncancer.csv format:


Unnamed: 0,dataset,tissue,marker,manual_annotation,manual_CLname,manual_CLID,manual_broadtype
0,coloncancer,colon cancer,"KRT18,KRT8,EPCAM,CLDN3,KRT19,TFF3,CLDN4,TSPAN8...",colon cancer cell,,,malignant cell
1,coloncancer,colon cancer,"PHGR1,GUCA2A,MT1G,PIGR,MT1E,GUCA2B,FABP1,LGALS...",Epithelial cells,epithelial cell,CL:0000066,epithelial cell
2,coloncancer,colon cancer,"IGFBP7,SPARC,CALD1,COL1A2,COL3A1,COL1A1,DCN,CO...",Stromal cells,stromal cell,CL:0000499,stromal cell
3,coloncancer,colon cancer,"TYROBP,FCER1G,IL1B,CCL3,CXCL8,CCL3L3,G0S2,S100...",Myeloids,myeloid cell,CL:0000763,myeloid cell
4,coloncancer,colon cancer,"CD3D,TRAC,CCL5,CD7,CD3E,TRBC2,CD2,KLRB1,CD52,T...",T cells,T cell,CL:0000084,t/nk cell



Our converted format:


Unnamed: 0,dataset,tissue,marker,manual_annotation,manual_CLname,manual_CLID,manual_broadtype
0,dataset_debug_restricted,unknown tissue,"SCGB2A2,XBP1,SCGB1D2,KRT18,TRPS1,TFF3,KRT8,MGP...",Cluster 0,,,
1,dataset_debug_restricted,unknown tissue,"IGFBP7,PCAT19,CD93,SPARCL1,CD59,GNG11,SPRY1,TM...",Cluster 11,,,
2,dataset_debug_restricted,unknown tissue,"AIF1,TYROBP,FCER1G,HLA-DRA,FTL,HLA-DPB1,HLA-DP...",Cluster 2,,,
3,dataset_debug_restricted,unknown tissue,"CRYAB,KRT7,KRT17,SFRP1,NFIB,FBXO32,KRT14,CALD1...",Cluster 5,,,
4,dataset_debug_restricted,unknown tissue,"COL3A1,COL1A2,COL1A1,CALD1,SPARC,COL6A2,COL6A1...",Cluster 4,,,



Success! Our output contains all the required columns.

Conversion completed. The h5ad file has been successfully converted to match the coloncancer.csv format.
