# Spatial transcriptomics unveils the in situ cellular and molecular hallmarks of the lung in fatal COVID-19

# Process Human Lung Cell Atlas (HLCA) core object to use it as reference dataset for deconvolution and mapping of Visium ST data. 

**Author:** Carlos A. Garcia-Prieto

* This notebook explains the processing of the HLCA core annData object to use it as a reference single-cell dataset for deconvolution and mapping of the studied Visium ST samples.
* See the [HLCA paper](https://doi.org/10.1038/s41591-023-02327-2) for more details and to download [HLCA core annData object.](https://cellxgene.cziscience.com/collections/6f6d381a-7701-4781-935c-db10d30de293)


## Import modules

In [1]:
import anndata
import matplotlib.pyplot as plt
import scanpy as sc
import pandas as pd
import seaborn as sns
import scanpy.external as sce
import numpy as np

In [2]:
pd.set_option('display.max_rows', 250)
pd.set_option('display.max_columns', 500)

In [3]:
#Read HLCA core annData object
HLCA_core = anndata.read(f'/Users/carlosgarciaprieto/Proyectos_IJC/Spatial/COVID/Results/Python/HLCA_publication/HLCA_core.h5ad')

In [4]:
#Explore HLCA core
HLCA_core

AnnData object with n_obs × n_vars = 584944 × 28024
    obs: 'suspension_type', 'donor_id', 'is_primary_data', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'BMI', 'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score', 'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'cause_of_death', 'dataset', 'entropy_dataset_leiden_3', 'entropy_original_ann_level_1_leiden_3', 'entropy_original_ann_level_2_clean_leiden_3', 'entropy_original_ann_level_3_clean_leiden_3', 'entropy_subject_ID_leiden_3', 'fresh_or_frozen', 'leiden_1', 'leiden_2', 'leiden_3', 'leiden_4', 'leiden_5', 'log10_total_counts', 'lung_condition', 'mixed_ancestry', 'n_genes_detected', 'original_ann_highest_res', 'original_ann_level_1', '

In [5]:
HLCA_core.obs

Unnamed: 0,suspension_type,donor_id,is_primary_data,assay_ontology_term_id,cell_type_ontology_term_id,development_stage_ontology_term_id,disease_ontology_term_id,self_reported_ethnicity_ontology_term_id,tissue_ontology_term_id,organism_ontology_term_id,sex_ontology_term_id,BMI,age_or_mean_of_age_range,age_range,anatomical_region_ccf_score,ann_coarse_for_GWAS_and_modeling,ann_finest_level,ann_level_1,ann_level_2,ann_level_3,ann_level_4,ann_level_5,cause_of_death,dataset,entropy_dataset_leiden_3,entropy_original_ann_level_1_leiden_3,entropy_original_ann_level_2_clean_leiden_3,entropy_original_ann_level_3_clean_leiden_3,entropy_subject_ID_leiden_3,fresh_or_frozen,leiden_1,leiden_2,leiden_3,leiden_4,leiden_5,log10_total_counts,lung_condition,mixed_ancestry,n_genes_detected,original_ann_highest_res,original_ann_level_1,original_ann_level_2,original_ann_level_3,original_ann_level_4,original_ann_level_5,original_ann_nonharmonized,reannotation_type,reference_genome,sample,scanvi_label,sequencing_platform,size_factors,smoking_status,study,subject_type,tissue_dissociation_protocol,tissue_level_2,tissue_level_3,tissue_sampling_method,cell_type,assay,disease,organism,sex,tissue,self_reported_ethnicity,development_stage
GCGACCATCCCTAACC_SC22,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0009899,CL:0000583,HsapDv:0000143,PATO:0000461,HANCESTRO:0005,UBERON:0008946,NCBITaxon:9606,PATO:0000383,,49.0,,0.97,Alveolar macrophages,Alveolar macrophages,Immune,Myeloid,Macrophages,Alveolar macrophages,,intracranial hemorrhage,Misharin_Budinger_2018,1.664049,0.177918,0.182714,0.277385,2.602398,fresh,1,1.0,1.0.0,1.0.0.0,,4.016197,Healthy,,2438,4,Immune,Myeloid,Macrophages,Alveolar macrophages,,Alveolar macrophages,Correctly annotated,Homo_sapiens.GRCh38.84,SC22,Macrophages,Illumina HiSeq 4000,0.998238,active,Misharin_Budinger_2018,organ_donor,Collagenase D + DNAse,parenchyma lower lobe,,donor_lung,alveolar macrophage,10x 3' v2,normal,Homo sapiens,female,lung parenchyma,European,49-year-old human stage
P2_1_GCGCAACCAGTTAACC,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0009899,CL:0000623,HsapDv:0000140,PATO:0000461,HANCESTRO:0005,UBERON:0008946,NCBITaxon:9606,PATO:0000384,33.100,46.0,,0.97,Innate lymphoid cell NK,NK cells,Immune,Lymphoid,Innate lymphoid cell NK,NK cells,,,Krasnow_2020,1.461924,0.013459,0.015608,0.326962,2.729274,fresh,1,1.3,1.3.0,1.3.0.0,,3.203848,Healthy (tumor adjacent),,700,4,Immune,Lymphoid,Innate lymphoid cell NK,NK cells,,Natural Killer,Correctly annotated,Homo_sapiens.GRCh38.84,distal 2,Non-T/B cells,Illumina NovaSeq 6000,0.172927,never,Krasnow_2020,alive_disease,Collagenase + Elastase + DNAse,parenchyma right middle lobe,,surgical_resection,natural killer cell,10x 3' v2,normal,Homo sapiens,male,lung parenchyma,European,46-year-old human stage
GCTCTGTAGTGCTGCC_SC27,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0009899,CL:0002063,HsapDv:0000141,PATO:0000461,HANCESTRO:0005,UBERON:0008946,NCBITaxon:9606,PATO:0000383,,47.0,,0.97,AT2,AT2,Epithelial,Alveolar epithelium,AT2,,,intracranial hemorrhage,Misharin_Budinger_2018,1.381972,0.280016,0.208872,0.388602,2.006138,fresh,0,0.1,0.1.2,0.1.2.0,,3.493458,Healthy,,1200,3,Epithelial,Alveolar epithelium,AT2,,,Alveolar epithelial type 2 cells,Correctly annotated,Homo_sapiens.GRCh38.84,SC27,AT2,Illumina HiSeq 4000,0.343101,active,Misharin_Budinger_2018,organ_donor,Collagenase D + DNAse,parenchyma lower lobe,,donor_lung,type II pneumocyte,10x 3' v2,normal,Homo sapiens,female,lung parenchyma,European,47-year-old human stage
P2_8_TTAGGACGTTCAGGCC,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0009899,CL:0000583,HsapDv:0000140,PATO:0000461,HANCESTRO:0005,UBERON:0001005,NCBITaxon:9606,PATO:0000384,33.100,46.0,,0.81,Alveolar macrophages,Alveolar Mph CCL3+,Immune,Myeloid,Macrophages,Alveolar macrophages,Alveolar Mph CCL3+,,Krasnow_2020,1.005008,0.002307,0.019563,0.062629,1.192024,fresh,1,1.0,1.0.2,1.0.2.0,,4.242591,Healthy (tumor adjacent),,3286,4,Immune,Myeloid,Macrophages,Alveolar macrophages,,Macrophage,Underannotated,Homo_sapiens.GRCh38.84,medial 2,Macrophages,Illumina NovaSeq 6000,1.571874,never,Krasnow_2020,alive_disease,Collagenase + Elastase + DNAse,distal lobular airways,distal airways middle right lobe,surgical_resection,alveolar macrophage,10x 3' v2,normal,Homo sapiens,male,respiratory airway,European,46-year-old human stage
CTTGATTGTCAGTTTG_T164,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0009922,CL:0002633,HsapDv:0000160,PATO:0000461,HANCESTRO:0005,UBERON:0001005,NCBITaxon:9606,PATO:0000384,28.903,66.0,,0.36,Basal,Suprabasal,Epithelial,Airway epithelium,Basal,Suprabasal,,,Seibold_2020_10Xv3,1.377439,0.006080,0.018198,0.159813,3.217918,frozen,0,0.0,0.0.1,0.0.1.1,,4.733551,Healthy,,6891,4,Epithelial,Airway epithelium,Basal,KRT8 high basal,,KRT8.high,Misannotated,Homo_sapiens.GRCh38.84,T164,Basal,Illumina NovaSeq 6000,7.520303,never,Seibold_2020,organ_donor,Cold protease overnight,trachea,,donor_lung,respiratory basal cell,10x 3' v3,normal,Homo sapiens,male,respiratory airway,European,66-year-old human stage
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ACCTTTACATTAACCG_T120,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0009899,CL:0002633,HsapDv:0000151,PATO:0000461,HANCESTRO:0005,UBERON:0001005,NCBITaxon:9606,PATO:0000383,22.039,57.0,,0.36,Basal,Suprabasal,Epithelial,Airway epithelium,Basal,Suprabasal,,,Seibold_2020_10Xv2,1.631233,0.016913,0.273477,0.387291,2.790810,frozen,0,0.0,0.0.2,0.0.2.3,,4.184152,Healthy,,3177,4,Epithelial,Airway epithelium,Basal,KRT8 high basal,,KRT8.high,Misannotated,Homo_sapiens.GRCh38.84,T120,Basal,Illumina NovaSeq 6000,1.801739,active,Seibold_2020,organ_donor,Cold protease overnight,trachea,,donor_lung,respiratory basal cell,10x 3' v2,normal,Homo sapiens,female,respiratory airway,European,57-year-old human stage
CATTATCTCCATGAAC_F01639,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0011025,CL:0002399,HsapDv:0000147,PATO:0000461,HANCESTRO:0005,UBERON:0008946,NCBITaxon:9606,PATO:0000384,33.800,53.0,,0.97,DC2,DC2,Immune,Myeloid,Dendritic cells,DC2,,,Banovich_Kropski_2020,2.148653,0.099850,0.091353,1.151454,3.716129,fresh,1,1.2,1.2.1,1.2.1.0,,4.007705,Healthy,,2590,4,Immune,Myeloid,Dendritic cells,DC1,,cDC1,Misannotated,Homo_sapiens.GRCh38.84,F01639,Dendritic cells,Illumina NovaSeq 6000 S2,1.169098,active,Banovich_Kropski_2020,organ_donor,Dispase + collagenase,,,donor_lung,CD1c-positive myeloid dendritic cell,10x 5' v1,normal,Homo sapiens,male,lung parenchyma,European,53-year-old human stage
AGGCCGTGTGTGACCC-SC56,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0009899,CL:0002063,HsapDv:0000151,PATO:0000461,HANCESTRO:0005,UBERON:0008946,NCBITaxon:9606,PATO:0000383,,57.0,,0.97,AT2,AT2,Epithelial,Alveolar epithelium,AT2,,,,Lafyatis_Rojas_2019_10Xv2,1.381972,0.280016,0.208872,0.388602,2.006138,fresh,0,0.1,0.1.2,0.1.2.0,,4.076422,Healthy,,2848,3,Epithelial,Alveolar epithelium,AT2,,,6 AT2,Correctly annotated,Homo_sapiens.GRCh38.93,SC56,AT2,Illumina NextSeq 500,1.401193,never,Lafyatis_Rojas_2019,organ_donor,Collagenase A + DNAse,,,donor_lung,type II pneumocyte,10x 3' v2,normal,Homo sapiens,female,lung parenchyma,European,57-year-old human stage
CGATGGCAGCAGGCTA-1-2,cell,homosapiens_None_2023_None_sikkemalisa_001_d10...,False,EFO:0011025,CL:0000158,HsapDv:0000135,PATO:0000461,HANCESTRO:0005,UBERON:0000004,NCBITaxon:9606,PATO:0000384,19.900,41.0,,0.00,Secretory,Club (nasal),Epithelial,Airway epithelium,Secretory,Club,Club (nasal),,Jain_Misharin_2021_10Xv1,1.404169,0.004833,0.014457,0.700806,2.782702,fresh,0,0.2,0.2.1,0.2.1.0,,3.653309,Healthy,,1899,4,Epithelial,Airway epithelium,Secretory,Club,,Club,Underannotated,Homo_sapiens.GRCh38.84,SC174_SC173,Secretory,Illumina NovaSeq 6000,0.746264,never,Jain_Misharin_2021,alive_healthy,Cold protease 1h,inferior turbinate,inferior turbinate right nostril,scraping,club cell,10x 5' v1,normal,Homo sapiens,male,nose,European,41-year-old human stage


In [6]:
HLCA_core.obs.columns

Index(['suspension_type', 'donor_id', 'is_primary_data',
       'assay_ontology_term_id', 'cell_type_ontology_term_id',
       'development_stage_ontology_term_id', 'disease_ontology_term_id',
       'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id',
       'organism_ontology_term_id', 'sex_ontology_term_id', 'BMI',
       'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score',
       'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1',
       'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5',
       'cause_of_death', 'dataset', 'entropy_dataset_leiden_3',
       'entropy_original_ann_level_1_leiden_3',
       'entropy_original_ann_level_2_clean_leiden_3',
       'entropy_original_ann_level_3_clean_leiden_3',
       'entropy_subject_ID_leiden_3', 'fresh_or_frozen', 'leiden_1',
       'leiden_2', 'leiden_3', 'leiden_4', 'leiden_5', 'log10_total_counts',
       'lung_condition', 'mixed_ancestry', 'n_genes_detected',
       'o

In [7]:
HLCA_core.obs["assay"].value_counts()

10x 3' v2    328263
10x 5' v1    130370
10x 3' v3     86309
10x 5' v2     37081
10x 3' v1      2921
Name: assay, dtype: int64

In [8]:
HLCA_core.obs["sequencing_platform"].value_counts()

Illumina NovaSeq 6000                         199172
Illumina HiSeq 4000                           159395
Illumina NextSeq 500                           98668
Illumina NovaSeq 6000 S4                       57225
Illumina NovaSeq 6000 S1                       41001
Illumina NovaSeq 6000 S2                       21371
Illumina NovaSeq 6000; Illumina HiSeq 4000      5815
Illumina NovaSeq 6000 SP                        1920
Name: sequencing_platform, dtype: int64

In [9]:
HLCA_core.obs["donor_id"].value_counts()

homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747NU_CZI02           36814
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747donor 2            28792
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747NU_CZI01           28029
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747donor 3            24667
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747GRO-09             16187
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747D353               15380
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747390C               14728
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747356C               13557
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747GRO-04             13476
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_10_483747D372               13264
homosapiens_None_2023_None_sikkemalisa_001_d10_1101_2022_03_

In [10]:
HLCA_core.obs["tissue"].value_counts()

lung parenchyma       333468
respiratory airway    173197
nose                   78279
Name: tissue, dtype: int64

In [11]:
HLCA_core.obs[["tissue","anatomical_region_ccf_score"]].value_counts()

tissue              anatomical_region_ccf_score
lung parenchyma     0.97                           333468
nose                0.00                            78279
respiratory airway  0.72                            74464
                    0.36                            54004
                    0.81                            27543
                    0.50                             9417
                    0.64                             7769
dtype: int64

## We select only lung parenchyma cells to match our studied Visium ST samples

In [12]:
HLCA_core_parenchyma = HLCA_core[HLCA_core.obs["tissue"] == "lung parenchyma"].copy()

In [13]:
#Explora HLCA parenchyma object
HLCA_core_parenchyma

AnnData object with n_obs × n_vars = 333468 × 28024
    obs: 'suspension_type', 'donor_id', 'is_primary_data', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'BMI', 'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score', 'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'cause_of_death', 'dataset', 'entropy_dataset_leiden_3', 'entropy_original_ann_level_1_leiden_3', 'entropy_original_ann_level_2_clean_leiden_3', 'entropy_original_ann_level_3_clean_leiden_3', 'entropy_subject_ID_leiden_3', 'fresh_or_frozen', 'leiden_1', 'leiden_2', 'leiden_3', 'leiden_4', 'leiden_5', 'log10_total_counts', 'lung_condition', 'mixed_ancestry', 'n_genes_detected', 'original_ann_highest_res', 'original_ann_level_1', '

In [14]:
#Explore finest annotation level
HLCA_core_parenchyma.obs["ann_finest_level"].value_counts()

Alveolar macrophages             65291
AT2                              59457
Monocyte-derived Mph             26057
CD4 T cells                      15702
Classical monocytes              15454
CD8 T cells                      14932
NK cells                         14878
EC general capillary             14329
Multiciliated (non-nasal)        10180
Adventitial fibroblasts           9739
Non-classical monocytes           8415
AT1                               7411
EC arterial                       6866
EC venous pulmonary               6194
Alveolar Mph CCL3+                5917
EC aerocyte capillary             5881
DC2                               5623
Mast cells                        4754
Interstitial Mph perivascular     4604
Alveolar fibroblasts              4349
pre-TB secretory                  3964
Lymphatic EC mature               3777
B cells                           2415
Pericytes                         2248
Smooth muscle                     1939
EC venous systemic       

## We select only cell types with at least 150 cells (finest annotation level) present in the HLCA core lung parenchyma object for a more robust reference model training

In [15]:
min_cells = 150

In [16]:
counts = HLCA_core_parenchyma.obs.groupby('ann_finest_level')['ann_finest_level'].count()

In [17]:
minimumValueCells = counts[counts>=min_cells]

In [18]:
HLCA_core_parenchyma_filter = HLCA_core_parenchyma[HLCA_core_parenchyma.obs['ann_finest_level'].isin(minimumValueCells.index)]

In [19]:
HLCA_core_parenchyma_filter

View of AnnData object with n_obs × n_vars = 333011 × 28024
    obs: 'suspension_type', 'donor_id', 'is_primary_data', 'assay_ontology_term_id', 'cell_type_ontology_term_id', 'development_stage_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'tissue_ontology_term_id', 'organism_ontology_term_id', 'sex_ontology_term_id', 'BMI', 'age_or_mean_of_age_range', 'age_range', 'anatomical_region_ccf_score', 'ann_coarse_for_GWAS_and_modeling', 'ann_finest_level', 'ann_level_1', 'ann_level_2', 'ann_level_3', 'ann_level_4', 'ann_level_5', 'cause_of_death', 'dataset', 'entropy_dataset_leiden_3', 'entropy_original_ann_level_1_leiden_3', 'entropy_original_ann_level_2_clean_leiden_3', 'entropy_original_ann_level_3_clean_leiden_3', 'entropy_subject_ID_leiden_3', 'fresh_or_frozen', 'leiden_1', 'leiden_2', 'leiden_3', 'leiden_4', 'leiden_5', 'log10_total_counts', 'lung_condition', 'mixed_ancestry', 'n_genes_detected', 'original_ann_highest_res', 'original_ann_lev

In [20]:
HLCA_core_parenchyma_filter.obs["ann_finest_level"].value_counts()

Alveolar macrophages             65291
AT2                              59457
Monocyte-derived Mph             26057
CD4 T cells                      15702
Classical monocytes              15454
CD8 T cells                      14932
NK cells                         14878
EC general capillary             14329
Multiciliated (non-nasal)        10180
Adventitial fibroblasts           9739
Non-classical monocytes           8415
AT1                               7411
EC arterial                       6866
EC venous pulmonary               6194
Alveolar Mph CCL3+                5917
EC aerocyte capillary             5881
DC2                               5623
Mast cells                        4754
Interstitial Mph perivascular     4604
Alveolar fibroblasts              4349
pre-TB secretory                  3964
Lymphatic EC mature               3777
B cells                           2415
Pericytes                         2248
Smooth muscle                     1939
EC venous systemic       

## Save HLCA filtered object for reference model training

In [21]:
#Write filtered lung parenchyma annData object
HLCA_core_parenchyma_filter.write_h5ad('HLCA_publication/HLCA_core_parenchyma_filter.h5ad', compression='gzip')