# Normal breast epithelia comprise three major populations plus transient intermediates

This notebook attempts to reproduce the analysis done to arrive at the source papers first result, normal breast tissue contains three major types of cells: basal, luminal progenitor, and mature luminal.

This analysis was performed on the single-cell RNA sequencing data for 11 subjects, corresponding to the following identifiers:
|"Sample" identifier|File identifier|
|---|---|
|N-1105-epi|`GSM4909260_N-N1105-Epi`|
|N-0280-epi|`GSM4909255_N-N280-Epi`|
|N-0230.16-epi|`GSM4909264_N-N1B-Epi`|
|N-0408-epi|`GSM4909259_N-NE-Epi`|
|N-1469-epi|`GSM4909258_N-NF-Epi`|
|N-0123-epi|`GSM4909267_N-MH0023-Epi`|
|N-0064-epi|`GSM4909313_N-MH0064-Epi`|
|N-0093-epi|`GSM4909256_N-PM0095-Epi`|
|N-0342-epi|`GSM4909269_N-PM0342-Epi`|
|N-0372-epi|`GSM4909275_N-PM0372-Epi`|
|N-0275-epi|`GSM4909273_N-MH275-Epi`|

This table was arrived at through:
* Figure 1c in the source article lists the "Sample" (quotes to emphasize in the context of the article, a sample is analogous to a human subject) identifiers used.
* Supplementary table 1 of the source article maps identifiers to file identifiers.

## 1. Additional preprocessing
For this finding, the authors leveraged the work of Lim *et al* (2009) [2] for the gene signatures of basal, luminal progenitor, and mature luminal signatures. Using these signatures, the authors were able to classify the sequencing data as either one of these 3 populations or other.

To reproduce their analysis, I use the following supplementary files from Lim *et al* that contain the gene signature to classify the cell data:
|Population|Filename|
|---|---|
|basal|`41591_2009_BFnm2000_MOESM13_ESM.xls`|
|luminal progenitor (lp)|`41591_2009_BFnm2000_MOESM14_ESM.xls`|
|mature luminal (ml)|`41591_2009_BFnm2000_MOESM15_ESM.xls`|

In [3]:
gene_signature_filenames = {
    'basal': '41591_2009_BFnm2000_MOESM13_ESM.xls',
    'lp': '41591_2009_BFnm2000_MOESM14_ESM.xls',
    'ml': '41591_2009_BFnm2000_MOESM15_ESM.xls',
}

In [6]:
import pandas as pd
from signals_in_the_noise.utilities.storage import get_resources_path

In [8]:
df = pd.read_excel(get_resources_path(gse.STUDY_ID + '/' + gene_signature_filenames['basal']))
gse.STUDY_ID

ImportError: Missing optional dependency 'xlrd'. Install xlrd >= 2.0.1 for xls Excel support Use pip or conda to install xlrd.

In [1]:
from signals_in_the_noise.preprocessing.data_gse161529 import GSE161529

gse = GSE161529()

2025-07-10 20:43:31,825 [INFO] signals_in_the_noise.utilities.tenx_genomics: Reading C:\Users\silly\GitHub\signals-in-the-noise\data\GSE161529_adata_cache\GSM4909253_N-PM0092-Total.h5ad as AnnData object.
2025-07-10 20:43:32,138 [INFO] signals_in_the_noise.utilities.tenx_genomics: Reading C:\Users\silly\GitHub\signals-in-the-noise\data\GSE161529_adata_cache\GSM4909254_N-PM0019-Total.h5ad as AnnData object.
2025-07-10 20:43:32,507 [INFO] signals_in_the_noise.utilities.tenx_genomics: Reading C:\Users\silly\GitHub\signals-in-the-noise\data\GSE161529_adata_cache\GSM4909255_N-N280-Epi.h5ad as AnnData object.
2025-07-10 20:43:32,669 [INFO] signals_in_the_noise.utilities.tenx_genomics: Reading C:\Users\silly\GitHub\signals-in-the-noise\data\GSE161529_adata_cache\GSM4909256_N-PM0095-Epi.h5ad as AnnData object.
2025-07-10 20:43:32,959 [INFO] signals_in_the_noise.utilities.tenx_genomics: Reading C:\Users\silly\GitHub\signals-in-the-noise\data\GSE161529_adata_cache\GSM4909257_N-PM0095-Total.h5ad 

In [2]:
filenames = [
    "GSM4909260_N-N1105-Epi.h5ad",
    "GSM4909255_N-N280-Epi.h5ad",
    "GSM4909264_N-N1B-Epi.h5ad",
    "GSM4909259_N-NE-Epi.h5ad",
    "GSM4909258_N-NF-Epi.h5ad",
    "GSM4909267_N-MH0023-Epi.h5ad",
    "GSM4909313_N-MH0064-Epi.h5ad",
    "GSM4909256_N-PM0095-Epi.h5ad",
    "GSM4909269_N-PM0342-Epi.h5ad",
    "GSM4909275_N-PM0372-Epi.h5ad",
    "GSM4909273_N-MH275-Epi.h5ad",
]   

In [None]:
adata = gse.objects[fil