<a href="https://colab.research.google.com/github/WormBase/wormcells-notebooks/blob/main/wormcells_wrangle_packer2019_h5ad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This notebook will create the `packer2019.h5ad` file from scratch. 


### Original study:
A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution

Packer, Jonathan S. and Zhu, Qin and Huynh, Chau and Sivaramakrishnan, Priya and Preston, Elicia and Dueck, Hannah and Stefanik, Derek and Tan, Kai and Trapnell, Cole and Kim, Junhyong and Waterston, Robert H. and Murray, John I.

Science  20 Sep 2019:
Vol. 365, Issue 6459, eaax1971
DOI: 10.1126/science.aax1971
https://science.sciencemag.org/content/365/6459/eaax1971.editor-summary

### Data description and link

89,701 cells profiled with 10xv2 across multiple timepoints of development

Data available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE126954



## Data wrangling conventions

As possible, we attempt to keep the field names lower case, short, descriptive, and only using valid Python variable names so they may be accessed via the syntax `adata.var.field_name` 

Below we provide a standard description of the mandatory fields we use in all datasets, plus some common optional ones that we have used so far (not all). Our goal is to standartize the naming convention for frequently used fields so that code may be reused without headaches changing variable names.

### `adata.var`: gene IDs, names and descriptions 

|Field name | Description | Type | Example value | Optionality|
|-----------|-------------|------|-------|-----|
| `adata.var.index` | WormBase gene ID, must be unique | string | `WBGene00010957`| Required|
| `adata.var.gene_id` | WormBase gene ID, repeat values from index | string | `WBGene00010957`|Required
| `adata.var.gene_name` | WormBase gene name | string | `nduo-6 `|Required|
| `adata.var.gene_description` | WormBase short gene description. Full list available for download [here](https://www.alliancegenome.org/downloads) | string | `Predicted to have NADH dehydrogenase (ubiquinone) activity. Predicted to localize to integral component of membrane; mitochondrial membrane; and respirasome.`|Required|

### `adata.obs`: cell barcode, experiment, batch, original study, cell type

|Field name | Description | Type | Example value | Optionality|
|-----------|-------------|------|-------|-----|
| `adata.var.index` | The batch name joined with cell barcode witha `+` char | string | `F4_1+TGTAACGGTTAGCTAC-1 `| Required|
| `adata.var.study` | A unique shorthand for the study that published the data, ideally in the style <first author><year> all lower case. The .h5ad file should have the same name as the study it corresponds to.  | categorical | `taylor2020`| Required|
| `adata.obs.batch` | The run that produced the corresponding barcode. Most of the time batch and experiment will be the same, but with multiplexing sometimes an one batch can have multiple experiments | categorical | `F4_1-1`|Required|
| `adata.obs.experiment` | The biological experiment performed | categorical string | `F4_1`|Required|
| `adata.obs.experiment_description` | Description of the experiment performed. This is mandatory because otherwise it will be very easy to confuse two experiments from their name without carefully reading the paper or contacting authors | categorical string | `F4_1`|Required|
| `adata.obs.barcode` | The cell barcode | string | `AAACCCAAGATCGCTT-1`|Required|
| `adata.obs.cell_type` | The cell type annotation provided by the authors. Should be `not provided` if not available | categorical | `ASJ`|Required|
| `adata.obs.cell_subtype` | The cell subtype annotation if provided by the authors | categorical | `BWM_head_row_1`|Optional|
| `adata.obs.tissue` | The tissue annotation if provided by the authors | categorical | `Intestine`|Optional|

In [12]:
!pip install anndata --quiet
import anndata 
import pandas as pd

anndata.__version__

'0.7.5'

In [13]:
# download cell annotation
!wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126954/suppl/GSE126954_cell_annotation.csv.gz
!gunzip GSE126954_cell_annotation.csv.gz
cells = pd.read_csv('GSE126954_cell_annotation.csv', index_col=0)
cells.head().T

--2021-04-06 07:39:57--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126954/suppl/GSE126954_cell_annotation.csv.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 130.14.250.11, 130.14.250.13, 2607:f220:41e:250::11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|130.14.250.11|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2814304 (2.7M) [application/x-gzip]
Saving to: ‘GSE126954_cell_annotation.csv.gz’


2021-04-06 07:39:57 (24.2 MB/s) - ‘GSE126954_cell_annotation.csv.gz’ saved [2814304/2814304]



Unnamed: 0,AAACCTGAGACAATAC-300.1.1,AAACCTGAGGGCTCTC-300.1.1,AAACCTGAGTGCGTGA-300.1.1,AAACCTGAGTTGAGTA-300.1.1,AAACCTGCAAGACGTG-300.1.1
cell,AAACCTGAGACAATAC-300.1.1,AAACCTGAGGGCTCTC-300.1.1,AAACCTGAGTGCGTGA-300.1.1,AAACCTGAGTTGAGTA-300.1.1,AAACCTGCAAGACGTG-300.1.1
n.umi,1630,2319,3719,4251,1003
time.point,300_minutes,300_minutes,300_minutes,300_minutes,300_minutes
batch,Waterston_300_minutes,Waterston_300_minutes,Waterston_300_minutes,Waterston_300_minutes,Waterston_300_minutes
Size_Factor,1.02319,1.45821,2.33828,2.65905,0.62961
cell.type,Body_wall_muscle,,,Body_wall_muscle,Ciliated_amphid_neuron
cell.subtype,BWM_head_row_1,,,BWM_anterior,AFD
plot.cell.type,BWM_head_row_1,,,BWM_anterior,AFD
raw.embryo.time,360,260,270,260,350
embryo.time,380,220,230,280,350


In [14]:
cells.columns

Index(['cell', 'n.umi', 'time.point', 'batch', 'Size_Factor', 'cell.type',
       'cell.subtype', 'plot.cell.type', 'raw.embryo.time', 'embryo.time',
       'embryo.time.bin', 'raw.embryo.time.bin', 'lineage',
       'passed_initial_QC_or_later_whitelisted'],
      dtype='object')

In [15]:
cells

Unnamed: 0,cell,n.umi,time.point,batch,Size_Factor,cell.type,cell.subtype,plot.cell.type,raw.embryo.time,embryo.time,embryo.time.bin,raw.embryo.time.bin,lineage,passed_initial_QC_or_later_whitelisted
AAACCTGAGACAATAC-300.1.1,AAACCTGAGACAATAC-300.1.1,1630,300_minutes,Waterston_300_minutes,1.023195,Body_wall_muscle,BWM_head_row_1,BWM_head_row_1,360,380.0,330-390,330-390,MSxpappp,True
AAACCTGAGGGCTCTC-300.1.1,AAACCTGAGGGCTCTC-300.1.1,2319,300_minutes,Waterston_300_minutes,1.458210,,,,260,220.0,210-270,210-270,MSxapaap,True
AAACCTGAGTGCGTGA-300.1.1,AAACCTGAGTGCGTGA-300.1.1,3719,300_minutes,Waterston_300_minutes,2.338283,,,,270,230.0,210-270,270-330,,True
AAACCTGAGTTGAGTA-300.1.1,AAACCTGAGTTGAGTA-300.1.1,4251,300_minutes,Waterston_300_minutes,2.659051,Body_wall_muscle,BWM_anterior,BWM_anterior,260,280.0,270-330,210-270,Dxap,True
AAACCTGCAAGACGTG-300.1.1,AAACCTGCAAGACGTG-300.1.1,1003,300_minutes,Waterston_300_minutes,0.629610,Ciliated_amphid_neuron,AFD,AFD,350,350.0,330-390,330-390,ABalpppapav/ABpraaaapav,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
TCTGAGACATGTCGAT-b02,TCTGAGACATGTCGAT-b02,585,mixed,Murray_b02,0.364709,Rectal_gland,Rectal_gland,Rectal_gland,390,700.0,> 650,390-450,,True
TCTGAGACATGTCTCC-b02,TCTGAGACATGTCTCC-b02,510,mixed,Murray_b02,0.323907,,,,510,470.0,450-510,510-580,,True
TGGCCAGCACGAAGCA-b02,TGGCCAGCACGAAGCA-b02,843,mixed,Murray_b02,0.529174,,,,400,470.0,450-510,390-450,,True
TGGCGCACAGGCAGTA-b02,TGGCGCACAGGCAGTA-b02,636,mixed,Murray_b02,0.397979,,,,330,350.0,330-390,330-390,,True


In [16]:
cells_wrangled=cells.copy()
#rename original columns
cells_wrangled.columns=['barcode', 'n_umi', 'time_point', 'batch', 'size_factor',
       'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time',
       'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage','passed_qc']
#add study name
cells_wrangled['study']='packer2019'
cells_wrangled['study']=cells_wrangled['study'].astype("category")
# replace nans in cell types
cells_wrangled['cell_type']=cells_wrangled.cell_type.fillna('not provided')
cells_wrangled['cell_type']=cells_wrangled['cell_type'].astype("category")

# samples are not multiplexed so are all the same batch 
cells_wrangled['sample']=cells_wrangled.batch
cells_wrangled['sample']=cells_wrangled['sample'].astype("category")

cells_wrangled.index=cells_wrangled.batch+'+'+cells_wrangled.barcode

#create dict with descriptions of each sample from geo
sample_descriptions_dict={
'Waterston_300_minutes':'GSM3618670 UW synchronized 300 min post bleach',
'Waterston_400_minutes':'GSM3618671 UW synchronized 400 min post bleach',
'Waterston_500_minutes_batch_1':'GSM3618672 UW synchronized 500 min post bleach batch 1',
'Waterston_500_minutes_batch_2':'GSM3618673 UW synchronized 500 min post bleach batch 2',
'Murray_r17':'GSM3618674 UPenn mixed embryo batch r17',
'Murray_b01':'GSM3618675 UPenn mixed embryo batch b01',
'Murray_b02':'GSM3618676 UPenn mixed embryo batch b02'  
}
#map descriptions to samples in a new column
cells_wrangled['sample_description']=cells_wrangled['sample'].map(sample_descriptions_dict)
cells_wrangled['sample_description']=cells_wrangled['sample_description'].astype("category")

#reorder columns
cells_wrangled=cells_wrangled[[
'study','batch','sample','sample_description','barcode','cell_type', 
'n_umi', 'time_point', 'size_factor', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time',
'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage','passed_qc']]

required_columns=['study','batch','sample','sample_description','barcode','cell_type']
if not set(required_columns).issubset(cells_wrangled.columns):
    raise ValueError('At least one of the required obs columns is missing')

cells_wrangled

Unnamed: 0,study,batch,sample,sample_description,barcode,cell_type,n_umi,time_point,size_factor,cell_subtype,plot_cell_type,raw_embryo_time,embryo_time,embryo_time_bin,raw_embryo_time_bin,lineage,passed_qc
Waterston_300_minutes+AAACCTGAGACAATAC-300.1.1,packer2019,Waterston_300_minutes,Waterston_300_minutes,GSM3618670 UW synchronized 300 min post bleach,AAACCTGAGACAATAC-300.1.1,Body_wall_muscle,1630,300_minutes,1.023195,BWM_head_row_1,BWM_head_row_1,360,380.0,330-390,330-390,MSxpappp,True
Waterston_300_minutes+AAACCTGAGGGCTCTC-300.1.1,packer2019,Waterston_300_minutes,Waterston_300_minutes,GSM3618670 UW synchronized 300 min post bleach,AAACCTGAGGGCTCTC-300.1.1,not provided,2319,300_minutes,1.458210,,,260,220.0,210-270,210-270,MSxapaap,True
Waterston_300_minutes+AAACCTGAGTGCGTGA-300.1.1,packer2019,Waterston_300_minutes,Waterston_300_minutes,GSM3618670 UW synchronized 300 min post bleach,AAACCTGAGTGCGTGA-300.1.1,not provided,3719,300_minutes,2.338283,,,270,230.0,210-270,270-330,,True
Waterston_300_minutes+AAACCTGAGTTGAGTA-300.1.1,packer2019,Waterston_300_minutes,Waterston_300_minutes,GSM3618670 UW synchronized 300 min post bleach,AAACCTGAGTTGAGTA-300.1.1,Body_wall_muscle,4251,300_minutes,2.659051,BWM_anterior,BWM_anterior,260,280.0,270-330,210-270,Dxap,True
Waterston_300_minutes+AAACCTGCAAGACGTG-300.1.1,packer2019,Waterston_300_minutes,Waterston_300_minutes,GSM3618670 UW synchronized 300 min post bleach,AAACCTGCAAGACGTG-300.1.1,Ciliated_amphid_neuron,1003,300_minutes,0.629610,AFD,AFD,350,350.0,330-390,330-390,ABalpppapav/ABpraaaapav,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Murray_b02+TCTGAGACATGTCGAT-b02,packer2019,Murray_b02,Murray_b02,GSM3618676 UPenn mixed embryo batch b02,TCTGAGACATGTCGAT-b02,Rectal_gland,585,mixed,0.364709,Rectal_gland,Rectal_gland,390,700.0,> 650,390-450,,True
Murray_b02+TCTGAGACATGTCTCC-b02,packer2019,Murray_b02,Murray_b02,GSM3618676 UPenn mixed embryo batch b02,TCTGAGACATGTCTCC-b02,not provided,510,mixed,0.323907,,,510,470.0,450-510,510-580,,True
Murray_b02+TGGCCAGCACGAAGCA-b02,packer2019,Murray_b02,Murray_b02,GSM3618676 UPenn mixed embryo batch b02,TGGCCAGCACGAAGCA-b02,not provided,843,mixed,0.529174,,,400,470.0,450-510,390-450,,True
Murray_b02+TGGCGCACAGGCAGTA-b02,packer2019,Murray_b02,Murray_b02,GSM3618676 UPenn mixed embryo batch b02,TGGCGCACAGGCAGTA-b02,not provided,636,mixed,0.397979,,,330,350.0,330-390,330-390,,True


In [17]:
foo=['study','barcode', 'n_umi', 'time_point', 'batch', 'size_factor',
       'cell_type', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time',
       'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage',
       'passed_qc']

foo

['study',
 'barcode',
 'n_umi',
 'time_point',
 'batch',
 'size_factor',
 'cell_type',
 'cell_subtype',
 'plot_cell_type',
 'raw_embryo_time',
 'embryo_time',
 'embryo_time_bin',
 'raw_embryo_time_bin',
 'lineage',
 'passed_qc']

In [18]:
 set(['a', 'sdb']).issubset(['a', 'b', 'c'])

False

In [19]:
# download gene annotation
!wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126954/suppl/GSE126954_gene_annotation.csv.gz
!gunzip GSE126954_gene_annotation.csv.gz
genes = pd.read_csv('GSE126954_gene_annotation.csv', index_col=0)
genes.head().T

--2021-04-06 07:39:58--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126954/suppl/GSE126954_gene_annotation.csv.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.228, 165.112.9.229, 2607:f220:41e:250::11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 177278 (173K) [application/x-gzip]
Saving to: ‘GSE126954_gene_annotation.csv.gz’


2021-04-06 07:39:59 (4.45 MB/s) - ‘GSE126954_gene_annotation.csv.gz’ saved [177278/177278]



Unnamed: 0,WBGene00010957,WBGene00010958,WBGene00010959,WBGene00010960,WBGene00010961
id,WBGene00010957,WBGene00010958,WBGene00010959,WBGene00010960,WBGene00010961
gene_short_name,nduo-6,ndfl-4,nduo-1,atp-6,nduo-2


In [20]:
genes_wrangled=genes.copy()

genes_wrangled.columns=['gene_id','gene_name']

genes_wrangled

Unnamed: 0,gene_id,gene_name
WBGene00010957,WBGene00010957,nduo-6
WBGene00010958,WBGene00010958,ndfl-4
WBGene00010959,WBGene00010959,nduo-1
WBGene00010960,WBGene00010960,atp-6
WBGene00010961,WBGene00010961,nduo-2
...,...,...
WBGene00021597,WBGene00021597,spsb-1
WBGene00021596,WBGene00021596,spsb-2
WBGene00021595,WBGene00021595,Y46E12BL.2
WBGene00021594,WBGene00021594,tig-3


In [21]:
# download gene count matrix
!wget https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126954/suppl/GSE126954_gene_by_cell_count_matrix.txt.gz
!gunzip GSE126954_gene_by_cell_count_matrix.txt.gz


--2021-04-06 07:39:59--  https://ftp.ncbi.nlm.nih.gov/geo/series/GSE126nnn/GSE126954/suppl/GSE126954_gene_by_cell_count_matrix.txt.gz
Resolving ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)... 165.112.9.228, 165.112.9.229, 2607:f220:41e:250::11, ...
Connecting to ftp.ncbi.nlm.nih.gov (ftp.ncbi.nlm.nih.gov)|165.112.9.228|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 261278354 (249M) [application/x-gzip]
Saving to: ‘GSE126954_gene_by_cell_count_matrix.txt.gz’


2021-04-06 07:40:03 (61.8 MB/s) - ‘GSE126954_gene_by_cell_count_matrix.txt.gz’ saved [261278354/261278354]



In [22]:
adata=anndata.read_mtx('GSE126954_gene_by_cell_count_matrix.txt')
adata.var=cells_wrangled
adata.obs=genes_wrangled
adata

AnnData object with n_obs × n_vars = 20222 × 89701
    obs: 'gene_id', 'gene_name'
    var: 'study', 'batch', 'sample', 'sample_description', 'barcode', 'cell_type', 'n_umi', 'time_point', 'size_factor', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'

In [23]:
adata.write_h5ad('packer2019.h5ad')
## once data is written to drive it should be downloaded and manually uploaded to figshare

... storing 'batch' as categorical
... storing 'time_point' as categorical
... storing 'cell_subtype' as categorical
... storing 'plot_cell_type' as categorical
... storing 'embryo_time_bin' as categorical
... storing 'raw_embryo_time_bin' as categorical
... storing 'lineage' as categorical


In [24]:
print('\nprint(adata) \n')
print(adata)
print('\nprint(adata.var.head(1).T) \n')
print(adata)
print('\nprint(adata.obs.head(1).T) \n')
print(adata.obs.head(1).T)


print(adata) 

AnnData object with n_obs × n_vars = 20222 × 89701
    obs: 'gene_id', 'gene_name'
    var: 'study', 'batch', 'sample', 'sample_description', 'barcode', 'cell_type', 'n_umi', 'time_point', 'size_factor', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'

print(adata.var.head(1).T) 

AnnData object with n_obs × n_vars = 20222 × 89701
    obs: 'gene_id', 'gene_name'
    var: 'study', 'batch', 'sample', 'sample_description', 'barcode', 'cell_type', 'n_umi', 'time_point', 'size_factor', 'cell_subtype', 'plot_cell_type', 'raw_embryo_time', 'embryo_time', 'embryo_time_bin', 'raw_embryo_time_bin', 'lineage', 'passed_qc'

print(adata.obs.head(1).T) 

           WBGene00010957
gene_id    WBGene00010957
gene_name          nduo-6
