<a href="https://colab.research.google.com/github/WormBase/wormcells-notebooks/blob/main/wormcells_wrangle_bendavid2021_h5ad.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## This notebook will create the `bendavid2021.h5ad` file 


**Whole-organism eQTL mapping at cellular resolution with single-cell sequencing**

_Eyal Ben-David, James Boocock, Longhua Guo, Stefan Zdraljevic, Joshua S Bloom, Leonid Kruglyak_

eLife 2021;10:e65857, DOI: [https://doi.org/10.7554/eLife.65857](https://doi.org/10.7554/eLife.65857)

### Data description and links

55,508 cells profiled with 10xv2 on L2 larvae

Raw sequencing data is available at https://www.ncbi.nlm.nih.gov/bioproject/658829

The authors kindly provided the original gene count matrix without the soupX modifications, which is used to create the .h5ad file in this notebook. The author provided files are:
```
bendavid2021_barcode_metadata_for.csv
bendavid2021_count_matrix_wormsceQTL_unmodified.mm.gz
bendavid2021_gene_metadata.csv
```

The R object as processed by the authors with soupX is also available on GitHub together with all the code to make the figures and calculate the numbers reported in the paper:

[https://github.com/eyalbenda/worm_sceQTL/](https://github.com/eyalbenda/worm_sceQTL/)

## Data wrangling conventions

This document describes how WormBase should wrangle single cell RNA seq data into the [anndata](https://anndata.readthedocs.io/en/stable/) format in `.h5ad` files with standard fields, plus any number of optional fields that vary depending on the metadata the authors provide. 

As possible, we attempt to keep the field names lower case, short, descriptive, and only using valid Python variable names so they may be accessed via the syntax `adata.var.field_name` 

Our goal is to standartize the naming convention for frequently used fields so that code may be reused without headaches changing variable names. The .h5ad file should only contain genes and cells with at least one count.

For now we are trying to keep these guidelines simple so that we can easily stick to them. Going forward we will update them as we learn from experience. For example, at some point WormBase may start uniformly reprocessing the data, so a version field may be needed in the adata.uns

Below we provide a standard description of the mandatory fields we use in all datasets, plus some common optional ones that we have used so far (not all, each dataset may have different metadata provided by the authors). 

### `adata.var`: gene IDs, names and descriptions 
|Field name | Description | Type | Example value | Optionality|
|-----------|-------------|------|-------|-----|
| `adata.var.index` | WormBase gene ID, must be unique | string | `WBGene00010957`| Required|
| `adata.var.gene_id` | WormBase gene ID, repeat values from index | string | `WBGene00010957`|Required
| `adata.var.gene_name` | WormBase gene name | string | `nduo-6 `|Required|
| `adata.var.gene_description` | WormBase short gene description. Full list available for download [here](https://www.alliancegenome.org/downloads) | string | `Predicted to have NADH dehydrogenase (ubiquinone) activity. Predicted to localize to integral component of membrane; mitochondrial membrane; and respirasome.`|Optional|

### `adata.obs`: cell barcode, experiment, batch, original study, cell type
|Field name | Description | Type | Example value | Optionality|
|-----------|-------------|------|-------|-----|
| `adata.var.index` | The batch name joined with cell barcode witha `+` char | string | `F4_1+TGTAACGGTTAGCTAC-1 `| Required|
| `adata.var.study` | A unique shorthand for the study that published the data, ideally in the style <first author><year> all lower case. The .h5ad file should have the same name as the study it corresponds to.  | categorical | `taylor2020`| Required|
| `adata.obs.sample_batch` | The run that produced the corresponding barcode. Most of the time batch and experiment will be the same, but with multiplexing sometimes an one batch can have multiple experiments | categorical | `F4_1`|Required|
| `adata.obs.sample` | The biological sample that is in this batch | categorical string | `L2 larvae fourth repeat`|Required|
| `adata.obs.sample_description` | Description of the sample. This is mandatory because otherwise it will be very easy to confuse two samples from their name without carefully reading the paper or contacting authors | categorical string | `F4_1`|Required|
| `adata.obs.barcode` | The cell barcode | string | `AAACCCAAGATCGCTT-1`|Required|
| `adata.obs.cell_type` | The cell type annotation provided by the authors. Should be `not provided` if not available | categorical | `ASJ`|Required|
| `adata.obs.cell_subtype` | The cell subtype annotation if provided by the authors | categorical | `BWM_head_row_1`|Optional|
| `adata.obs.tissue` | The tissue annotation if provided by the authors | categorical | `Intestine`|Optional|This document describes how WormBase should wrangle single cell RNA seq data into the [anndata](https://anndata.readthedocs.io/en/stable/) format in `.h5ad` files with standard fields, plus any number of optional fields that vary depending on the metadata the authors provide. 

As possible, we attempt to keep the field names lower case, short, descriptive, and only using valid Python variable names so they may be accessed via the syntax `adata.var.field_name` 

Our goal is to standartize the naming convention for frequently used fields so that code may be reused without headaches changing variable names. The .h5ad file should only contain genes and cells with at least one count.

For now we are trying to keep these guidelines simple so that we can easily stick to them. Going forward we will update them as we learn from experience. For example, at some point WormBase may start uniformly reprocessing the data, so a version field may be needed in the adata.uns

Below we provide a standard description of the mandatory fields we use in all datasets, plus some common optional ones that we have used so far (not all, each dataset may have different metadata provided by the authors). 

### `adata.var`: gene IDs, names and descriptions 
|Field name | Description | Type | Example value | Optionality|
|-----------|-------------|------|-------|-----|
| `adata.var.index` | WormBase gene ID, must be unique | string | `WBGene00010957`| Required|
| `adata.var.gene_id` | WormBase gene ID, repeat values from index | string | `WBGene00010957`|Required
| `adata.var.gene_name` | WormBase gene name | string | `nduo-6 `|Required|
| `adata.var.gene_description` | WormBase short gene description. Full list available for download [here](https://www.alliancegenome.org/downloads) | string | `Predicted to have NADH dehydrogenase (ubiquinone) activity. Predicted to localize to integral component of membrane; mitochondrial membrane; and respirasome.`|Optional|

### `adata.obs`: cell barcode, experiment, batch, original study, cell type
|Field name | Description | Type | Example value | Optionality|
|-----------|-------------|------|-------|-----|
| `adata.var.index` | The batch name joined with cell barcode witha `+` char | string | `F4_1+TGTAACGGTTAGCTAC-1 `| Required|
| `adata.var.study` | A unique shorthand for the study that published the data, ideally in the style <first author><year> all lower case. The .h5ad file should have the same name as the study it corresponds to.  | categorical | `taylor2020`| Required|
| `adata.obs.batch` | The run that produced the corresponding barcode. Most of the time batch and experiment will be the same, but with multiplexing sometimes an one batch can have multiple experiments | categorical | `F4_1`|Required|
| `adata.obs.sample` | The biological sample that is in this batch | categorical string | `L2 larvae fourth repeat`|Required|
| `adata.obs.sample_description` | Description of the sample. This is mandatory because otherwise it will be very easy to confuse two samples from their name without carefully reading the paper or contacting authors | categorical string | `F4_1`|Required|
| `adata.obs.barcode` | The cell barcode | string | `AAACCCAAGATCGCTT-1`|Required|
| `adata.obs.cell_type` | The cell type annotation provided by the authors. Should be `not provided` if not available | categorical | `ASJ`|Required|
| `adata.obs.cell_subtype` | The cell subtype annotation if provided by the authors | categorical | `BWM_head_row_1`|Optional|
| `adata.obs.tissue` | The tissue annotation if provided by the authors | categorical | `Intestine`|Optional|

In [7]:
!pip install anndata --quiet
import anndata 
import pandas as pd

anndata.__version__

'0.7.6'

### You need to manually download these three files from Caltech data and upload them to Colab in order to run the notebook, since Caltech data doesn't support wget

```
bendavid2021_barcode_metadata_for.csv
bendavid2021_count_matrix_wormsceQTL_unmodified.mm.gz
bendavid2021_gene_metadata.csv
```

https://data.caltech.edu/tindfiles/serve/fed92e1c-a935-4917-b8c6-5405d3af73f3/

https://data.caltech.edu/tindfiles/serve/7a2b5d31-ebc1-4796-b350-6a8d6fc91b40/

https://data.caltech.edu/tindfiles/serve/c41d4507-204f-4d1f-b9ed-ea3f9683178e/

In [8]:
### LOAD CELL ANNOTATION DATAFRAME
### wrangles names to conform to convention

cells = pd.read_csv('bendavid2021_barcode_metadata.csv')
display(cells.head())
cells['barcode']=cells.barcode.str.split('_', expand=True)[2]
cells=cells.rename(columns={'Batch':'sample_batch','neuronal_subtype':'cell_subtype'})
cells.index=cells['sample_batch']+'+'+cells['barcode']
cells['cell_subtype']=cells['cell_subtype'].fillna(cells['cell_type'])
display(cells.head())
cells['study']='bendavid2021'
cells['sample']=cells['sample_batch']
cells['sample_description']='C. elegans F4 segregants collected at the L2 larval stage'
cells=cells[['study','sample_batch','sample','cell_type','cell_subtype','sample_description','barcode']]
display(cells.head())

### CHECK THAT CELLS DATAFRAME CONFORMS TO ADATA NAMING CONVENTION
### must contain at least columns `gene_id` and `gene_name`
for column_name in ['study','sample_batch','sample','sample_description','barcode','cell_type','cell_subtype']:
    if column_name not in cells.columns:
        raise ValueError(column_name + ' is not a name in gene dataframe columns')
         
print('cells column names look ok!')

Unnamed: 0,Batch,Size_Factor,cell_type,neuronal_subtype,total,barcode,doublet
0,F4_1,102.5322,Intestine,,104863,F4_1_TGTAACGGTTAGCTAC-1,False
1,F4_1,60.046012,Intestine,,61411,F4_1_GGCAGTCCAGCCTATA-1,False
2,F4_1,60.384321,Somatic Gonad,,61757,F4_1_AAGTACCGTCATCCCT-1,False
3,F4_1,51.692898,Intestine,,52868,F4_1_AAGATAGTCCCTCTAG-1,False
4,F4_1,59.00175,Pharynx and Arcade Cells,,60343,F4_1_ACCAAACCAGCTGTAT-1,False


Unnamed: 0,sample_batch,Size_Factor,cell_type,cell_subtype,total,barcode,doublet
F4_1+TGTAACGGTTAGCTAC-1,F4_1,102.5322,Intestine,Intestine,104863,TGTAACGGTTAGCTAC-1,False
F4_1+GGCAGTCCAGCCTATA-1,F4_1,60.046012,Intestine,Intestine,61411,GGCAGTCCAGCCTATA-1,False
F4_1+AAGTACCGTCATCCCT-1,F4_1,60.384321,Somatic Gonad,Somatic Gonad,61757,AAGTACCGTCATCCCT-1,False
F4_1+AAGATAGTCCCTCTAG-1,F4_1,51.692898,Intestine,Intestine,52868,AAGATAGTCCCTCTAG-1,False
F4_1+ACCAAACCAGCTGTAT-1,F4_1,59.00175,Pharynx and Arcade Cells,Pharynx and Arcade Cells,60343,ACCAAACCAGCTGTAT-1,False


Unnamed: 0,study,sample_batch,sample,cell_type,cell_subtype,sample_description,barcode
F4_1+TGTAACGGTTAGCTAC-1,bendavid2021,F4_1,F4_1,Intestine,Intestine,C. elegans F4 segregants collected at the L2 l...,TGTAACGGTTAGCTAC-1
F4_1+GGCAGTCCAGCCTATA-1,bendavid2021,F4_1,F4_1,Intestine,Intestine,C. elegans F4 segregants collected at the L2 l...,GGCAGTCCAGCCTATA-1
F4_1+AAGTACCGTCATCCCT-1,bendavid2021,F4_1,F4_1,Somatic Gonad,Somatic Gonad,C. elegans F4 segregants collected at the L2 l...,AAGTACCGTCATCCCT-1
F4_1+AAGATAGTCCCTCTAG-1,bendavid2021,F4_1,F4_1,Intestine,Intestine,C. elegans F4 segregants collected at the L2 l...,AAGATAGTCCCTCTAG-1
F4_1+ACCAAACCAGCTGTAT-1,bendavid2021,F4_1,F4_1,Pharynx and Arcade Cells,Pharynx and Arcade Cells,C. elegans F4 segregants collected at the L2 l...,ACCAAACCAGCTGTAT-1


cells column names look ok!


In [9]:
### LOAD GENE DATAFRAME AND CHANGES NAMES TO CONVENTION
genes = pd.read_csv('bendavid2021_gene_metadata.csv', index_col=0)
display(genes.head())
genes=genes.rename(columns={'wormbase_gene':'gene_id','gene_short_name':'gene_name'})
genes.index.name=None
genes=genes[['gene_id','gene_name','wbps_transcript_id', 'chromosome_name', 'start_position',
       'end_position', 'strand', 'external_gene_id', 'external_transcript_id',
       'wormbase_locus', 'wormbase_gseq']]
display(genes)

### CHECK THAT GENES DATAFRAME CONFORMS TO ADATA NAMING CONVENTION
### must contain at least columns `gene_id` and `gene_name`
for column_name in ['gene_id','gene_name']:
    if column_name not in genes.columns:
        raise ValueError(column_name + ' is not a name in gene dataframe columns')
         
print('gene column names look ok!')

Unnamed: 0_level_0,wbps_transcript_id,chromosome_name,start_position,end_position,strand,external_gene_id,external_transcript_id,wormbase_gene,wormbase_locus,wormbase_gseq,gene_short_name
wbps_gene_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
WBGene00010957,MTCE.3.1,MtDNA,113,549,1,nduo-6,MTCE.3.1,WBGene00010957,nduo-6,MTCE.3,nduo-6
WBGene00010958,MTCE.4.1,MtDNA,549,783,1,ndfl-4,MTCE.4.1,WBGene00010958,ndfl-4,MTCE.4,ndfl-4
WBGene00010959,MTCE.11.1,MtDNA,1763,2635,1,nduo-1,MTCE.11.1,WBGene00010959,nduo-1,MTCE.11,nduo-1
WBGene00010960,MTCE.12.1,MtDNA,2634,3235,1,atp-6,MTCE.12.1,WBGene00010960,atp-6,MTCE.12,atp-6
WBGene00010961,MTCE.16.1,MtDNA,3389,4269,1,nduo-2,MTCE.16.1,WBGene00010961,nduo-2,MTCE.16,nduo-2


Unnamed: 0,gene_id,gene_name,wbps_transcript_id,chromosome_name,start_position,end_position,strand,external_gene_id,external_transcript_id,wormbase_locus,wormbase_gseq
WBGene00010957,WBGene00010957,nduo-6,MTCE.3.1,MtDNA,113,549,1,nduo-6,MTCE.3.1,nduo-6,MTCE.3
WBGene00010958,WBGene00010958,ndfl-4,MTCE.4.1,MtDNA,549,783,1,ndfl-4,MTCE.4.1,ndfl-4,MTCE.4
WBGene00010959,WBGene00010959,nduo-1,MTCE.11.1,MtDNA,1763,2635,1,nduo-1,MTCE.11.1,nduo-1,MTCE.11
WBGene00010960,WBGene00010960,atp-6,MTCE.12.1,MtDNA,2634,3235,1,atp-6,MTCE.12.1,atp-6,MTCE.12
WBGene00010961,WBGene00010961,nduo-2,MTCE.16.1,MtDNA,3389,4269,1,nduo-2,MTCE.16.1,nduo-2,MTCE.16
...,...,...,...,...,...,...,...,...,...,...,...
WBGene00021597,WBGene00021597,spsb-1,Y46E12BL.4a.1,II,15207443,15229725,1,spsb-1,Y46E12BL.4a.1,spsb-1,Y46E12BL.4
WBGene00021596,WBGene00021596,spsb-2,Y46E12BL.3.1,II,15230343,15232803,1,spsb-2,Y46E12BL.3.1,spsb-2,Y46E12BL.3
WBGene00021595,WBGene00021595,Y46E12BL.2,Y46E12BL.2a.1,II,15233153,15244931,1,Y46E12BL.2,Y46E12BL.2a.1,,Y46E12BL.2
WBGene00021594,WBGene00021594,tig-3,Y46E12BL.1.1,II,15244855,15251774,-1,tig-3,Y46E12BL.1.1,tig-3,Y46E12BL.1


gene column names look ok!


In [10]:
adata=anndata.read_mtx('bendavid2021_count_matrix_wormsceQTL_unmodified.mm.gz')
adata=adata.T
adata.var=genes
adata.obs=cells

### CHECK THAT OBS AND VAR CONFORMS TO ADATA NAMING CONVENTION
for column_name in ['study','sample_batch','sample','sample_description','barcode','cell_type','cell_subtype']:
    if column_name not in adata.obs.columns:
        raise ValueError(column_name + ' is not a name in gene adata.var columns')
for column_name in ['gene_id','gene_name']:
    if column_name not in adata.var.columns:
        raise ValueError(column_name + ' is not a name in gene adata.obs columns')
         
print('adata.var and adata.obs column names look ok!')
adata.write_h5ad('bendavid2021.h5ad')
adata

... storing 'study' as categorical
... storing 'sample_batch' as categorical
... storing 'sample' as categorical
... storing 'cell_type' as categorical
... storing 'cell_subtype' as categorical
... storing 'sample_description' as categorical


adata.var and adata.obs column names look ok!


... storing 'barcode' as categorical
... storing 'chromosome_name' as categorical
... storing 'wormbase_locus' as categorical
... storing 'wormbase_gseq' as categorical


AnnData object with n_obs × n_vars = 55508 × 20138
    obs: 'study', 'sample_batch', 'sample', 'cell_type', 'cell_subtype', 'sample_description', 'barcode'
    var: 'gene_id', 'gene_name', 'wbps_transcript_id', 'chromosome_name', 'start_position', 'end_position', 'strand', 'external_gene_id', 'external_transcript_id', 'wormbase_locus', 'wormbase_gseq'