# Importing Tabula Muris data to Scanpy

Simon Anders, 2018-06

This Notebook shows how to import the Tabula Muris data into Scanpy

The Tabula Muris project has sequenced around 100k cells from various mouse tissues using the SmartSeq2 and the 10X Dropseq platforms. See this paper: 

  The Tabula Muris Consortium: Transcriptomic characterization of 20 organs and tissues from mouse at single cell resolution creates a Tabula Muris. BioArxiv, 2018. doi: https://doi.org/10.1101/237446
  
The dropseq data is available from figshare at doi https://doi.org/10.6084/m9.figshare.5968960.v2 . 

Download the three files found there, unpack the zip archive, and then, you can run this notebook.


First, load the packages. I use here Python 3.6.5 and Scanpy 1.0.4, as well as Numpy 1.14.3 and Pandas 0.23.0.

In [2]:
import os
import numpy as np
import pandas as pd
import scanpy.api as sc

Now, load the file with the cell type assignments, provided at figshare as "annotation_droplet.csv": 

In [4]:
anno = pd.read_csv( "annotations_droplet.csv" )
anno["cell"] = anno["cell"].apply( lambda x: bytes(x, "UTF8") )
anno.head()

  interactivity=interactivity, compiler=compiler, result=result)


Unnamed: 0,cell,cell_ontology_class,cell_ontology_id,channel,cluster.ids,free_annotation,mouse.id,mouse.sex,subsetA,subsetA_cluster.ids,subsetB,subsetB_cluster.ids,subtissue,tissue,tissue_tSNE_1,tissue_tSNE_2
0,b'10X_P7_8_AAACGGGAGGATATAC',myeloid cell,CL:0000763,10X_P7_8,20,dendritic cells and interstital macrophages,3-F-56,F,,,,,,Lung,17.024721,-32.902836
1,b'10X_P7_8_AAACGGGTCTCGTATT',alveolar macrophage,CL:0000583,10X_P7_8,5,,3-F-56,F,,,,,,Lung,25.160619,25.066566
2,b'10X_P7_8_AAAGATGCAGATCTGT',B cell,CL:0000236,10X_P7_8,12,,3-F-56,F,,,,,,Lung,1.740567,46.488878
3,b'10X_P7_8_AAATGCCAGATAGTCA',natural killer cell,CL:0000623,10X_P7_8,7,,3-F-56,F,,,,,,Lung,-31.647934,-2.208061
4,b'10X_P7_8_AAATGCCCAAACTGCT',T cell,CL:0000084,10X_P7_8,21,,3-F-56,F,,,,,,Lung,-37.281266,-5.619565


Next, we read in the contents up the 'droplet' folder that you get from unpacking the file 'droplets.zip'. The following function processes one of the subdirectories therein, using Scanpy's 'read' data to read the counts, which are found in the directory in the 10X format.

In [28]:
# This function reads one subdirectory of the "droplet" data directory of the
# Tabula muris data, i.e., one "channel".

def read_Tm_10x( dirname, path = "droplet" ):
    
    # Read the sparse matrix:
    a = sc.read( os.path.join( path, dirname, "matrix.mtx" ) ).transpose()
    
    # Add the barcodes and gene symbols:
    a.obs_names = pd.read_table( os.path.join( path, dirname, "barcodes.tsv" ), header=None )[0]
    a.var_names = pd.read_table( os.path.join( path, dirname, "genes.tsv" ), header=None )[0]
    
    # The diname has the format "tisse-channel"
    assert len( dirname.split("-") ) == 2 
    
    # Remove "-1" label from cell barcodes, prefix channel:
    a.obs_names = np.fromiter( [ dirname.split("-")[1] + "_" + s.replace( "-1", "" ) 
       for s in a.obs_names ], 'S32' )
    
    # Look these up in the annotation table and keep only those found
    # This is necessary because 4 of the dropseq directrories contain superfluous barcodes
    annotated_cells = set( anno["cell"].values )
    a = a[ np.fromiter( ( s in annotated_cells for s in a.obs_names ), dtype=bool ), ]    
    
    # Store the data from the dirname
    a.obs["tissue"] = dirname.split("-")[0]
    a.obs["channel"] = dirname.split("-")[1]
    
    # Merge in the annotation data
    a.obs = a.obs.merge( anno, how="left", left_index=True, right_on="cell", 
        suffixes=("",".y") ).set_index( "cell" )
    
    # Remove unnecessary columns
    assert all( a.obs["tissue"] == a.obs["tissue.y"] )
    a.obs = a.obs.drop( columns = [ "tissue.y" ] )

    return a

In [29]:
path10x = "droplet"
dirs10x = sorted( os.listdir( path10x ) )
dirs10x = [ x for x in dirs10x if not x.startswith(".") ]   # get rid of spurious semi-hidden MacOS files
dirs10x

['Bladder-10X_P4_3',
 'Bladder-10X_P4_4',
 'Bladder-10X_P7_7',
 'Heart_and_Aorta-10X_P7_4',
 'Kidney-10X_P4_5',
 'Kidney-10X_P4_6',
 'Kidney-10X_P7_5',
 'Limb_Muscle-10X_P7_14',
 'Limb_Muscle-10X_P7_15',
 'Liver-10X_P4_2',
 'Liver-10X_P7_0',
 'Liver-10X_P7_1',
 'Lung-10X_P7_8',
 'Lung-10X_P7_9',
 'Lung-10X_P8_12',
 'Lung-10X_P8_13',
 'Mammary_Gland-10X_P7_12',
 'Mammary_Gland-10X_P7_13',
 'Marrow-10X_P7_2',
 'Marrow-10X_P7_3',
 'Spleen-10X_P4_7',
 'Spleen-10X_P7_6',
 'Thymus-10X_P7_11',
 'Tongue-10X_P4_0',
 'Tongue-10X_P4_1',
 'Tongue-10X_P7_10',
 'Trachea-10X_P8_14',
 'Trachea-10X_P8_15']

This loop now reads all the subdirectories in the 'droplets' folders, with the actual count data. We cat back one big Scanpy AnnData object.

In [30]:
ac = read_Tm_10x( dirs10x[0], path10x )
print( f"Read {dirs10x[0]}.")
for d in dirs10x[1:]:
    a2 = read_Tm_10x( d, path10x )
    assert( all( ac.var_names == a2.var_names ) )
    ac = ac.concatenate( a2, index_unique=None )
    print( f"Read {d}.")
del a2

ac

Read Bladder-10X_P4_3.
Read Bladder-10X_P4_4.
Read Bladder-10X_P7_7.
Read Heart_and_Aorta-10X_P7_4.
Read Kidney-10X_P4_5.
Read Kidney-10X_P4_6.
Read Kidney-10X_P7_5.
Read Limb_Muscle-10X_P7_14.
Read Limb_Muscle-10X_P7_15.
Read Liver-10X_P4_2.
Read Liver-10X_P7_0.
Read Liver-10X_P7_1.
Read Lung-10X_P7_8.
Read Lung-10X_P7_9.
Read Lung-10X_P8_12.
Read Lung-10X_P8_13.
Read Mammary_Gland-10X_P7_12.
Read Mammary_Gland-10X_P7_13.
Read Marrow-10X_P7_2.
Read Marrow-10X_P7_3.
Read Spleen-10X_P4_7.
Read Spleen-10X_P7_6.
Read Thymus-10X_P7_11.
Read Tongue-10X_P4_0.
Read Tongue-10X_P4_1.
Read Tongue-10X_P7_10.
Read Trachea-10X_P8_14.
Read Trachea-10X_P8_15.


AnnData object with n_obs × n_vars = 55656 × 23433 
    obs: 'tissue', 'channel', 'cell_ontology_class', 'cell_ontology_id', 'channel.y', 'cluster.ids', 'free_annotation', 'mouse.id', 'mouse.sex', 'subsetA', 'subsetA_cluster.ids', 'subsetB', 'subsetB_cluster.ids', 'subtissue', 'tissue_tSNE_1', 'tissue_tSNE_2', 'batch'

Finally, load the plate metadata from the file 'metadata_droplet.csv', which can also be found in the figshare archive and merge it to the AnnData object

In [31]:
# Load channel metadata
meta = pd.read_csv( "metadata_droplet.csv" )
meta.head()

Unnamed: 0,channel,mouse.id,tissue,subtissue,mouse.sex
0,10X_P4_0,3-M-8,Tongue,,M
1,10X_P4_1,3-M-9,Tongue,,M
2,10X_P4_2,3-M-8/9,Liver,hepatocytes,M
3,10X_P4_3,3-M-8,Bladder,,M
4,10X_P4_4,3-M-9,Bladder,,M


In [14]:
# Merge in channel metadata
ac.obs = ac.obs.merge( meta, how="left", on="channel" )

Now, write out the object in Scanpy's h5ad format.

In [15]:
ac.write( "Tabula_muris_10x.h5ad" )