## Work out how to add cell type annotation to loom file

Here I will figure out the commands needed to add ClusterID and cell type annotations to a loom file using loompy. This will contribute towards eventually creating a script to automate this process.

I will start with the Gary Bader dataset as we have cell type annotations and their own cluster IDs that I can work with.

The first thing I want to do is work out how we can match up HCA cell IDs with the cell ids used in the paper

In [85]:
import loompy
import numpy as np
import pprint
import pandas as pd
import json
from collections import Counter

In [6]:
bader_loom="SingleCellLiverLandscape.loom"

now we take a look at how the Bader lab identify their cells

In [27]:
bader_celltypes=pd.read_csv("bader_cell_type_cell_id.csv")

In [28]:
pprint.pprint(bader_celltypes[1:5])

                    cell_id donor_id           barcode  cluster_id  \
1  P1TLH_AAACCTGTCCTCATTA_1    P1TLH  AAACCTGTCCTCATTA          17   
2  P1TLH_AAACCTGTCTAAGCCA_1    P1TLH  AAACCTGTCTAAGCCA          12   
3  P1TLH_AAACGGGAGTAGGCCA_1    P1TLH  AAACGGGAGTAGGCCA          10   
4  P1TLH_AAACGGGGTTCGGGCT_1    P1TLH  AAACGGGGTTCGGGCT           2   

                      cell type  
1                Cholangiocytes  
2          Central venous LSECs  
3  Non-inflammatory Macrophages  
4                    ab T cells  


The lab names the cells with a combination of the donor id and the cell barcode. Since there is only one cell suspension per specimen, and one specimen per donor we can use a combination of the `donor_organism.provenance.document_id` and `barcode`. 

Next I need to match between the user provided ids and the uuids in out system by using output from the script created by the ingest devs here: https://github.com/ebi-ait/hca-ebi-dev-team/tree/master/scripts/spreadsheet-id-mapper

In [32]:
# Load the json output from the spreadsheet_id_mapper.py script
with open("BaderLiverLandscapeMapping.json") as bader_json:
    loaded_json = json.load(bader_json)

In [37]:
# Create a donor dictionary with local ids as keys since that's all we want to match
bader_donor_dict={v: k for k, v in loaded_json["donor_organism"].items()}
pprint.pprint(bader_donor_dict)

{'P1TLH': '893bb9d2-9d13-42ff-88ed-2fe40f499090',
 'P2TLH': '1a9760c6-30ca-4d20-81af-72cc4ab7d07b',
 'P3TLH': '85dafcd5-795a-46d0-a17e-0f982e9108ce',
 'P4TLH': 'f69bb28a-df96-4725-9933-e86e63c4450c',
 'P5TLH': '1505e6b0-e7a1-4436-ba79-5454f78198c8'}


In [38]:
# Add the uuids to the celltypes dataframe for easy creation of the celltype dict
bader_celltypes["donor_uuid"] = bader_celltypes['donor_id'].map(bader_donor_dict)

In [39]:
pprint.pprint(bader_celltypes[1:5])

                    cell_id donor_id           barcode  cluster_id  \
1  P1TLH_AAACCTGTCCTCATTA_1    P1TLH  AAACCTGTCCTCATTA          17   
2  P1TLH_AAACCTGTCTAAGCCA_1    P1TLH  AAACCTGTCTAAGCCA          12   
3  P1TLH_AAACGGGAGTAGGCCA_1    P1TLH  AAACGGGAGTAGGCCA          10   
4  P1TLH_AAACGGGGTTCGGGCT_1    P1TLH  AAACGGGGTTCGGGCT           2   

                      cell type                            donor_uuid  
1                Cholangiocytes  893bb9d2-9d13-42ff-88ed-2fe40f499090  
2          Central venous LSECs  893bb9d2-9d13-42ff-88ed-2fe40f499090  
3  Non-inflammatory Macrophages  893bb9d2-9d13-42ff-88ed-2fe40f499090  
4                    ab T cells  893bb9d2-9d13-42ff-88ed-2fe40f499090  


In [87]:
# Create a celltype dict where the key is a tuple of (barcode, donor_uuid)
celltype_dict={}
for index, row in bader_celltypes.iterrows():
    celltype_dict[(row['barcode'], row['donor_uuid'])] = row['cell type']

In [80]:
# write a function that takes the cell barcode and donor uuid and returns the cell type
# takes a tuple of the form (cell_barcode, donor_uuid)

def get_cell_type(cell_barcode, donor_uuid, celltype_dict):
    try:
        cell_type=celltype_dict[(cell_barcode, donor_uuid)]
    except KeyError:
        cell_type="none"
    return(cell_type)
    

In [81]:
# Test function
get_cell_type(cell_barcode_eg, donor_uuid_eg, celltype_dict)

'Cholangiocytes'

In [83]:
celltype_list = []
with loompy.connect(bader_loom) as ds:
    for i in range(ds.shape[1]):
        celltype_list.append(get_cell_type(ds.ca["barcode"][i], ds.ca["donor_organism.provenance.document_id"][i], celltype_dict))
    ds.ca["cell_type"] = celltype_list
    

In [86]:
# Check that the cell type was added
with loompy.connect(bader_loom) as ds:
    pprint.pprint(ds.ca.keys())
    pprint.pprint(Counter(ds.ca["cell_type"]))
    for i in range(0:20):
        print(ds.ca["cell_type"] + " " + ds.ca["donor_organism.provenance.document_id" + " " + ])
    

['CellID',
 'analysis_protocol.protocol_core.protocol_id',
 'analysis_protocol.provenance.document_id',
 'analysis_working_group_approval_status',
 'barcode',
 'bundle_uuid',
 'bundle_version',
 'cell_suspension.genus_species.ontology',
 'cell_suspension.genus_species.ontology_label',
 'cell_suspension.provenance.document_id',
 'cell_type',
 'derived_organ_label',
 'derived_organ_ontology',
 'derived_organ_parts_label',
 'derived_organ_parts_ontology',
 'donor_organism.development_stage.ontology',
 'donor_organism.development_stage.ontology_label',
 'donor_organism.diseases.ontology',
 'donor_organism.diseases.ontology_label',
 'donor_organism.human_specific.ethnicity.ontology',
 'donor_organism.human_specific.ethnicity.ontology_label',
 'donor_organism.is_living',
 'donor_organism.provenance.document_id',
 'donor_organism.sex',
 'dss_bundle_fqid',
 'emptydrops_is_cell',
 'file_uuid',
 'file_version',
 'genes_detected',
 'library_preparation_protocol.end_bias',
 'library_preparation_pr

## Conclusion
Note that many cells are not annotated most likely due to a large amount of filtering before cell types are assigned.

To do next would be writing a more automated script to do this but this might be difficult as each dataset may have different ways of identifying particular cells.