# Assign T cell annotations

To assemble our annotations, we'll read our clustered T cell data and assign our expert annotations to those clusters. We'll then inspect the annotations in our UMAP projections, and output final labels for these cells.

For T cells, we have multiple groups of cells to label. We clustered all T cells, then subset cell types for additional resolution. So, we'll load these sets, remove the subsets from the rest of the T cells, assign identities based on clusters in each, and finally concatenate all of the cell barcodes.

In [1]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=RuntimeWarning)

from datetime import date
import hisepy
import os
import pandas as pd
import scanpy as sc

### Helper function

These function makes it easy to pull csv and h5ad files stored in HISE as pandas DataFrames

In [2]:
def read_csv_uuid(csv_uuid):
    csv_path = '/home/jupyter/cache/{u}'.format(u = csv_uuid)
    if not os.path.isdir(csv_path):
        hise_res = hisepy.reader.cache_files([csv_uuid])
    csv_filename = os.listdir(csv_path)[0]
    csv_file = '{p}/{f}'.format(p = csv_path, f = csv_filename)
    df = pd.read_csv(csv_file, index_col = 0)
    return df

In [3]:
def read_obs_uuid(h5ad_uuid):
    h5ad_path = '/home/jupyter/cache/{u}'.format(u = h5ad_uuid)
    if not os.path.isdir(h5ad_path):
        hise_res = hisepy.reader.cache_files([h5ad_uuid])
    h5ad_filename = os.listdir(h5ad_path)[0]
    h5ad_file = '{p}/{f}'.format(p = h5ad_path, f = h5ad_filename)
    adata = sc.read_h5ad(h5ad_file, backed = 'r')
    obs = adata.obs.copy()
    return obs

## Read subclustering results from HISE

In [4]:
cell_class = 't-cells'

In [5]:
h5ad_uuid = 'd6ebc576-34ea-4394-a569-e35e16f20253'
h5ad_path = '/home/jupyter/cache/{u}'.format(u = h5ad_uuid)

In [6]:
if not os.path.isdir(h5ad_path):
    hise_res = hisepy.reader.cache_files([h5ad_uuid])

In [7]:
h5ad_filename = os.listdir(h5ad_path)[0]
h5ad_file = '{p}/{f}'.format(p = h5ad_path, f = h5ad_filename)

In [8]:
adata = sc.read_h5ad(h5ad_file)

In [9]:
adata.shape

(1191327, 1487)

## Read iterative results from hise

In [24]:
iter_uuids = {
    't-cd4-naive':      '70651e60-282b-4ed0-96f6-414547297232',
    't-cd8-mait':       '0f821486-866b-4c08-b0b8-508a5c544547',
    't-cd8-cm':         '6c1dff43-ddc5-437b-8e3d-dd5a32553b16',
    't-cd8-em':         'b671c53a-2698-41c1-a886-9ab939306716',
    'treg':             'bf615641-d907-4daa-b0b7-5280bd86b861',
    't-cd8-naive':      '5ae29893-5a77-4081-86d1-523713a237e6',
    't-proliferating':  '90a71622-5713-47f7-82e8-18e164ca9454',
    't-gd':             '71d79aee-5600-4f3f-a3d1-e3f830e1c0ff',
    't-isg-high':       'd33ef147-59db-4fb6-950c-1dd8af242d4f',
    't-other':          'bda4fe2f-1d8a-4ec5-9ce7-6bee1a158d7b'
}

In [11]:
iter_obs = {}
for cell_type, uuid in iter_uuids.items():
    obs = read_obs_uuid(uuid)
    iter_obs[cell_type] = obs

downloading fileID: bf615641-d907-4daa-b0b7-5280bd86b861
Files have been successfully downloaded!
downloading fileID: 71d79aee-5600-4f3f-a3d1-e3f830e1c0ff
Files have been successfully downloaded!


## Drop gdT cells from non-gdT data

For gdT cells subclustering, we included some cells that initially clustered with MAIT, CD8 CM, and CD8 EM cells. Here, we'll identify our gdT cells, then drop the cells that were labeled with gdTs from the other subclustering results so we don't have duplicates.

In [12]:
gdt_bc = iter_obs["t-gd"]['barcodes'].tolist()
len(gdt_bc)

54113

In [13]:
drop_set = ['t-cd8-mait', 't-cd8-cm', 't-cd8-em']

In [14]:
for cell_type in drop_set:
    obs = iter_obs[cell_type]
    n_start = obs.shape[0]
    keep_bc = [not x for x in obs['barcodes'].isin(gdt_bc)]
    obs = obs[keep_bc]
    n_end = obs.shape[0]
    print('{c}; N Start: {s}; N End: {e}'.format(c = cell_type, s = str(n_start), e=str(n_end)))
    
    iter_obs[cell_type] = obs

t-cd8-mait; N Start: 50823; N End: 48084
t-cd8-cm; N Start: 43289; N End: 37568
t-cd8-em; N Start: 118291; N End: 105726


## Assign labels to cell barcodes

Now, we'll join cell type labels from our cluster annotations to our cell barcode-level observations.

In [15]:
anno_uuids = {
    't-cd4-naive':     '68a20596-2a50-426e-a31c-92b7d579be0c',
    'treg':            '9e931d2a-4b0e-4392-895d-a1b1ffe494ca',
    't-other':         '459d1402-2c4d-4536-9c42-3ecf3c076306',
    't-cd8-naive':     '1cf6a489-6a46-46f4-9b13-7cdb73bcc267',
    't-cd8-mait':      '814634d6-d39e-4e02-8f66-e5d2be60f7e3',
    't-gd':            '4e13eb94-a7ac-44d6-bc3c-a8923033d7f3',
    't-proliferating': '3477460e-28c4-483e-a525-70a820b2e3fc',
    't-cd8-em':        'd5212382-fe47-4f32-8e2c-cc080427b08b',
    't-cd8-cm':        '0cf05843-0ff0-4356-a3de-82ff62912e4e',
    't-isg-high':      '51dbac56-a8b1-4296-a199-fd6ac2104b15'
}

In [16]:
iter_anno = {}
for cell_type,uuid in anno_uuids.items():
    iter_anno[cell_type] = read_csv_uuid(uuid)

downloading fileID: 68a20596-2a50-426e-a31c-92b7d579be0c
Files have been successfully downloaded!
downloading fileID: 9e931d2a-4b0e-4392-895d-a1b1ffe494ca
Files have been successfully downloaded!
downloading fileID: 459d1402-2c4d-4536-9c42-3ecf3c076306
Files have been successfully downloaded!
downloading fileID: 1cf6a489-6a46-46f4-9b13-7cdb73bcc267
Files have been successfully downloaded!
downloading fileID: 814634d6-d39e-4e02-8f66-e5d2be60f7e3
Files have been successfully downloaded!
downloading fileID: 4e13eb94-a7ac-44d6-bc3c-a8923033d7f3
Files have been successfully downloaded!
downloading fileID: 3477460e-28c4-483e-a525-70a820b2e3fc
Files have been successfully downloaded!
downloading fileID: d5212382-fe47-4f32-8e2c-cc080427b08b
Files have been successfully downloaded!
downloading fileID: 0cf05843-0ff0-4356-a3de-82ff62912e4e
Files have been successfully downloaded!
downloading fileID: 51dbac56-a8b1-4296-a199-fd6ac2104b15
Files have been successfully downloaded!


In [17]:
iter_bc_anno = {}
for cell_type,sub_obs in iter_obs.items():
    sub_anno = iter_anno[cell_type]
    join_col = sub_anno.columns[0]
    sub_anno[join_col] = sub_anno[join_col].astype(str).astype('category')
    sub_obs = sub_obs.merge(sub_anno, on = join_col, how = 'left')
    sub_obs = sub_obs[['barcodes', 'AIFI_L1', 'AIFI_L2', 'AIFI_L3']]
    iter_bc_anno[cell_type] = sub_obs

## Assemble all labels

In [18]:
all_anno = pd.concat(iter_bc_anno)

In [19]:
all_anno.shape

(1191200, 4)

In [20]:
adata.shape

(1191327, 1487)

In [25]:
obs = adata.obs
missing_obs = obs[[not x for x in adata.obs['barcodes'].isin(all_anno['barcodes'])]]

In [27]:
missing_obs['leiden_resolution_1.5'].value_counts()

leiden_resolution_1.5
14    398
0       0
12      0
21      0
20      0
19      0
18      0
17      0
16      0
15      0
13      0
11      0
1       0
10      0
9       0
8       0
7       0
6       0
5       0
4       0
3       0
2       0
22      0
Name: count, dtype: int64

14 is Tregs

In [28]:
treg_anno = iter_bc_anno['treg']

In [30]:
missing_treg_anno = treg_anno[treg_anno['barcodes'].isin(missing_obs['barcodes'])]

In [32]:
missing_treg_anno.head()

Unnamed: 0,barcodes,AIFI_L1,AIFI_L2,AIFI_L3


In [23]:
sum(adata.obs['barcodes'].isin(all_anno['barcodes']))

1190929

## Add to AnnData to preview assignments

In [21]:
obs = adata.obs
obs = obs.reset_index(drop = True)
obs = obs.merge(all_anno, on = 'barcodes', how = 'left')
obs = obs.set_index('barcodes', drop = True)

In [22]:
adata.obs = obs

ValueError: Length of passed value for obs_names is 1191598, but this AnnData has shape: (1191327, 1487)

In [None]:
sc.pl.umap(adata, color = ['AIFI_L1', 'AIFI_L2', 'AIFI_L3'], ncols = 1)

In [None]:
sc.pl.umap(adata, 
           color = ['leiden_resolution_1',
                    'leiden_resolution_1.5',
                    'leiden_resolution_2'],
           ncols = 1)

## Output final annotations

In [None]:
obs = adata.obs
obs = obs.reset_index(drop = False)

In [None]:
umap_mat = adata.obsm['X_umap']
umap_df = pd.DataFrame(umap_mat, columns = ['umap_1', 'umap_2'])
obs['umap_1'] = umap_df['umap_1']
obs['umap_2'] = umap_df['umap_2']

In [None]:
obs.head()

In [None]:
out_dir = 'output'
if not os.path.isdir(out_dir):
    os.makedirs(out_dir)

In [None]:
obs_out_csv = '{p}/ref_pbmc_{c}_labeled_meta_umap_{d}.csv'.format(p = out_dir, c = cell_class, d = date.today())
obs.to_csv(obs_out_csv, index = False)

In [None]:
obs_out_parquet = '{p}/ref_pbmc_{c}_labeled_meta_umap_{d}.parquet'.format(p = out_dir, c = cell_class, d = date.today())
obs.to_parquet(obs_out_parquet, index = False)

In [None]:
bc_anno = obs[['barcodes', 'AIFI_L1', 'AIFI_L2', 'AIFI_L3']]

In [None]:
label_out_csv = '{p}/ref_pbmc_{c}_barcode_labels_{d}.csv'.format(p = out_dir, c = cell_class, d = date.today())
bc_anno.to_csv(label_out_csv, index = False)

In [None]:
label_out_parquet = '{p}/ref_pbmc_{c}_barcode_labels_{d}.parquet'.format(p = out_dir, c = cell_class, d = date.today())
bc_anno.to_parquet(label_out_parquet, index = False)

## Upload annotations to HISE

Finally, we'll use `hisepy.upload.upload_files()` to send a copy of our output to HISE to use for downstream analysis steps.

In [None]:
study_space_uuid = '64097865-486d-43b3-8f94-74994e0a72e0'
title = 'T cell barcode annotations {d}'.format(d = date.today())

In [None]:
iter_h5ad_uuids = list(iter_uuids.values())
iter_anno_uuids = list(anno_uuids.values())

In [None]:
in_files = [h5ad_uuid] + iter_h5ad_uuids + iter_anno_uuids

In [None]:
in_files

We should have 10 h5ad's and 10 annotations

In [None]:
len(in_files)

In [None]:
out_files = [obs_out_csv, obs_out_parquet,
             label_out_csv, label_out_parquet]

In [None]:
out_files

In [None]:
hisepy.upload.upload_files(
    files = out_files,
    study_space_id = study_space_uuid,
    title = title,
    input_file_ids = in_files
)

In [None]:
import session_info
session_info.show()