<h1>Automated Gating with Immunova: Loading peritonitis data into database</h1>

This is the first of three notebooks comparing the automated gating in immunova to expert manual gating. In this notebook I am creating the database entries.

<h3>Create project and experiments</h3>

In [1]:
from immunova.data.project import Project
from immunova.data.mongo_setup import global_init
from immunova.data.fcs_experiments import FCSExperiment, Panel
from immunova.data.utilities import get_fcs_file_paths
from immunova.flow.readwrite.read_fcs import explore_channel_mappings
from tqdm import tqdm_notebook, tqdm
import pandas as pd
global_init()

In [2]:
pd_project = Project(project_id='Peritonitis', owner='burtonrossj')
pd_project.save()

<Project: Project object>

In [3]:
pd_project.id.__str__()

'5d824215602a252ab0e7ebf4'

I will create four experiments in total:
* PBMC_T: T cell panel for PBMC samples
* PBMC_M: Myeloid cell panel for PBMC samples
* PDMC_T: T cell panel for peritoneal fluid samples
* PDMC_M: Myeloid cell panel for peritoneal fluid samples

For each of these experiments I need to associate a flow cytometry panel. A Panel object defines the channel(fluorochrome)/marker(antibody) mappings for all associated flow data. This allows for standardisation of the flow cytometry meta-data at the point of entry.

Panel objects can be created from a python dictionary object or using an excel template. In this case I have created an excel template (see documentation for details on creating panel templates).

It is often useful to explore the channel and marker names of a large selection of fcs files to get a feel for the naming conventions and make sure you have convered all edge cases. There is a useful utility function in `immunova.flow.readwrite.read_fcs` called `explore_channel_mappings`. Given a directory, the function will search for all `.fcs` files and return all permutations of channel/marker pairings found.

In [None]:
cm_permutations = explore_channel_mappings('/media/ross/FCS_DATA/Raya PD Samples/ds_friendly')

In [6]:
len(cm_permutations)

20

So there is 20 permutations for the different ways that markers have been labelled in fcs files. I can account for most cases using regular expression but in a few cases (e.g. live/dead staining) I have added like-for-like matches in the templates.

In [4]:
t_panel = Panel(panel_name='peritonitis_t_panel')
m_panel = Panel(panel_name='peritonitis_m_panel')

In [5]:
t_panel.create_from_excel(path='experiment_data/peritonitis_t_template.xlsx')

True

In [6]:
m_panel.create_from_excel(path='experiment_data/peritonitis_m_template.xlsx')

True

The `create_from_excel` method will populate the Panel object using the excel template. I can now save the panels to the database.

In [7]:
t_panel.save()
m_panel.save()

<Panel: Panel object>

With the panels created I can now create the experiments. When you create an experiment you must always associate it to a project. We therefore use the `add_experiment` method of the Project object.

In [8]:
pbmc_t = pd_project.add_experiment(experiment_id='PBMC_T', panel_name='peritonitis_t_panel')
pdmc_t = pd_project.add_experiment(experiment_id='PDMC_T', panel_name='peritonitis_t_panel')
pbmc_m = pd_project.add_experiment(experiment_id='PBMC_M', panel_name='peritonitis_m_panel')
pdmc_m = pd_project.add_experiment(experiment_id='PDMC_M', panel_name='peritonitis_m_panel')

Experiment created successfully!
Experiment created successfully!
Experiment created successfully!
Experiment created successfully!


Now that the experiments are created I can start adding the fcs files. The `add_new_sample` method is used to generate a new fcs file entry into the mongo database, which is then associated to the experiment. See the documentation below:


In [9]:
?pbmc_t.add_new_sample

[0;31mSignature:[0m
[0mpbmc_t[0m[0;34m.[0m[0madd_new_sample[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0msample_id[0m[0;34m:[0m[0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfile_path[0m[0;34m:[0m[0mstr[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcontrols[0m[0;34m:[0m[0mlist[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcomp_matrix[0m[0;34m:[0m[0;34m<[0m[0mbuilt[0m[0;34m-[0m[0;32min[0m [0mfunction[0m [0marray[0m[0;34m>=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcompensate[0m[0;34m:[0m[0mbool[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mfeedback[0m[0;34m:[0m[0mbool[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mstr[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Add a new sample (FileGroup) to this experiment
:param sample_id: primary ID for identification of sample (FileGroup.primary_id)
:param file_path: file path of the primary fcs fi

<h3>Add PDMC files (Myeloid cell panel)</h3>

A summary table of all the samples collected in the peritonitis study can provide us with the sample numbers and the manual gating results.

In [10]:
summary = pd.read_excel('/media/ross/FCS_DATA/Raya PD Samples/ClinicalData_and_ManualGatingResults.xlsx')

In [11]:
pdmc_sample_ids = summary[summary['Cell origin'] == 'PDMC']['Patient no.'].values

In [12]:
pdmc_sample_ids

array(['142-09', '175-09', '209-03', '209-05', '210-12', '210-14',
       '229-02', '237-06', '239-02', '239-04', '251-07', '251-08',
       '254-04', '254-05', '255-04', '255-05', '262-01', '264-02',
       '267-01', '267-02', '272-01', '273-01', '276-01', '279-03',
       '286-02', '286-03', '286-04', '288-02', '289-01', '294-01',
       '294-02', '294-03', '295-01', '298-01', '302-01', '305-01',
       '305-02', '305-03', '306-01', '307-01', '308-01', '308-02R',
       '308-03R', '308-04', '310-01', '315-01', '315-02', '316-01',
       '318-01', '320-01', '321-01', '322-01', '323-01', '323-02',
       '324-01', '326-01'], dtype=object)

We can use the utility functin `get_fcs_file_paths` from immunova's data module to generate file paths for adding samples.

In [13]:
get_fcs_file_paths(fcs_dir='/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/142-09/m_panel',
                  control_names=['CD1c', 'HLA-DR'], ctrl_id='FMO')

{'primary': ['/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/142-09/m_panel/Peri 142-09R PDMC 1 N Panel_N1_013.fcs'],
 'controls': [{'control_id': 'CD1c',
   'path': '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/142-09/m_panel/Peri 142-09R PDMC 1 N Panel_N2 FMO CD1c_014.fcs'},
  {'control_id': 'HLA-DR',
   'path': '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/142-09/m_panel/Peri 142-09R PDMC 1 N Panel_N3 FMO HLA-DR_015.fcs'}]}

In [14]:
pdmc_m_142_09 = get_fcs_file_paths(fcs_dir='/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/142-09/m_panel',
                  control_names=['CD1c', 'HLA-DR'], ctrl_id='FMO')
primary = pdmc_m_142_09['primary'][0]
controls = pdmc_m_142_09['controls']

In [15]:
pdmc_m.add_new_sample(sample_id='pd142_09_m', file_path=primary, controls=controls)

Generating main file entry...
Generating file entries for controls...
Successfully created pd142_09_m and associated to PDMC_M


'5d824225602a252ab0e7f12a'

Let's add the rest of the files for PDMC's.

In [29]:
#pd209-03 M
# Only one file for this patient
primary = '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/209-03/m_panel/Peri 209-03 PDMC_A1_001.fcs'
pdmc_m.add_new_sample(sample_id='pd209-03_m', controls=[], file_path=primary)

Generating main file entry...
Generating file entries for controls...
Successfully created pd209-03_m and associated to PDMC_M


'5d824693602a252ab0e7f48c'

In [30]:
fcs = FlowData(primary)

In [31]:
fcs.channels

{'1': {'PnN': 'FSC-A'},
 '2': {'PnN': 'FSC-H'},
 '3': {'PnN': 'SSC-A'},
 '4': {'PnN': 'SSC-H'},
 '5': {'PnN': 'SSC-W'},
 '6': {'PnN': 'Alexa Fluor 488-A', 'PnS': 'CD14'},
 '7': {'PnN': 'Alexa Fluor 647-A', 'PnS': 'Siglec-8'},
 '8': {'PnN': 'APC-Cy7-A', 'PnS': 'CD3'},
 '9': {'PnN': 'Alexa Fluor 405-A', 'PnS': 'CD1c'},
 '10': {'PnN': 'AmCyan-A', 'PnS': 'Zombie Aqua'},
 '11': {'PnN': 'BV605-A', 'PnS': 'CD15'},
 '12': {'PnN': 'BV711-A', 'PnS': 'HLA-DR'},
 '13': {'PnN': 'PE-A', 'PnS': 'CD116'},
 '14': {'PnN': 'PE-Cy7-A', 'PnS': 'CD19'},
 '15': {'PnN': 'Alexa Fluor 700-A', 'PnS': 'CD45'},
 '16': {'PnN': 'Time'}}

pd209-03 for M panel is missing CD16

In [33]:
# pd209-05 M
# Only one file for this patient
primary = '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/209-05/m_panel/Stable 209-05 PDMC N1 FSC 350_013.fcs'
pdmc_m.add_new_sample(sample_id='pd209-05_m', controls=[], file_path=primary)

Generating main file entry...
Generating file entries for controls...
Successfully created pd209-05_m and associated to PDMC_M


'5d824920602a252ab0e7f49f'

In [38]:
# Add some convenience variables/partialised func
from functools import partial
root = '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC'
get_fp = partial(get_fcs_file_paths, control_names=['CD1c', 'HLA-DR'], ctrl_id='FMO')

In [40]:
# pd210-14 M
file_paths = get_fp(f'{root}/210-14/m_panel')



In [41]:
file_paths

{'primary': ['/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/210-14/m_panel/Peri 210-14 PDMC N1_019.fcs',
  '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/210-14/m_panel/Peri 210-14 PDMC N2_020.fcs',
  '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/210-14/m_panel/Peri 210-14 PDMC N3_021.fcs'],
 'controls': []}

File names missing name of the FMO and therefore not captured by my function. I know that N2 is the FMO for CD1c and N3 is the FMO for HLA-DR

In [45]:
primary = file_paths['primary'][0]
controls = [dict(control_id='CD1c', path=file_paths['primary'][1]),
           dict(control_id='HLA-DR', path=file_paths['primary'][2])]

In [46]:
pdmc_m.add_new_sample(sample_id='pd210-14_m', controls=controls, file_path=primary)

Generating main file entry...
Generating file entries for controls...
Successfully created pd210-14_m and associated to PDMC_M


'5d824e60602a252ab0e7fa29'

In [47]:
# pd237-06 M
file_paths = get_fp(f'{root}/237-06/m_panel')



In [48]:
file_paths

{'primary': ['/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/237-06/m_panel/Peri 237-06 PDMC N1_001.fcs',
  '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/237-06/m_panel/Peri 237-06 PDMC N2_002.fcs',
  '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/237-06/m_panel/Peri 237-06 PDMC N3_003.fcs'],
 'controls': []}

In [49]:
primary = file_paths['primary'][0]
controls = [dict(control_id='CD1c', path=file_paths['primary'][1]),
           dict(control_id='HLA-DR', path=file_paths['primary'][2])]

In [50]:
pdmc_m.add_new_sample(sample_id='pd237-06_m', controls=controls, file_path=primary)

Generating main file entry...
Generating file entries for controls...
Successfully created pd237-06_m and associated to PDMC_M


'5d824f5e602a252ab0e80183'

In [51]:
# pd239-02 M
file_paths = get_fp(f'{root}/239-02/m_panel')



In [52]:
file_paths

{'primary': ['/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/239-02/m_panel/239-02 Stable PDMC Neat N1_014.fcs',
  '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/239-02/m_panel/239-02 Stable PDMC Neat N2_015.fcs',
  '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/239-02/m_panel/239-02 Stable PDMC Neat N3_016.fcs'],
 'controls': []}

In [53]:
primary = file_paths['primary'][0]
controls = [dict(control_id='CD1c', path=file_paths['primary'][1]),
           dict(control_id='HLA-DR', path=file_paths['primary'][2])]

In [54]:
pdmc_m.add_new_sample(sample_id='pd239-02_m', controls=controls, file_path=primary)

Generating main file entry...
Generating file entries for controls...
Successfully created pd239-02_m and associated to PDMC_M


'5d8250b9602a252ab0e8080f'

In [55]:
# pd239-03
file_paths = get_fp(f'{root}/239-04/m_panel')



I now know the general errors that arrise now so I can loop over the remaining files and catch errors where necessary.

In [65]:
from immunova.data.fcs import FileGroup
processed_so_far = FileGroup.objects()
processed_so_far = [f.primary_id.replace('_m', '').replace('pd', '') for f in processed_so_far]
processed_so_far

['142-09', '209-03', '209-05', '210-14', '237-06', '239-02']

In [66]:
remaining = [p for p in pdmc_sample_ids if p not in processed_so_far]

In [74]:
# Add some convenience variables/partialised func
root = '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC'
get_fp = partial(get_fcs_file_paths, control_names=['CD1c', 'HLA-DR'], ctrl_id='FMO')
no_files = []
errors = []

for pt_id in tqdm_notebook(remaining):
    file_paths = get_fp(f'{root}/{pt_id}/m_panel')
    if not file_paths['controls']:
        if not file_paths['primary']:
            no_files.append(pt_id)
            continue
        if len(file_paths['primary']) == 1:
            print(f'Single file found for {pt_id}')
            primary = file_paths['primary'][0]
            controls = []
            pdmc_m.add_new_sample(sample_id=f'pd{pt_id}_m', 
                                  controls=controls, 
                                  file_path=primary)
        else:
            primary = []
            controls = []
            for f in file_paths['primary']:
                if f.find('N2') != -1:
                    controls.append(dict(control_id='CD1c', path=f))
                if f.find('N3') != -1:
                    controls.append(dict(control_id='HLA-DR', path=f))
                if f.find('N1') != -1:
                    primary.append(f)
            if len(primary) > 1:
                errors.append(pt_id)
                continue
            if len(controls) > 2:
                errors.append(pt_id)
                continue
            primary = primary[0]
            pdmc_m.add_new_sample(sample_id=f'pd{pt_id}_m', 
                      controls=controls, 
                      file_path=primary)
    else:
        if not file_paths['primary']:
            errors.append(pt_id)
            continue
        if len(file_paths['primary'])>1:
            errors.append(pt_id)
            continue
        if len(file_paths['controls'])>3:
            errors.append(pt_id)
            continue
        if len(file_paths['primary']) == 1 and len(file_paths['controls']) == 3:
            primary = file_paths['primary'][0]
            controls = file_paths['controls']
            pdmc_m.add_new_sample(sample_id=f'pd{pt_id}_m', 
                                  controls=controls, 
                                  file_path=primary)

HBox(children=(IntProgress(value=0, max=50), HTML(value='')))

Single file found for 239-04
Error: a file group with id pd239-04_m already exists
Error: a file group with id pd251-07_m already exists
Error: a file group with id pd251-08_m already exists
Single file found for 254-04
Error: a file group with id pd254-04_m already exists
Single file found for 254-05
Error: a file group with id pd254-05_m already exists
Single file found for 255-04
Error: a file group with id pd255-04_m already exists
Single file found for 255-05
Error: a file group with id pd255-05_m already exists
Error: a file group with id pd264-02_m already exists
Single file found for 267-01
Error: a file group with id pd267-01_m already exists
Error: a file group with id pd267-02_m already exists
Single file found for 272-01
Error: a file group with id pd272-01_m already exists
Error: a file group with id pd273-01_m already exists
Error: a file group with id pd276-01_m already exists
Error: a file group with id pd286-02_m already exists
Error: a file group with id pd286-03_m al

In [75]:
no_files

['175-09',
 '229-02',
 '262-01',
 '289-01',
 '294-01',
 '308-04',
 '320-01',
 '321-01',
 '322-01',
 '323-02',
 '324-01',
 '326-01']

<h3>Add PDMC files (T cell panel)</h3>

In [77]:
root = '/media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC'
get_fp = partial(get_fcs_file_paths, control_names=['CXCR3', 'CD25', 'CD45RA',
                                                   'CCR7'], ctrl_id='FMO')

In [79]:
pdmc_t = FCSExperiment.objects(experiment_id='PDMC_T').get()
no_files = []
errors = []

for pt_id in tqdm_notebook(pdmc_sample_ids):
    file_paths = get_fp(f'{root}/{pt_id}/t_panel')
    if not file_paths['controls']:
        if not file_paths['primary']:
            no_files.append(pt_id)
            continue
        if len(file_paths['primary']) == 1:
            print(f'Single file found for {pt_id}')
            primary = file_paths['primary'][0]
            controls = []
            pdmc_t.add_new_sample(sample_id=f'pd{pt_id}_t', 
                                  controls=controls, 
                                  file_path=primary)
        else:
            primary = []
            controls = []
            for f in file_paths['primary']:
                if f.find('T2') != -1:
                    controls.append(dict(control_id='CXCR3', path=f))
                if f.find('T3') != -1:
                    controls.append(dict(control_id='CD25', path=f))
                if f.find('T4') != -1:
                    controls.append(dict(control_id='CD45RA', path=f))
                if f.find('T5') != -1:
                    controls.append(dict(control_id='CCR7', path=f))
                if f.find('T1') != -1:
                    primary.append(f)
            if not primary:
                errors.append(pt_id)
                continue                
            if len(primary) > 1:
                errors.append(pt_id)
                continue
            if len(controls) > 2:
                errors.append(pt_id)
                continue
            primary = primary[0]
            pdmc_t.add_new_sample(sample_id=f'pd{pt_id}_t', 
                      controls=controls, 
                      file_path=primary)
    else:
        if not file_paths['primary']:
            errors.append(pt_id)
            continue
        if len(file_paths['primary'])>1:
            errors.append(pt_id)
            continue
        if len(file_paths['controls'])>4:
            errors.append(pt_id)
            continue
        if len(file_paths['primary']) == 1 and len(file_paths['controls']) == 4:
            primary = file_paths['primary'][0]
            controls = file_paths['controls']
            pdmc_t.add_new_sample(sample_id=f'pd{pt_id}_t', 
                                  controls=controls, 
                                  file_path=primary)

HBox(children=(IntProgress(value=0, max=56), HTML(value='')))

Single file found for 175-09
Generating main file entry...
Unable to normalise CD45RA PE Dazz; matched multiple channels in linked panel, check panel for incorrect definitions. Matches found: ['CD45RA', 'CD4']
Error: invalid channel/marker mappings for pd175-09_t, at path /media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/175-09/t_panel/175-09 Stable PDMC T1_014.fcs, aborting.
Single file found for 209-05
Generating main file entry...
Unable to normalise CD45RA PE Dazz; matched multiple channels in linked panel, check panel for incorrect definitions. Matches found: ['CD45RA', 'CD4']
Error: invalid channel/marker mappings for pd209-05_t, at path /media/ross/FCS_DATA/Raya PD Samples/ds_friendly/PDMC/209-05/t_panel/PD210-12-P PDMC 2 T Cells_T1 209-05stable Tpanel PDMC_019.fcs, aborting.
Single file found for 239-02
Generating main file entry...
Unable to normalise CD45RA PE Dazz; matched multiple channels in linked panel, check panel for incorrect definitions. Matches found: ['CD45RA',

KeyboardInterrupt: 