The purpose of this notebook is to document the manipulations I do to get the number of patients/samples bc I'm sick of re-doing it every time.

In [1]:
import pandas as pd
import numpy as np

In [2]:
fnotu = '/Users/claire/github/aspiration-analysis/data/clean/rosen.otu_table.rel_abun.clean'
fnmeta = '/Users/claire/github/aspiration-analysis/data/clean/rosen.metadata.clean'

meta = pd.read_csv(fnmeta, sep='\t', index_col=0)
meta.columns

Index([u' If Yes, specify the symptom score',
       u' If yes, please indicate level', u'% time pH<4', u'% time pH<4:',
       u'A1. Subject ID number:', u'A2. Subject initials:',
       u'A3. What Cohort is the subject enrolled into?',
       u'A4. Aim(s) enrolled in?',
       u'A5.  Date of initial/baseline visit/procedure (MM/DD/YYYY):',
       u'A5a. Date filled out(MM/DD/YYYY):',
       ...
       u'STUDYID', u'STUDY', u'AIM', u'SOURCE', u'PHMII', u'ACIDSUP', u'DATE',
       u'ppi_consolidated', u'mbs_consolidated', u'total_reads'],
      dtype='object', length=958)

In [3]:
# Remove some samples I don't want to include from the metdata
meta = meta[~meta['sample_id.1'].str.endswith('F')]
meta = meta[~meta['sample_id.1'].str.endswith('sick')]
meta = meta[~meta['sample_id.1'].str.endswith('F2')]
meta = meta[~meta['sample_id.1'].str.endswith('F2T')]
meta = meta[~meta['sample_id.1'].str.endswith('2')]
meta = meta[~meta['sample_id.1'].str.startswith('05')]

In [4]:
meta['mbs_consolidated'] = meta['mbs_consolidated'].fillna('nan')
meta['ppi_consolidated'] = meta['ppi_consolidated'].fillna('nan')

patientsamples = meta\
    .groupby(['mbs_consolidated', 'site', 'subject_id'])\
    .size()\
    .to_frame('n_samples').reset_index()
patientsamples.sort_values(by='n_samples', ascending=False).head()

Unnamed: 0,mbs_consolidated,site,subject_id,n_samples
0,Aspiration/Penetration,bal,02-184-5,1
317,,gastric_fluid,04-159-2,1
315,,gastric_fluid,04-150-1,1
314,,gastric_fluid,04-149-2,1
313,,gastric_fluid,04-144-7,1


# First, number of patients for each site alone

This is useful for Figure 1, where I compare samples across patients.

I should probably re-make Figure 1 using data only from patients who are not known to be aspirators?

In [5]:
sites = ['stool', 'bal', 'gastric_fluid', 'throat_swab']

In [6]:
# With PPI info
patientsamples.query('site == @sites')\
    .groupby(['site', 'mbs_consolidated'])\
    .size()\
    .to_frame('n_samples')

Unnamed: 0_level_0,Unnamed: 1_level_0,n_samples
site,mbs_consolidated,Unnamed: 2_level_1
bal,Aspiration/Penetration,33
bal,Normal,33
bal,,36
gastric_fluid,Aspiration/Penetration,41
gastric_fluid,Normal,48
gastric_fluid,,58
stool,,25
throat_swab,Aspiration/Penetration,36
throat_swab,Normal,43
throat_swab,,97


Note: the number of stool samples to report is lower than this one. The only stool samples I used were those from patients who also had oropharyngeal samples. This data shows the total number of stool samples I have, without considering whether or not there's also an associated oropharyngeal sample.

To see the number of stool samples I actually used, look at the within-patient beta diversity notebook or below in the within-patient sample numbers.

### Number of unique patients with each sample type

In [7]:
patientsamples.query('site == @sites')\
    .groupby(['site', 'subject_id'])\
    .size()\
    .reset_index()\
    .groupby('site')\
    .size()

site
bal              102
gastric_fluid    147
stool             25
throat_swab      176
dtype: int64

In [8]:
patientsamples.query('site == @sites')\
    .groupby(['mbs_consolidated', 'subject_id'])\
    .size()\
    .reset_index()\
    .groupby(['mbs_consolidated'])\
    .size()

mbs_consolidated
Aspiration/Penetration     47
Normal                     57
nan                       118
dtype: int64

# Number of patients with intra site combinations

This is for figure 2: the within-patient comparisons.

The comparisons I've made are: bal_throat, bal_gastric, gastric_throat, stool_throat, stool_stool.

In [9]:
for site1 in sites:
    for site2 in sites[sites.index(site1)+1:]:
        subjects = patientsamples\
                    .query('(site == @site1) | (site == @site2)')\
                    .groupby(['mbs_consolidated', 'subject_id'])\
                    .size()
        print('{} + {}'.format(site1, site2))
                
        # This line shows the number of within-patient comparisons,
        # just grouped by MBS status 
        print(subjects[subjects == 2].reset_index()
              .groupby(['mbs_consolidated']).size())

        ## Uncomment this line to see disaggregation by PPI status too
        ## (this only affects stool_throat comparisons)
        #print(subjects[subjects == 2].reset_index()
        #      .groupby(['ppi_consolidated', 'mbs_consolidated']).size())        
        
        # And this line is just straight-up the number of unique patients
        print(subjects[subjects == 2].reset_index()['subject_id'].unique().shape)
        print('')


stool + bal
Series([], dtype: int64)
(0,)

stool + gastric_fluid
Series([], dtype: int64)
(0,)

stool + throat_swab
mbs_consolidated
nan    20
dtype: int64
(20,)

bal + gastric_fluid
mbs_consolidated
Aspiration/Penetration    29
Normal                    28
nan                       32
dtype: int64
(89,)

bal + throat_swab
mbs_consolidated
Aspiration/Penetration    25
Normal                    23
nan                       25
dtype: int64
(73,)

gastric_fluid + throat_swab
mbs_consolidated
Aspiration/Penetration    32
Normal                    35
nan                       45
dtype: int64
(112,)



In [10]:
site1 = 'stool'
site2 = 'throat_swab'

subjects = patientsamples\
            .query('(site == @site1) | (site == @site2)')\
            .groupby(['mbs_consolidated', 'subject_id'])\
            .size()

subjects[subjects == 2].reset_index().shape

(20, 3)

## All three sites

In [11]:
aero_sites = ['bal', 'gastric_fluid', 'throat_swab']
subjects = patientsamples\
            .query('(site == @aero_sites)')\
            .groupby(['mbs_consolidated', 'subject_id'])\
            .size()

# This line shows the number of within-patient comparisons,
# just grouped by MBS status 
print(subjects[subjects == 3].reset_index()
      .groupby(['mbs_consolidated']).size())

# And this line is just straight-up the number of unique patients
print(subjects[subjects == 3].reset_index()['subject_id'].unique().shape)
print('')

mbs_consolidated
Aspiration/Penetration    23
Normal                    19
nan                       24
dtype: int64
(66,)



## Number of patients with at least two sites

In [12]:
# Number of patients tested
print(patientsamples.query('mbs_consolidated != "nan"').groupby('subject_id').size() > 1).sum()

# All types of patients
(patientsamples\
    .query('site == @sites')\
    .groupby(['mbs_consolidated', 'subject_id'])\
    .size() > 1)\
    .reset_index()\
    .groupby('mbs_consolidated').sum()

88


Unnamed: 0_level_0,0
mbs_consolidated,Unnamed: 1_level_1
Aspiration/Penetration,40.0
Normal,48.0
,74.0


# PPI information

In [29]:
sra = pd.read_csv('../../final/patients/biosample_attributes.SUB3758953.txt', sep='\t')
sra.head()

Unnamed: 0,organism,env_biome,geo_loc_name,host,lat_lon,collection_date,sample_name,subject_id,sequencing_date,env_feature,env_material
0,human-associated metagenome,human-associated habitat,USA: Boston,Homo sapiens,missing,missing,04-074-1T,04-074-1,2014,oropharynx,oropharyngeal_swab
1,human-associated metagenome,human-associated habitat,USA: Boston,Homo sapiens,missing,missing,02-164-1G,02-164-1,2014,stomach,gastric_fluid
2,human-associated metagenome,human-associated habitat,USA: Boston,Homo sapiens,missing,missing,04-262-5T,04-262-5,2016,oropharynx,oropharyngeal_swab
3,human-associated metagenome,human-associated habitat,USA: Boston,Homo sapiens,missing,missing,04-074-1G,04-074-1,2014,stomach,gastric_fluid
4,human-associated metagenome,human-associated habitat,USA: Boston,Homo sapiens,missing,missing,04-074-1B,04-074-1,2014,lung,bronchoalveolar_lavage


In [30]:
sra['subject_id'].unique().shape

(222,)

In [32]:
subjects = sra['subject_id'].unique().tolist()

In [33]:
meta.query('subject_id == @subjects').drop_duplicates(subset='subject_id').groupby('ppi_consolidated').size()

ppi_consolidated
conflicting     8
nan            30
off            88
on             96
dtype: int64

In [34]:
#', '.join(meta.query('subject_id == @subjects').drop_duplicates(subset='subject_id').query('ppi_consolidated == "conflicting"')['subject_id'].tolist())

'029-6-F1, 030-5-F1, 03-107-4, 032-1-F1, 036-2-F1, 04-029-8, 045-9-F1, 13-117-4'

In [35]:
#', '.join(meta.query('subject_id == @subjects').drop_duplicates(subset='subject_id').query('ppi_consolidated == "nan"')['subject_id'].tolist())

'01-112-7, 01-164-7, 01-165-8, 01-173-4, 01-200-1, 01-209-2, 01-215-7, 01-230-9, 01-247-3, 01-263-4, 01-270-3, 01-297-4, 01-299-7, 03-076-8, 03-138-9, 03-146-6, 03-149-1, 03-150-8, 03-153-7, 03-156-7, 03-178-6, 03-181-7, 03-182-8, 03-199-7, 03-225-1, 03-226-4, 03-272-3, 04-235-8, 04-269-0, 14-233-0'

In [36]:
with open('../../final/patients/all_patients_in_sra.txt', 'w') as f:
    f.write('\n'.join(sra['subject_id'].tolist()) + '\n')

# Number with swallow study

Of the 222 patients, XX had swallow studies.

Of the XX that had swallow studies, X were aspirators and X had normal swallow function.

 

In [42]:
subjects = sra['subject_id'].unique().tolist()
print(len(subjects))

222
Object `drop_duplicates` not found.


In [46]:
meta.query('subject_id == @subjects').drop_duplicates(subset=['subject_id', 'mbs_consolidated']).groupby('mbs_consolidated').size()

mbs_consolidated
Aspiration/Penetration     47
Normal                     57
nan                       118
dtype: int64