This notebook is my first attempt at cleaning up the Goyal metadata.

I got this metadata from a few email interactions with Andrew Yeh <yeha@upmc.edu>.

In [91]:
import pandas as pd
import numpy as np

In [67]:
fexcel = '../../data/raw/goyal2018/FMT_study_log_23Sept2016_edited.xlsx'
fprjna = '../../data/raw/goyal2018/goyal2018.PRJNA380944.txt'

# Out file
fout = '../../data/clean/goyal2018.metadata.txt'

outcomes = pd.read_excel(fexcel, sheet_name="PUCAI-PCDAI", skiprows=2)
outcomes = outcomes.rename(columns={'Unnamed: 5': 'Notes'})

Note: there are lots of other types of metadata available like medications, adverse events, calprotectin, etc.

In [68]:
outcomes.head()

Unnamed: 0,Patient # dds 1/26/15,Screen,Week 1,Month 1,Month 6,Notes
0,001 JFC,55.0,,,,
1,002 A-Z,50.0,,,,
2,003 SMA,70.0,,,,
3,004 MPM,12.5,0.0,15.0,5.0,
4,005 J-C,55.0,55.0,35.0,35.0,


This metadata file includes patients who weren't in the study (which is encoded through colors in the cells...)

Let's get the patient IDs out of the ENA metadata.

### Clean up ENA metadata

In [69]:
meta = pd.read_csv(fprjna, sep='\t')
meta.head()

Unnamed: 0,study_accession,secondary_study_accession,sample_accession,secondary_sample_accession,experiment_accession,run_accession,submission_accession,tax_id,scientific_name,instrument_platform,...,sra_ftp,sra_aspera,sra_galaxy,cram_index_ftp,cram_index_aspera,cram_index_galaxy,sample_alias,broker_name,nominal_sdev,first_created
0,PRJNA380944,SRP102742,SAMN06652281,SRS2088440,SRX2691263,SRR5396454,SRA550516,408170,human gut metagenome,ILLUMINA,...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/004/SRR5396454,fasp.sra.ebi.ac.uk:/vol1/srr/SRR539/004/SRR539...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/004/SRR5396454,,,,FMT.03.042.W,,,2017-04-01
1,PRJNA380944,SRP102742,SAMN06652280,SRS2088438,SRX2691264,SRR5396455,SRA550516,408170,human gut metagenome,ILLUMINA,...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/005/SRR5396455,fasp.sra.ebi.ac.uk:/vol1/srr/SRR539/005/SRR539...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/005/SRR5396455,,,,FMT.03.042.P,,,2017-04-01
2,PRJNA380944,SRP102742,SAMN06652279,SRS2088439,SRX2691265,SRR5396456,SRA550516,408170,human gut metagenome,ILLUMINA,...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/006/SRR5396456,fasp.sra.ebi.ac.uk:/vol1/srr/SRR539/006/SRR539...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/006/SRR5396456,,,,FMT.03.042.D,,,2017-04-01
3,PRJNA380944,SRP102742,SAMN06652278,SRS2088441,SRX2691266,SRR5396457,SRA550516,408170,human gut metagenome,ILLUMINA,...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/007/SRR5396457,fasp.sra.ebi.ac.uk:/vol1/srr/SRR539/007/SRR539...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/007/SRR5396457,,,,FMT.03.040.M,,,2017-04-01
4,PRJNA380944,SRP102742,SAMN06652277,SRS2088442,SRX2691267,SRR5396458,SRA550516,408170,human gut metagenome,ILLUMINA,...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/008/SRR5396458,fasp.sra.ebi.ac.uk:/vol1/srr/SRR539/008/SRR539...,ftp.sra.ebi.ac.uk/vol1/srr/SRR539/008/SRR5396458,,,,FMT.03.040.W,,,2017-04-01


In [70]:
smpls = meta['sample_alias'].str.split('.', expand=True)
smpls = smpls.rename(columns={0: 'FMT', 
                              1: 'sequencing_run', 
                              2: 'patient_id', 
                              3: 'sample_type'})
smpls.head()

Unnamed: 0,FMT,sequencing_run,patient_id,sample_type
0,FMT,3,42,W
1,FMT,3,42,P
2,FMT,3,42,D
3,FMT,3,40,M
4,FMT,3,40,W


In [71]:
meta = pd.merge(meta, smpls, left_index=True, right_index=True)

Let's parse the sample types. From Andrew's email:

> D refers to the donor sample. When there is a 1 or 2 that means we sent the donor sample for repeat sequencing (most likely because it failed). P is the patient sample before FMT. 4M, 6M, 9M refer to the 4 month, 6 month, and 9 month sample. W is the 1 week sample. WX, P1, P2, PA, PB, 6MX are just samples that underwent repeat sequencing because they failed.

Not usre what just `M` is, but the paper includes a 1 month sample so let's guess that it's that. (Also note: the paper doesn't actually look at 4 month or 9 month samples, I wonder why...)

In [72]:
meta['sample_type'].unique()

array(['W', 'P', 'D', 'M', '6M', '4M', 'PB', 'D2', 'D1', 'PX', '6MX', 'P1',
       'WX', 'P2', 'PA', '9M'], dtype=object)

In [73]:
time_dict = {'W': '1_week', 
             'P': 'pre_fmt',
             'D': 'donor', 
             'M': '1_month',
             '4M': '4_month',
             '6M': '6_month',
             'PB': 'pre_fmt',
             'D1': 'donor',
             'D2': 'donor',
             'PX': 'pre_fmt', 
             '6MX': '6_month', 
             'P1': 'pre_fmt', 
             'WX': '1_week',
             'P2': 'pre_fmt',
             'PA': 'pre_fmt', 
             '9M': '9_month'}

meta['time_point'] = meta['sample_type'].apply(lambda x: time_dict[x])

In [74]:
meta.columns

Index([u'study_accession', u'secondary_study_accession', u'sample_accession',
       u'secondary_sample_accession', u'experiment_accession',
       u'run_accession', u'submission_accession', u'tax_id',
       u'scientific_name', u'instrument_platform', u'instrument_model',
       u'library_name', u'library_layout', u'nominal_length',
       u'library_strategy', u'library_source', u'library_selection',
       u'read_count', u'base_count', u'center_name', u'first_public',
       u'last_updated', u'experiment_title', u'study_title', u'study_alias',
       u'experiment_alias', u'run_alias', u'fastq_bytes', u'fastq_md5',
       u'fastq_ftp', u'fastq_aspera', u'fastq_galaxy', u'submitted_bytes',
       u'submitted_md5', u'submitted_ftp', u'submitted_aspera',
       u'submitted_galaxy', u'submitted_format', u'sra_bytes', u'sra_md5',
       u'sra_ftp', u'sra_aspera', u'sra_galaxy', u'cram_index_ftp',
       u'cram_index_aspera', u'cram_index_galaxy', u'sample_alias',
       u'broker_name', u'n

In [75]:
(meta
 .groupby(['patient_id', 'time_point'])
 .size()
 .reset_index(name='n_samples')
 .sort_values(by='n_samples', ascending=False)
).head(10)


Unnamed: 0,patient_id,time_point,n_samples
14,7,pre_fmt,2
52,29,donor,2
31,22,1_week,2
7,5,6_month,2
19,10,pre_fmt,2
43,24,pre_fmt,2
42,24,donor,2
41,24,6_month,2
66,32,6_month,1
73,33,pre_fmt,1


Ok, so some of the patients who have duplicate samples sent for sequencing also still have the original. When I look at the data, I should see which sample looks better and only use that one.

### Clean up clinical data

Specifically, extract the patient IDs so I can merge on that with the ENA metadata.

In [76]:
outcomes['patient_id'] = outcomes['Patient # dds 1/26/15'].str.split(expand=True)[0]

Add responder/remission flags, based on the clinical outcomes.

The paper says:

> Response was defined as a decrease of 15 points in PUCAI or 12.5 points in PCDAI at 1 month, as used in previous studies. Remission was defined as normalization of previously elevated fecal biomarkers and a PCDAI/PUCAI of 0 points. If subjects required escalation of medical therapy prior to 1-month evaluation, they were considered to be nonresponders. Subsequently, any escalation of medical therapy was considered a loss of response.

I don't really know how to figure out who required escalation of medical therapy or something, but I'm assuming that would be reflected in the PCDAI/PUCAI scores..? Anyway, for now let's just go from those.

Note: I think that the numbers highlighted in red are the PCDAI scores and the black ones are PUCAI. There aren't very many red ones, so let's just go with everything being PUCAI...

In [92]:
outcomes.loc[23, 'Notes'] = outcomes.loc[23, 'Month 6']
outcomes.loc[23, 'Month 6'] = np.nan

We'll define response as _either_ remission (value is 0) OR decrease in at least 15 points.

In [102]:
outcomes['remission_m1'] = outcomes['Month 1'] == 0
outcomes['response_m1'] = ((outcomes['Screen'] - outcomes['Month 1']) >= 15) | outcomes['remission_m1']

outcomes['remission_m6'] = outcomes['Month 6'] == 0
outcomes['response_m6'] = ((outcomes['Screen'] - outcomes['Month 6']) >= 15) | outcomes['remission_m6']

outcomes.sort_values(by='Month 6').head(10)

Unnamed: 0,Patient # dds 1/26/15,Screen,Week 1,Month 1,Month 6,Notes,patient_id,remission_m1,response_m1,remission_m6,response_m6
37,040 RJ,27.5,10.0,10.0,0,,40,False,True,True,True
34,037 TSH,10.0,5.0,5.0,0,,37,False,False,True,True
32,034 RNH,25.0,5.0,5.0,0,,34,False,True,True,True
10,011 A-E,0.0,5.0,0.0,0,,11,True,True,True,True
28,030 CJH,30.0,32.5,17.5,5,,30,False,False,False,True
21,022 M-M,50.0,15.0,15.0,5,Week 1 note: activity affected by flu,22,False,True,False,True
3,004 MPM,12.5,0.0,15.0,5,,4,False,False,False,False
27,029 LPS,55.0,10.0,35.0,5,,29,False,True,False,True
26,028 AER,30.0,50.0,15.0,10,,28,False,True,False,True
15,016 M-B,30.0,5.0,5.0,15,,16,False,True,False,True


Hm ok, not super sure about patient 004 - i.e. not sure whether the authors considered this person a responder or not. I don't expect these data to be that useful anyway, so let's not bother the authors again...

### Merge clinical and ENA metadata

In [103]:
fullmeta = pd.merge(meta, outcomes)

In [104]:
fullmeta.to_csv(fout, sep='\t', index=False)