This program reads in and merges multiple output files from running the TIME pairwise Dynamic Time Warping (DTW) Distance workflow (Workflow 5b) on the CF data. Dynamic Time Warping is a measure of similarity in longitudinal data, and the TIME version of the algorithm ranges from 0 to 1. 0 is the most similar, and 1 is the most different. For more on the algorithm, see the [relevant research paper here](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00036/full).

Workflow 5b allows for calculations across all samples as well as by condition. Conditions are part of the metadata entered into the application. Our conditions were 'Exacerbated' and 'Stable.' For my analysis, I used the following settings: a taxonomic level of 'Genus', the default DTW constraint of 2 and a 0.1 cutoff for ignoring rare taxa because my input data had already been appropriately filtered. The input data was not rarefied or normalised in the initial upload. 

Although the program is written for use with the CF data, I have included some images of what the output should look like for those who do not have access to it. To run this code yourself using example data, see Create_TDTW_all_example. This program and Create_TDTW_all_example generate files which are used as input in my box plot programs, DTW_All_boxplots and DTW_All_boxplots_example. I have also included in this repository an altered version of the output from this program called TDTW_all_filtered2.csv, which contains randomly generated values, so that users can run DTW_All_boxplots as well as DTW_All_boxplots_example. If you would like to see the code I used to generate TDTW_all_filtered2.csv, please email me at vtalbot@lesley.edu or vrtalbot@yahoo.com.
<br>

For DTW_All_boxplots_example, users can do everything themselves, from running Workflow 5b on their repeated antibiotic perturbation example analysis to creating the merged file in Create_TDTW_all_example to plotting the data. [Click here to run this analysis](https://web.rniapps.net/time/index.php) now. To learn more about the antibiotic data, click [here to view the antibiotic research paper](https://www.pnas.org/content/108/Supplement_1/4554.long).

<br>
The file generated by Create_TDTW_all_example, TDTW_all_example.csv, also serves as input for DTW_boxplots_by_status.

The output csv files were automatically generated, with names in the following format:
<br>

(participant id)\_(taxonomic level)\_mdtw\_(cutoff for ignoring rare taxa)\_(constraint)\_(condition).csv

For example, one of my output files was named '711_Genus_mdtw_0.1_2_Stable.csv', for the analysis at the genus level of only the samples taken from particpant 711 when in Stable condition.

The code may be easily modified to read in and merge multiple files which differ in name only by certain strings or variables.

In [2]:
#import necessary libraries
import pandas as pd

In [1]:
#make a list of the ID's of participants whose data was analyzed in the workflow
IDs =[708,711,761,762,764,768]

In [3]:
#read the output files from the different types of samples into dictionaries
#first for the pairwise distances across all the samples combined
csv = {i: pd.read_csv('{}_Genus_mdtw_0.1_2_All.csv'.format(i)) for i in IDs }
#then for each of the two conditions individually
csvS = {i: pd.read_csv('{}_Genus_mdtw_0.1_2_Stable.csv'.format(i)) for i in IDs}
csvE = {i: pd.read_csv('{}_Genus_mdtw_0.1_2_Exacerbation.csv'.format(i)) for i in IDs}

In [4]:
#create a dictionary for the dictionaries, and a list of the keys
#I made them lower case because the words will become part of a complicated string, and I found them easier to read that way
#they can easily be changed back with str.title()
conditions={'all': csv, 'stable':csvS, 'exacerbation':csvE}
keylist=list(conditions.keys())

The tables start out with three columns: Taxa1 for the first bacteria, Taxa2 for a second, and TIME_DTW_Distance for the DTW distance between them. 

In [5]:
#combine the two bacteria name columns, and edit distance column to specify which samples it gives the pairwise distance for
#the bacteria in our data have underscores before their names, so we join them without any additional strings as separators
for i in IDs:
    for j in keylist:
        conditions[j][i]['Taxa1Taxa2'] = conditions[j][i][['Taxa1', 'Taxa2']].apply(lambda x: ''.join(x), axis=1)
        conditions[j][i]=conditions[j][i].drop(['Taxa1','Taxa2'],1)
        conditions[j][i].rename(columns={'TIME_DTW_Distance': 'TDTW_{}_{}'.format(i,j)}, inplace=True)

Check that it ran properly by looking at one of the dataframes for each condition, if desired.

In [None]:
csv[764].head()

In [None]:
csvS[768].head()

In [None]:
csvE[708].head()

Samples of what a row and the columns should look like for the data frames, using the trivial example of the distance between Actinomyces and itself: 
<img src='https://i.imgur.com/fUL01Rp.png' style='height:80px'>
<br>

<img src='https://imgur.com/5NDGBVQ.png' style='height:70px'>

<br>

<img src='https://imgur.com/9Rez6Og.png' style='height:80px'>

In [6]:
#merge the dataframes
for i in IDs:
    for j in keylist:
        if i==708 and j=='all':
            result=conditions[j][i]
        else:
            result = pd.merge(result, conditions[j][i], on="Taxa1Taxa2", how= "outer")
result.head()

Sample of what a row and the first few columns should look like at this point, using the trivial example of the distance between Actinomyces and itself: 
<img src='https://imgur.com/LkV5PcG.png' style='height:55px'>
The columns need reordering, and we want to transpose the table.

In [7]:
#sort the columns and transpose, making the columns the bacteria pairs 
#the index becomes the string identifying the participamt and their condition
result = result.reindex(sorted(result.columns), axis=1)
result=result.set_index('Taxa1Taxa2').T


In [None]:
#check that it ran properly by looking at the first few rows
result.head()

In [9]:
#save file, keeping the index because it is now an identifying string 
result.to_csv('TDTW_all_filtered.csv', index=True)