### Gabi's env_res project
This notebook will document the processing of the data for Gabi's MSc. project that was collected some time ago by Maren.
The samples that were collected were supposed to be used to characterise the zooxs present before during and after a bleaching episode but the bleaching never happened and so the sample collection was somewhat abandoned. The samples will now be repurposed and used by Gabi for her MSc project. The object of which will be to compare the diversity of ITS2 sequences found in the corals to those found in the environmental samples.

Originally, we had planned to look for ITS2 profiles found in the coral samples in the environmental samples. I spent a lot of time writing code to search for the ITS2 type profiles but in the end this approach was abandoned because we did have a big enough selection of corals to really capture the full diversity of DIV abundances. As such we weren't finding many of the profiles in the environmental samples. 

Instead of this approach we will now aim for a much more simplified approach which will purely look at the diversities of ITS2 sequences in the coral samples and compare this to the diverstiy of sequences found in the environmental samples.

As mentioned above there was originally a time component to the original samples collected. You will see in the info file that there are different collection dates. We will not be using any of these time replicate samples and only using the samples that werwe colleceted at the same time as the first batch of corals. You will also see that there were an additional set of corals that were collected at a second date and did not have a plot number associated with them. These will also be discarded.

There was also a spatial consideration to the original data collection with there being 5 plots and the aim was to see whether environmental resevoirs were more specific to coral samples between and among the plots. However, due to the minimal number of environmental samples we are not able to take into account this aspect of the study so the plot factor will be ignored. 

You will see that that leaves us with a total of about 50 corals (although several sponge samples need to be removed) and associated mucus samples, 5 water samples, 5 close sediment samples (sampled at the base of the coral shelf), 5 far sediment samples (sampled 2m from the base of the coral shelf), and 10 turf algae samples.

#### Sections
* [Clean up data input](#Clean-up-data-input)
* [Python code processing](#Python-code-processing)

#### Clean up data input

One of the first things to deal with is the info data that contains the metadata for the samples. This was in the form of an excel file that has some formating issues when exported in .csv (as always). We will clean that up below. * We can't type the ^M special character here so I will just say that we ran ```$sed 's/[,]+^M$//g' info_290819.csv > info_290819.csv```

We have the SP outputs in this directory, the info.csv that needs cleaning up and the code for generating the figs.

In [2]:
!ls

131_142.DIVs.absolute.txt  131_142.DIVs.relative.txt  gabi_env_res.py
131_142.DIVs.fasta	   gabi_env_res.ipynb	      info_290819.csv


In [1]:
!head info_290819.csv

Sample no.,coral genus,coral species,environ type,plot #,depth (m),date collected,Sample ID,Sample Name,Sequence file,no. of reads,,,,,,,,,,,,,,,,,,,,,,,,
coral 1,Acropora,plate,NA,1,4.4,22.08.2016,Coral_ITS_001,P5_A01,P5-A01_P5-A01_N701-S502_L002_R2_001.fastq.gz,"281,819",,,,,,,,,,,,,,,,,,,,,,,,
coral 2 ,Stylophora,pistillata,NA,1,4.4,22.08.2016,Coral_ITS_002,P5_A02,P5-A02_P5-A02_N702-S502_L002_R2_001.fastq.gz,"368,805",,,,,,,,,,,,,,,,,,,,,,,,
coral 3,Xenia,,NA,1,4.4,22.08.2016,Coral_ITS_003,P5_A03,P5-A03_P5-A03_N703-S502_L002_R2_001.fastq.gz,"375,166",,,,,,,,,,,,,,,,,,,,,,,,
coral 4,Platygyra ,lamellina,NA,1,4.4,22.08.2016,Coral_ITS_004,P5_A04,P5-A04_P5-A04_N704-S502_L002_R2_001.fastq.gz,"344,712",,,,,,,,,,,,,,,,,,,,,,,,
coral 5,Acropora,"pharaonis, cytherea, ",NA,1,4.4,22.08.2016,Coral_ITS_005,P5_A05,P5-A05_P5-A05_N705-S502_L002_R2_001.fastq.gz,"717,521",,,,,,,,,,,,,,,,,,,,,,,,
coral 6,Galaxea,fascicularis,NA,1,4.4,22.08.2016,Coral_ITS_006,P5_A06,P5-A06_P5-A06_N706-S502_

In [2]:
! head info_290819_fix.csv

Sample no.,coral genus,coral species,environ type,plot #,depth (m),date collected,Sample ID,Sample Name,Sequence file,no. of reads
coral 1,Acropora,plate,NA,1,4.4,22.08.2016,Coral_ITS_001,P5_A01,P5-A01_P5-A01_N701-S502_L002_R2_001.fastq.gz,"281,819"
coral 2 ,Stylophora,pistillata,NA,1,4.4,22.08.2016,Coral_ITS_002,P5_A02,P5-A02_P5-A02_N702-S502_L002_R2_001.fastq.gz,"368,805"
coral 3,Xenia,,NA,1,4.4,22.08.2016,Coral_ITS_003,P5_A03,P5-A03_P5-A03_N703-S502_L002_R2_001.fastq.gz,"375,166"
coral 4,Platygyra ,lamellina,NA,1,4.4,22.08.2016,Coral_ITS_004,P5_A04,P5-A04_P5-A04_N704-S502_L002_R2_001.fastq.gz,"344,712"
coral 5,Acropora,"pharaonis, cytherea, ",NA,1,4.4,22.08.2016,Coral_ITS_005,P5_A05,P5-A05_P5-A05_N705-S502_L002_R2_001.fastq.gz,"717,521"
coral 6,Galaxea,fascicularis,NA,1,4.4,22.08.2016,Coral_ITS_006,P5_A06,P5-A06_P5-A06_N706-S502_L002_R2_001.fastq.gz,"498,419"
coral 7,Porites,lobata,NA,1,4.4,22.08.2016,Coral_ITS_007,P5_A07,P5-A07_P5-A07_N707-S502_L002_R2_001.fastq.gz,"424,227"

In [4]:
! sed '/^\s*$/d' info_290819_fix.csv > info_290819_fix_fix.csv

In [5]:
! head info_290819_fix_fix.csv

Sample no.,coral genus,coral species,environ type,plot #,depth (m),date collected,Sample ID,Sample Name,Sequence file,no. of reads
coral 1,Acropora,plate,NA,1,4.4,22.08.2016,Coral_ITS_001,P5_A01,P5-A01_P5-A01_N701-S502_L002_R2_001.fastq.gz,"281,819"
coral 2 ,Stylophora,pistillata,NA,1,4.4,22.08.2016,Coral_ITS_002,P5_A02,P5-A02_P5-A02_N702-S502_L002_R2_001.fastq.gz,"368,805"
coral 3,Xenia,,NA,1,4.4,22.08.2016,Coral_ITS_003,P5_A03,P5-A03_P5-A03_N703-S502_L002_R2_001.fastq.gz,"375,166"
coral 4,Platygyra ,lamellina,NA,1,4.4,22.08.2016,Coral_ITS_004,P5_A04,P5-A04_P5-A04_N704-S502_L002_R2_001.fastq.gz,"344,712"
coral 5,Acropora,"pharaonis, cytherea, ",NA,1,4.4,22.08.2016,Coral_ITS_005,P5_A05,P5-A05_P5-A05_N705-S502_L002_R2_001.fastq.gz,"717,521"
coral 6,Galaxea,fascicularis,NA,1,4.4,22.08.2016,Coral_ITS_006,P5_A06,P5-A06_P5-A06_N706-S502_L002_R2_001.fastq.gz,"498,419"
coral 7,Porites,lobata,NA,1,4.4,22.08.2016,Coral_ITS_007,P5_A07,P5-A07_P5-A07_N707-S502_L002_R2_001.fastq.gz,"424,227"

In [6]:
! tail info_290819_fix_fix.csv | cat -A

17TAA,NA,NA,turf algae,1,4.4,02.10.2016,Turf_ITS_072,P7_G08,P7-G08_P7-G08_N710-S521_L002_R2_001.fastq.gz,"79,565"$
17TAB,NA,NA,turf algae,2,4,02.10.2016,Turf_ITS_073,P7_G09,P7-G09_P7-G09_N711-S521_L002_R2_001.fastq.gz,"84,231"$
18TAA,NA,NA,turf algae,2,4,02.10.2016,Turf_ITS_074,P7_G10,P7-G10_P7-G10_N712-S521_L002_R2_001.fastq.gz,"76,205"$
18TAB,NA,NA,turf algae,3,4,02.10.2016,Turf_ITS_075,P7_G11,P7-G11_P7-G11_N714-S521_L002_R2_001.fastq.gz,"96,295"$
18TAB.2,NA,NA,turf algae,3,4,02.10.2016,Turf_ITS_076,P7_G12,P7-G12_P7-G12_N715-S521_L002_R2_001.fastq.gz,"2,816"$
19TAA,NA,NA,turf algae,4,4,02.10.2016,Turf_ITS_077,P7_H01,P7-H01_P7-H01_N701-S522_L002_R2_001.fastq.gz,"178,471"$
19TAB,NA,NA,turf algae,4,4,02.10.2016,Turf_ITS_078,P7_H02,P7-H02_P7-H02_N702-S522_L002_R2_001.fastq.gz,"163,046"$
20TAA,NA,NA,turf algae,5,4.5,02.10.2016,Turf_ITS_079,P7_H03,P7-H03_P7-H03_N703-S522_L002_R2_001.fastq.gz,"178,250"$
20TAB,NA,NA,turf algae,5,4.5,02.10.2016,Turf_ITS_080,P7_H04,P7-H04_P7-H04_N704-S

In [7]:
! sed -r 's/^,+$//g' info_290819_fix_fix.csv > info_290819_fix_fix_fix.csv

In [8]:
! tail info_290819_fix_fix_fix.csv

16TAB,NA,NA,turf algae,1,4.4,02.10.2016,Turf_ITS_071,P7_G07,P7-G07_P7-G07_N707-S521_L002_R2_001.fastq.gz,"6,737"
17TAA,NA,NA,turf algae,1,4.4,02.10.2016,Turf_ITS_072,P7_G08,P7-G08_P7-G08_N710-S521_L002_R2_001.fastq.gz,"79,565"
17TAB,NA,NA,turf algae,2,4,02.10.2016,Turf_ITS_073,P7_G09,P7-G09_P7-G09_N711-S521_L002_R2_001.fastq.gz,"84,231"
18TAA,NA,NA,turf algae,2,4,02.10.2016,Turf_ITS_074,P7_G10,P7-G10_P7-G10_N712-S521_L002_R2_001.fastq.gz,"76,205"
18TAB,NA,NA,turf algae,3,4,02.10.2016,Turf_ITS_075,P7_G11,P7-G11_P7-G11_N714-S521_L002_R2_001.fastq.gz,"96,295"
18TAB.2,NA,NA,turf algae,3,4,02.10.2016,Turf_ITS_076,P7_G12,P7-G12_P7-G12_N715-S521_L002_R2_001.fastq.gz,"2,816"
19TAA,NA,NA,turf algae,4,4,02.10.2016,Turf_ITS_077,P7_H01,P7-H01_P7-H01_N701-S522_L002_R2_001.fastq.gz,"178,471"
19TAB,NA,NA,turf algae,4,4,02.10.2016,Turf_ITS_078,P7_H02,P7-H02_P7-H02_N702-S522_L002_R2_001.fastq.gz,"163,046"
20TAA,NA,NA,turf algae,5,4.5,02.10.2016,Turf_ITS_079,P7_H03,P7-H03_P7-H03_N703-S522_L002_R

In [9]:
! ls

131_142.DIVs.absolute.txt  gabi_env_res.ipynb  info_290819_fix.csv
131_142.DIVs.fasta	   gabi_env_res.py     info_290819_fix_fix.csv
131_142.DIVs.relative.txt  info_290819.csv     info_290819_fix_fix_fix.csv


In [10]:
! mv info_290819_fix_fix_fix.csv info_300718.csv

In [11]:
! rm *2908*

In [12]:
! ls

131_142.DIVs.absolute.txt  131_142.DIVs.relative.txt  gabi_env_res.py
131_142.DIVs.fasta	   gabi_env_res.ipynb	      info_300718.csv


info_300718.csv is now our meta datafile that we will work with in the python processing

In [16]:
! head info_300718.csv

Sample no.,coral genus,coral species,environ type,plot #,depth (m),date collected,Sample ID,Sample Name,Sequence file,no. of reads
coral 1,Acropora,plate,NA,1,4.4,22.08.2016,Coral_ITS_001,P5_A01,P5-A01_P5-A01_N701-S502_L002_R2_001.fastq.gz,"281,819"
coral 2 ,Stylophora,pistillata,NA,1,4.4,22.08.2016,Coral_ITS_002,P5_A02,P5-A02_P5-A02_N702-S502_L002_R2_001.fastq.gz,"368,805"
coral 3,Xenia,,NA,1,4.4,22.08.2016,Coral_ITS_003,P5_A03,P5-A03_P5-A03_N703-S502_L002_R2_001.fastq.gz,"375,166"
coral 4,Platygyra ,lamellina,NA,1,4.4,22.08.2016,Coral_ITS_004,P5_A04,P5-A04_P5-A04_N704-S502_L002_R2_001.fastq.gz,"344,712"
coral 5,Acropora,"pharaonis, cytherea, ",NA,1,4.4,22.08.2016,Coral_ITS_005,P5_A05,P5-A05_P5-A05_N705-S502_L002_R2_001.fastq.gz,"717,521"
coral 6,Galaxea,fascicularis,NA,1,4.4,22.08.2016,Coral_ITS_006,P5_A06,P5-A06_P5-A06_N706-S502_L002_R2_001.fastq.gz,"498,419"
coral 7,Porites,lobata,NA,1,4.4,22.08.2016,Coral_ITS_007,P5_A07,P5-A07_P5-A07_N707-S502_L002_R2_001.fastq.gz,"424,227"

#### Python code processing

__Read in the info file and the SP output file__

Then clean these up by dropping the samples that are not of the correct data. Then relate the SP output and info files so that they have the same names for the samples. Also pickle out the data_frames and have a check to see if they already exist.

The warning spat out by pandas can be ignored in this case

In [20]:
import pickle
import pandas as pd

try:
    sp_output_df = pickle.load(open('sp_output_df.pickle', 'rb'))
    QC_info_df= pickle.load(open('QC_info_df', 'rb'))
    info_df = pickle.load(open('info_df.pickle', 'rb'))
except:
    # read in the SymPortal output
    sp_output_df = pd.read_csv('131_142.DIVs.absolute.txt', sep='\t', lineterminator='\n')

    # The SP output contains the QC info columns between the DIVs and the no_name ITS2 columns.
    # lets put the QC info columns into a seperate df.
    QC_info_df = sp_output_df[['Samples','raw_contigs', 'post_qc_absolute_seqs', 'post_qc_unique_seqs',
                            'post_taxa_id_absolute_symbiodinium_seqs', 'post_taxa_id_unique_symbiodinium_seqs',
                            'post_taxa_id_absolute_non_symbiodinium_seqs', 'post_taxa_id_unique_non_symbiodinium_seqs',
                               'size_screening_violation_absolute', 'size_screening_violation_unique',
                               'post_med_absolute', 'post_med_unique']]

    # now lets drop the QC columns from the SP output df and also drop the clade summation columns
    # we will be left with just clumns for each one of the sequences found in the samples
    sp_output_df.drop(columns=['noName Clade A', 'noName Clade B', 'noName Clade C', 'noName Clade D',
                            'noName Clade E', 'noName Clade F', 'noName Clade G', 'noName Clade H',
                            'noName Clade I', 'raw_contigs', 'post_qc_absolute_seqs', 'post_qc_unique_seqs',
                            'post_taxa_id_absolute_symbiodinium_seqs', 'post_taxa_id_unique_symbiodinium_seqs',
                            'post_taxa_id_absolute_non_symbiodinium_seqs', 'post_taxa_id_unique_non_symbiodinium_seqs',
                               'size_screening_violation_absolute', 'size_screening_violation_unique',
                               'post_med_absolute', 'post_med_unique'
                               ]
                      , inplace=True)

    # read in the info file
    info_df = pd.read_csv('info_300718.csv')

    # drop the rows that aren't from the 22.08.2016 data
    info_df = info_df[info_df['date collected'] == '22.08.2016']

    # Now we need to link the SP output to the sample names in the excel. Annoyingly they are formatted
    # slightly differently so we can't make a direct comparison.
    # easiest way to link them is to see if the first part of the SP name is the same as the first part
    # of the 'sequence file' in the meta info
    # when doing this we can also drop the SP info for those samples that won't be used i.e. those that
    # aren't now in the info_df

    # firstly rename the colums so that they are 'sample_name' in all of the dfs
    QC_info_df.rename(index=str, columns={'Samples': 'sample_name'}, inplace=True)
    sp_output_df.rename(index=str, columns={'Samples': 'sample_name'}, inplace=True)
    info_df.rename(index=str, columns={'Sample Name': 'sample_name'}, inplace=True)

    indices_to_drop = []
    for sp_index in sp_output_df.index.values.tolist():
        # keep track of whether the sp_index was found in the info table
        # if it wasn't then it should be dropped
        found = False
        for info_index in info_df.index.values.tolist():
            if sp_output_df.loc[sp_index, 'sample_name'].split('_')[0] == info_df.loc[info_index, 'Sequence file'].split('_')[0]:
                found = True
                # then these are a related set of rows and we should make the sample_names the same
                sp_output_df.loc[sp_index, 'sample_name'] = info_df.loc[info_index, 'sample_name']
                QC_info_df.loc[sp_index, 'sample_name'] = info_df.loc[info_index, 'sample_name']


        if not found:
            indices_to_drop.append(sp_index)

    # drop the rows from the SP output tables that aren't going to be used
    sp_output_df.drop(inplace=True, index=indices_to_drop)
    QC_info_df.drop(inplace=True, index=indices_to_drop)

    # let's sort out the 'environ type' column in the info_df
    # currently it is a bit of a mess
    for index in info_df.index.values.tolist():
        if 'coral' in info_df.loc[index, 'Sample no.']:
            info_df.loc[index, 'environ type'] = 'coral'
        elif 'seawater' in info_df.loc[index, 'Sample no.']:
            info_df.loc[index, 'environ type'] = 'sea_water'
        elif 'mucus' in info_df.loc[index, 'Sample no.']:
            info_df.loc[index, 'environ type'] = 'mucus'
        elif 'SA' in info_df.loc[index, 'Sample no.']:
            info_df.loc[index, 'environ type'] = 'sed_close'
        elif 'SB' in info_df.loc[index, 'Sample no.']:
            info_df.loc[index, 'environ type'] = 'sed_far'
        elif 'TA' in info_df.loc[index, 'Sample no.']:
            info_df.loc[index, 'environ type'] = 'turf'

    # now clean up the df indices
    info_df.index = range(len(info_df))
    sp_output_df.index = range(len(sp_output_df))
    QC_info_df.index = range(len(QC_info_df))

    # pickle the out put and put a check in place to see if we need to do the above
    pickle.dump(sp_output_df, open('sp_output_df.pickle', 'wb'))
    pickle.dump(QC_info_df, open('QC_info_df.pickle', 'wb'))
    pickle.dump(info_df, open('info_df.pickle', 'wb'))

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  return super(DataFrame, self).rename(**kwargs)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item] = s
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


In [18]:
! ls

131_142.DIVs.absolute.txt  gabi_env_res.ipynb  info_df.pickle
131_142.DIVs.fasta	   gabi_env_res.py     QC_info_df.pickle
131_142.DIVs.relative.txt  info_300718.csv     sp_output_df.pickle


In [21]:
print(sp_output_df)

    sample_name      A1     A1k  A10   A1c   A1h   A1cc   A1cr    A1m   A1z  \
0        P5_D07      19       0    0     0     0      0      0      0     0   
1        P5_A07      18       0    0     0     0      0      0      0     0   
2        P5_C10      17       8    0     0     0      0      0      0     0   
3        P5_A01      29       0    0     0     0      0      0      0     0   
4        P5_B05   58573  114471    0     0     0      0      0      0   415   
5        P5_C02   14743       0    0     0     0      0      0      0     0   
6        P5_B03      32      29    0     0     0      0      0      0     0   
7        P5_B01      18      10    0     0     0      0      0      0     0   
8        P5_D10      21      10    6     0     0      0      0      0     0   
9        P5_B07       6       4    0     0     0      0      0      0     0   
10       P5_A08      28       0    0     0     0      0      0      0     0   
11       P5_D02      37       0    0     0     0    

In [22]:
print(info_df)

    Sample no.  coral genus          coral species environ type  plot #  \
0      coral 1     Acropora                  plate          NaN     1.0   
1     coral 2    Stylophora             pistillata          NaN     1.0   
2      coral 3        Xenia                    NaN          NaN     1.0   
3      coral 4   Platygyra               lamellina          NaN     1.0   
4      coral 5     Acropora  pharaonis, cytherea,           NaN     1.0   
5      coral 6      Galaxea           fascicularis          NaN     1.0   
6      coral 7      Porites                 lobata          NaN     1.0   
7      coral 8   Echinopora            fruticulosa          NaN     1.0   
8      coral 9  Pocillopora              verrucosa          NaN     1.0   
9     coral 10   Goniastrea       aspera, edwardsi          NaN     1.0   
10    coral 11   Stylophora             pistillata          NaN     2.0   
11    coral 12  Pocillopora              verrucosa          NaN     2.0   
12    coral 13        Fav

In [23]:
print(QC_info_df)

    sample_name  raw_contigs  post_qc_absolute_seqs  post_qc_unique_seqs  \
0        P5_D07     249412.0               125388.0               2308.0   
1        P5_A07     424227.0               223882.0               3628.0   
2        P5_C10    1174688.0               763225.0               5029.0   
3        P5_A01     281819.0               145353.0               1878.0   
4        P5_B05     387147.0               212392.0               1964.0   
5        P5_C02     397892.0               231619.0               2918.0   
6        P5_B03     358106.0               195359.0               1774.0   
7        P5_B01     192914.0                93647.0               1714.0   
8        P5_D10     632170.0               397407.0               3577.0   
9        P5_B07     230982.0               120302.0               1590.0   
10       P5_A08     273633.0               149802.0               2363.0   
11       P5_D02     293009.0               172731.0               2083.0   
12       P5_

__Get a dictionary that will hold seq name as value and a tup of the sample name and relative abundance of the sequence in which it was found to be most abundant__

In [24]:
# so we want to plot the ITS2 sequence diversity in each of the samples as bar charts
# We are going to have a huge diversity of sequences to deal with so I think something along the lines
# of plotting the top 10 most abundant sequences. The term 'most abundant' should be considered carefully here
# I think it will be best if we work on a sample by sample basis. i.e. we pick the 10 sequences that have the
# highest representation in any one sample. So for example what we are not doing is seeing how many times
# C3 was sequenced across all of the samples, and finding that it is a lot and therefore sequencing.
# we are looking in each of the samples and seeing the highest proportion it is found at in any one sample.
# This way we sould have the best chance of having a coloured representation for each sample's most abundant
# sequence.

# to start lets go sample by sample and see what the highest prop for each seq is.

# dict to hold info on which sample and what the proportion is for each sequence
# key = sequence name, value = tup ( sample name, relative abundance)
try:
    seq_rel_abund_calculator_dict = pickle.load(open('seq_rel_abund_caculator_dict.picle', 'rb'))
except:
    seq_rel_abund_calculator_dict = {}
    for sample_index in sp_output_df.index.values.tolist():
        sys.stdout.write('\nGeting rel seq abundances from {}\n'.format(sp_output_df.loc[sample_index][0]))
        # temp_prop_array = sp_output_df.loc[sample_index].div(sp_output_df.loc[sample_index].sum(axis='index'))
        temp_prop_array = sp_output_df.loc[sample_index][1:].div(sp_output_df.loc[sample_index][1:].sum())
        for seq_name in temp_prop_array.keys():
            sys.stdout.write('\rseq: {}'.format(seq_name))
            val = temp_prop_array[seq_name]
            if val != 0: # if the sequences was found in the sample
                # If the sequence is already in the dict
                if seq_name in seq_rel_abund_calculator_dict.keys():
                    # check to seee if the rel abundance is larger than the one already logged
                    if val > seq_rel_abund_calculator_dict[seq_name][1]:
                        seq_rel_abund_calculator_dict[seq_name] = (sp_output_df.loc[sample_index][0], val)
                # if we haven't logged for this sequence yet, then add this as the first log
                else:
                    seq_rel_abund_calculator_dict[seq_name] = (sp_output_df.loc[sample_index][0], val)
    pickle.dump(seq_rel_abund_calculator_dict, open('seq_rel_abund_caculator_dict.picle', 'wb'))

# here we have a dict that contains the largest rel_abundances per sample for each of the seqs
# now we can sort this to look at the top ? 10 sequences to start with (I'm not sure how the colouring will
# look like so lets just start with 10 and see where we get to)

In [25]:
! ls

131_142.DIVs.absolute.txt  info_300718.csv
131_142.DIVs.fasta	   info_df.pickle
131_142.DIVs.relative.txt  QC_info_df.pickle
gabi_env_res.ipynb	   seq_rel_abund_caculator_dict.picle
gabi_env_res.py		   sp_output_df.pickle


In [26]:
print(seq_rel_abund_calculator_dict)

{'A1': ('P7_E02', 0.92837632776934753), 'C41': ('P6_C10', 0.66655707522356655), 'C1': ('P6_E08', 0.4414120961352232), 'C1b': ('P6_F12', 0.28770949720670391), 'C39': ('P6_E01', 0.083333333333333329), 'C3': ('P7_A06', 0.10174825174825175), 'C1ae': ('P6_G01', 0.049325267566309915), 'C41f': ('P6_G01', 0.054618427175430431), 'C41a': ('P6_E08', 0.04481028259529888), 'C15h': ('P7_E05', 0.7390385631273112), 'C1ai': ('P6_G01', 0.021696137738483015), 'C1w': ('P6_F12', 0.023743016759776536), 'C1n': ('P6_D11', 0.0040863762465322036), 'C1a': ('P6_E07', 0.0076177285318559558), 'C41e': ('P6_G01', 0.02472080037226617), 'C1af': ('P6_E08', 0.011708777181089884), 'C3v': ('P6_E08', 0.0014966106171317897), 'C39a': ('P6_G01', 0.0054094927873429505), 'C89': ('P6_E08', 0.00052821551192886698), 'C1f': ('P6_G01', 0.010237319683573755), 'C93a': ('P7_E05', 0.017960908610670893), 'C3u': ('P7_A06', 0.02062937062937063), 'C39j': ('P6_D11', 0.0014995876134063133), 'D1': ('P6_D08', 0.605235769754223), '1398_C': ('P5_B