# Statistical Comparison of Bin Metrics Between DAS Tool Outputs With and Without MetaDecoder

In [1]:
import pandas as pd

# Show all columns
pd.set_option("display.max_columns", None)

## Data Cleaning

In [2]:
def clean_fastani(file):
    """
    Remove unwanted columns, rename them and remove path prefixes from query and reference bins.
    """

    # Read FastANI file
    table = pd.read_table(file)

    # We need only the first three columns
    table = table.iloc[:, 0:3]

    # Reset column names
    table.columns = range(table.columns.size)

    # Rename columns to more appropriate names
    table.columns = ["query", "reference", "match"]

    # Extract only the bin names
    table["query"] = table["query"].str.split("/").str[-1]
    table["reference"] = table["reference"].str.split("/").str[-1]

    return table

In [3]:
kaloevig_fastani = clean_fastani("kaloevig_fastani.tsv")
loegten_fastani = clean_fastani("loegten_fastani.tsv")

In [4]:
kaloevig_fastani

Unnamed: 0,query,reference,match
0,maxbin_bin.10.fa,maxbin_bin.10_sub.fa,100.0000
1,maxbin_bin.114.fa,maxbin_bin.114.fa,100.0000
2,maxbin_bin.116_sub.fa,maxbin_bin.116_sub.fa,100.0000
3,maxbin_bin.121_sub.fa,maxbin_bin.121_sub.fa,100.0000
4,maxbin_bin.121_sub.fa,metadecoder_kaloevig.metadecoder.1157.fa,81.2515
...,...,...,...
196,vamb_S1C8489.fa,vamb_S1C8489.fa,100.0000
197,vamb_S1C8497.fa,metadecoder_kaloevig.metadecoder.1115.fa,100.0000
198,vamb_S1C9541.fa,vamb_S1C9541.fa,100.0000
199,vamb_S1C9779_sub.fa,metadecoder_kaloevig.metadecoder.666_sub.fa,100.0000


In [5]:
kaloevig_fastani[kaloevig_fastani["query"] == "maxbin_bin.55_sub.fa"]

Unnamed: 0,query,reference,match
35,maxbin_bin.55_sub.fa,maxbin_bin.55_sub.fa,100.0
36,maxbin_bin.55_sub.fa,metadecoder_kaloevig.metadecoder.700.fa,80.9148


In [6]:
loegten_fastani

Unnamed: 0,query,reference,match
0,maxbin_bin.117_sub.fa,metadecoder_loegten.metadecoder.1170.fa,99.9817
1,maxbin_bin.121_sub.fa,maxbin_bin.121_sub.fa,100.0000
2,maxbin_bin.130_sub.fa,maxbin_bin.130_sub.fa,100.0000
3,maxbin_bin.135.fa,maxbin_bin.135_sub.fa,100.0000
4,maxbin_bin.13_sub.fa,maxbin_bin.13_sub.fa,100.0000
...,...,...,...
149,vamb_S1C7996.fa,vamb_S1C7996.fa,100.0000
150,vamb_S1C8265_sub.fa,vamb_S1C8265_sub.fa,99.9999
151,vamb_S1C8664.fa,vamb_S1C8664.fa,99.9999
152,vamb_S1C8901.fa,vamb_S1C8901.fa,100.0000


In [7]:
loegten_fastani[loegten_fastani["query"].duplicated(keep=False)]

Unnamed: 0,query,reference,match
4,maxbin_bin.13_sub.fa,maxbin_bin.13_sub.fa,100.0
5,maxbin_bin.13_sub.fa,vamb_S1C5018.fa,78.7938
6,maxbin_bin.140_sub.fa,maxbin_bin.140_sub.fa,100.0
7,maxbin_bin.140_sub.fa,metadecoder_loegten.metadecoder.785_sub.fa,77.7941
10,maxbin_bin.177.fa,metadecoder_loegten.metadecoder.620.fa,99.9411
11,maxbin_bin.177.fa,vamb_S1C3219_sub.fa,95.1226
18,maxbin_bin.252_sub.fa,maxbin_bin.252_sub.fa,100.0
19,maxbin_bin.252_sub.fa,metabat_bin.439.fa,79.2989
20,maxbin_bin.252_sub.fa,metabat_bin.149_sub.fa,79.1409
23,maxbin_bin.39_sub.fa,maxbin_bin.39_sub.fa,100.0


In the original reference files the numbers of bins were 179 for Kalø Vig, and 139 for Løgten. 

Some of the query bins are duplicated (as well as reference bins). I'll retain only the bins with the **highest match value**. 

I will first remove the duplicates from the `query` column and then from the `reference` column if any duplicate is still left.

In [8]:
print(
    f"# of duplicates in 'query' column of the Kalø Vig dataset: \
{kaloevig_fastani['query'].duplicated().sum()}"
)

print(
    f"# of duplicates in 'query' column of the Løgten dataset: \
{loegten_fastani['query'].duplicated().sum()}"
)

# of duplicates in 'query' column of the Kalø Vig dataset: 51
# of duplicates in 'query' column of the Løgten dataset: 32


In [9]:
def remove_query_dup(df):
    """
    Keep only the first occurrence of the duplicated entries in the 'query' column.
    """
    return (
        df.sort_values("match", ascending=False).drop_duplicates("query").sort_index()
    )

In [10]:
kaloevig_no_dup = remove_query_dup(kaloevig_fastani)
loegten_no_dup = remove_query_dup(loegten_fastani)

In [11]:
# Kalø Vig reference duplicates
kaloevig_no_dup[kaloevig_no_dup["reference"].duplicated(keep=False)]

Unnamed: 0,query,reference,match


In [12]:
# Loegten reference duplicates
loegten_no_dup[loegten_no_dup["reference"].duplicated(keep=False)]

Unnamed: 0,query,reference,match
39,maxbin_bin.70_sub.fa,metabat_bin.373_sub.fa,99.9569
81,metabat_bin.373_sub.fa,metabat_bin.373_sub.fa,99.8184


In [13]:
print(
    f"# of duplicates in 'reference' column of Kalø Vig dataset: \
{kaloevig_no_dup['reference'].duplicated().sum()}"
)

print(
    f"# of duplicates in 'reference' column of Løgten dataset: \
{loegten_no_dup['reference'].duplicated().sum()}"
)

# of duplicates in 'reference' column of Kaloevig dataset: 0
# of duplicates in 'reference' column of Loegten dataset: 1


There are still duplicates in the `reference` column. For example, in the Løgten dataset the reference `metabat_bin.373_sub.fa` corresponds to two queries: `maxbin_bin.70_sub.fa` and `metabat_bin.373_sub.fa`. The match between them is almost the same: `99.9569` and `99.8184`, respectively.

Between the two or more query bins that correspond to the same reference bin, I'll choose the one with the highest total length (from the results of Quast). The reason behind this is that one of the algorithms has probably split a bin that is singular in the the other algorithm.

First, I'll shortlist the duplicated reference bins.

In [14]:
# Kalø Vig reference duplicates
kaloevig_ref_dup = kaloevig_no_dup[
    kaloevig_no_dup["reference"].duplicated(keep=False)
].reset_index(drop=True)

Now remove extension `.fa` from the query column to allow merging and remove all rows where match is less than 99.5% (there is just one entry in the Kaloevig dataset).

In [15]:
# Remove extension .fa from query bins
kaloevig_no_dup["query"] = kaloevig_no_dup["query"].str.rstrip(".fa")
loegten_no_dup["query"] = loegten_no_dup["query"].str.rstrip(".fa")

# Remove all rows where 'match' < 99.5
kaloevig_no_dup = kaloevig_no_dup[kaloevig_no_dup["match"] > 99.5]
loegten_no_dup = loegten_no_dup[loegten_no_dup["match"] > 99.5]

kaloevig_no_dup.head()

Unnamed: 0,query,reference,match
0,maxbin_bin.10,maxbin_bin.10_sub.fa,100.0
1,maxbin_bin.114,maxbin_bin.114.fa,100.0
2,maxbin_bin.116_sub,maxbin_bin.116_sub.fa,100.0
3,maxbin_bin.121_sub,maxbin_bin.121_sub.fa,100.0
8,maxbin_bin.128_sub,maxbin_bin.128_sub.fa,100.0


**Update 2022-03-29**: there was an error in the function `remove_query_dup`. I used code from [here](https://stackoverflow.com/questions/12497402/remove-duplicates-by-columns-a-keeping-the-row-with-the-highest-value-in-column) from **Ted Petrou** but `groupby()` does not work correctly if you need to correctly associate every value in each column to the correct value in another column (`groupby()` will work only with two columns).

Now it's very easy to remove just one duplicated reference value from the Løgten dataset.

In [16]:
# Løgten reference duplicates
loegten_no_dup[loegten_no_dup["reference"].duplicated(keep=False)]

Unnamed: 0,query,reference,match
39,maxbin_bin.70_sub,metabat_bin.373_sub.fa,99.9569
81,metabat_bin.373_sub,metabat_bin.373_sub.fa,99.8184


In [17]:
# Drop index 81
loegten_no_dup = loegten_no_dup.drop(index=[81])

# Loegten reference duplicates
loegten_no_dup[loegten_no_dup["reference"].duplicated(keep=False)]

Unnamed: 0,query,reference,match


In [18]:
# Before when using groupby() maxbin_bin.55_sub.fa as associated with metadecoder_kaloevig.metadecoder.700
# with the match of 100 which is erroneous as these bins have only ~81 of match
kaloevig_no_dup[kaloevig_no_dup["query"] == "maxbin_bin.55_sub"]

Unnamed: 0,query,reference,match
35,maxbin_bin.55_sub,maxbin_bin.55_sub.fa,100.0


## Quast

Next, change the format of the Quast (no MetaDecoder) datasets to an easier format to work with (use snippets from `filter_n50`).

In [19]:
def reformat_quast(input_file):
    """
    Change the format of the Quast datasets to an easier format to work with.
    """
    # Open input file
    input_file = pd.read_table(input_file)

    # Transpose the tsv file
    transposed = input_file.set_index("Assembly").T

    return transposed

I'll use the non-MetaDecoder Quast datasets as more bins will be kept.

In [20]:
# Extact bin names and their total length
kaloevig_quast_tot_len = reformat_quast(
    "results_kaloevig_dastool_no_metadecoder_quast.tsv"
)["Total length"]
kaloevig_quast_tot_len = kaloevig_quast_tot_len.reset_index()
kaloevig_quast_tot_len.columns = ["bin", "total_length"]

loegten_quast_tot_len = reformat_quast(
    "results_loegten_dastool_no_metadecoder_quast.tsv"
)["Total length"]
loegten_quast_tot_len = loegten_quast_tot_len.reset_index()
loegten_quast_tot_len.columns = ["bin", "total_length"]

Next step is to merge datasets by the `bin` and `query` columns.

In [21]:
# Merge datasets
kaloevig_merged = kaloevig_quast_tot_len.merge(
    kaloevig_no_dup, left_on="bin", right_on="query"
).drop(["bin", "match"], axis=1)

loegten_merged = loegten_quast_tot_len.merge(
    loegten_no_dup, left_on="bin", right_on="query"
).drop(["bin", "match"], axis=1)

In [22]:
kaloevig_merged

Unnamed: 0,total_length,query,reference
0,3174270.0,maxbin_bin.10,maxbin_bin.10_sub.fa
1,6091574.0,maxbin_bin.114,maxbin_bin.114.fa
2,3000122.0,maxbin_bin.116_sub,maxbin_bin.116_sub.fa
3,2112080.0,maxbin_bin.121_sub,maxbin_bin.121_sub.fa
4,3863717.0,maxbin_bin.128_sub,maxbin_bin.128_sub.fa
...,...,...,...
143,2033899.0,vamb_S1C7378,vamb_S1C7378.fa
144,2654240.0,vamb_S1C8489,vamb_S1C8489.fa
145,2528295.0,vamb_S1C8497,metadecoder_kaloevig.metadecoder.1115.fa
146,782798.0,vamb_S1C9541,vamb_S1C9541.fa


Now extract the `query` bins with the largest total length of the contigs.

In [23]:
kaloevig_query = (
    kaloevig_merged.groupby(["reference"], as_index=False)
    .apply(lambda group: group.nlargest(1, columns="total_length"))
    .reset_index(drop=True)
)  # From https://stackoverflow.com/questions/32459325/python-pandas-dataframe-select-row-by-max-value-in-group

loegten_query = (
    loegten_merged.groupby(["reference"], as_index=False)
    .apply(lambda group: group.nlargest(1, columns="total_length"))
    .reset_index(drop=True)
)

In [24]:
print(f"# of rows in Kalø Vig dataset after filtering is {kaloevig_query.shape[0]}.")

# of rows in Kaloevig dataset after filtering is 148.


In [25]:
print(f"# of rows in Løgten dataset after filtering is {loegten_query.shape[0]}.")

# of rows in Loegten dataset after filtering is 120.


In [26]:
# Example to demonstrate that the approach worked correctly
kaloevig_merged[kaloevig_merged["reference"] == "metabat_bin.467.fa"]

Unnamed: 0,total_length,query,reference
90,3737853.0,metabat_bin.467,metabat_bin.467.fa


In [27]:
# Remove .fa extension as the format should be identical to CheckM and Quast outputs
kaloevig_query["reference"] = kaloevig_query["reference"].str.rstrip(".fa")
loegten_query["reference"] = loegten_query["reference"].str.rstrip(".fa")

# Drop total_length as we won't need it anymore
kaloevig_query.drop("total_length", axis=1, inplace=True)
loegten_query.drop("total_length", axis=1, inplace=True)

In [28]:
kaloevig_query

Unnamed: 0,query,reference
0,maxbin_bin.10,maxbin_bin.10_sub
1,maxbin_bin.114,maxbin_bin.114
2,maxbin_bin.116_sub,maxbin_bin.116_sub
3,maxbin_bin.121_sub,maxbin_bin.121_sub
4,maxbin_bin.128_sub,maxbin_bin.128_sub
...,...,...
143,vamb_S1C7378,vamb_S1C7378
144,vamb_S1C8489,vamb_S1C8489
145,vamb_S1C8497,metadecoder_kaloevig.metadecoder.1115
146,vamb_S1C9541,vamb_S1C9541


In [29]:
loegten_query

Unnamed: 0,query,reference
0,maxbin_bin.117_sub,metadecoder_loegten.metadecoder.1170
1,maxbin_bin.121_sub,maxbin_bin.121_sub
2,maxbin_bin.130_sub,maxbin_bin.130_sub
3,maxbin_bin.135,maxbin_bin.135_sub
4,maxbin_bin.13_sub,maxbin_bin.13_sub
...,...,...
115,vamb_S1C7595,vamb_S1C7595_sub
116,vamb_S1C7996,vamb_S1C7996
117,vamb_S1C8265_sub,vamb_S1C8265_sub
118,vamb_S1C8664,vamb_S1C8664


Export the two datasets in `csv` as they'll be used in the selection of the best bins.

In [30]:
kaloevig_query.to_csv("kaloevig_query.csv", index=False)
loegten_query.to_csv("loegten_query.csv", index=False)

In [31]:
kaloevig_query.shape

(148, 2)

In [32]:
loegten_query.shape

(120, 2)

## Comparison

The bins that will be used in the comparison are from the `kaloevig_query` and `loegten_query` datasets. The metrics for the comparison are:

1. Completeness
2. Contamination
3. N50
4. L50
5. \# of contigs
6. Maximum contig length
7. Total contig length

The first four metrics are coming from the Quast datasets, while the last two metrics are from the CheckM datasets.

These metrics are recommended in [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6436528/) (*Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea*):
<blockquote>
    To assist downstream users in the evaluation of assembly quality, we recommend reporting basic       assembly statistics from individual SAGs and/or MAGs, including, total assembly size, contig         N50/L50, and maximum contig length.
</blockquote>

### Completeness and Contamination

To make this process easier, I'll first work with just completeness and contamination and then extend the approach to the other metrics.

I'll be using the Bacteria results of CheckM because they have higher bin completeness.

In [33]:
checkm_cols = ["Bin Id", "Completeness", "Contamination"]

# Kalø Vig CheckM datasets
kaloevig_md_checkm = pd.read_table(
    "results_bacteria_kaloevig_dastool_metadecoder.tsv", usecols=checkm_cols
)
kaloevig_no_md_checkm = pd.read_table(
    "results_bacteria_kaloevig_dastool_no_metadecoder.tsv", usecols=checkm_cols
)

# Løgten CheckM datasets
loegten_md_checkm = pd.read_table(
    "results_bacteria_loegten_dastool_metadecoder.tsv", usecols=checkm_cols
)
loegten_no_md_checkm = pd.read_table(
    "results_bacteria_loegten_dastool_no_metadecoder.tsv", usecols=checkm_cols
)

Rename `Bin Id` to either `reference` (MetaDecoder dataset) or `query` (No-Metadecoder dataset).

In [34]:
# Kalø Vig
kaloevig_md_checkm.rename(columns={"Bin Id": "reference"}, inplace=True)
kaloevig_no_md_checkm.rename(columns={"Bin Id": "query"}, inplace=True)

# Løgten
loegten_md_checkm.rename(columns={"Bin Id": "reference"}, inplace=True)
loegten_no_md_checkm.rename(columns={"Bin Id": "query"}, inplace=True)

Before merging these datasets, it's important to remove all the bins whose contamination > 10 and/or completeness < 50. This will avoid the issue of including the bins that do not satisfy this requirement when we have to select the best bins in `selecting_best_bins`.

In [35]:
kaloevig_md_checkm

Unnamed: 0,reference,Completeness,Contamination
0,maxbin_bin.105_sub,85.19,50.73
1,maxbin_bin.107,91.38,1.72
2,maxbin_bin.10_sub,81.90,1.72
3,maxbin_bin.114,87.93,5.33
4,maxbin_bin.116_sub,48.28,0.47
...,...,...,...
339,vamb_S1C5430,93.97,1.72
340,vamb_S1C7378,70.69,0.00
341,vamb_S1C8489,82.76,0.00
342,vamb_S1C9502_sub,76.80,11.03


In [36]:
kaloevig_md_checkm[
    kaloevig_md_checkm["reference"] == "metadecoder_kaloevig.metadecoder.1115"
]

Unnamed: 0,reference,Completeness,Contamination
214,metadecoder_kaloevig.metadecoder.1115,89.66,3.61


In [37]:
kaloevig_no_md_checkm[kaloevig_no_md_checkm["query"] == "vamb_S1C8497"]

Unnamed: 0,query,Completeness,Contamination
276,vamb_S1C8497,77.59,0.0


Perform two consecutive merges:
1. CheckM with MetaDecoder + Kalø Vig/Løgten `reference`
2. CheckM without MetaDecoder + Kalø Vig/Løgten `query`

The merge is inner because we don't we just need the bins that we can use to evaluate the performance of the two algorithms so it's OK to exclude some bins.

In [38]:
# MD + reference
kaloevig_final = kaloevig_md_checkm.merge(kaloevig_query, on="reference")
loegten_final = loegten_md_checkm.merge(loegten_query, on="reference")

# No-MD + query
kaloevig_final = kaloevig_no_md_checkm.merge(
    kaloevig_final, on="query", suffixes=["_no_MD", "_MD"]
)
loegten_final = loegten_no_md_checkm.merge(
    loegten_final, on="query", suffixes=["_no_MD", "_MD"]
)

In [39]:
kaloevig_final

Unnamed: 0,query,Completeness_no_MD,Contamination_no_MD,reference,Completeness_MD,Contamination_MD
0,maxbin_bin.10,81.90,1.72,maxbin_bin.10_sub,81.90,1.72
1,maxbin_bin.114,87.93,5.33,maxbin_bin.114,87.93,5.33
2,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,48.28,0.47
3,maxbin_bin.121_sub,76.65,7.05,maxbin_bin.121_sub,76.65,5.33
4,maxbin_bin.128_sub,82.76,8.62,maxbin_bin.128_sub,82.76,8.62
...,...,...,...,...,...,...
143,vamb_S1C7378,70.69,0.00,vamb_S1C7378,70.69,0.00
144,vamb_S1C8489,82.76,0.00,vamb_S1C8489,82.76,0.00
145,vamb_S1C8497,77.59,0.00,metadecoder_kaloevig.metadecoder.1115,89.66,3.61
146,vamb_S1C9541,77.43,0.00,vamb_S1C9541,77.43,0.00


In [40]:
loegten_final

Unnamed: 0,query,Completeness_no_MD,Contamination_no_MD,reference,Completeness_MD,Contamination_MD
0,maxbin_bin.117_sub,79.31,1.72,metadecoder_loegten.metadecoder.1170,89.66,0.00
1,maxbin_bin.121_sub,80.17,10.34,maxbin_bin.121_sub,80.17,10.34
2,maxbin_bin.130_sub,94.83,21.87,maxbin_bin.130_sub,94.83,21.87
3,maxbin_bin.135,96.47,1.72,maxbin_bin.135_sub,95.61,1.72
4,maxbin_bin.13_sub,95.69,2.51,maxbin_bin.13_sub,95.69,2.51
...,...,...,...,...,...,...
115,vamb_S1C7595,96.55,15.52,vamb_S1C7595_sub,94.83,0.00
116,vamb_S1C7996,89.66,0.00,vamb_S1C7996,89.66,0.00
117,vamb_S1C8265_sub,26.69,0.16,vamb_S1C8265_sub,26.69,0.16
118,vamb_S1C8664,95.30,0.00,vamb_S1C8664,95.30,0.00


The `reference` column corresponds to the bins generated by DAS Tool with MetaDecoder and `query` without MetaDecoder.

Now we add columns that'll contain comparison results of completeness and contamination. The difference is a subtraction between a metric with MetaDecoder and a metric without MetaDecodeder. 

Thus, a **positive number** in case of completeness means that the inclusion of MetaDecoder worsened the results while a **positive number** in case of contamination means that it improved the results.

For reproducibility, the function below will compute all differences for a given dataset.

In [41]:
def comp_diff(df):
    """
    Compute differences in metrics of MetaDecoder and non-MetaDecoder dataset.
    """

    # Difference in completeness
    df["completeness_diff"] = df["Completeness_MD"] - df["Completeness_no_MD"]

    # Difference in contamination
    df["contamination_diff"] = df["Contamination_MD"] - df["Contamination_no_MD"]

    # Difference in # of contigs
    df["#_contigs_diff"] = df["# contigs_MD"] - df["# contigs_no_MD"]

    # Difference in largest contig lengths
    df["largest_contig_diff"] = df["Largest contig_MD"] - df["Largest contig_no_MD"]

    # Difference in total contig length
    df["total_length_diff"] = df["Total length_MD"] - df["Total length_no_MD"]

    # Difference in N50
    df["N50_diff"] = df["N50_MD"] - df["N50_no_MD"]

    # Difference in L50
    df["L50_diff"] = df["L50_MD"] - df["L50_no_MD"]

    return df

For an easier comprehension I am adding the suffix `_MD` (MetaDecoder) to the `reference` column and `no_MD` to the `query` column.

In [42]:
# New column names
final_cols = {"reference": "reference_MD", "query": "query_no_MD"}

# Rename columns
kaloevig_final.rename(columns=final_cols, inplace=True)
loegten_final.rename(columns=final_cols, inplace=True)

In [43]:
kaloevig_final.head()

Unnamed: 0,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD
0,maxbin_bin.10,81.9,1.72,maxbin_bin.10_sub,81.9,1.72
1,maxbin_bin.114,87.93,5.33,maxbin_bin.114,87.93,5.33
2,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,48.28,0.47
3,maxbin_bin.121_sub,76.65,7.05,maxbin_bin.121_sub,76.65,5.33
4,maxbin_bin.128_sub,82.76,8.62,maxbin_bin.128_sub,82.76,8.62


In [44]:
loegten_final.head()

Unnamed: 0,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD
0,maxbin_bin.117_sub,79.31,1.72,metadecoder_loegten.metadecoder.1170,89.66,0.0
1,maxbin_bin.121_sub,80.17,10.34,maxbin_bin.121_sub,80.17,10.34
2,maxbin_bin.130_sub,94.83,21.87,maxbin_bin.130_sub,94.83,21.87
3,maxbin_bin.135,96.47,1.72,maxbin_bin.135_sub,95.61,1.72
4,maxbin_bin.13_sub,95.69,2.51,maxbin_bin.13_sub,95.69,2.51


#### \# of contigs, Maximum contig length, Total contig length, N50, L50

Now add the remaining metric to the table.

In [45]:
# Colums to select from the Quast datasets
cols_quast = ["# contigs", "Largest contig", "Total length", "N50", "L50"]

# Reformat Kalø Vig Quast datasets and select columns
kaloevig_md_quast = reformat_quast("results_kaloevig_dastool_metadecoder_quast.tsv")[
    cols_quast
]
kaloevig_no_md_quast = reformat_quast(
    "results_kaloevig_dastool_no_metadecoder_quast.tsv"
)[cols_quast]

# Reformat Løgten Quast datasets and select columns
loegten_md_quast = reformat_quast("results_loegten_dastool_metadecoder_quast.tsv")[
    cols_quast
]
loegten_no_md_quast = reformat_quast(
    "results_loegten_dastool_no_metadecoder_quast.tsv"
)[cols_quast]

For some reason, merging datasets on index and column and adding suffixes, swaps the metrics (what should be the metric with MetaDecoder has values of the metric without MetaDecoder and vice versa. Thus, I'll transform indices of `kaloevig/loegten_md_quast` and `kaloevig/loegten_no_md_quast` to columns and merge on them.

In [46]:
ind_cols = {"index": "assembly"}
# Kalø Vig
kaloevig_md_quast = kaloevig_md_quast.reset_index().rename(columns=ind_cols)

kaloevig_no_md_quast = kaloevig_no_md_quast.reset_index().rename(columns=ind_cols)

## Rename indices
kaloevig_md_quast.index = kaloevig_md_quast.index.set_names(["#"])
kaloevig_no_md_quast.index = kaloevig_no_md_quast.index.set_names(["#"])

# Løgten
loegten_md_quast = loegten_md_quast.reset_index().rename(columns=ind_cols)

loegten_no_md_quast = loegten_no_md_quast.reset_index().rename(columns=ind_cols)

## Rename indices
loegten_md_quast.index = loegten_md_quast.index.set_names(["#"])
loegten_no_md_quast.index = loegten_no_md_quast.index.set_names(["#"])

In [47]:
kaloevig_md_quast.head()

Assembly,assembly,# contigs,Largest contig,Total length,N50,L50
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,maxbin_bin.105_sub,843.0,49404.0,4637323.0,6661.0,215.0
1,maxbin_bin.107,364.0,63160.0,3745902.0,16094.0,74.0
2,maxbin_bin.10_sub,279.0,78889.0,3171641.0,17796.0,52.0
3,maxbin_bin.114,398.0,109419.0,6091574.0,25439.0,70.0
4,maxbin_bin.116_sub,238.0,169044.0,2997117.0,27636.0,26.0


Below are two rows to confirm the accuracy of the merging process.

In [48]:
kaloevig_md_quast[
    kaloevig_md_quast["assembly"] == "metadecoder_kaloevig.metadecoder.1115"
]

Assembly,assembly,# contigs,Largest contig,Total length,N50,L50
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
214,metadecoder_kaloevig.metadecoder.1115,313.0,54943.0,3116509.0,13492.0,73.0


In [49]:
kaloevig_no_md_quast[kaloevig_no_md_quast["assembly"] == "vamb_S1C8497"]

Assembly,assembly,# contigs,Largest contig,Total length,N50,L50
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
276,vamb_S1C8497,203.0,54943.0,2528295.0,15438.0,54.0


In [50]:
loegten_md_quast[loegten_md_quast["assembly"] == "vamb_S1C7595_sub"]

Assembly,assembly,# contigs,Largest contig,Total length,N50,L50
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
246,vamb_S1C7595_sub,129.0,139870.0,3439320.0,37218.0,30.0


In [51]:
loegten_no_md_quast[loegten_no_md_quast["assembly"] == "vamb_S1C7595"]

Assembly,assembly,# contigs,Largest contig,Total length,N50,L50
#,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
198,vamb_S1C7595,141.0,139870.0,3960912.0,38894.0,33.0


Merge MetaDecoder dataset with `loegten/kaloevig_final` on `reference` and non-MetaDecoder dataset on `query`.

In [52]:
# MD + reference
kaloevig_final = kaloevig_final.merge(
    kaloevig_md_quast, left_on="reference_MD", right_on="assembly"
)
loegten_final = loegten_final.merge(
    loegten_md_quast, left_on="reference_MD", right_on="assembly"
)

# No-MD + query
kaloevig_final = kaloevig_no_md_quast.merge(
    kaloevig_final,
    left_on="assembly",
    right_on="query_no_MD",
    suffixes=["_no_MD", "_MD"],
)
loegten_final = loegten_no_md_quast.merge(
    loegten_final,
    left_on="assembly",
    right_on="query_no_MD",
    suffixes=["_no_MD", "_MD"],
)

In [53]:
kaloevig_final.head()

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD
0,maxbin_bin.10,280.0,78889.0,3174270.0,17796.0,52.0,maxbin_bin.10,81.9,1.72,maxbin_bin.10_sub,81.9,1.72,maxbin_bin.10_sub,279.0,78889.0,3171641.0,17796.0,52.0
1,maxbin_bin.114,398.0,109419.0,6091574.0,25439.0,70.0,maxbin_bin.114,87.93,5.33,maxbin_bin.114,87.93,5.33,maxbin_bin.114,398.0,109419.0,6091574.0,25439.0,70.0
2,maxbin_bin.116_sub,239.0,169044.0,3000122.0,27636.0,26.0,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,238.0,169044.0,2997117.0,27636.0,26.0
3,maxbin_bin.121_sub,146.0,83833.0,2112080.0,22270.0,26.0,maxbin_bin.121_sub,76.65,7.05,maxbin_bin.121_sub,76.65,5.33,maxbin_bin.121_sub,134.0,83833.0,1989826.0,23286.0,24.0
4,maxbin_bin.128_sub,233.0,128670.0,3863717.0,38217.0,36.0,maxbin_bin.128_sub,82.76,8.62,maxbin_bin.128_sub,82.76,8.62,maxbin_bin.128_sub,232.0,128670.0,3853027.0,38217.0,36.0


In [54]:
kaloevig_final.shape

(148, 18)

In [55]:
kaloevig_final.sort_values("Completeness_no_MD", ascending=True).head(40)

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD
18,maxbin_bin.4,17.0,16450.0,126483.0,10257.0,5.0,maxbin_bin.4,25.36,0.0,maxbin_bin.4,25.36,0.0,maxbin_bin.4,17.0,16450.0,126483.0,10257.0,5.0
71,metabat_bin.310,258.0,83467.0,3333323.0,25154.0,40.0,metabat_bin.310,33.82,1.18,metabat_bin.310,33.82,1.18,metabat_bin.310,258.0,83467.0,3333323.0,25154.0,40.0
26,maxbin_bin.586,225.0,143447.0,2329597.0,26056.0,25.0,maxbin_bin.586,33.82,2.74,maxbin_bin.586,33.82,2.74,maxbin_bin.586,225.0,143447.0,2329597.0,26056.0,25.0
89,metabat_bin.466,262.0,59839.0,2589623.0,16809.0,48.0,metabat_bin.466,47.41,0.0,metadecoder_kaloevig.metadecoder.1639_sub,77.66,4.02,metadecoder_kaloevig.metadecoder.1639_sub,474.0,59839.0,4006755.0,12921.0,90.0
2,maxbin_bin.116_sub,239.0,169044.0,3000122.0,27636.0,26.0,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,238.0,169044.0,2997117.0,27636.0,26.0
7,maxbin_bin.186_sub,295.0,76732.0,2775691.0,17946.0,45.0,maxbin_bin.186_sub,57.29,5.17,maxbin_bin.186_sub,57.29,5.17,maxbin_bin.186_sub,295.0,76732.0,2775691.0,17946.0,45.0
96,metabat_bin.513,396.0,31158.0,3325675.0,10658.0,106.0,metabat_bin.513,57.76,0.0,metabat_bin.513,57.76,0.0,metabat_bin.513,396.0,31158.0,3325675.0,10658.0,106.0
68,metabat_bin.302,166.0,31624.0,1330667.0,10806.0,45.0,metabat_bin.302,60.66,1.57,metabat_bin.302_sub,60.66,1.57,metabat_bin.302_sub,165.0,31624.0,1327943.0,10806.0,45.0
88,metabat_bin.456,166.0,41239.0,1737299.0,13179.0,42.0,metabat_bin.456,61.21,0.47,metabat_bin.456,61.21,0.47,metabat_bin.456,166.0,41239.0,1737299.0,13179.0,42.0
147,vamb_S1C9779_sub,83.0,132592.0,3150102.0,56596.0,20.0,vamb_S1C9779_sub,61.29,0.0,metadecoder_kaloevig.metadecoder.666_sub,73.35,8.62,metadecoder_kaloevig.metadecoder.666_sub,125.0,132592.0,4095697.0,40050.0,29.0


In [56]:
loegten_final.head()

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD
0,maxbin_bin.117_sub,172.0,204078.0,4088783.0,66256.0,18.0,maxbin_bin.117_sub,79.31,1.72,metadecoder_loegten.metadecoder.1170,89.66,0.0,metadecoder_loegten.metadecoder.1170,104.0,204078.0,3886821.0,66872.0,17.0
1,maxbin_bin.121_sub,376.0,95467.0,3148782.0,18435.0,45.0,maxbin_bin.121_sub,80.17,10.34,maxbin_bin.121_sub,80.17,10.34,maxbin_bin.121_sub,376.0,95467.0,3148782.0,18435.0,45.0
2,maxbin_bin.130_sub,318.0,97860.0,3703001.0,21396.0,50.0,maxbin_bin.130_sub,94.83,21.87,maxbin_bin.130_sub,94.83,21.87,maxbin_bin.130_sub,316.0,97860.0,3693587.0,21396.0,50.0
3,maxbin_bin.135,327.0,228522.0,7107400.0,51065.0,40.0,maxbin_bin.135,96.47,1.72,maxbin_bin.135_sub,95.61,1.72,maxbin_bin.135_sub,326.0,228522.0,7095657.0,52238.0,39.0
4,maxbin_bin.13_sub,230.0,100650.0,3061740.0,23802.0,35.0,maxbin_bin.13_sub,95.69,2.51,maxbin_bin.13_sub,95.69,2.51,maxbin_bin.13_sub,230.0,100650.0,3061740.0,23802.0,35.0


### Differences

In [57]:
kaloevig_final.shape

(148, 18)

In [58]:
loegten_final.shape

(120, 18)

In [59]:
kaloevig_diff = comp_diff(kaloevig_final)
loegten_diff = comp_diff(loegten_final)

In [60]:
kaloevig_final[kaloevig_final["reference_MD"] == "metadecoder_kaloevig.metadecoder.700"]

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD,completeness_diff,contamination_diff,#_contigs_diff,largest_contig_diff,total_length_diff,N50_diff,L50_diff


In [61]:
kaloevig_diff.sort_values("Completeness_no_MD")

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD,completeness_diff,contamination_diff,#_contigs_diff,largest_contig_diff,total_length_diff,N50_diff,L50_diff
18,maxbin_bin.4,17.0,16450.0,126483.0,10257.0,5.0,maxbin_bin.4,25.36,0.00,maxbin_bin.4,25.36,0.00,maxbin_bin.4,17.0,16450.0,126483.0,10257.0,5.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0
71,metabat_bin.310,258.0,83467.0,3333323.0,25154.0,40.0,metabat_bin.310,33.82,1.18,metabat_bin.310,33.82,1.18,metabat_bin.310,258.0,83467.0,3333323.0,25154.0,40.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0
26,maxbin_bin.586,225.0,143447.0,2329597.0,26056.0,25.0,maxbin_bin.586,33.82,2.74,maxbin_bin.586,33.82,2.74,maxbin_bin.586,225.0,143447.0,2329597.0,26056.0,25.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0
89,metabat_bin.466,262.0,59839.0,2589623.0,16809.0,48.0,metabat_bin.466,47.41,0.00,metadecoder_kaloevig.metadecoder.1639_sub,77.66,4.02,metadecoder_kaloevig.metadecoder.1639_sub,474.0,59839.0,4006755.0,12921.0,90.0,30.25,4.02,212.0,0.0,1417132.0,-3888.0,42.0
2,maxbin_bin.116_sub,239.0,169044.0,3000122.0,27636.0,26.0,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,238.0,169044.0,2997117.0,27636.0,26.0,0.00,0.00,-1.0,0.0,-3005.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
69,metabat_bin.309_sub,238.0,92110.0,3334483.0,24114.0,40.0,metabat_bin.309_sub,100.00,9.48,metadecoder_kaloevig.metadecoder.1633,96.55,0.86,metadecoder_kaloevig.metadecoder.1633,154.0,92110.0,3018838.0,27530.0,34.0,-3.45,-8.62,-84.0,0.0,-315645.0,3416.0,-6.0
67,metabat_bin.28,360.0,105846.0,5480739.0,24881.0,66.0,metabat_bin.28,100.00,6.52,metabat_bin.28_sub,100.00,6.52,metabat_bin.28_sub,359.0,105846.0,5477847.0,24881.0,66.0,0.00,0.00,-1.0,0.0,-2892.0,0.0,0.0
39,metabat_bin.11,147.0,153117.0,3777217.0,63577.0,19.0,metabat_bin.11,100.00,1.72,metadecoder_kaloevig.metadecoder.683,93.97,3.45,metadecoder_kaloevig.metadecoder.683,335.0,153117.0,4980782.0,44272.0,32.0,-6.03,1.73,188.0,0.0,1203565.0,-19305.0,13.0
111,metabat_bin.92,291.0,162915.0,3361454.0,20445.0,48.0,metabat_bin.92,100.00,0.31,metadecoder_kaloevig.metadecoder.1208_sub,100.00,0.00,metadecoder_kaloevig.metadecoder.1208_sub,216.0,162915.0,3215280.0,20853.0,45.0,0.00,-0.31,-75.0,0.0,-146174.0,408.0,-3.0


In [62]:
kaloevig_diff

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD,completeness_diff,contamination_diff,#_contigs_diff,largest_contig_diff,total_length_diff,N50_diff,L50_diff
0,maxbin_bin.10,280.0,78889.0,3174270.0,17796.0,52.0,maxbin_bin.10,81.90,1.72,maxbin_bin.10_sub,81.90,1.72,maxbin_bin.10_sub,279.0,78889.0,3171641.0,17796.0,52.0,0.00,0.00,-1.0,0.0,-2629.0,0.0,0.0
1,maxbin_bin.114,398.0,109419.0,6091574.0,25439.0,70.0,maxbin_bin.114,87.93,5.33,maxbin_bin.114,87.93,5.33,maxbin_bin.114,398.0,109419.0,6091574.0,25439.0,70.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0
2,maxbin_bin.116_sub,239.0,169044.0,3000122.0,27636.0,26.0,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,48.28,0.47,maxbin_bin.116_sub,238.0,169044.0,2997117.0,27636.0,26.0,0.00,0.00,-1.0,0.0,-3005.0,0.0,0.0
3,maxbin_bin.121_sub,146.0,83833.0,2112080.0,22270.0,26.0,maxbin_bin.121_sub,76.65,7.05,maxbin_bin.121_sub,76.65,5.33,maxbin_bin.121_sub,134.0,83833.0,1989826.0,23286.0,24.0,0.00,-1.72,-12.0,0.0,-122254.0,1016.0,-2.0
4,maxbin_bin.128_sub,233.0,128670.0,3863717.0,38217.0,36.0,maxbin_bin.128_sub,82.76,8.62,maxbin_bin.128_sub,82.76,8.62,maxbin_bin.128_sub,232.0,128670.0,3853027.0,38217.0,36.0,0.00,0.00,-1.0,0.0,-10690.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
143,vamb_S1C7378,85.0,68987.0,2033899.0,28980.0,25.0,vamb_S1C7378,70.69,0.00,vamb_S1C7378,70.69,0.00,vamb_S1C7378,85.0,68987.0,2033899.0,28980.0,25.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0
144,vamb_S1C8489,97.0,88040.0,2654240.0,36022.0,28.0,vamb_S1C8489,82.76,0.00,vamb_S1C8489,82.76,0.00,vamb_S1C8489,97.0,88040.0,2654240.0,36022.0,28.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0
145,vamb_S1C8497,203.0,54943.0,2528295.0,15438.0,54.0,vamb_S1C8497,77.59,0.00,metadecoder_kaloevig.metadecoder.1115,89.66,3.61,metadecoder_kaloevig.metadecoder.1115,313.0,54943.0,3116509.0,13492.0,73.0,12.07,3.61,110.0,0.0,588214.0,-1946.0,19.0
146,vamb_S1C9541,56.0,71702.0,782798.0,19957.0,12.0,vamb_S1C9541,77.43,0.00,vamb_S1C9541,77.43,0.00,vamb_S1C9541,56.0,71702.0,782798.0,19957.0,12.0,0.00,0.00,0.0,0.0,0.0,0.0,0.0


In [63]:
kaloevig_diff[kaloevig_diff["assembly_no_MD"] == "maxbin_bin.55_sub"]

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD,completeness_diff,contamination_diff,#_contigs_diff,largest_contig_diff,total_length_diff,N50_diff,L50_diff
23,maxbin_bin.55_sub,297.0,202267.0,3827608.0,28901.0,34.0,maxbin_bin.55_sub,97.41,11.36,maxbin_bin.55_sub,95.69,9.64,maxbin_bin.55_sub,279.0,202267.0,3695094.0,30901.0,31.0,-1.72,-1.72,-18.0,0.0,-132514.0,2000.0,-3.0


In [64]:
# Save final datasets to csv
loegten_diff.to_csv("loegten_difference.csv", index=False)
kaloevig_diff.to_csv("kaloevig_difference.csv", index=False)

## Conclusion

It is very difficult to make any meaningful conclusion on which algorithm (with or without MetaDecoder) because the statistical metric in some occasions point in two different directions. For example, in the `loegten_diff` dataset, the bin `metabat_bin.294` (`metadecoder_loegten.metadecoder.365`) have more or less the same completeness (difference of 1.72) and contamination (difference of -2.20), but the total length of the MetaDecoder bin is smaller than the non-MetaDecoder one by -704327 bp, while its N50 is better by 2804 bp.

Thus, the best approach would be to select medium-quality bins (completeness ≥50%, contamination <10%) as described in [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6436528/) (*Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea*) from exclusively MetaDecoder, non-MetaDecoder, and bins that have no difference in metrics (all of them are 0s, and at the end only contamination and completeness were used) and combine them with the bins that have differences illustrated in this notebook. This approach will exploit the strengths of both algorithms.

The exact approach is described in the notebook `selecting_best_bins.ipynb`.