# Selecting the Best Bins

As said in the `statistics` notebook, the best approach would be to select medium-quality bins (completeness ≥50%, contamination <10%) as described in [this paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6436528/) (*Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea*) from exclusively MetaDecoder, non-MetaDecoder, and bins that have no difference in metrics (all of them are 0s) and combine them with the bins that have differences illustrated in this notebook. This approach will exploit the strengths of both algorithms.

Since FastANI worked only on the bins that passed a threshold of N50 > 10 Kbp and also excluded any bins with ANI < 80% (actually, some of them were included at around 79.5% but then were excluded anyway because there were some duplicates with a higher ANI), there can be some duplicates between the bins that will be filtered excluding the bins that FastANI reported as being identical between the two algorithms (including the filtering that was done in the `statistics` notebook excluding bin pairs with low ANI, and duplicates).

This process generated two datasets: `kaloevig_query.csv` with 148 bins and `loegten_query.csv` with 120 bins. We will use these two datasets to get the bins that are exclusive to either the Kalø Vig or Løgten CheckM dataset.

The bins that have different metrics between the two approaches will be worked on in the second step.

## Filtering CheckM Datasets

In the first step, we have to take the reference (for MetaDecoder bins) and query (for non-MetaDecoder bins) bins and use them as masks to get only the exclusively MetaDecoder or non-MetaDecoder bins from the CheckM Bacteria datasets.

In [1]:
import pandas as pd

# Display all columns
pd.set_option("display.max_columns", None)

In [2]:
checkm_cols = ["Bin Id", "Completeness", "Contamination"]

# Kalø Vig CheckM datasets
kaloevig_md_checkm = pd.read_table(
    "results_bacteria_kaloevig_dastool_metadecoder.tsv", usecols=checkm_cols
)
kaloevig_no_md_checkm = pd.read_table(
    "results_bacteria_kaloevig_dastool_no_metadecoder.tsv", usecols=checkm_cols
)

# Løgten CheckM datasets
loegten_md_checkm = pd.read_table(
    "results_bacteria_loegten_dastool_metadecoder.tsv", usecols=checkm_cols
)
loegten_no_md_checkm = pd.read_table(
    "results_bacteria_loegten_dastool_no_metadecoder.tsv", usecols=checkm_cols
)

In [3]:
# Open datasets to filter on
kaloevig_query = pd.read_csv("kaloevig_query.csv")
loegten_query = pd.read_csv("loegten_query.csv")

In [4]:
kaloevig_query.shape[0]

148

In [5]:
loegten_query.shape[0]

120

In [6]:
# Filter bins exclusive to Kaloevig (filter out entries that are not in the query dataset)
kaloevig_md_checkm_filt = kaloevig_md_checkm[
    ~kaloevig_md_checkm["Bin Id"].isin(kaloevig_query["reference"])
]

kaloevig_no_md_checkm_filt = kaloevig_no_md_checkm[
    ~kaloevig_no_md_checkm["Bin Id"].isin(kaloevig_query["query"])
]

# Filter bins exclusive to Loegten
loegten_md_checkm_filt = loegten_md_checkm[
    ~loegten_md_checkm["Bin Id"].isin(loegten_query["reference"])
]

loegten_no_md_checkm_filt = loegten_no_md_checkm[
    ~loegten_no_md_checkm["Bin Id"].isin(loegten_query["query"])
]

## Adding the Best Bins

In [7]:
def filter_cont_compl(df):
    """
    Filter a dataset by completeneses and contamination.
    """
    return df.loc[(df["Completeness"] >= 50) & (df["Contamination"] < 10)]

In [8]:
def concat_df(df1, df2):
    """
    Concatenate two dataframes.
    """
    return pd.concat([df1, df2], ignore_index=True)

In [9]:
# Concatenate two datasets vertically
kaloevig_final = concat_df(
    filter_cont_compl(kaloevig_md_checkm_filt),
    filter_cont_compl(kaloevig_no_md_checkm_filt),
)

loegten_final = concat_df(
    filter_cont_compl(loegten_md_checkm_filt),
    filter_cont_compl(loegten_no_md_checkm_filt),
)

# Drop duplicate rows (each value in that rows is identical)
kaloevig_final = kaloevig_final.drop_duplicates()
loegten_final = loegten_final.drop_duplicates()

However, dropping duplicates still results in the duplicates in the `Bin Id` column. There are just two duplicates, so it's easy to drop the worse bin.

In [10]:
kaloevig_final[kaloevig_final["Bin Id"].duplicated(keep=False)]

Unnamed: 0,Bin Id,Completeness,Contamination
13,maxbin_bin.591_sub,62.07,8.62
115,maxbin_bin.591_sub,63.79,8.62


In [11]:
# Drop index 13
kaloevig_final = kaloevig_final.drop(index=[13])

In the Løgten dataset, there are no duplicates.

In [12]:
loegten_final[loegten_final["Bin Id"].duplicated(keep=False)]

Unnamed: 0,Bin Id,Completeness,Contamination


In [13]:
kaloevig_final[kaloevig_final["Bin Id"] == "metadecoder_kaloevig.metadecoder.700"]

Unnamed: 0,Bin Id,Completeness,Contamination
91,metadecoder_kaloevig.metadecoder.700,93.1,3.45


In [14]:
cols_to_drop = ["Completeness", "Contamination"]

# Remove unwanted indices and columns
kaloevig_final = kaloevig_final.drop(columns=cols_to_drop)
loegten_final = loegten_final.drop(columns=cols_to_drop)

kaloevig_final

Unnamed: 0,Bin Id
0,maxbin_bin.107
1,maxbin_bin.157_sub
2,maxbin_bin.167
3,maxbin_bin.254_sub
4,maxbin_bin.290_sub
...,...
158,metabat_bin.499
161,metabat_bin.509_sub
164,metabat_bin.556
165,metabat_bin.575_sub


In [15]:
loegten_final

Unnamed: 0,Bin Id
0,maxbin_bin.114
1,maxbin_bin.139_sub
2,maxbin_bin.25
3,maxbin_bin.382_sub
4,maxbin_bin.450
...,...
99,metabat_bin.338
100,metabat_bin.381
109,metabat_bin.47_sub
112,metabat_bin.78


Now we should add the bins that have no differences between the two approaches (every metric difference is equal to 0). I will use the bin names that are coming from the MetaDecoder dataset (column `assembly_MD`) because it also includes the MetaDecoder bins.

For the other columns (that are different between the two datasets), select the ones with the least contamination, and if it is the same, choose the most complete bin.

In [16]:
# Open 'difference' datasets
kaloevig_diff = pd.read_csv("kaloevig_difference.csv")
loegten_diff = pd.read_csv("loegten_difference.csv")

Select all the bins where differences in metrics are 0s. We also have to filter by Completeness >= 50 and Contamination < 10. Since the differences are 0, it does not matter by which column we should filter. However, I will select the MetaDecoder bin names for the final dataset because they potentially contain also `metadecoder` names (but again, it does not matter what bins to choose because they have no difference).

In [17]:
# Select bins where all differences are 0s
kaloevig_no_diff_metrics = kaloevig_diff[(kaloevig_diff.iloc[:, 18:] == 0).all(axis=1)]
loegten_no_diff_metrics = loegten_diff[(loegten_diff.iloc[:, 18:] == 0).all(axis=1)]

# Convert to dataframe and rename columns to Bin Id to concatenate it with kaloevig_final
kaloevig_no_diff_metrics = pd.DataFrame(kaloevig_no_diff_metrics).rename(
    columns={"assembly_MD": "Bin Id"}
)
loegten_no_diff_metrics = pd.DataFrame(loegten_no_diff_metrics).rename(
    columns={"assembly_MD": "Bin Id"}
)

# Filter by completeness >= 50 and contamination < 10
kaloevig_no_diff_metrics = kaloevig_no_diff_metrics[
    (kaloevig_no_diff_metrics["Completeness_MD"] >= 50)
    & (kaloevig_no_diff_metrics["Contamination_MD"] < 10)
]

loegten_no_diff_metrics = loegten_no_diff_metrics[
    (loegten_no_diff_metrics["Completeness_MD"] >= 50)
    & (loegten_no_diff_metrics["Contamination_MD"] < 10)
]

# Select the Bin Id and convert Series to dataframe for concatenation
kaloevig_no_diff_metrics = pd.DataFrame(
    kaloevig_no_diff_metrics["Bin Id"], columns=["Bin Id"]
)
loegten_no_diff_metrics = pd.DataFrame(
    loegten_no_diff_metrics["Bin Id"], columns=["Bin Id"]
)

In [18]:
# Add bins with no differences to the final dataset
kaloevig_final = concat_df(kaloevig_final, kaloevig_no_diff_metrics)
loegten_final = concat_df(loegten_final, loegten_no_diff_metrics)

Now select the bins with differences in the metrics.

In [19]:
# Select bins where with differences in metrics
kaloevig_diff_metrics = kaloevig_diff[~(kaloevig_diff.iloc[:, 18:] == 0).all(axis=1)]
loegten_diff_metrics = loegten_diff[~(loegten_diff.iloc[:, 18:] == 0).all(axis=1)]

In [20]:
def add_best_bins(df):
    """
    Accept a dataframe with assembly names (for either MetaDecoder or non-MetaDecoder),
    contamination/completeness differences and return a list of the best bins based on these criteria:
    1. If contamination_diff < 0 and contamination < 10 and completeness >= 50: select assembly_MD (MetaDecoder is better) else assembly_no_MD
    2. If contamination is equal: select the bin with the highest completeness (but completeness >= 50 and contamination < 10)
    """
    bins_to_add = []

    for row in df.itertuples():
        assembly_no_MD = row[1]
        assembly_MD = row[13]

        completeness_no_MD = row[8]
        contamination_no_MD = row[9]

        completeness_MD = row[11]
        contamination_MD = row[12]

        completeness_diff = row[19]
        contamination_diff = row[20]

        if contamination_diff < 0 and contamination_MD < 10 and completeness_MD >= 50:
            bins_to_add.append(assembly_MD)
        elif (
            contamination_diff > 0
            and contamination_no_MD < 10
            and completeness_no_MD >= 50
        ):
            bins_to_add.append(assembly_no_MD)
        elif contamination_diff == 0:  # If contaminations are equal
            if (
                completeness_diff > 0
                and completeness_MD >= 50
                and contamination_MD < 10
            ):
                bins_to_add.append(assembly_MD)
            elif (
                completeness_diff <= 0
                and completeness_no_MD >= 50
                and contamination_no_MD < 10
            ):
                bins_to_add.append(assembly_no_MD)
        else:
            continue
    return bins_to_add

In [21]:
# Select bins to add to the final datasets
kaloevig_bins_to_add = add_best_bins(kaloevig_diff_metrics)
loegten_bins_to_add = add_best_bins(loegten_diff_metrics)

In [22]:
kaloevig_added_bins = pd.DataFrame(kaloevig_bins_to_add, columns=["Bin Id"])
loegten_added_bins = pd.DataFrame(loegten_bins_to_add, columns=["Bin Id"])

In [23]:
kaloevig_added_bins.head()

Unnamed: 0,Bin Id
0,maxbin_bin.10
1,maxbin_bin.121_sub
2,maxbin_bin.128_sub
3,maxbin_bin.243
4,maxbin_bin.326


In [24]:
loegten_added_bins.head()

Unnamed: 0,Bin Id
0,metadecoder_loegten.metadecoder.1170
1,maxbin_bin.135
2,maxbin_bin.147_sub
3,maxbin_bin.231
4,maxbin_bin.520


Finally, add these bins to the final dataset. That's important to **note that sometimes the contamination in the MetaDecoder dataset is higher than in the non-MetaDecoder dataset but the non-MetaDecoder bins were selected nonetheless because completeness must be >= 50** (if it's not, it does not matter if contamination is smaller). But if completeness >= 50, then the contamination metric is prioritized. For example, if in the MetaDecoder dataset the completeness is 60 and contamination is 2, and in the non-MetaDecoder dataset the completeness is 95 and contamination is 9, the MetaDecoder bin will be selected because its contamination is lower (2 < 9) even though its completeness is lower (but still higher than the threshold of 50).


In [25]:
kaloevig_final = concat_df(kaloevig_final, kaloevig_added_bins)
loegten_final = concat_df(loegten_final, loegten_added_bins)

In [26]:
dup = kaloevig_final[kaloevig_final.duplicated()]
kaloevig_diff[kaloevig_diff["assembly_MD"].isin(dup["Bin Id"])]

Unnamed: 0,assembly_no_MD,# contigs_no_MD,Largest contig_no_MD,Total length_no_MD,N50_no_MD,L50_no_MD,query_no_MD,Completeness_no_MD,Contamination_no_MD,reference_MD,Completeness_MD,Contamination_MD,assembly_MD,# contigs_MD,Largest contig_MD,Total length_MD,N50_MD,L50_MD,completeness_diff,contamination_diff,#_contigs_diff,largest_contig_diff,total_length_diff,N50_diff,L50_diff


In [27]:
kaloevig_final[kaloevig_final["Bin Id"] == "maxbin_bin.55_sub"]

Unnamed: 0,Bin Id
214,maxbin_bin.55_sub


In [28]:
kaloevig_final[kaloevig_final.duplicated()]

Unnamed: 0,Bin Id


In [29]:
loegten_final[loegten_final.duplicated()]

Unnamed: 0,Bin Id


In [30]:
kaloevig_final.to_csv("best_kaloevig_bins.csv", index=False)
loegten_final.to_csv("best_loegten_bins.csv", index=False)