# Creating Big Table with Quality Metrics of the *Candidatus* Electrothrix communis RB sample

The purpose of this notebook is to create a big dataset that combines the bin names and the CheckM quality metrics:
1. Completeness
2. Contamination

This table will be useful when we have to prioritize the taxa we want to look into: for example, it's better to select those that have higher completeness and lower contamination. The "marine_gs" abbreviation stands for "Marine Golden Standard", which is an internal name for the *Candidatus* Electrothrix communis RB species.

In [1]:
import pandas as pd

# Relevant columns
checkm_cols = ["Bin Id", "Completeness", "Contamination"]

# Open Candidatus Electrothrix communis RB CheckM dataset
marine_gs_illumina_checkm = pd.read_table(
    "binning/results/2022-05-11/checkm/marine_gs_illumina/metabat/bacteria/marine_gs_illumina_results_bacteria.tsv",
    usecols=checkm_cols,
)

# Open Candidatus Electrothrix communis RB taxonomic table
marine_gs_illumna_taxonomy = pd.read_table(
    "taxonomy/results/2022-05-16/gtdbtk/marine_gs_illumina_metabat/gtdbtk.bac120.summary.Illumina.tsv",
    usecols=["user_genome", "classification"]
)

# Rename user_genome from taxonomic table to match the column name in CheckM dataset
marine_gs_illumna_taxonomy = marine_gs_illumna_taxonomy.rename(columns={"user_genome": "Bin Id"})

In [2]:
### Merge datasets ###
marine_gs_illumina_taxa_quality = marine_gs_illumna_taxonomy.merge(marine_gs_illumina_checkm, on=["Bin Id"])

# Save dataset to csv files
marine_gs_illumina_taxa_quality.to_csv("marine_gs_illumina_taxa_quality.csv", index=False)