# Protein Hits Count in Kalø Vig and Løgten

The purpose of this notebook is to generate the tables with the number of protein hits of each protein in each bin. I am working with the translations of the original final high-quality bins (251 from  Kalø Vig and 185 from Løgten)

In [1]:
import os
from pathlib import Path

import numpy as np
import pandas as pd
from Bio import SeqIO

In the BLAST output I get some **overlapping matches** (for example, in the `kaloevig_omc_blast.tsv` file protein ABL63052.1 matches the contig k127_22601620_5 two times. This may be due to the fact  that these matches overlap, so **the query is searched against the database and finds a hit but then it continues the search and finds another hit in the same contig** (i.e., protein sequence) and also reports it. So, continuing, with the previous example, there was a first hit in the contig k127_22601620_5 from point 55 up to 232 but then the next hit came up from point 123 to 263 (with a lower e-value). Thus, I have to remove these duplicates and keep only the one with the lowest e-value.

In [2]:
# Open BLAST output tables
kaloevig_blast = pd.read_table("kaloevig/kaloevig_omc_blast.tsv", header=None)
loegten_blast = pd.read_table("loegten/loegten_omc_blast.tsv", header=None)

# Drop the duplicated protein sequence-contig pairs with the highest e-value
kaloevig_blast = kaloevig_blast.sort_values(10).drop_duplicates(subset=[0, 1])
loegten_blast = loegten_blast.sort_values(10).drop_duplicates(subset=[0, 1])

# Select the first two columns
kaloevig_blast = kaloevig_blast.iloc[:, :2]
loegten_blast = loegten_blast.iloc[:, :2]

# Add column names
blast_col_names = ["protein_id", "contig"]
kaloevig_blast.columns = blast_col_names
loegten_blast.columns = blast_col_names

Create column names (protein names) and indices (bin names) for the final dataframes.

In [3]:
# Protein accession numbers are identical in both BLAST outputs
protein_col_names = kaloevig_blast["protein_id"].unique()


def get_bins(sample_name):
    """
    Extract bin names without extension.
    """
    index = []
    for file in os.listdir(f"../../../results/2022-04-26/prodigal/{sample_name}"):
        index.append(Path(file).stem)
    return index


kaloevig_bins = get_bins("kaloevig")
loegten_bins = get_bins("loegten")

In [4]:
def contig_hits(sample_name, blast_output, bins):
    """
    Save a dataframe with the number of sequence hits of each protein in each genomic bin.
    """
    # Bins path
    bin_path = "../../../results/2022-04-26/prodigal/"

    # Final dataframe skeleton
    final_df = pd.DataFrame(0, index=bins, columns=protein_col_names)

    for bin_file in Path(bin_path + sample_name).iterdir():
        # print(bin_file)

        # Save the current bin's sequence ids to a dataframe
        bin_record_ids = []

        for record in SeqIO.parse(bin_file, "fasta"):
            bin_record_ids.append(record.id)
        bin_record_ids_df = pd.DataFrame(bin_record_ids, columns=["contig"])

        # Merge the dataframe of sequence ids of the current bin and blast output
        merged = bin_record_ids_df.merge(blast_output, on="contig")

        # Group the merge dataframe to compute the total number of contig hits for the proteion in the bin
        grouped = merged.groupby("protein_id").count().reset_index()

        # The name of the current bin
        current_bin = Path(bin_file).stem

        # Loop through the grouped dataframe and extract protein id and number of contig hits
        # Then add the number to the corresponding row/column of the dataset
        for index, row in grouped.iterrows():
            final_df.loc[current_bin, row["protein_id"]] = row["contig"]

    return final_df

In [5]:
kaloevig_contig_hits = contig_hits(
    sample_name="kaloevig", blast_output=kaloevig_blast, bins=kaloevig_bins
)

loegten_contig_hits = contig_hits(
    sample_name="loegten", blast_output=loegten_blast, bins=loegten_bins
)

In [6]:
# Save to csv files
kaloevig_contig_hits.to_csv("kaloevig_contig_hits.csv", index=False)
loegten_contig_hits.to_csv("loegten_contig_hits.csv", index=False)