# Dsr Proteins in InterProScan Datasets

The purpose of this notebook is to search for Dsr proteins in the InterProscan datasets. Dissimilatory sulfate reduction (DSR) is an anaerobic process common in marine sediments which uses sulfate as a terminal electron acceptor and produces hydrogen sulfide, the main substrate of cable bacteria. Two proteins are essential for this process: [sulphite reductase, dissimilatory-type alpha subunit](http://www.ebi.ac.uk/interpro/entry/InterPro/IPR011806/) (DsrA) and [sulphite reductase, dissimilatory-type beta subunit](http://www.ebi.ac.uk/interpro/entry/InterPro/IPR011808/) (DsrB). Thus, it is interesting to understand which organisms are capable of this process as they are essential for cable bacteria growth.

Cable bacteria belong to the *Desulfobulbaceae* family which is known for this pathway but it was not confirmed that cable bacteria perform this type of metabolism, even though their genome contains the essential genes required for DSR.

The search was performed only in the top 50 most abundant genomic bins in the Kalø Vig and Løgten samples. Note that the bin maxbin_bin.55_sub was removed from the final genomic bins as it did not pass the threshold of contamination (its contamination was 11.46%). Therefore, the total number of bins in the Kalø Vig sample accounts for 49 bins.

In [1]:
import os

import pandas as pd

## Kalø Vig Sample

In [2]:
def read_parser_error(tsv_file):
    """
    If a TSV file cannot be read due to a ParserError, read it line by line,
    split each line by a tab, and convert this list of lists into a DataFrame.
    """
    with open(tsv_file) as tsv:
        lines = tsv.readlines()

    # List to save correctly split lines
    tsv_lines = []

    for line in lines:
        tsv_lines.append(line.split("\t"))

    return pd.DataFrame(tsv_lines)

In [3]:
def find_dsr(path):
    """
    Return bin names which contain either DsrA or DsrB proteins.

    Parameters
    ----------
    path : str
        Path to genomic bins.
    """
    for file in os.listdir(path):
        if file.endswith(".tsv"):
            open_file = read_parser_error(os.path.join(path, file))
        if (open_file.iloc[:, 11] == "IPR011806").any() or (  # DsrA
            open_file.iloc[:, 11] == "IPR011808"  # DsrB
        ).any():  # 11th column is "InterPro annotations - accession"
            print(file.rstrip(".faa.tsv"))

In [4]:
print("Genomic bins of the Kalø Vig sample containing DsrA or DsrB:")
find_dsr("interproscan_bins/kaloevig/")

Genomic bins of the Kalø Vig sample containing DsrA or DsrB:
metadecoder_kaloevig.metadecoder.700
maxbin_bin.657
maxbin_bin.7_sub
maxbin_bin.654
maxbin_bin.9
metabat_bin.603
metabat_bin.12
metabat_bin.479
metabat_bin.189
maxbin_bin.525
metabat_bin.596
metabat_bin.18
maxbin_bin.40
maxbin_bin.10


Cable bacteria are:
* `metabat_bin.603` - genus *Electronema*
* `metabat_bin.479` - genus *Electrothrix*
* `metabat_bin.189` - genus *Electronema*

This list contains some of the bins that were shortlisted:
* `metadecoder_kaloevig.metadecoder.700` - family *SZUA-229*, class Gammaproteobacteria
* `maxbin_bin.657` - *Sedimenticola thiotaurini_A*
* `maxbin_bin.654` - genus *Magnetovibrio*
* `maxbin_bin.9` - genus *41T-STBD-0c-01*, family *Sedimenticolaceae*
* `maxbin_bin.10` - genus *41T-STBD-0c-01*, family *Sedimenticolaceae*

Other bacteria are:
* `maxbin_bin.7` - genus *JABZFP01*, order Desulfobacterales
* `metabat_bin.12` - genus *UBA9214*, order Thiohalobacterales
* `maxbin_bin.525` - genus *SLDE01*, family *Thiohalomonadaceae*
* `metabat_bin.596` - genus *Sulfuritalea*
* `metabat_bin.18` - family *BMS3Bbin11*, order Arenicellales
* `maxbin_bin.40` - family *Beggiatoaceae*

## Løgten Sample

In [5]:
print("Genomic bins of the Løgten sample containing DsrA or DsrB:")
find_dsr("interproscan_bins/loegten/")

Genomic bins of the Løgten sample containing DsrA or DsrB:
metabat_bin.351
metabat_bin.126
metabat_bin.370
maxbin_bin.56_sub
metabat_bin.57
metabat_bin.73
maxbin_bin.85_sub
metabat_bin.312_sub
metabat_bin.126
maxbin_bin.450


Cable bacteria are:
* `maxbin_bin.450` - genus *Electrothrix*

Shortlisted bacteria are:
* `metabat_bin.351` - genus *41T-STBD-0c-01*, family *Sedimenticolaceae*
* `metabat_bin.57` - family *Sedimenticolaceae*
* `maxbin_bin.85`_sub - genus *IGN2*, family *Ignavibacteriaceae*
* `metabat_bin.312_sub` - family *Mor1*, phylum Acidobacteriota

Other bacteria are:
* `metabat_bin.126` - genus *SZUA-116*, family *SZUA-229*, class Gammaproteobacteria
* `metabat_bin.370` - family *UBA6429*, order Thiohalomonadales
* `maxbin_bin.56_sub` - family *J044*, class Gammaproteobacteria
* `metabat_bin.73` - genus *JABZFP01*, order Desulfobacterales
* `metabat_bin.126` - genus *SZUA-116*, family *SZUA-229*, class Gammaproteobacteria