# Contamination analysis

I don't want to run [ContScout](https://github.com/h836472/ContScout/) because it requires insane
amounts of RAM/storage. Instead, I will align my draft genome against UniRef90 with mmseqs2, append
taxonomic information to the better hits, and calculate the taxonomic makeup of each scaffold.

We start by [downloading the UniRef90 database](mmseqs-download_uniref90.sh) and creating a mmseqs2
index. While it is running, we will [prepare the draft genome](mmseqs-prepdb.sh) for alignment. When
the download finishes, we can [align the draft genome](mmseqs-align.sh) against UniRef90, add 
taxonomic information, and save the result in a .tsv file, which we can afterwards read and parse
at our leisure.

In [None]:
import datetime

print(datetime.datetime.today().date().isoformat())

In [None]:
from tqdm import tqdm
import os

import pandas as pd

In [None]:
columns = ["query", "target", "alnScore", "seqIdentity", "eVal", "qStart", "qEnd", "qLen", "tStart",
"tEnd", "tLen", "queryOrfStart", "queryOrfEnd", "dbOrfStart", "dbOrfEnd", "taxid",
           "level", "level_value", "taxonomy"]

The end result is close to 20Gb. To parse it, we first need to know how many lines it has.

In [None]:
m8 = "/Volumes/scratch/pycnogonum/genome/draft/contamination/contam_tax.m8"
output = "/Volumes/scratch/pycnogonum/genome/draft/contamination/chromosomes/"

In [None]:
%%bash -s "$m8" --out lines
m8=$1

wc -l $m8

In [None]:
lines

In [None]:
total_lines = int(lines.split()[0])

Now we will parse the file. Since we don't want to keep 20Gb in memory, we will extract only the
relevant information: what we need is the taxonomic information of the hits _per
scaffold/chromosome_. We will keep the corresponding columns of the `.m8` file and save them in
separate files per chromosome/scaffold, for later processing. This is somewhat time-consuming, but
can just run in the background, and we will only need to do it once*.

\* unless, of course, we change the genome, which has already happened once :D

In [None]:
chromosome = ""

with open(m8) as f:
    for i in tqdm(range(total_lines)):
        line = f.readline().strip().split("\t")
        query = line[0]
        xx = int(line[12])
        taxonomy = line[18]
        if chromosome == "":
            chromosome = query
            out = open(f"{output}/{chromosome}.tsv", "w")
        if query != chromosome:
            out.close()
            chromosome = query
            out = open(f"{output}/{chromosome}.tsv", "w")
        
        out.write("\t".join(line) + "\n")

out.close()

In [None]:
def approximate_taxonomic_distribution(sequence_path, columns, resolution=1000):
    """
    Approximate the taxonomic distribution of a scaffold/pseudochromosome by aggregating hits within
    ORFs. Essentially breaks the sequence into bins of size `resolution` but only uses detected ORFs
    instead of blindly scanning with a sliding window. Assigns the taxon of the hit with the highest
    score to each bin.

    Parameters
    ----------
    sequence_path : str
        Path to the sequence file.
    columns : list
        Column names of the sequence file.
    resolution : int
        Resolution of the approximation (default: 1000).

    Returns
    -------
    pd.Series
        Approximated taxonomic distribution; contains the absolute number of hits for each taxon
        (Metazoa, unknown, Viridiplantae, uc_Bacteria, Fungi, various viruses, uc_Archaea,
        uc_Eukaryota).
    """
    raw = pd.read_csv(sequence_path, sep="\t", header=None)
    raw.columns = columns
    raw["queryOrfStart_approx"] = raw["queryOrfStart"] // resolution
    raw["queryOrfEnd_approx"] = raw["queryOrfEnd"] // resolution
    first_pass = raw.groupby("queryOrfStart_approx").first().sort_values("queryOrfEnd", ascending=False)
    second_pass = first_pass.groupby("queryOrfEnd_approx").first().sort_values("queryOrfEnd", ascending=False)
    return second_pass["taxonomy"].value_counts()

In [None]:
dir_contents = os.listdir(f"{output}")
sequences = [s for s in dir_contents if s.endswith('.tsv')]

result = {}

for sequence in tqdm(sequences):
    result[sequence] = approximate_taxonomic_distribution(f"{output}/{sequence}", columns)

In [None]:
df = pd.concat(result.values(), axis=1).fillna(0)
df.columns = [k.split(".")[0] for k in result.keys()]
# normalize each column by the sum of the column
perc = df.div(df.sum(axis=0), axis=1)

df = df.T
perc = perc.T

In [None]:
# df.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv", sep="\t")
df = pd.read_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution.tsv", sep="\t", header=0, index_col=0)

In [None]:
viruses = df.columns[df.columns.str.contains("vir")]
df['viruses'] = df[viruses].sum(axis=1)
df.drop(columns=viruses, inplace=True)

In [None]:
df.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_collapsed_vir.tsv", sep="\t")

In [None]:
suspect = df.loc[perc["Metazoa"] < 0.9]

In [None]:
suspect.to_csv("/Volumes/scratch/pycnogonum/genome/draft/contamination/scaffolds_taxonomic_distribution_suspect.tsv", sep="\t")