# Investigating r2_g3735

Hox clusters are very often uninterrupted. In our gene models, a gene (and RNA-seq peaks) show up
between _Hox3_ and _Dfd_. To test whether this is a real gene, we can look through the de-novo
assembled transcriptomes.

```bash
module load mmseqs2

M8FORMAT="query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qlen"
cd /lisc/scratch/zoology/pycnogonum/genome/draft/r2_g3735/

for pep in ../transcriptome/*/Trinity.fasta.transdecoder.pep; do
    stage=$(echo $pep | cut -d"/" -f3);
    mmseqs easy-search query.fa $pep $stage.m8 tmp --format-output $M8FORMAT --threads 4
done
```

In [1]:
import pandas as pd
import matplotlib.pyplot as plt

import matplotlib.colors as mcolors

a function to expedite reading the output file:

In [2]:
# read file, name columns
def read_aln(m8, id_sep=None):
    hox = pd.read_csv(m8, sep="\t", header=None)
    m8_columns = ['query', 'target', 'seq_id', 'ali_len', 'no_mism', 'no_go',
                'q_start', 'q_end', 't_start', 't_end', 'eval', 'bit', 'qlen']
    hox.columns = m8_columns
    # trim the query to just the ID
    if id_sep is not None:
        hox["query"] = hox["query"].str.split(id_sep).str[1]
    return hox

First, read the output table. This contains all hits that cross a lowest e-value threshold (10-3).

In [3]:
hits = {}

deep_dev_transcriptomes = ["INSTAR1", "INSTAR3", "INSTAR5", "JUV1", "EMBRYO3", "INSTAR2", "INSTAR4", "INSTAR6", "SUBADULT"]

for stage in deep_dev_transcriptomes:
    r2_g3735_txome_loc = f'/Volumes/scratch/pycnogonum/genome/draft/r2_g3735/{stage}.m8'
    r2_g3735 = read_aln(r2_g3735_txome_loc)

    top_hits = r2_g3735[r2_g3735['eval'] < 1e-20]
    hits[stage] = top_hits.copy()

In [4]:
best = pd.concat([df.assign(Source=key) for key, df in hits.items()], ignore_index=True)

In [5]:
best.sort_values(by=['seq_id', 'eval'], ascending=[False, True])

Unnamed: 0,query,target,seq_id,ali_len,no_mism,no_go,q_start,q_end,t_start,t_end,eval,bit,qlen,Source
34,r2_g3735.t1.p1,TRINITY_DN6471_c0_g1_i2.p1,1.000,221,0,0,1,221,1,221,1.437000e-146,461,297,INSTAR5
86,r2_g3735.t1.p1,TRINITY_DN11889_c0_g1_i3.p1,0.989,296,3,0,1,296,1,296,3.168000e-196,604,297,INSTAR2
87,r2_g3735.t1.p1,TRINITY_DN11889_c0_g3_i2.p1,0.989,296,3,0,1,296,1,296,3.168000e-196,604,297,INSTAR2
88,r2_g3735.t1.p1,TRINITY_DN11889_c0_g1_i5.p1,0.989,296,3,0,1,296,1,296,3.168000e-196,604,297,INSTAR2
89,r2_g3735.t1.p1,TRINITY_DN11889_c0_g3_i4.p1,0.989,296,3,0,1,296,1,296,3.168000e-196,604,297,INSTAR2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
66,r2_g3735.t1.p1,TRINITY_DN8916_c0_g1_i4.p1,0.369,229,141,0,1,229,22,245,7.612000e-38,146,297,JUV1
67,r2_g3735.t1.p1,TRINITY_DN8916_c0_g1_i2.p1,0.369,229,141,0,1,229,53,276,7.612000e-38,146,297,JUV1
68,r2_g3735.t1.p1,TRINITY_DN8916_c0_g2_i6.p1,0.369,229,141,0,1,229,53,276,7.612000e-38,146,297,JUV1
137,r2_g3735.t1.p1,TRINITY_DN12148_c0_g1_i8.p1,0.369,229,141,0,1,229,22,245,1.759000e-37,145,297,INSTAR6
