# Visualising the NK cluster

## finding NK gene candidates

We compiled a list of homeobox genes in the NK family from the [Aase-Remedios _et al._
(2023)](https://doi.org/10.1093/molbev/msad239) analysis of the spider Homeobox gene repertoire. We
used this list to scan against the _P. litorale_ draft genome using `mmseqs`:

```bash
cd /lisc/scratch/zoology/pycnogonum/genome/draft/nkx
module load mmseqs2
M8FORMAT="query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qlen"
mmseqs easy-search nk.fa ../draft.fasta nk_genomic.m8 tmp --format-output $M8FORMAT --threads 4
```

Similarly, we used the same list to scan against the deeply sequenced _P. litorale_ transcriptomes:

```bash
for pep in ../transcriptome/*/Trinity.fasta.transdecoder.pep; do
    stage=$(echo $pep | cut -d"/" -f3);
    mmseqs easy-search nk.fa $pep nk_$stage.m8 tmp --format-output $M8FORMAT --threads 4
done
```

First, we'll have to read the GFF file so that we can locate the loci that overlap with our genomic
hits.

In [1]:
from tqdm import tqdm

import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

from matplotlib import pyplot as plt
from matplotlib.patches import ConnectionPatch

In [4]:
def read_gff(loc):
    gff = pd.read_csv(loc, sep="\t", header=None, skiprows=4)
    gff_columns = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    gff.columns = gff_columns
    return gff

# read file, name columns
def read_aln(m8, id_sep=None):
    hox = pd.read_csv(m8, sep="\t", header=None)
    m8_columns = ['query', 'target', 'seq_id', 'ali_len', 'no_mism', 'no_go',
                'q_start', 'q_end', 't_start', 't_end', 'eval', 'bit', 'qlen']
    hox.columns = m8_columns
    # trim the query to just the ID
    if id_sep is not None:
        hox["query"] = hox["query"].str.split(id_sep).str[1]
    return hox

In [14]:
gff_loc = "/Volumes/project/pycnogonum/paper/zenodo/results/merged_sorted_named_dedup_flagged.gff3"
tmp = read_gff(gff_loc)
gff = tmp[tmp['type'] == 'gene'].copy()
gff['gene'] = gff['attributes'].str.split("ID=").str[1].str.split(";").str[0]
del tmp

In [15]:
gff

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes,gene
0,pseudochrom_1,GeneMark.hmm3,gene,119,5102,.,+,.,ID=g1;gene=Uncharacterised protein g1;gene_id=g1,g1
14,pseudochrom_1,PacBio,gene,20074,33326,.,-,.,ID=PB.1;function=Major Facilitator Superfamily...,PB.1
67,pseudochrom_1,PacBio,gene,33362,39771,.,+,.,ID=PB.2;gene=Uncharacterised protein PB.2;gene...,PB.2
117,pseudochrom_1,PacBio,gene,50782,71089,.,-,.,ID=PB.3;function=ribose phosphate diphosphokin...,PB.3
259,pseudochrom_1,AUGUSTUS,gene,63017,66568,.,+,.,ID=g5;function=Bifunctional DNA N-glycosylase ...,g5
...,...,...,...,...,...,...,...,...,...,...
761020,scaffold_12318,PacBio,gene,1232,15217,.,-,.,ID=PB.8896;gene=Uncharacterised protein PB.889...,PB.8896
761038,scaffold_12330,AGAT,gene,6041,13951,.,+,.,ID=r2_g4229;gene=Uncharacterised protein r2_g4...,r2_g4229
761046,scaffold_12335,PacBio,gene,1064,1939,.,+,.,ID=PB.8840;gene=Uncharacterised protein PB.884...,PB.8840
761050,scaffold_12338,tRNAscan-SE,gene,19482,19554,29.5,-,.,ID=tRNA905;gene_id=scaffold_12338.tRNA1-TyrGTA...,tRNA905


In [17]:
nk = read_aln('/Volumes/scratch/pycnogonum/genome/draft/nkx/nk_genomic.m8')

In [52]:
def hit_to_locus(hit, gff):
    same_chrom = gff['seqid'] == hit['target']
    within_borders = (gff['start'] <= hit['t_start']) & (gff['end'] >= hit['t_end'])

    overlap = gff[same_chrom & within_borders]
    return overlap['gene'].values[0]

In [58]:
hits = nk[nk['eval'] < 1e-20].copy()
hits['gene'] = hits.apply(hit_to_locus, axis=1, gff=gff)

In [60]:
hits.

gene
g8689      84
PB.3341    83
PB.3347    83
PB.1005    41
PB.1003    36
g6762      31
PB.7934    20
Name: count, dtype: int64

In [61]:
best_nk = {}

deep_dev_transcriptomes = ["INSTAR1", "INSTAR3", "INSTAR5", "JUV1", "EMBRYO3", "INSTAR2", "INSTAR4", "INSTAR6", "SUBADULT"]

for stage in deep_dev_transcriptomes:
    nk_txome_loc = f'/Volumes/scratch/pycnogonum/genome/draft/nkx/nk_{stage}.m8'
    nk = read_aln(nk_txome_loc)

    top_hits = nk[nk['eval'] < 1e-20]
    best_nk[stage] = top_hits['target'].unique()

In [62]:
best_nk

{'INSTAR1': array(['TRINITY_DN20982_c2_g1_i1.p1', 'TRINITY_DN1632_c8_g1_i1.p1',
        'TRINITY_DN3973_c1_g1_i3.p1', 'TRINITY_DN3973_c1_g1_i4.p1',
        'TRINITY_DN3973_c1_g1_i2.p1', 'TRINITY_DN3973_c1_g1_i5.p1',
        'TRINITY_DN98526_c0_g2_i1.p1', 'TRINITY_DN20982_c0_g1_i1.p1',
        'TRINITY_DN3014_c2_g1_i1.p1', 'TRINITY_DN191_c0_g3_i1.p1'],
       dtype=object),
 'INSTAR3': array(['TRINITY_DN77411_c0_g1_i1.p1', 'TRINITY_DN1357_c1_g1_i1.p1',
        'TRINITY_DN2510_c0_g1_i2.p1', 'TRINITY_DN2510_c0_g1_i4.p1',
        'TRINITY_DN2510_c0_g1_i1.p1', 'TRINITY_DN2510_c0_g1_i3.p1',
        'TRINITY_DN71961_c0_g1_i1.p1', 'TRINITY_DN4683_c3_g1_i4.p1',
        'TRINITY_DN4683_c3_g1_i7.p1', 'TRINITY_DN954_c0_g1_i1.p1',
        'TRINITY_DN2546_c0_g1_i10.p1', 'TRINITY_DN2546_c0_g1_i3.p1',
        'TRINITY_DN2546_c0_g1_i6.p1', 'TRINITY_DN2546_c0_g1_i1.p1',
        'TRINITY_DN2546_c0_g1_i11.p1', 'TRINITY_DN2546_c0_g1_i12.p1',
        'TRINITY_DN2546_c0_g1_i7.p1', 'TRINITY_DN3779_c0_g2_i1.p1