# Visualising the NK cluster

## finding NK gene candidates

We compiled a list of homeobox genes in the NK family from the [Aase-Remedios _et al._
(2023)](https://doi.org/10.1093/molbev/msad239) analysis of the spider Homeobox gene repertoire. We
used this list to scan against the _P. litorale_ draft genome using `mmseqs`:

```bash
cd /lisc/scratch/zoology/pycnogonum/genome/draft/nkx
module load mmseqs2
M8FORMAT="query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qlen"
mmseqs easy-search nk_chelicerates.fa ../draft.fasta nk_genomic.m8 tmp --format-output $M8FORMAT --threads 4
```

First, we'll have to read the GFF file so that we can locate the loci that overlap with our genomic
hits.

In [1]:
from tqdm import tqdm

import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

from matplotlib import pyplot as plt
from matplotlib.patches import ConnectionPatch

In [2]:
def read_gff(loc):
    gff = pd.read_csv(loc, sep="\t", header=None, skiprows=4)
    gff_columns = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    gff.columns = gff_columns
    return gff

# read file, name columns
def read_aln(m8, id_sep=None):
    nk = pd.read_csv(m8, sep="\t", header=None)
    m8_columns = ['query', 'target', 'seq_id', 'ali_len', 'no_mism', 'no_go',
                'q_start', 'q_end', 't_start', 't_end', 'eval', 'bit', 'qlen']
    nk.columns = m8_columns
    # trim the query to just the ID
    if id_sep is not None:
        nk["query"] = nk["query"].str.split(id_sep).str[1]
    return nk

In [3]:
gff_loc = "/Volumes/project/pycnogonum/paper/zenodo/results/merged_sorted_named_dedup_flagged.gff3"
tmp = read_gff(gff_loc)
gff = tmp[tmp['type'] == 'gene'].copy()
gff['gene'] = gff['attributes'].str.split("ID=").str[1].str.split(";").str[0]
del tmp

In [4]:
gff

Unnamed: 0,seqid,source,type,start,end,score,strand,phase,attributes,gene
0,pseudochrom_1,GeneMark.hmm3,gene,119,5102,.,+,.,ID=g1;gene=Uncharacterised protein g1;gene_id=g1,g1
14,pseudochrom_1,PacBio,gene,20074,33326,.,-,.,ID=PB.1;function=Major Facilitator Superfamily...,PB.1
67,pseudochrom_1,PacBio,gene,33362,39771,.,+,.,ID=PB.2;gene=Uncharacterised protein PB.2;gene...,PB.2
117,pseudochrom_1,PacBio,gene,50782,71089,.,-,.,ID=PB.3;function=ribose phosphate diphosphokin...,PB.3
259,pseudochrom_1,AUGUSTUS,gene,63017,66568,.,+,.,ID=g5;function=Bifunctional DNA N-glycosylase ...,g5
...,...,...,...,...,...,...,...,...,...,...
761020,scaffold_12318,PacBio,gene,1232,15217,.,-,.,ID=PB.8896;gene=Uncharacterised protein PB.889...,PB.8896
761038,scaffold_12330,AGAT,gene,6041,13951,.,+,.,ID=r2_g4229;gene=Uncharacterised protein r2_g4...,r2_g4229
761046,scaffold_12335,PacBio,gene,1064,1939,.,+,.,ID=PB.8840;gene=Uncharacterised protein PB.884...,PB.8840
761050,scaffold_12338,tRNAscan-SE,gene,19482,19554,29.5,-,.,ID=tRNA905;gene_id=scaffold_12338.tRNA1-TyrGTA...,tRNA905


In [5]:
nk = read_aln('/Volumes/scratch/pycnogonum/genome/draft/nkx/nk_genomic.m8')

In [6]:
nk['species'] = nk['query'].str.split("_").str[:2].str.join(" ")
nk['symbol'] = nk['query'].str.split("_").str[2:].str.join("_")

In [7]:
def hit_to_locus(hit, gff):
    same_chrom = gff['seqid'] == hit['target']
    within_borders = (gff['start'] <= hit['t_start']) & (gff['end'] >= hit['t_end'])

    overlap = gff[same_chrom & within_borders]
    return overlap['gene'].values[0]

In [8]:
hits = nk[nk['eval'] < 1e-20].copy()
hits['gene'] = hits.apply(hit_to_locus, axis=1, gff=gff)

In [9]:
hits['gene'].value_counts()

gene
g8689      84
PB.3341    83
PB.3347    83
PB.1018    64
g10110     64
PB.1017    64
PB.1005    41
g6828      39
PB.1003    36
g8691      31
g6762      31
PB.8153    21
PB.7934    20
g1625       6
g1627       6
g10112      4
Name: count, dtype: int64

In [10]:
candidates = hits['gene'].unique()

In [11]:
plit_peptides = {}
with open('/Volumes/project/pycnogonum/genome/draft/transcripts.fa.transdecoder.pep', 'r') as f:
    lines = f.readlines()
    for i, line in enumerate(lines):
        if line.startswith(">"):
            gene = line.split(" ")[0][1:]
            plit_peptides[gene] = ""
        else:
            plit_peptides[gene] += line.strip()

In [12]:
candidate_sequences = {}
for gene in plit_peptides:
    for c in candidates:
        if gene.startswith(c):
            candidate_sequences[gene] = plit_peptides[gene]

In [13]:
with open('/Volumes/scratch/pycnogonum/genome/draft/nkx/nk_candidates.fa', 'w') as f:
    for gene, seq in candidate_sequences.items():
        f.write(f">{gene}\n{seq}\n")

Now we can concatenate all the sequences and align them.

```bash
cat nk_chelicerates.fa > nk.fa
cat nk_candidates.fa >> nk.fa
```

First, quick-and-dirty alignment so we can just cut out the homeodomain regions from the _P.
litorale_ sequences:

```bash
mafft --thread 4 nk.fa > nk_aligned.fasta
```

Now we can extract the homeodomain regions. This was done manually in Jalview, by deleting columns
left and right of the homeobox region. Two _P. litorale_ isoforms that broke up the homeobox domain
were excluded completely. Additionally, the _Hhex_ sequences were misaligned, which led to a break
in the homeobox domain block. We trimmed up to the beginning of the homeobox domain and included the
break, hoping that a realignment would fix this.

The resulting trimmed alignment was saved as `homeobox.fasta` and
re-aligned with MAFFT in L-INS-i mode (Probably most accurate, a little bit slower):

```bash
mafft --thread 4 --maxiterate 1000 --localpair homeobox.fasta > homeobox_aligned.fasta
```

The re-alignment improved the alignment quality, and we could safely trim the _P. litorale_
positions that mapped to the misplaced _Hhex_ parts. The resulting alignment was input to IQTREE2.

```bash
> module load iqtree
> /usr/bin/time iqtree2 -s ./homeobox_aligned.fasta -B 1000 -T 4
```