# Visualising the NK cluster

## finding NK gene candidates

We compiled a list of homeobox genes in the NK family from the [Aase-Remedios _et al._
(2023)](https://doi.org/10.1093/molbev/msad239) analysis of the spider Homeobox gene repertoire. We
used this list to scan against the _P. litorale_ draft genome using `mmseqs`:

```bash
cd /lisc/scratch/zoology/pycnogonum/genome/draft/nkx
module load mmseqs2
M8FORMAT="query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qlen"
mmseqs easy-search nk_chelicerates.fa ../draft.fasta nk_genomic.m8 tmp --format-output $M8FORMAT --threads 4
```

First, we'll have to read the GFF file so that we can locate the loci that overlap with our genomic
hits.

In [1]:
from tqdm import tqdm

import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

from matplotlib import pyplot as plt
from matplotlib.patches import ConnectionPatch

In [2]:
def read_gff(loc):
    gff = pd.read_csv(loc, sep="\t", header=None, skiprows=4)
    gff_columns = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    gff.columns = gff_columns
    return gff

# read file, name columns
def read_aln(m8, id_sep=None):
    nk = pd.read_csv(m8, sep="\t", header=None)
    m8_columns = ['query', 'target', 'seq_id', 'ali_len', 'no_mism', 'no_go',
                'q_start', 'q_end', 't_start', 't_end', 'eval', 'bit', 'qlen']
    nk.columns = m8_columns
    # trim the query to just the ID
    if id_sep is not None:
        nk["query"] = nk["query"].str.split(id_sep).str[1]
    return nk

In [3]:
gff_loc = "/Volumes/project/pycnogonum/paper/zenodo/results/merged_sorted_named_dedup_flagged.gff3"
tmp = read_gff(gff_loc)
gff = tmp[tmp['type'] == 'gene'].copy()
gff['gene'] = gff['attributes'].str.split("ID=").str[1].str.split(";").str[0]
del tmp

In [4]:
nk = read_aln('/Volumes/scratch/pycnogonum/genome/draft/nkx/nk_genomic.m8')

In [5]:
nk['species'] = nk['query'].str.split("_").str[:2].str.join(" ")
nk['symbol'] = nk['query'].str.split("_").str[2:].str.join("_")

In [6]:
def hit_to_locus(hit, gff):
    same_chrom = gff['seqid'] == hit['target']
    within_borders = (gff['start'] <= hit['t_start']) & (gff['end'] >= hit['t_end'])

    overlap = gff[same_chrom & within_borders]
    return overlap['gene'].values[0]

In [7]:
hits = nk[nk['eval'] < 1e-10].copy()
hits['gene'] = hits.apply(hit_to_locus, axis=1, gff=gff)

In [8]:
hits['gene'].value_counts()

gene
g2646        432
PB.1005      426
g8689        376
g8691        347
PB.1017      347
PB.1018      334
g10110       334
PB.1003      326
PB.7934      300
PB.8153      287
PB.3341      218
g10112       213
g9720        209
PB.1830      171
PB.3347      156
g4363        148
g9718        147
g6828        147
PB.8616      124
g9723        109
g9721        109
PB.8615       97
g1625         73
g1627         73
PB.3793       61
g6864         35
g6762         33
PB.8617       26
PB.1021        5
g1744          5
g1756          5
g11364         4
at_DN2391      4
g9725          1
Name: count, dtype: int64

In [9]:
candidates = hits['gene'].unique()

In [10]:
plit_peptides = {}
with open('/Volumes/project/pycnogonum/genome/draft/transcripts.fa.transdecoder.pep', 'r') as f:
    lines = f.readlines()
    for i, line in enumerate(lines):
        if line.startswith(">"):
            gene = line.split(" ")[0][1:]
            plit_peptides[gene] = ""
        else:
            plit_peptides[gene] += line.strip()

In [11]:
candidate_sequences = {}
for gene in plit_peptides:
    for c in candidates:
        if gene.startswith(c):
            candidate_sequences[gene] = plit_peptides[gene]

In [12]:
with open('/Volumes/scratch/pycnogonum/genome/draft/nkx/nk_candidates.fa', 'w') as f:
    for gene, seq in candidate_sequences.items():
        f.write(f">{gene}\n{seq}\n")

Now we can concatenate all the sequences and align them.

```bash
cat nk_chelicerates.fa > nk.fa
cat nk_candidates.fa >> nk.fa
```

First, quick-and-dirty alignment so we can just cut out the homeodomain regions from the _P.
litorale_ sequences:

```bash
mafft --thread 4 nk.fa > nk_aligned.fasta
```

Now we can extract the homeodomain regions. This was done manually in Jalview, by deleting columns
left and right of the homeobox region. Two _P. litorale_ isoforms that broke up the homeobox domain
were excluded completely. Additionally, the _Hhex_ sequences were misaligned, which led to a break
in the homeobox domain block. We trimmed up to the beginning of the homeobox domain and included the
break, hoping that a realignment would fix this.

The resulting trimmed alignment was saved as `homeobox.fasta` and
re-aligned with MAFFT in L-INS-i mode (Probably most accurate, a little bit slower):

```bash
mafft --thread 4 --maxiterate 1000 --localpair homeobox.fasta > homeobox_aligned.fasta
```

The re-alignment improved the alignment quality, and we could safely trim the _P. litorale_
positions that mapped to the misplaced _Hhex_ parts. The resulting alignment was input to IQTREE2.

```bash
module load iqtree
/usr/bin/time iqtree2 -s ./homeobox_aligned.fasta -B 2000
```

The resulting tree was visualised in treeViewer. The tree was rooted manually using the _labial_
sequences as a guide. Ultimately, a rooting that separated the NK/NK2 sequences was selected. In the
process of tree analysis, we realized that two of the sequences in the _Hlx_ clade are actually
Dbx sequences. The cause of this is the limited homeobox sequences used for this alignment, as the
_dbx_ homeobox is very similar to the _hlx_ homeobox. Inclusion of longer sequences in the alignment
immediately ponts out that the following sequences are _Dbx_ genes:



In [13]:
names = {
'PB.1017': 'Msx',
'PB.1018': 'Msx',
'g10110': 'Msx',
'g10112': 'Msx',
'PB.1003': 'NK1',
'g6864': 'NK7',
'PB.1005': 'NK5',
'PB.7934': 'NK3',
'g8689': 'NK4',
'PB.3347': 'NK2.2',
'PB.3341': 'NK2.1',
'g6762': 'NK6',
'PB.8153': 'Noto',
'g1625': 'Emx',
'g1627': 'Emx',
'PB.1021': 'Hhex',
'g6828': 'Lbx',
'g8691': 'Tlx',
'g1744': 'Hlx',
'g1756': 'Hlx',
}

In [14]:
nk_cluster = gff.set_index('gene').loc[names.keys()]
nk_cluster['name'] = nk_cluster.index.map(names)
keep = ['seqid', 'name', 'attributes']
nk_cluster.sort_values(['seqid', 'start'], ascending=[False, True])[keep]

Unnamed: 0_level_0,seqid,name,attributes
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
PB.1005,pseudochrom_6,NK5,ID=PB.1005;function=Homeodomain;gene=NK5;gene_...
PB.1003,pseudochrom_6,NK1,ID=PB.1003;function=sequence-specific DNA bind...
PB.1017,pseudochrom_6,Msx,ID=PB.1017;function=sequence-specific DNA bind...
g10110,pseudochrom_6,Msx,ID=g10110;function=sequence-specific DNA bindi...
PB.1018,pseudochrom_6,Msx,ID=PB.1018;function=sequence-specific DNA bind...
g10112,pseudochrom_6,Msx,ID=g10112;function=Homeodomain;gene=Msx-g10112...
PB.1021,pseudochrom_6,Hhex,ID=PB.1021;function=hepatic duct development;g...
PB.8153,pseudochrom_51,Noto,ID=PB.8153;function=Homeodomain;gene=Noto;gene...
g8689,pseudochrom_48,NK4,ID=g8689;function=sequence-specific DNA bindin...
PB.7934,pseudochrom_48,NK3,ID=PB.7934;function=homeobox;gene=NK3;gene_id=...
