# Visualising the HRO cluster

## finding HRO gene candidates

We compiled a list of homeobox sequences from genes in the HRO family from the [Aase-Remedios _et
al._ (2023)](https://doi.org/10.1093/molbev/msad239) analysis of the spider Homeobox gene
repertoire. We used this list to scan against the _P. litorale_ draft genome using `mmseqs`:

```bash
cd /lisc/scratch/zoology/pycnogonum/genome/draft/hro
module load mmseqs2
M8FORMAT="query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qlen"
mmseqs easy-search hro_chelicerates.fa ../draft.fasta hro_genomic.m8 tmp --format-output $M8FORMAT --threads 4
```

First, we'll have to read the GFF file so that we can locate the loci that overlap with our genomic
hits.

In [1]:
from tqdm import tqdm

import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

from matplotlib import pyplot as plt
from matplotlib.patches import ConnectionPatch

In [2]:
def read_gff(loc):
    gff = pd.read_csv(loc, sep="\t", header=None, skiprows=4)
    gff_columns = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    gff.columns = gff_columns
    return gff

# read file, name columns
def read_aln(m8, id_sep=None):
    hro = pd.read_csv(m8, sep="\t", header=None)
    m8_columns = ['query', 'target', 'seq_id', 'ali_len', 'no_mism', 'no_go',
                'q_start', 'q_end', 't_start', 't_end', 'eval', 'bit', 'qlen']
    hro.columns = m8_columns
    # trim the query to just the ID
    if id_sep is not None:
        hro["query"] = hro["query"].str.split(id_sep).str[1]
    return hro

In [3]:
gff_loc = "/Volumes/project/pycnogonum/paper/zenodo/results/merged_sorted_named_dedup_flagged.gff3"
tmp = read_gff(gff_loc)
gff = tmp[tmp['type'] == 'gene'].copy()
gff['gene'] = gff['attributes'].str.split("ID=").str[1].str.split(";").str[0]
del tmp

In [4]:
hro = read_aln('/Volumes/scratch/pycnogonum/genome/draft/hro/hro_genomic.m8')

In [5]:
hro['species'] = hro['query'].str.split("_").str[:2].str.join(" ")
hro['symbol'] = hro['query'].str.split("_").str[2:].str.join("_")

In [6]:
def hit_to_locus(hit, gff):
    same_chrom = gff['seqid'] == hit['target']
    within_borders = (gff['start'] <= hit['t_start']) & (gff['end'] >= hit['t_end'])

    overlap = gff[same_chrom & within_borders]
    return overlap['gene'].values[0]

In [7]:
hits = hro[hro['eval'] < 1e-20].copy()
hits['gene'] = hits.apply(hit_to_locus, axis=1, gff=gff)

In [8]:
candidates = hits['gene'].unique()

In [9]:
plit_peptides = {}
with open('/Volumes/project/pycnogonum/genome/draft/transcripts.fa.transdecoder.pep', 'r') as f:
    lines = f.readlines()
    for i, line in enumerate(lines):
        if line.startswith(">"):
            gene = line.split(" ")[0][1:]
            plit_peptides[gene] = ""
        else:
            plit_peptides[gene] += line.strip()

In [10]:
candidate_sequences = {}
for gene in plit_peptides:
    for c in candidates:
        if c in gene:
            candidate_sequences[gene] = plit_peptides[gene]

In [11]:
with open('/Volumes/scratch/pycnogonum/genome/draft/hro/hro_candidates.fa', 'w') as f:
    for gene, seq in candidate_sequences.items():
        f.write(f">{gene}\n{seq}\n")

Now we can concatenate all the sequences and align them.

```bash
cat hro_chelicerates.fa > hro.fa
cat hro_candidates.fa >> hro.fa
```

Before continuing, check whether the sequences were concatenated correctly - if there is no newline
character at the end of `hro.fa`, the header of the first candidate sequence will be appended to
the last sequence of the `hro_chelicerates.fa` file.

Additionally, we need to check for synonyms here; where I found the same combination of species and
gene name, I collapsed identical sequences. Where this was not given, I appended lowercase "i"s to
the names in order to keep the additional variation.

First, quick-and-dirty alignment so we can just cut out the homeodomain regions from the _P.
litorale_ sequences:

```bash
mafft --thread 4 hro.fa > hro_aligned.fasta
```

Now we can extract the homeodomain regions. This was done manually in Jalview, by deleting columns
left and right of the homeobox region. Some _P. litorale_ sequences that broke the homeobox in the
alignment (g8288.t1.p1, g8288.t2.p1, g8457.t1.p1, g8458.t1.p1, r2_g828.t2.p1, g8280.t1.p1,
g8459.t1.p1) were extracted and BLASTed against NCBI `nr` as a sanity check. All of them returned
hits to non-homeobox genes (potassium channel, low e-value hypothetical proteins, RNA polymerase,
granulin, low-quality solute carrier family, no good hit, respectively), and were therefore removed
from the alignment.

The resulting trimmed alignment was saved as `homeobox.fasta` and re-aligned with MAFFT in L-INS-i
mode (Probably most accurate, a little bit slower):

```bash
mafft --thread 4 --maxiterate 1000 --localpair homeobox.fasta > homeobox_aligned.fasta
```

The resulting alignment was input to IQTREE2.

```bash
module load iqtree
/usr/bin/time iqtree2 -s ./homeobox_aligned.fasta -B 1000 -T 4
```

The resulting tree was visualised in treeViewer and rooted manually to separate the gene families.
The following assignments were retrieved:

In [13]:
names = {
    'g11023': 'Otp',
    'g828': 'Rx',
    'g845': 'Isl'
}

hro_cluster = gff.set_index('gene').loc[names.keys()]
hro_cluster['name'] = hro_cluster.index.map(names)
hro_cluster.sort_values(['seqid', 'start'], ascending=[False, True])

Unnamed: 0_level_0,seqid,source,type,start,end,score,strand,phase,attributes,name
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
g11023,pseudochrom_9,AUGUSTUS,gene,5390926,5425657,.,+,.,ID=g11023;function=sequence-specific DNA bindi...,Otp
g828,pseudochrom_13,AUGUSTUS,gene,2323194,2375852,.,-,.,ID=g828;function=sequence-specific DNA binding...,Rx
g845,pseudochrom_13,AUGUSTUS,gene,2972219,3044387,.,+,.,ID=g845;function=binding. It is involved in th...,Isl
