# Visualising the SINE cluster

## finding SINE gene candidates

We compiled a list of homeobox genes in the SINE family from the [Aase-Remedios _et al._
(2023)](https://doi.org/10.1093/molbev/msad239) analysis of the spider Homeobox gene repertoire. We
used this list to scan against the _P. litorale_ draft genome using `mmseqs`:

```bash
cd /lisc/scratch/zoology/pycnogonum/genome/draft/sine
module load mmseqs2
M8FORMAT="query,target,fident,alnlen,mismatch,gapopen,qstart,qend,tstart,tend,evalue,bits,qlen"
mmseqs easy-search sine_chelicerates.fa ../draft.fasta sine_genomic.m8 tmp --format-output $M8FORMAT --threads 4
```

First, we'll have to read the GFF file so that we can locate the loci that overlap with our genomic
hits.

In [1]:
from tqdm import tqdm

import pandas as pd
import numpy as np

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs

from matplotlib import pyplot as plt
from matplotlib.patches import ConnectionPatch

In [2]:
def read_gff(loc):
    gff = pd.read_csv(loc, sep="\t", header=None, skiprows=4)
    gff_columns = ['seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attributes']
    gff.columns = gff_columns
    return gff

# read file, name columns
def read_aln(m8, id_sep=None):
    sine = pd.read_csv(m8, sep="\t", header=None)
    m8_columns = ['query', 'target', 'seq_id', 'ali_len', 'no_mism', 'no_go',
                'q_start', 'q_end', 't_start', 't_end', 'eval', 'bit', 'qlen']
    sine.columns = m8_columns
    # trim the query to just the ID
    if id_sep is not None:
        sine["query"] = sine["query"].str.split(id_sep).str[1]
    return sine

In [3]:
gff_loc = "/Volumes/project/pycnogonum/paper/zenodo/results/merged_sorted_named_dedup_flagged.gff3"
tmp = read_gff(gff_loc)
gff = tmp[tmp['type'] == 'gene'].copy()
gff['gene'] = gff['attributes'].str.split("ID=").str[1].str.split(";").str[0]
del tmp

In [4]:
sine = read_aln('/Volumes/scratch/pycnogonum/genome/draft/sine/sine_genomic.m8')

In [5]:
sine['species'] = sine['query'].str.split("_").str[:2].str.join(" ")
sine['symbol'] = sine['query'].str.split("_").str[2:].str.join("_")

In [6]:
def hit_to_locus(hit, gff):
    same_chrom = gff['seqid'] == hit['target']
    within_borders = (gff['start'] <= hit['t_start']) & (gff['end'] >= hit['t_end'])

    overlap = gff[same_chrom & within_borders]
    return overlap['gene'].values[0]

In [7]:
hits = sine[sine['eval'] < 1e-20].copy()
hits['gene'] = hits.apply(hit_to_locus, axis=1, gff=gff)

In [8]:
hits[hits['target'] == 'pseudochrom_3']

Unnamed: 0,query,target,seq_id,ali_len,no_mism,no_go,q_start,q_end,t_start,t_end,eval,bit,qlen,species,symbol,gene
0,T_tridentatus_Six3/6,pseudochrom_3,0.983,180,1,0,1,60,7711226,7711047,1.348000e-35,131,60,T tridentatus,Six3/6,g5264
26,P_tepidariorum_Six1/2,pseudochrom_3,0.728,177,16,0,1,59,7711226,7711050,2.638000e-24,98,60,P tepidariorum,Six1/2,g5264
34,T_antipodiana_Six1/2,pseudochrom_3,0.694,177,18,0,1,59,7711226,7711050,1.772000e-23,96,60,T antipodiana,Six1/2,g5264
40,A_bruennichi_Six3/6,pseudochrom_3,0.983,180,1,0,1,60,7711226,7711047,4.787000e-35,129,60,A bruennichi,Six3/6,g5264
44,C_sculpturatus_Six3/6,pseudochrom_3,1.000,180,0,0,1,60,7711226,7711047,1.348000e-35,131,60,C sculpturatus,Six3/6,g5264
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
574,S_maritima_Six1/2,pseudochrom_3,0.728,177,16,0,1,59,7711226,7711050,3.931000e-25,101,60,S maritima,Six1/2,g5264
581,T_castaneum_Six1/2,pseudochrom_3,0.711,177,17,0,1,59,7711226,7711050,7.414000e-25,100,60,T castaneum,Six1/2,g5264
588,A_bruennichi_Six3/6,pseudochrom_3,1.000,180,0,0,1,60,7711226,7711047,1.348000e-35,131,60,A bruennichi,Six3/6,g5264
622,A_bruennichi_Six1/2,pseudochrom_3,0.694,177,18,0,1,59,7711226,7711050,3.624000e-24,98,60,A bruennichi,Six1/2,g5264


In [9]:
candidates = hits['gene'].unique()

In [10]:
plit_peptides = {}
with open('/Volumes/project/pycnogonum/genome/draft/transcripts.fa.transdecoder.pep', 'r') as f:
    lines = f.readlines()
    for i, line in enumerate(lines):
        if line.startswith(">"):
            gene = line.split(" ")[0][1:]
            plit_peptides[gene] = ""
        else:
            plit_peptides[gene] += line.strip()

In [11]:
candidate_sequences = {}
for gene in plit_peptides:
    for c in candidates:
        if c in gene:
            candidate_sequences[gene] = plit_peptides[gene]

In [12]:
with open('/Volumes/scratch/pycnogonum/genome/draft/sine/sine_candidates.fa', 'w') as f:
    for gene, seq in candidate_sequences.items():
        f.write(f">{gene}\n{seq}\n")

Now we can concatenate all the sequences and align them.

```bash
cat sine_chelicerates.fa > sine.fa
cat sine_candidates.fa >> sine.fa
```

Before continuing, check whether the sequences were concatenated correctly - if there is no newline
character at the end of `sine.fa`, the header of the first candidate sequence will be appended to
the last sequence of the `sine_chelicerates.fa` file.

First, quick-and-dirty alignment so we can just cut out the homeodomain regions from the _P.
litorale_ sequences:

```bash
cd /lisc/scratch/zoology/pycnogonum/genome/draft/sine
mafft --thread 4 sine.fa > sine_aligned.fasta
```

Now we can extract the homeodomain regions. This was done manually in Jalview, by deleting columns
left and right of the homeobox region. We excluded the "sine-like" sequences from the alignment, as
they visibly did not agree with the rest of the sequences.

The resulting trimmed alignment was saved as `homeobox.fasta` and re-aligned with MAFFT in L-INS-i
mode (Probably most accurate, a little bit slower):

```bash
mafft --thread 4 --maxiterate 1000 --localpair homeobox.fasta > homeobox_aligned.fasta
```

The resulting alignment was input to IQTREE2.

```bash
module load iqtree
/usr/bin/time iqtree2 -s ./homeobox_aligned.fasta -B 1000 -T 4
```

In [13]:
names = {
    'PB.8648': 'Six4/5',
    'g5264': 'Six3/6',
    'PB.2234': 'Six1/2',
}

sine_cluster = gff.set_index('gene').loc[names.keys()]
sine_cluster['name'] = sine_cluster.index.map(names)
sine_cluster.sort_values(['seqid', 'start'], ascending=[False, True])

Unnamed: 0_level_0,seqid,source,type,start,end,score,strand,phase,attributes,name
gene,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
PB.8648,pseudochrom_56,PacBio,gene,4420024,4432656,.,+,.,ID=PB.8648;function=Sequence-specific DNA bind...,Six4/5
g5264,pseudochrom_3,AUGUSTUS,gene,7701237,7711655,.,-,.,"ID=g5264;function=Transcriptional regulator, S...",Six3/6
PB.2234,pseudochrom_13,PacBio,gene,4947930,4972081,.,+,.,ID=PB.2234;function=sequence-specific DNA bind...,Six1/2
