# Phylogenetic tree of host species
We decided to build a phylogeny of the host species in the hopes of finding out when endosymbiotic events likely occurred and how many of them.

To build the phylogeny, we used mitochondrial 12S rRNA sequences with the accession codes in host-tree/12S_names. We didn't use full genomes as we estimated that would take a long time to download and process, and take up a lot of space. The 12S gene was also one of the only genes that was sequenced for all host organisms and that is suitable for constructing phylogenies over long evolutionary distances. The selected sequences also include some partial or complete mitochondrial genomes, which might have to be trimmed for the alignment to work well.

## Downloading the sequences

```cat 12S_names | while read f;  do wget -q -O - "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=$f&rettype=fasta";  done > 12S_genes.frn```
We used this loop to download all the sequences with the identifiers from 12S_names and concatenate them into 12S_genes.frn.

## Aligning the sequences with MAFFT v7.525

We first ran ```mafft 12S_genes.frn > 12S_genes.frn.mafft``` to align the sequences. However, the 12S rRNA sequences were very fragmented when mapped against the mitochondrial contigs of some other species. To combat this, we tried running MAFFT with iterative refinement: ```mafft --auto --thread -1 12S_genes.frn > 12S_genes.mafft_auto```. MAFFT used the FFT-NS-2 alignment strategy, which progressively aligned two times, but the alignment was still very fragmented.

## Trying to extract 12S sequences from mitochondrial contigs with BLASTn 2.15.0+
We tried to extract 12S sequences from the mitochondrial contigs by running ```cat 12S_names|while read id; do blastn -query $id.frn -db 12S_genes.frn -dust no -task blastn -subject_besthit -out "$id"_vs_12S.blastn; done;```, but this didn't give significant hits between all sequences. Some sequences, like EF678869.1, did not give significant alignments against other sequences at all. Because there wasn't enough time to figure out the cause of the missing hits, we decided to move on and make a host phylogeny manually.

## Manual phylogeny

To create a phylogeny, we looked up the binomial names for all host species and placed them in an [insect tree](https://doi.org/10.1016/j.cub.2021.08.057). To resolve the hemipteran insects, we used a [more detailed tree](https://doi.org/10.1016/j.ympev.2019.05.009). Mealybugs were resolved using a [tree for Coccomorpha](https://doi.org/10.1111/syen.12534), where the genus Llaveia falls within the outgroup. The result is displayed in the figure below. ![alt text](figures/host-phylogeny-draft.png "Title") 

Note that the genus Columbicola is wrongly placed with the Neuroptera and should be placed with the Phthiraptera instead. This has been fixed in the final tanglegram.

## Manual tanglegram

We made a Newick file for the host phylogeny, as well as for the endosymbiotic Sodalis(-allied) species. These can be found in ~/host-tree/Host_taxonomy.newick and ~/host-tree/Endosymbionts_tree.newick, respectively. The files were imported in FigTree v1.4.4, and branches were colored according to their corresponding host/endosymbiont. The trees can be found in ~/host-tree/colored-hosts and ~/host-tree/colored-sodalis, respectively. The two trees were then placed together using a painting program, and the host tree flipped. Lines were drawn between corresponding hosts and endosymbionts. The incorrect placement of Columbicola has been fixed by manually moving branches in the host tree. The result is shown below: ![.](figures/tanglegram.png ".")

There seems to be no association between clades in the host tree and the Sodalis tree. This points to a scenario where Sodalis is a widely present facultative (endo)symbiont that has recently become an obligate endosymbiont in some host species, as outlined in [this paper](https://doi.org/10.3389/fmicb.2021.668644) and [this paper](https://doi.org/10.1186/s40851-014-0009-5). 