### Creation of the 19k ortholog alignment and phylogeny for Parkinson 2018
This notebook will document the work that was necessary to go from the following output files:
* Filtered assembly .fasta files (one per species)
* Predicted ORF files .fasta files (one per species)
* Predicted one-to-one ortholog file (one file for all four species)

To an amino acid alignment made through the concatenation of all 19000 ortholog peptide sequences and a ML phylogeny.

This documentaion will be split up into two major sections
#### * Fixing multi-ORF prediction
#### * Creating the super alignment and phylogeny
<hr />

### Fixing multi-ORF prediction
When we were working through the predicted one-to-one ortholog file e.g. /Users/humebc/Google Drive/projects/parky/protein_alignments/a_id.csv we were finding that some of the sets of amino acid sequences (four sequences, one per species) did not align well. Looking back into why this could be we found that the problem lay in the fact that although the post-filtering transcript sequences e.g. compXXX_seqXXX were unique per species assembly file, several ORFs could be predicted per transcript. When the one-to-one data was created, rather than referencing the ORF ID e.g. m.XXX (which is unique per species) the compXXX ID was used. As such, for those transcripts that had several ORFs predicted for them, it was pure chance that the correct ORF (that had been found to be an ortholog of one of the other species ORFs) had been selected.

The easiest way to solve this problem would have been to go back to the original input and output of the one-to-one ortholog predictions and work with the unique ORF IDs (e.g. m.XXX). However, this analysis was done several years ago and I was unable to find this file. As such, a different approach was required. To fix the issue I decided to go through each of the ortholog predictions and check the possible combinations of ORFs that could have been selected to find the set of ORFs with the lowest average pairwise distances.

E.g. if we consider a single ortholog (ortholog_0), we have the four transcipts identified that the ORFs came from that were predicted to be orthologs, e.g. comp0, comp1, comp2, comp3. So that:
```python
transect_to_ORF_dict = {
    'comp0': ['comp0_m.0', 'comp0_m.1'],
    'comp1': ['comp1_m.0'],
    'comp2': ['comp2_m.0', 'comp2_m.1'],
    'comp3': ['comp3_m.0']
}
```
In this case the possible alignments that could have been selected were:
```python
list_of_possible_alignements = [
    ['comp0_m.0', 'comp1_m.0', 'comp2_m.0', 'comp3_m.0'],
    ['comp0_m.0', 'comp1_m.0', 'comp2_m.1', 'comp3_m.0'],
    ['comp0_m.1', 'comp1_m.0', 'comp2_m.0', 'comp3_m.0'],
    ['comp0_m.1', 'comp1_m.0', 'comp2_m.1', 'comp3_m.0'],   
]
```
For each of these alignments 2 way combinations irregardless of order were permutated and for each of these pairwise comparisons the sequence distance was calculated. For each alignment set, the average pairwise distance was then calculated and the set of sequences with the highest average pairwise alignment was selected as the best set of ORFs to work with.

For a sanity check through this process, I regularly inspected the alignments that were being out put as the 'best selections' to make sure that no further errors had occured previously in the one-to-one predictions. All alignments I checked looked great.

Pseudo code:

In [None]:
'''So now we know that there are a few doozies here that we need to take account of.
    1 - the comps were not unique and had sequence variations. These have been made unique in the
    longest_250 versions of the assemblies and so these are the files we should work with in terms of getting DNA
    2 - the comps are not unique across speceis, i.e. each species has a comp00001
    3 - after the above longest_250 processing we can essentially assume that comps are unique within species

    sooo.... some pseudo code to figure this out
    we should work in a dataframe for this
    read in the list of ortholog gene IDs for each species (the comps that make up the 19000) (this is the dataframe)
    then for each species, identify the gene IDs (comps) for which multiple ORFs were made
    go row by row in the dataframe and see if any of the comp IDs (specific to species) are with multiple ORFs
    These are our rows of interest, we need to work within these rows:

    for each comp for each speceis get a list of the orf aa sequences. turn these into a list of lists
    then use itertools.product to get all of the four orfs that could be aligned.
    then for each of these possible combinations do itertools.combinations(seqs, 2) and do
    pairwise distances and get the average score of the pairwise alignments
    we keep the set of four that got the best possible score
    '''

#### __Read in the csv files as pandas dataframes__
These will give us the compIDs and the actual aa seqs of the currently predicted ORF variants for each ortholog

In [None]:
# First get the ID of the transcript that is related to this Ortholog for each species
gene_id_df = pd.read_csv('/home/humebc/projects/parky/gene_id.csv', index_col=0)

# We want to be able to compare how many of the alignments we fixed and how many were OK due to luck
# To do this we will need a couple of counters but also the currently chosen ORFs
aa_seq_df = pd.read_csv('/home/humebc/projects/parky/aa_seq_fixed_again.csv', index_col=0)