### Creation of the 19k ortholog alignment and phylogeny for Parkinson 2018
This notebook will document the work that was necessary to go from the following output files:
* Filtered assembly .fasta files (one per species)
* Predicted ORF files .fasta files (one per species)
* Predicted one-to-one ortholog file (one file for all four species)

To an amino acid alignment made through the concatenation of all 19000 ortholog peptide sequences and a ML phylogeny.

This documentaion will be split up into two major sections
#### * Fixing multi-ORF prediction
#### * Creating the super alignment and phylogeny
<hr />

### Fixing multi-ORF prediction
When we were working through the predicted one-to-one ortholog file e.g. /Users/humebc/Google Drive/projects/parky/protein_alignments/a_id.csv we were finding that some of the sets of amino acid sequences (four sequences, one per species) did not align well. Looking back into why this could be we found that the problem lay in the fact that although the post-filtering transcript sequences e.g. compXXX_seqXXX were unique per species assembly file, several ORFs could be predicted per transcript. When the one-to-one data was created, rather than referencing the ORF ID e.g. m.XXX (which is unique per species) the compXXX ID was used. As such, for those transcripts that had several ORFs predicted for them, it was pure chance that the correct ORF (that had been found to be an ortholog of one of the other species ORFs) had been selected.

The easiest way to solve this problem would have been to go back to the original input and output of the one-to-one ortholog predictions and work with the unique ORF IDs (e.g. m.XXX). However, this analysis was done several years ago and I was unable to find this file. As such, a different approach was required. To fix the issue I decided to go through each of the ortholog predictions and check the possible combinations of ORFs that could have been selected to find the set of ORFs with the lowest average pairwise distances.

E.g. if we consider a single ortholog (ortholog_0), we have the four transcipts identified that the ORFs came from that were predicted to be orthologs, e.g. comp0, comp1, comp2, comp3. So that:
```python
transect_to_ORF_dict = {
    'comp0': ['comp0_m.0', 'comp0_m.1'],
    'comp1': ['comp1_m.0'],
    'comp2': ['comp2_m.0', 'comp2_m.1'],
    'comp3': ['comp3_m.0']
}
```
In this case the possible alignments that could have been selected were:
```python
list_of_possible_alignements = [
    ['comp0_m.0', 'comp1_m.0', 'comp2_m.0', 'comp3_m.0'],
    ['comp0_m.0', 'comp1_m.0', 'comp2_m.1', 'comp3_m.0'],
    ['comp0_m.1', 'comp1_m.0', 'comp2_m.0', 'comp3_m.0'],
    ['comp0_m.1', 'comp1_m.0', 'comp2_m.1', 'comp3_m.0'],   
]
```
For each of these alignments 2 way combinations irregardless of order were permutated and for each of these pairwise comparisons the sequence distance was calculated. For each alignment set, the average pairwise distance was then calculated and the set of sequences with the highest average pairwise alignment was selected as the best set of ORFs to work with.

For a sanity check through this process, I regularly inspected the alignments that were being out put as the 'best selections' to make sure that no further errors had occured previously in the one-to-one predictions. All alignments I checked looked great.

Pseudo code:

In [None]:
'''So now we know that there are a few doozies here that we need to take account of.
    1 - the comps were not unique and had sequence variations. These have been made unique in the
    longest_250 versions of the assemblies and so these are the files we should work with in terms of getting DNA
    2 - the comps are not unique across speceis, i.e. each species has a comp00001
    3 - after the above longest_250 processing we can essentially assume that comps are unique within species

    sooo.... some pseudo code to figure this out
    we should work in a dataframe for this
    read in the list of ortholog gene IDs for each species (the comps that make up the 19000) (this is the dataframe)
    then for each species, identify the gene IDs (comps) for which multiple ORFs were made
    go row by row in the dataframe and see if any of the comp IDs (specific to species) are with multiple ORFs
    These are our rows of interest, we need to work within these rows:

    for each comp for each speceis get a list of the orf aa sequences. turn these into a list of lists
    then use itertools.product to get all of the four orfs that could be aligned.
    then for each of these possible combinations do itertools.combinations(seqs, 2) and do
    pairwise distances and get the average score of the pairwise alignments
    we keep the set of four that got the best possible score
    '''

#### __Read in the csv files as pandas dataframes__
These will give us the compIDs and the actual aa seqs of the currently predicted ORF variants for each ortholog

In [None]:
# First get the ID of the transcript that is related to this Ortholog for each species
gene_id_df = pd.read_csv('/home/humebc/projects/parky/gene_id.csv', index_col=0)

# We want to be able to compare how many of the alignments we fixed and how many were OK due to luck
# To do this we will need a couple of counters but also the currently chosen ORFs
aa_seq_df = pd.read_csv('/home/humebc/projects/parky/aa_seq_fixed_again.csv', index_col=0)

counters for keeping track of how many of the orthologs we had to fix

In [None]:
multi_orf_counter = 0
fixed_counter = 0

#### Read in the files containing which ORFs were predicted for each of the transcripts for each species
Hold these in a 2D list, one list per species

In [None]:
def convert_interleaved_to_sequencial_fasta_two(fasta_in):
    fasta_out = []

    for i in range(len(fasta_in)):

        if fasta_in[i].startswith('>'):
            if fasta_out:
                # if the fasta is not empty then this is not the first
                fasta_out.append(temp_seq_str)
            #else then this is the first sequence and there is no need to add the seq.
            temp_seq_str = ''
            fasta_out.append(fasta_in[i])
        else:
            temp_seq_str = temp_seq_str + fasta_in[i]
    #finally we need to add in the last sequence
    fasta_out.append(temp_seq_str)
    return fasta_out

In [None]:
# read in the predicted orfs
# read in the orf aa files for each of the four species
aa_orf_file_holder_list = []
for spp in list(gene_id_df):
    with open('/home/baumgas/projects/done/7_John/assemblies_species/annotation/for-dnds/{}_ORFaa.fa'.format(spp),
              'r') as f:
        aa_orf_file_holder_list.append(convert_interleaved_to_sequencial_fasta_two([line.rstrip() for line in f]))

#### Create a dict that relates compIDs with multiple ORFs predicted to a list of the ORF IDs predicted for them

In [None]:
comp_to_orfs_dict_holder_list = [defaultdict(list) for spp in list(gene_id_df)]
orf_to_aa_dict_holder_list = [dict() for spp in list(gene_id_df)]
for i in range(len(list(gene_id_df))):
    for j in range(len(aa_orf_file_holder_list[i])):
        if aa_orf_file_holder_list[i][j].startswith('>'):
            comp_ID = aa_orf_file_holder_list[i][j].split()[8].split('_')[0]
            orf_ID = aa_orf_file_holder_list[i][j].split()[0][1:]
            comp_to_orfs_dict_holder_list[i][comp_ID].append(orf_ID)
            orf_to_aa_dict_holder_list[i][orf_ID] = aa_orf_file_holder_list[i][j + 1]
    # print out stats for the default dict to see if we agree with Chris at this point
    multi_orf_comps = sum([1 for k, v in comp_to_orfs_dict_holder_list[i].items() if len(v) > 1])
    print('{} comps with > 1 ORF in {}'.format(multi_orf_comps, list(gene_id_df)[i]))

#### Now make a list from this which is simply the comps that have multiple ORFs predicted for them

In [None]:
# here we have the list of dictionaries from which we can get the comp IDs that have multiple ORFs
# perhaps we shoud convert them to lists now rather than on the fly
list_of_multi_orf_comps_per_spp = [[k for k, v in comp_to_orfs_dict_holder_list[i].items() if len(v) > 1] for i in range(len(list(gene_id_df)))]

#### Go through the df of compIDs for each ortholog and for each species look to see if the comp ID is found in the list we just created
If it is then this means that there at least one of the speceis has a transcript for which multiple ORFs were predicted. The set of ORFs associated with this ortholog will therefore need to be investigated.

We will multiprocess the checks of the problematic orthologs.
To do this we will create a list that will hold three items:
* A list of tuples. Each tuple containing four aa sequences. The collection of tuples represent all of the possible alignments that we need to check
* The list of aa sequenecs that are currently assigned to the ortholog
* The ortholog ID (simply an int)

#### create the list that will hold the MP info

In [None]:
mp_list = []

In [None]:
col_labels = list(gene_id_df)
for index in gene_id_df.index.values.tolist():
    # within each row check to see if the comp is in its respective list
    row_to_check = False
    for i in range(len(col_labels)):
        if gene_id_df.loc[index, col_labels[i]] in list_of_multi_orf_comps_per_spp[i]:
            row_to_check = True
    # here we have checked through each of the comps in the row
    # if row_to_check == True then we need to work through each of the comps and see which ORF combinations
    # produce the best average pairwise distances

    if row_to_check:
        print('Checking multi-ORF ortholog {}'.format(index), end='\r')
        multi_orf_counter += 1
        list_of_lists_of_possible_orfs = [comp_to_orfs_dict_holder_list[i][gene_id_df.loc[index, col_labels[i]]] for i in range(len(col_labels))]


        # now run itertools.product on this list of lists to get the tuples which are essentially
        # the different orfs that we would be trying to align
        list_of_lists_of_possible_orfs_as_aa = [[] for spp in col_labels]
        for k in range(len(list_of_lists_of_possible_orfs)):
            for orfID in list_of_lists_of_possible_orfs[k]:
                list_of_lists_of_possible_orfs_as_aa[k].append(orf_to_aa_dict_holder_list[k][orfID])
        # hardcode it to a list so that we can get the index of each tuple below
        alignment_tuples = [tup for tup in itertools.product(*list_of_lists_of_possible_orfs_as_aa)]


        mp_list.append((alignment_tuples, aa_seq_df.loc[index].tolist(), index))

#### Setup the MP worker that will perform the pairwise comparisons

In [None]:
def ORF_screening_worker(input_queue, rows_to_be_replaced_dict):
    for tup in iter(input_queue.get, 'STOP'):
        alignment_tuples, current_seqs, index = tup
        print('Processing multi-ORF ortholog: {}'.format(index))
        # we should now have sets that are four aa sequences.
        # for each set, generate pairwise comparisons and calculate pairwise distances
        # keep track of each distance and calculate average PW distances for each set_of_alignemtn_seqs
        average_distance_list = []
        for set_of_alignment_seqs in alignment_tuples:
            temp_pairwise_scores_list = []
            for seq_a, seq_b in itertools.combinations(set_of_alignment_seqs, 2):
                # here is a single PW distance calculation
                score = pairwise2.align.globalxx(seq_a, seq_b, score_only=True)
                temp_pairwise_scores_list.append(score)
            # now calculate average
            temp_average = sum(temp_pairwise_scores_list) / len(temp_pairwise_scores_list)
            average_distance_list.append(temp_average)

        # now we have a list of PW distance for each of the sets of sequences (virtual alignments)
        # we want to select the set that has the highest score

        index_of_best_set = average_distance_list.index(max(average_distance_list))
        # now check to see if these match those that were already chosen
        best_set_of_aas = alignment_tuples[index_of_best_set]
        alignments_are_same = True

        for i in range(len(current_seqs)):
            if current_seqs[i] != best_set_of_aas[i]:
                alignments_are_same = False
                break

        if not alignments_are_same:
            # print('Fasta for ortholog: {}'.format(index))
            # for n in range(len(best_set_of_aas)):
            #     print('>{}\n{}'.format(n, alignment_tuples[index_of_best_set][n]))
            # we don't need to know the actual ORFs that have been chosen for each spp
            # only the aa codes so let's just output this result
            rows_to_be_replaced_dict[index] = [aa for aa in alignment_tuples[index_of_best_set]]

#### Setup and run the MP

In [None]:
num_proc = 20

#Queue that will hold the index of the rows that need to be checked
input_queue = Queue()

# populate input_queue
for tup in mp_list:
    input_queue.put(tup)

for i in range(num_proc):
    input_queue.put('STOP')

# Manager for a dict rows_to_be_replaced_dict that will hold the new aa_seqs for the fixed indices
manager = Manager()
rows_to_be_replaced_dict = manager.dict()

list_of_processes = []
for N in range(num_proc):
    p = Process(target=ORF_screening_worker, args=(input_queue, rows_to_be_replaced_dict))
    list_of_processes.append(p)
    p.start()

for p in list_of_processes:
    p.join()

At this point we have a dictionary that contains the ID of an ortholog as key and a list of the four aa seqs (one for each sequence) that need to be __replaced in the dataframe and then written out__.

In [None]:
fixed_counter = len(rows_to_be_replaced_dict.keys())
# now replace the dataframe values
for index in aa_seq_df.index.values.tolist():
    if index in rows_to_be_replaced_dict.keys():
        # then this is a row that needs replcing with the new values
        aa_seq_df.loc[index] = rows_to_be_replaced_dict[index]


# at this point it only remains to write the data frame out as csv
print('{} orthologs were checked due to multi-ORFs\n'
      '{} were fixed\n'
      '{} already contained the optimal choice of ORFs\n'.format
    (
    multi_orf_counter, fixed_counter, multi_orf_counter-fixed_counter
    )
)
aa_seq_df.to_csv('/home/humebc/projects/parky/aa_seq_multi_orf_orths_fixed.csv')

<hr />

### Creating the super alignment and phylogeny
* Generate local alignment for each ortholog (MAFFT; cropped)
* Calculate best fit amino acid substitution matrix (using prottest)
* Concatenate the local alignements into super alignment, create q file and make ML tree (using raxml_HPC)

### Generate local alignment for each ortholog (MAFFT; cropped)

__Local functions__

In [1]:
def readDefinedFileToList(filename):
    temp_list = []
    with open(filename, mode='r') as reader:
        temp_list = [line.rstrip() for line in reader]
    return temp_list

def writeListToDestination(destination, listToWrite):
    #print('Writing list to ' + destination)
    try:
        os.makedirs(os.path.dirname(destination))
    except FileExistsError:
        pass

    with open(destination, mode='w') as writer:
        i = 0
        while i < len(listToWrite):
            if i != len(listToWrite)-1:
                writer.write(listToWrite[i] + '\n')
            elif i == len(listToWrite)-1:
                writer.write(listToWrite[i])
            i += 1

def convert_interleaved_to_sequencial_fasta_two(fasta_in):
    fasta_out = []

    for i in range(len(fasta_in)):

        if fasta_in[i].startswith('>'):
            if fasta_out:
                # if the fasta is not empty then this is not the first
                fasta_out.append(temp_seq_str)
            #else then this is the first sequence and there is no need to add the seq.
            temp_seq_str = ''
            fasta_out.append(fasta_in[i])
        else:
            temp_seq_str = temp_seq_str + fasta_in[i]
    #finally we need to add in the last sequence
    fasta_out.append(temp_seq_str)
    return fasta_out

#### Read in the aa and gene ID csvs

In [None]:
# the amino acid sequences
aa_seq_array = pd.read_csv('/home/humebc/projects/parky/aa_seq_multi_orf_orths_fixed.csv', sep=',', lineterminator='\n', index_col=0, header=0)

# the gene IDs
gene_id_array = pd.read_csv('/home/humebc/projects/parky/gene_id_fixed.csv', sep=',', lineterminator='\n', index_col=0, header=0)


#### Do the alignments using multiprocessing to speed things up.
This worker will perform the alignment and also do the cropping. We will do the cropping by reading in an alignment into a dataframe and then drop the columns from the dataframe that contained gaps from both the beggining and the end, until we come to a column that doesn't contain a gap.

In [None]:
def create_local_alignment_worker(input, output_dir, spp_list):
    # for each list that represents an ortholog
    for k_v_pair in iter(input.get, 'STOP'):

        # ortholog_id
        ortholog_id = k_v_pair[0]
        print('Processing {}'.format(ortholog_id))
        # ortholog_spp_seq_list
        ortholog_spp_seq_list = k_v_pair[1]

        # create the fasta
        fasta_file = []

        # for each species
        # add the name and aa_seq to the fasta_file
        for i in range(len(spp_list)):
            fasta_file.extend(['>{}_{}'.format(spp_list[i], ortholog_spp_seq_list[i][0]), ortholog_spp_seq_list[i][1]])

        # here we have the fasta_file populated

        # Write out the new fasta
        fasta_output_path = '{}/{}.fasta'.format(output_dir, ortholog_id)
        writeListToDestination(fasta_output_path, fasta_file)
        # now perform the alignment with MAFFT
        mafft = local["mafft-linsi"]
        in_file = fasta_output_path
        out_file = fasta_output_path.replace('.fasta', '_aligned.fasta')
        # now run mafft including the redirect
        (mafft[in_file] > out_file)()

        # at this point we have the aligned .fasta written to the output directory
        # at this point we need to trim the fasta.
        # I was going to use trimAl but this doesn't actually have an option to clean up the ends of alignments
        # instead, read in the alignment as a TwoD list to a pandas dataframe
        # then delete the begining and end columns that contain gap sites
        aligned_fasta_interleaved = readDefinedFileToList(out_file)
        aligned_fasta = convert_interleaved_to_sequencial_fasta_two(aligned_fasta_interleaved)
        array_list = []
        for i in range(1, len(aligned_fasta), 2):
                array_list.append(list(aligned_fasta[i]))

        # make into pandas dataframe
        alignment_df = pd.DataFrame(array_list)

        # go from either end deleting any columns that have a gap
        columns_to_drop = []
        for i in list(alignment_df):
            # if there is a gap in the column at the beginning
            if '-' in list(alignment_df[i]) or '*' in list(alignment_df[i]):
                columns_to_drop.append(i)
            else:
                break
        for i in reversed(list(alignment_df)):
            # if there is a gap in the column at the end
            if '-' in list(alignment_df[i]) or '*' in list(alignment_df[i]):
                columns_to_drop.append(i)
            else:
                break

        # get a list that is the columns indices that we want to keep
        col_to_keep = [col_index for col_index in list(alignment_df) if col_index not in columns_to_drop]

        # drop the gap columns
        alignment_df = alignment_df[col_to_keep]

        # here we have the pandas dataframe with the gap columns removed
        #convert back to a fasta and write out
        cropped_fasta = []
        alignment_index_labels = list(alignment_df.index)
        for i in range(len(alignment_index_labels)):
            seq_name = '>{}_{}'.format(spp_list[i], ortholog_spp_seq_list[i][0])
            aa_seq = ''.join(alignment_df.loc[alignment_index_labels[i]])
            cropped_fasta.extend([seq_name, aa_seq])

        # here we have the cropped and aligned fasta
        # write it out
        aligned_cropped_fasta_path = fasta_output_path.replace('.fasta', '_aligned_cropped.fasta')
        writeListToDestination(aligned_cropped_fasta_path, cropped_fasta)

        # here we should be done with the single alignment
        print('Local alignment for {} completed'.format(ortholog_id))

#### Create and populate the input Queue for the MP process. Then run.
The items in this list will be key value pairs from the dictionary that we will create below. The dictionary will simply be ortholog ID = key and list of aa seqs for that ortholog will be value.

In [None]:
# making MP data_holder_list
tuple_holder_dict = {}
#for each ortholog
for row_index in aa_seq_array.index.values.tolist():
    print('Adding ortholog {} to MP info list'.format(row_index))
    ortholog_id_seq_list = []
    # for each species.
    for spp in list(gene_id_array):
        gene_id = gene_id_array[spp][row_index]
        aa_seq = aa_seq_array[spp][row_index]
        ortholog_id_seq_list.append((gene_id, aa_seq))
    tuple_holder_dict[row_index] = ortholog_id_seq_list

# creating the MP input queue
ortholog_input_queue = Queue()

# populate with one key value pair per ortholog
for key, value in tuple_holder_dict.items():
    print('Placing {} in MP queue'.format(key))
    ortholog_input_queue.put((key, value))

num_proc = 24

# put in the STOPs
for N in range(num_proc):
    ortholog_input_queue.put('STOP')

allProcesses = []

# directory to put the local alignments
output_dir = '/home/humebc/projects/parky/aa_tree_creation/local_alignments'

# the list of species for each ortholog
spp_list = [spp for spp in list(gene_id_array)]

# Then start the workers
for N in range(num_proc):
    p = Process(target=create_local_alignment_worker, args=(ortholog_input_queue, output_dir, spp_list))
    allProcesses.append(p)
    p.start()

for p in allProcesses:
    p.join()

# at this point we have the local alignments all written as fasta files to output_dir.
# Now it just remains to concatenate and then run the ML tree.

<hr />
### Calculate best fit amino acid substitution matrix (using prottest)
We will use prottest to calculate the model for each of the local alignments. We will MP this. This part of the program takes a considerable amount of time and I ran it over night to complete. You can run it using tmux. Once the models had been output I checked to see that none of the outputs had been corupted (and thus would cause problems further down the line) by simply checking the size of the output files. Most files were 33k in size. Files smaller than this were checked (about 20). These all had IO errors and had stopped prematurely. So, I ran the code again (it has a built in check to see if the output has already been produced; in which case it will not re-do). This time the check came up with no files less than 33k.

In [None]:
def prottest_worker(input_queue):
    base_dir = '/home/humebc/projects/parky/aa_tree_creation/local_alignments'
    for file_name in iter(input_queue.get, 'STOP'):
        input_path = '{}/{}'.format(base_dir, file_name)
        output_path = input_path.replace('_aligned_cropped.fasta', '_prottest_result.out')
        if os.path.isfile(output_path):
            continue
        sys.stdout.write('\rRunning prottest: {}'.format(file_name))
        # perform the prottest
        prot_path = '/home/humebc/phylogeneticsSoftware/protest/prottest-3.4.2/prottest-3.4.2.jar'
        subprocess.run(['java', '-jar', prot_path, '-i', input_path, '-o', output_path, '-all-distributions', '-all']
                       , stdout=subprocess.PIPE, stderr=subprocess.PIPE)

In [None]:
# we will find each of the local alignments and run put them into a list which we will MP
# for each item we will run prottest with a single thread and output a file
# in the concatenate local alignments file we will then create a q file that will
# designate the different partitions and which of the substitution models to use.

# get a list of all of the fasta names that we will want to concatenate
base_dir = '/home/humebc/projects/parky/aa_tree_creation/local_alignments'
list_of_files = [f for f in os.listdir(base_dir) if 'aligned_cropped.fasta' in f]


num_proc = 12

# Queue that will hold the index of the rows that need to be checked
input_queue = Queue()

# populate input_queue
for file_name in list_of_files:
    input_queue.put(file_name)

for i in range(num_proc):
    input_queue.put('STOP')

list_of_processes = []
for N in range(num_proc):
    p = Process(target=prottest_worker, args=(input_queue,))
    list_of_processes.append(p)
    p.start()

for p in list_of_processes:
    p.join()

return

<hr />
### Concatenate the local alignements into super alignment, create q file and make ML tree (using raxml_HPC)
To do this we read in all of the prot model output files and look to see which model was suggested for each of the local alignments. We make a default dictionary that holds key = model, value = list(the ortholog IDs that use that model). We then go model by model from this dict and within that ortholog ID by ortholog ID to concatenate all of the local alignments together. At the same time that we are doing the concatenation of the local alignments we are also producing the q_file on a model by model basis. The q file is a file that tells raxml how to partition the data and in our case which protein model should be used in conjunction with which partition. By grouping together the local alignments by the model they will use and therefore minimising the partitions we will hopefully be gaining significant advantages in comput time in the raxml stage.

#### Create the dictionary of model to ortholog IDs

In [None]:
# The master alignment that we create should be partitioned according to the protein model used.
# I have generated all of the .out files which are the outputs from the prottest
# We should iter through these and create a dictionary that is a model type as key
# and then have a list of the orthologs of that model.
# then sort this by the length of the list
# then work our way through the local alignments in this order creating the alignment
# We will need to generate the p file as we go
# this should take the form
'''
JTT, gene1 = 1-500
WAGF, gene2 = 501-800
WAG, gene3 = 801-1000

'''

# get list of the .out prottest files
base_dir = '/home/humebc/projects/parky/aa_tree_creation/local_alignments'
list_of_prot_out_filenames = [f for f in os.listdir(base_dir) if 'prottest' in f]

# iter through the list of protfiles creating the dict relating model to ortholog
# we cannnot change the +G or +I for each partition. As such I will define according to the base model
model_to_orth_dict = defaultdict(list)
for i in range(len(list_of_prot_out_filenames)):
    model = ''
    file_name = list_of_prot_out_filenames[i]
    orth_num = int(file_name.split('_')[0])
    with open('{}/{}'.format(base_dir, file_name), 'r') as f:
        temp_file_list = [line.rstrip() for line in f]
    for j in range(300, len(temp_file_list), 1):
        if 'Best model according to BIC' in temp_file_list[j]:
            model = temp_file_list[j].split(':')[1].strip().replace('+G','').replace('+I','')
            break
    if model == '':
        sys.exit('Model line not found in {}'.format(orth_num))
    model_to_orth_dict[model].append(orth_num)

# #N.B. that we cannot have different gamma for different partitions
# # Also best advice is not to run +G and +I together.
# # As such we only need to extract the base model here i.e. WAG rather than WAG [+G|+I]
# for model in model_to_orth_dict

print('The 19k sequences are best represented by {} different aa models'.format(len(model_to_orth_dict.keys())))

#### Go model by model concatenating and making the q file

In [None]:
# here we have the dict populated
# now sort the dict
sorted_model_list = sorted(model_to_orth_dict, key=lambda k: len(model_to_orth_dict[k]), reverse=True)

# now go model by model in the sorted_model_list to make the master alignment.

# not the most elegant way but I think I'll just create the mast fasta in memory
master_fasta = ['>min','', '>pmin', '', '>psyg', '', '>ppsyg', '']

# The q file will hold the information for the partitioning of the alignment for the raxml analysis
q_file = []
for model in sorted_model_list:
    q_file_start = len(master_fasta[1]) + 1
    sys.stdout.write('\rProcessing model {} sequences'.format(model))
    for orth_num in model_to_orth_dict[model]:
        file_name = str(orth_num) + '_aligned_cropped.fasta'
        with open('{}/{}'.format(base_dir, file_name), 'r') as f:
            temp_list_of_lines = [line.rstrip() for line in f]

        for i in range(1, len(temp_list_of_lines), 2):
            new_seq = master_fasta[i] + temp_list_of_lines[i]
            master_fasta[i] = new_seq
    q_file_finish = len(master_fasta[1])
    q_file.append('{}, gene{} = {}-{}'.format(
        model.upper(), sorted_model_list.index(model) + 1, q_file_start, q_file_finish))

# here we have the master fasta and the q file ready to be outputted

#### write out the master fasta and the q file

In [None]:
# now write out the master fasta
master_fasta_output_path = '/home/humebc/projects/parky/aa_tree_creation/master.fasta'
with open(master_fasta_output_path, 'w') as f:
    for line in master_fasta:
        f.write('{}\n'.format(line))

# now write out the q file
q_file_output_path = '/home/humebc/projects/parky/aa_tree_creation/qfile.q'
with open(q_file_output_path, 'w') as f:
    for line in q_file:
        f.write('{}\n'.format(line))

#### Now it just remains to run the raxml.
There are a lot of options here to get it to run right
Breifly (to helpfully save time next time)
* -s = input file
* -q = the q file that defines the partitions
* -x = this switches on rapid bootstrapping and provides a random number to initiate
* -f a = Thi means that the summarised tree will also have the bootstrapped values put on it in the output
* -p = a seed required by raxml (random number)
* -# the number of bootstraps
* -n the base of the output files
* -w the output directory where the files will be written
* -T the threads used (processes)
* -m the model used (see comment in code)

NB I tried to get the AVX2 version of raxml to work but the architechture of Symbiomics did not support it so we are running the AVX. Also please see the comment in the code on the -m flag.

In [None]:
# now run raxml
#NB note that although we are specificing mdels for each partition, we still need to add the PROTGAMMAIWAG
# model argument to the -m flag. This is just a weird operation of raxml and is explained but hidden in the manual
# (search for 'If you want to do a partitioned analysis of concatenated'). Raxml will only extract the rate
# variation information from this and will ignore the model component e.g. WAG. FYI any model could be used
# doesn't have to be WAG.
raxml_path = '/home/humebc/phylogeneticsSoftware/raxml/standard-RAxML/raxmlHPC-PTHREADS-AVX'
subprocess.run([raxml_path, '-s', master_fasta_output_path, '-q', q_file_output_path,
                '-x', '183746', '-f', 'a', '-p', '83746273', '-#', '100', '-T', '8', '-n', 'parkinson_out',
                '-m', 'PROTGAMMAWAG', '-w', '/home/humebc/projects/parky/aa_tree_creation'])

print('\nConstruction of master fasta complete:\n{}\n{}'.format(master_fasta_output_path, q_file_output_path))