# Prepare fasta sequences for orthofinder analysis
To maximize and simplify the usage of OrthoFinder results, sequences must be in the following format:

1. Consists of coding sequence only. If the sequence comes from predicted transcripts, ensure that there is no UTR sequence. The beginning and end of the sequence should correspond to the beginning and end of the protein. 
2. Number of sequences == number of predicted coding genes. Select the single 'best' (in practice, usually the longest) coding sequence per gene. 
3. Each sequence should be appended with the genome id using a '_' e.g., >g000234.t1_UTEX2797
4. Sequence headers should be simple. Remove any sequence descriptors and/or nonstandard characters. Word characters, digits, '.', '-', and '_' are okay, everything else should be removed.
5. Resulting FASTA files should be named simply with the genome id, e.g., UTEX2797.fa

*Starting FASTA files can be formated in myriad different ways, so it is impossible to standardize this step. Use this notebook is to document how the sequences were preprocessed prior to running through the OrthoFinder pipeline. Save the resulting FASTA files to the **`fasta`** subdirectory.*

In [None]:
import glob
from Bio import SeqIO

In [None]:
infiles = '../../../figshare/annotation/genes_*_assembly/*codingseq.fa'
outdir = 'fasta/'

In [None]:
for infile in glob.glob(infiles):
    
    strain = infile.split('/')[-1].split('_')[0]
    #print(strain)
    
    seqDict = {}
    for record in SeqIO.parse(infile, "fasta"):
        gene = record.id.split(".")[0]
        
        if gene not in seqDict:
            seqDict[gene] = [str(record.seq), record.id]
            continue
        
        if len(seqDict[gene][0]) < len(record.seq):
            seqDict[gene] = [str(record.seq), record.id]
            
    outfile = outdir + strain + '.fa'
    with open(outfile, 'w') as fo:
        for gene in seqDict:
            
            if strain == '12B1' or strain == 'UTEX2797':
                geneid = 'g' + seqDict[gene][1].split('g')[1]
                fo.write('>' + geneid + '_' + strain + '\n')
                fo.write(seqDict[gene][0] + '\n')        
            
            else:
                fo.write('>' + seqDict[gene][1] + '_' + strain + '\n')
                fo.write(seqDict[gene][0] + '\n')        

        

In [None]:
infiles = '../../../figshare/annotation/genes_*_assembly/*proteins.fa'
outdir = 'fasta'

In [None]:
for infile in glob.glob(infiles):
    
    #print(infile)
    strain = infile.split('/')[-1].split('_')[0]
    #print(strain)
    
    outfile = outdir + strain + '.fa'
    fo = open(outfile, 'w')
    
    for record in SeqIO.parse(infile, "fasta"):
        header = record.id
        sequence = str(record.seq)
        sequence = sequence.replace('*', '')
          
        if strain == '12B1' or strain == 'UTEX2797':
            header = 'g' + header.split('g')[1]
        
        newheader = header + '_' + strain
        fo.write('>' + newheader + '\n' + sequence + '\n')
                
    fo.close()