<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Finding-genome-coordinates-for-MaxQuant-Derived-Peptides" data-toc-modified-id="Finding-genome-coordinates-for-MaxQuant-Derived-Peptides-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Finding genome coordinates for MaxQuant Derived Peptides</a></span><ul class="toc-item"><li><span><a href="#Retreiving-Indexes-and-Information-from-a-CSV-file" data-toc-modified-id="Retreiving-Indexes-and-Information-from-a-CSV-file-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Retreiving Indexes and Information from a CSV file</a></span></li><li><span><a href="#Preparing-relevant-data-for-running-a-command-line-tBLASTn-search-against-the-Tb927-genome" data-toc-modified-id="Preparing-relevant-data-for-running-a-command-line-tBLASTn-search-against-the-Tb927-genome-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Preparing relevant data for running a command line tBLASTn search against the Tb927 genome</a></span></li><li><span><a href="#Command-Line-tBLASTn-code" data-toc-modified-id="Command-Line-tBLASTn-code-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Command Line tBLASTn code</a></span></li><li><span><a href="#Handling-BLAST-output" data-toc-modified-id="Handling-BLAST-output-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Handling BLAST output</a></span></li><li><span><a href="#Comparison-of-orginal-(MaxQuant)-and-new-(tBLASTn)-results" data-toc-modified-id="Comparison-of-orginal-(MaxQuant)-and-new-(tBLASTn)-results-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Comparison of orginal (MaxQuant) and new (tBLASTn) results</a></span></li></ul></li><li><span><a href="#Reveiwing-Mass-Spec-data-(MaxQuant)" data-toc-modified-id="Reveiwing-Mass-Spec-data-(MaxQuant)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reveiwing Mass Spec data (MaxQuant)</a></span><ul class="toc-item"><li><span><a href="#Retreiving-relevant-Information-for-new-protein-coding-genes" data-toc-modified-id="Retreiving-relevant-Information-for-new-protein-coding-genes-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Retreiving relevant Information for new protein coding genes</a></span></li></ul></li></ul></div>

In [None]:
import pandas as pd
import Bio as bio
import matplotlib.pyplot as plt

# Finding genome coordinates for MaxQuant Derived Peptides




MaxQuant was used to identify predicted new peptides and map them to the T brucei 927 (Tb927) genome available at TriTrypDB, with the intention of using these genomic coordinates to discover new protein coding gene regions. We wished to verify the genomic coordinates output from MaxQuant by re-mapping the peptides onto a newer version of the Tb927 genome (version 48) using a tBLASTn set-up

## Retreiving Indexes and Information from a CSV file

For MaxQuant output (predicted_new_pe.csv) column numbers for peptide reference (pep_ref), gene name, and ID number were retrieved. A function to retrieve information on a requested feature was defined

In [None]:
my_file = open('predicted_new_pep.csv', 'r')

def get_index(infile, feature):
    headers = (infile.readline()).split(',')
    index = headers.index(feature)
    infile.seek(0)
    return index

pep_ref_index = get_index(my_file, 'pep_ref')
name_index = get_index(my_file, 'name')
ID_index = get_index(my_file, '')

my_file.close()



For each row in predicted_new_pep.csv file pep_ref was retrieved. Peptide sequence was stored as a dictionary key (ensures unique peptides only) with corresponding value as an appropriate identifying fasta title (>pep{ID number}_{gene name}) 

In [None]:
my_file = open('predicted_new_pep.csv', 'r')

all_peptides = []
pep_dict = {}


for line_number, line in enumerate(my_file):
    if line_number >0:                              #skip over column headers
        listed = line.split(',')                    #split line into list of values for each column entry
        pep_details = (listed[pep_ref_index]).split('-')  #splits up pep_ref entry to allow only peptide sequence to be selected
        peptide = pep_details[5]                    #select just sequence
        gene_name = (listed[name_index])            #retrieve gene name... 
        num_ID = (listed[ID_index])                 #...and ID number for identifying once in fasta format
        fasta_title = '>pep{}_{}'.format(num_ID,gene_name) #create header for fasta with ID number and name
        all_peptides.append(peptide)                #store all peptides as a list, in order to have for future reference
        pep_dict[peptide] = fasta_title             #dict_key is peptide sequence, dict_value is fasta header

print(len(all_peptides))  #double check to ensure non unique peptides have been removed
print(len(pep_dict))


my_file.close()


## Preparing relevant data for running a command line tBLASTn search against the Tb927 genome

Contents of the unique peptides dictionary was written out into a file in fasta format (>title \n peptide sequence)

In [None]:
fasta_out = open('pep_ref_all.fa', 'w')

for entry in pep_dict.keys():
    fasta_out.write(pep_dict[entry]+'\n')  #write fasta header
    fasta_out.write(entry+'\n')            #write peptide sequence

fasta_out.close()


Fasta sequences were split into long (>20aa) and short (<20aa) sequences and saved in different files so as to run tBLASTn with different parameters for short/long peptides. 

In [None]:
fasta_out_longpeps = open('pep_ref_long.fa', 'w')
fasta_out_shortpeps = open('pep_ref_short.fa', 'w')

for entry in pep_dict.keys():
    if len(entry)>=20:
        fasta_out_longpeps.write(pep_dict[entry]+'\n')  #write fasta header
        fasta_out_longpeps.write(entry+'\n')            #write peptide sequence
    if len(entry)<20:
        fasta_out_shortpeps.write(pep_dict[entry]+'\n')  #write fasta header
        fasta_out_shortpeps.write(entry+'\n')            #write peptide sequence
        
fasta_out_longpeps.close()
fasta_out_shortpeps.close()



## Command Line tBLASTn code


The tBLASTn set-up itself was run scripted with bash. The database 927_genome was assembled  using the script below from the whole genome fasta file downloaded from TriTrypDB. Both this raw file and the resulting databse files (927_genome.nhr, 927_genome.nin and 927_genome.nsqr) are available in the associated Git repository

In [None]:
%%bash

#!/bin/bash

#$ -cwd

makeblastdb -in TriTrypDB-48_TbruceiTREU927_Genome.fasta -out 927_genome -dbtype nucl



Long sequences

In [None]:
%%bash

#!/bin/bash

#$ -cwd

tblastn -query pep_ref_long.fa -db 927_genome -out results_pep_ref_long.csv -outfmt '10 qseqid qseq sseqid pident qlen qstart qend sstart send frames positive mismatch gaps evalue bitscore'

Short sequences

In [None]:
%%bash

#!/bin/bash

#$ -cwd

tblastn -query pep_ref_short.fa -db 927_genome -out results_pep_ref_short.csv -evalue 100 -word_size 2 -gapopen 9 -gapextend 1 -matrix PAM30 -threshold 16 -comp_based_stats 0 -window_size 15 -outfmt '10 qseqid qseq sseqid pident qlen qstart qend sstart send frames positive mismatch gaps evalue bitscore'

## Handling BLAST output

tBLASTn output was reformatted to include headers for columns, saving in new files results_withheaders_long.csv and results_withheaders_short.csv


In [None]:
#defining function

def write_headers(infile,outfile):
    outfile.write("name,query_peptide,subject,%ID,query_length,q_start,q_end,s_start,s_end,frames(q/s),num_positives,num_mismatches,num_gaps,evalue,bitscore")
    for line in infile:
        outfile.write(line)
    return None

#apply for both long and short results

results_short = open('results_pep_ref_short.csv', 'r')
results_long= open('results_pep_ref_long.csv', 'r')

headers_short = open('results_withheaders_short.csv', 'w')
headers_long = open('results_withheaders_long.csv', 'w')

write_headers(results_short,headers_short)
write_headers(results_long,headers_long)


results_short.close()
results_long.close()
headers_short.close()
headers_long.close()



Only matches with 100% ID were selected (again storing in new files so as to maintain original files in case needed in future)

In [None]:
#defining function
def select_best(infile, outfile):
    for line in infile:
        ID = (line.split(',')).index('%ID')
        outfile.write(line)
        break

    for line in infile:
        columns=line.split(',')
        if float(columns[ID]) == 100:
            outfile.write(line)
    return None

#long peptides
results = open('results_withheaders_long.csv', 'r')
best_matches = open('results_long_bestID.csv', 'w')
select_best(results,best_matches)
               
results.close()
best_matches.close()

#short peptides - the above function was not used as another condition was included: 
#that the match was covering the entire query length, 
#as the parameters used for short sequences can often provide incomplete matches

results = open('results_withheaders_short.csv', 'r')
best_matches = open('results_short_bestID.csv', 'w')

for line in results:
        end = (line.split(',')).index('q_end')
        start = (line.split(',')).index('q_start')
        length= (line.split(',')).index('query_length')
        ID = (line.split(',')).index('%ID')
        best_matches.write(line)
        break

for line in results:
        columns=line.split(',')
        if float(columns[ID]) == 100 and ((float(columns[length])-1))<= (float(columns[end])-float(columns[start])):
            best_matches.write(line)

results.close()
best_matches.close()      
        

## Comparison of orginal (MaxQuant) and new (tBLASTn) results

In [None]:
#retreive the ID numbers of the peptides that were analysed

pep_dict_IDs = []       
for entry in pep_dict.values():     #instead of opening original file again, unique peptides dictionary can be used..
    end_ID = entry.index('_')       #...as these peptides were actually analysed after filtering out non-unique
    pep_dict_IDs.append(entry[4:end_ID])  

    
    
#obtain the original coordinates (generated by MaxQuant) of the peptides analysed, 
#storing in dictionary to ensure only unique peptides again and keep ID numbers and coordinates properly associated

old_results = open('predicted_new_pep.csv', 'r')

results_sorted = {}

for line_number, line in enumerate(old_results):      
    if line_number >0:
        data = line.strip().split(',')
        coord = '{}:{}..{}'.format(data[1],data[12],data[13])
        pep_details = (data[14]).split('-')  
        peptide = pep_details[5]
        ID = data[0]
        gene_name= data[4]
        if ID in pep_dict_IDs:
            results_sorted[data[0]] = [gene_name, peptide, coord]
        else:
            continue
               
old_results.close()


#Combining original and new results (for both long and short peptides) by peptide ID,
#saving in new 'results_by_ID_compared.csv' file by first creating results_sorted dictionary 
               
def combine_by_ID(infile):
    
    for line_number, line in enumerate(infile):
        if line_number >0: 
            data = line.strip().split(',')
            ID = ((data[0].split('_'))[0]).strip(' pep')
            if not ID in results_sorted :
                results_sorted[ID] = []
            if data[7]<data[8]:
                results_sorted[ID].append('{}:{}..{}'.format(data[2],data[7],data[8]))
            elif data[8]<data[7]:
                results_sorted[ID].append('{}:{}..{}'.format(data[2],data[8],data[7]))
    return results_sorted

          
results_long = open('results_withheaders_long.csv', 'r')
results_short = open('results_short_bestID.csv', 'r')

combine_by_ID(results_long)
combine_by_ID(results_short)

results_long.close()
results_short.close()

#writing out to file
#checking if theres a match between original and new coordinates

results_by_ID = open('results_by_ID_compared.csv', 'w')

results_by_ID.write("ID,gene,peptide,original_coordinates,matching,blast_outputs \n")

for entry in results_sorted:
    ID = entry
    gene_name = str(results_sorted[entry][0]).strip('[')
    peptide = str(results_sorted[entry][1])
    orig_coord = results_sorted[entry][2]
    blast = results_sorted[entry][3::]
    if orig_coord in blast:
        matching = 'MATCH'
    else:
        matching = 'DIFFERENT'
    results_by_ID.write('{},{},{},{},{},"{}"\n'.format(ID,gene_name,peptide,orig_coord,matching,blast)) 
    

results_by_ID.close()

# Reveiwing Mass Spec data (MaxQuant)



The data produced by MaxQuant was reviewed to manually assess if the evidence is strong enough for the peptide identities assigned by the program

## Retreiving relevant Information for new protein coding genes

Relevant information (score, delta score, PEP, raw file) for our new coding genes was retrieved from files of MS evidence. As the raw files are too large to be added to the associated Git repository, they can be downloaded locally using the cURL commands in the following 2 cells.
Additionally, it was annotated if the peptides are observed in Blood Stream Form (BSF) or Procyclic Form (PCF) parasite lifecycle stages

In [None]:
#PCF file

!curl 'https://dmail-my.sharepoint.com/personal/mtinti_dundee_ac_uk/_layouts/15/download.aspx?SourceUrl=%2Fpersonal%2Fmtinti%5Fdundee%5Fac%5Fuk%2FDocuments%2Fmaster%5Fproject%5Flizzie%2FMS%2Fevidence%5Fpcf%2Ezip' \
  -H 'Connection: keep-alive' \
  -H 'sec-ch-ua: "Google Chrome";v="87", " Not;A Brand";v="99", "Chromium";v="87"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'DNT: 1' \
  -H 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Service-Worker-Navigation-Preload: true' \
  -H 'Sec-Fetch-Site: same-origin' \
  -H 'Sec-Fetch-Mode: navigate' \
  -H 'Sec-Fetch-User: ?1' \
  -H 'Sec-Fetch-Dest: document' \
  -H 'Referer: https://dmail-my.sharepoint.com/personal/mtinti_dundee_ac_uk/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fmtinti%5Fdundee%5Fac%5Fuk%2FDocuments%2Fmaster%5Fproject%5Flizzie%2FMS%2Fevidence%5Fpcf%2Ezip&parent=%2Fpersonal%2Fmtinti%5Fdundee%5Fac%5Fuk%2FDocuments%2Fmaster%5Fproject%5Flizzie%2FMS' \
  -H 'Accept-Language: en-US,en;q=0.9,it;q=0.8' \
  -H 'Cookie: MicrosoftApplicationsTelemetryDeviceId=d5c68e82-a582-4578-859f-783ec8a2ce20; MicrosoftApplicationsTelemetryFirstLaunchTime=2020-12-17T17:30:25.426Z; WordWacDataCenter=GUK1; WacDataCenter=GUK1; rtFa=irlrX2saO25ge/DfrhXLIZkMAY5SZ8Woc/v/wO33fCgmN0Y2NTRGMkMtODBEQS00NENBLUIwN0YtM0FCNDREREQ1QkEzQoKU/vFZqSq5NViWSrXANugJgoJO3jG2Uzt+dfjDSuRaTuzhj6/9sJRnv4YtW0QliwY7ZDTdKDIy7il/bPLSeEyM+ZJdeinNnbZqpI9nWSV12JPLaRMqbDxVb6BoyYab9oh+grTgh2QMP+VHfm83J5P+rhisjDj+uWAIRScIUIWu7rbViOopervSx5t46rBIfOIZU1akZU4/yzhfT9e8ub2gu3v/NpsKGjOdUjg4rxQdlk+OR5aOE2YaD7Q1VvLpbn61Dx3LL7xpgi2O5q4OLwmZ7IBTmj1bQTAJFH3dc1nkYEG0lWkNxSL1h1E1znzoBhuNv8jEW1ZsiKoU4CovRkUAAAA=; FedAuth=77u/PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz48U1A+VjgsMGguZnxtZW1iZXJzaGlwfDEwMDM3ZmZlODBhYzg4NThAbGl2ZS5jb20sMCMuZnxtZW1iZXJzaGlwfG10aW50aUBkdW5kZWUuYWMudWssMTMyNTI1OTAwNDcwMDAwMDAwLDEzMjI3NzM2OTkzMDAwMDAwMCwxMzI1Mjc4NjEwNzU2MTMzOTcsMTM0LjM2LjIyMy4yNDAsMiw3ZjY1NGYyYy04MGRhLTQ0Y2EtYjA3Zi0zYWI0NGRkZDViYTMsLGQ1YzJjMDljLTFjNjgtNDg0Zi05ZDdkLTg1MmRhZjZiNDNmMCwxMGJlZTE1NC1iYjljLTQzMWMtYmFiMS0wZWYyMGQ1MTY1NmEsMTBiZWUxNTQtYmI5Yy00MzFjLWJhYjEtMGVmMjBkNTE2NTZhLCwwLDEzMjUyNzAzMzA3NTMwMDg1OCwxMzI1Mjk1ODkwNzUzMDA4NTgsLCxleUo0YlhOZlkyTWlPaUpiWENKRFVERmNJbDBpZlE9PSwyNjUwNDY3NzQzOTk5OTk5OTk5LDEzMjUyNjk5NzA3MDAwMDAwMCw3OGM1MTQyYi1iODVhLTQ5NzItYjc4ZS03NmU3MTFlN2QwODYsQTB4cmhXdmFldHd1dnBaU3BMMHdON1YrakV5M1NqckNZeEx5S01HU1JpdGo4QnVzU3FLbHdMcnhJQ0FtY2NJRUEvUVVKS2N1UlBRd2h3T3BEYjExQkh2aXpJQnZ5ZUFBV1VacHVqeXZhMkw5Q0pnWlRQVnhJWHV4bDRXYVNxRkNqYXVrcEM4OEdBNVY1QmVtYmRFME54UDdRKzUzVFJXek9JVnZZOHhPUGQwdklDZW5iL3R1dGxlWFUrL1FsTVl1REhhWnB4bG5lZjhXdEtTTitVTmFBWXVmdjJaWXZrQ3JFS0pPenpxNDJMZU9mNy93aE5FZEo4MDZyWFNpMlB1ZDhxV25oZVV4a2JOdEU1RklyT2NCcTVVazFQalFnQ2haS3RxWlFvdyt3aVk1anJZeUwzdDhxVnMvKy9uNkJYR1ptNlQxUXJDaHorSS9aK0pMOUtheitBPT08L1NQPg==; odbn=1; cucg=1' \
  --compressed --output evidence_pcf.zip

In [None]:
#BSF file

!curl 'https://dmail-my.sharepoint.com/personal/mtinti_dundee_ac_uk/_layouts/15/download.aspx?SourceUrl=%2Fpersonal%2Fmtinti%5Fdundee%5Fac%5Fuk%2FDocuments%2Fmaster%5Fproject%5Flizzie%2FMS%2Fevidence%5Fbsf%2Ezip' \
  -H 'authority: dmail-my.sharepoint.com' \
  -H 'sec-ch-ua: "Google Chrome";v="87", " Not;A Brand";v="99", "Chromium";v="87"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'upgrade-insecure-requests: 1' \
  -H 'dnt: 1' \
  -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36' \
  -H 'accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'service-worker-navigation-preload: true' \
  -H 'sec-fetch-site: same-origin' \
  -H 'sec-fetch-mode: navigate' \
  -H 'sec-fetch-user: ?1' \
  -H 'sec-fetch-dest: document' \
  -H 'referer: https://dmail-my.sharepoint.com/personal/mtinti_dundee_ac_uk/_layouts/15/onedrive.aspx?id=%2Fpersonal%2Fmtinti%5Fdundee%5Fac%5Fuk%2FDocuments%2Fmaster%5Fproject%5Flizzie%2FMS%2Fevidence%5Fbsf%2Ezip&parent=%2Fpersonal%2Fmtinti%5Fdundee%5Fac%5Fuk%2FDocuments%2Fmaster%5Fproject%5Flizzie%2FMS&originalPath=aHR0cHM6Ly9kbWFpbC1teS5zaGFyZXBvaW50LmNvbS86dTovZy9wZXJzb25hbC9tdGludGlfZHVuZGVlX2FjX3VrL0VVV3NZTk5fYy1sRWtHS1ZFcUstemhBQlZteWk1TUs3YWNGVFlnXzVrQngwd1E_cnRpbWU9N0gzZVJMR2kyRWc' \
  -H 'accept-language: en-US,en;q=0.9,it;q=0.8' \
  -H 'cookie: MicrosoftApplicationsTelemetryDeviceId=d5c68e82-a582-4578-859f-783ec8a2ce20; MicrosoftApplicationsTelemetryFirstLaunchTime=2020-12-17T17:30:25.426Z; WordWacDataCenter=GUK1; WacDataCenter=GUK1; rtFa=irlrX2saO25ge/DfrhXLIZkMAY5SZ8Woc/v/wO33fCgmN0Y2NTRGMkMtODBEQS00NENBLUIwN0YtM0FCNDREREQ1QkEzQoKU/vFZqSq5NViWSrXANugJgoJO3jG2Uzt+dfjDSuRaTuzhj6/9sJRnv4YtW0QliwY7ZDTdKDIy7il/bPLSeEyM+ZJdeinNnbZqpI9nWSV12JPLaRMqbDxVb6BoyYab9oh+grTgh2QMP+VHfm83J5P+rhisjDj+uWAIRScIUIWu7rbViOopervSx5t46rBIfOIZU1akZU4/yzhfT9e8ub2gu3v/NpsKGjOdUjg4rxQdlk+OR5aOE2YaD7Q1VvLpbn61Dx3LL7xpgi2O5q4OLwmZ7IBTmj1bQTAJFH3dc1nkYEG0lWkNxSL1h1E1znzoBhuNv8jEW1ZsiKoU4CovRkUAAAA=; FedAuth=77u/PD94bWwgdmVyc2lvbj0iMS4wIiBlbmNvZGluZz0idXRmLTgiPz48U1A+VjgsMGguZnxtZW1iZXJzaGlwfDEwMDM3ZmZlODBhYzg4NThAbGl2ZS5jb20sMCMuZnxtZW1iZXJzaGlwfG10aW50aUBkdW5kZWUuYWMudWssMTMyNTI1OTAwNDcwMDAwMDAwLDEzMjI3NzM2OTkzMDAwMDAwMCwxMzI1Mjc4NjEwNzU2MTMzOTcsMTM0LjM2LjIyMy4yNDAsMiw3ZjY1NGYyYy04MGRhLTQ0Y2EtYjA3Zi0zYWI0NGRkZDViYTMsLGQ1YzJjMDljLTFjNjgtNDg0Zi05ZDdkLTg1MmRhZjZiNDNmMCwxMGJlZTE1NC1iYjljLTQzMWMtYmFiMS0wZWYyMGQ1MTY1NmEsMTBiZWUxNTQtYmI5Yy00MzFjLWJhYjEtMGVmMjBkNTE2NTZhLCwwLDEzMjUyNzAzMzA3NTMwMDg1OCwxMzI1Mjk1ODkwNzUzMDA4NTgsLCxleUo0YlhOZlkyTWlPaUpiWENKRFVERmNJbDBpZlE9PSwyNjUwNDY3NzQzOTk5OTk5OTk5LDEzMjUyNjk5NzA3MDAwMDAwMCw3OGM1MTQyYi1iODVhLTQ5NzItYjc4ZS03NmU3MTFlN2QwODYsQTB4cmhXdmFldHd1dnBaU3BMMHdON1YrakV5M1NqckNZeEx5S01HU1JpdGo4QnVzU3FLbHdMcnhJQ0FtY2NJRUEvUVVKS2N1UlBRd2h3T3BEYjExQkh2aXpJQnZ5ZUFBV1VacHVqeXZhMkw5Q0pnWlRQVnhJWHV4bDRXYVNxRkNqYXVrcEM4OEdBNVY1QmVtYmRFME54UDdRKzUzVFJXek9JVnZZOHhPUGQwdklDZW5iL3R1dGxlWFUrL1FsTVl1REhhWnB4bG5lZjhXdEtTTitVTmFBWXVmdjJaWXZrQ3JFS0pPenpxNDJMZU9mNy93aE5FZEo4MDZyWFNpMlB1ZDhxV25oZVV4a2JOdEU1RklyT2NCcTVVazFQalFnQ2haS3RxWlFvdyt3aVk1anJZeUwzdDhxVnMvKy9uNkJYR1ptNlQxUXJDaHorSS9aK0pMOUtheitBPT08L1NQPg==; odbn=1; cucg=1' \
  --compressed --output evidence_bsf.zip

In [None]:
#read in the proteomic evidence data (relevant columns only) for BSF 
#add a 'lifecyle stage' column, set this to 'BSF' 
df_bsf = pd.read_csv('evidence_bsf.zip',sep='\t', usecols = ['Sequence','Score','Delta score','PEP','Raw file'])
df_bsf['Lifecycle Stage'] = 'BSF'

#as above for PCF
df_pcf = pd.read_csv('evidence_pcf.zip',sep='\t', usecols = ['Sequence','Score','Delta score','PEP','Raw file'])
df_pcf['Lifecycle Stage'] = 'PCF'

#combine as one dataframe
df_all = df_bsf.append(df_pcf)
df_all

In [None]:
#from previous blast results, create dataframe of gene names and peptide sequences for genes of interest only

genes_of_interest = ['TRY.375','MSTRG.94','KS17gene_8518a','KS17gene_7003a', 'KS17gene_6998a', 
                                     'KS17gene_6299a', 'KS17gene_265a', 'KS17gene_2338a', 
                                      'KS17gene_1898a']

df_genes = pd.read_csv('results_by_ID_compared.csv', index_col=0)

df_genes = df_genes[df_genes['gene'].isin(genes_of_interest)][['gene','peptide']]

df_genes = df_genes.groupby('gene')['peptide'].apply(list).reset_index(name='peptides')

df_genes


In [None]:
#create empty output data frame to append relevant information to

df_output = pd.DataFrame(columns= ['Gene','Sequence','Score','Delta score','PEP','Raw file', 'Lifecycle Stage'])

df_output

In [None]:
#for genes of interest (df_genes), retrieve relevent information from evidence dataframe (df_all)
#df_all can be searched by peptide sequence, 
#so retrieve information for the peptides in the list under each gene in df_genes

for index, row in df_genes.iterrows():
    gene_name = row['gene']
    df_all['Gene'] = gene_name
    df_output = df_output.append(df_all[df_all['Sequence'].isin(row['peptides'])][
        ['Gene','Sequence','Score','Delta score','PEP','Raw file', 'Lifecycle Stage']].sort_values('PEP').drop_duplicates('Sequence'))
         #save in dataframe only the information and columns that are relevant to our analysis
         #sort results by quality (low to high PEP) and remove non-unique peptides


df_output.to_csv('ms_data.csv')  #save as csv file
df_output

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Finding-genome-coordinates-for-MaxQuant-Derived-Peptides" data-toc-modified-id="Finding-genome-coordinates-for-MaxQuant-Derived-Peptides-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Finding genome coordinates for MaxQuant Derived Peptides</a></span><ul class="toc-item"><li><span><a href="#Retreiving-Indexes-and-Information-from-a-CSV-file" data-toc-modified-id="Retreiving-Indexes-and-Information-from-a-CSV-file-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Retreiving Indexes and Information from a CSV file</a></span></li><li><span><a href="#Preparing-relevant-data-for-running-a-command-line-tBLASTn-search-against-the-Tb927-genome" data-toc-modified-id="Preparing-relevant-data-for-running-a-command-line-tBLASTn-search-against-the-Tb927-genome-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Preparing relevant data for running a command line tBLASTn search against the Tb927 genome</a></span></li><li><span><a href="#Command-Line-tBLASTn-code" data-toc-modified-id="Command-Line-tBLASTn-code-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Command Line tBLASTn code</a></span></li><li><span><a href="#Handling-BLAST-output" data-toc-modified-id="Handling-BLAST-output-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Handling BLAST output</a></span></li><li><span><a href="#Comparison-of-orginal-(MaxQuant)-and-new-(tBLASTn)-results" data-toc-modified-id="Comparison-of-orginal-(MaxQuant)-and-new-(tBLASTn)-results-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Comparison of orginal (MaxQuant) and new (tBLASTn) results</a></span></li></ul></li><li><span><a href="#Reveiwing-Mass-Spec-data-(MaxQuant)" data-toc-modified-id="Reveiwing-Mass-Spec-data-(MaxQuant)-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Reveiwing Mass Spec data (MaxQuant)</a></span><ul class="toc-item"><li><span><a href="#Retreiving-relevant-Information-for-new-protein-coding-genes" data-toc-modified-id="Retreiving-relevant-Information-for-new-protein-coding-genes-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Retreiving relevant Information for new protein coding genes</a></span></li></ul></li></ul></div>