Automated Protein Data Retrieval and Epitope Prediction

This algorithm was developed to improve the workflow of research about immunogenic T cell epitopes. Specified only by organism and functional subcellular location whole proteomes and subproteomes are accessible. For each sequence of the proteome epitopes are predicted by the epitope predictor NetMHCpan. Accessed via REST api the parameters mhc allele, epitope length and binding score threshold can be specified. All steps are automated and require the minimum of workload. Output comprises spreadsheets for protein and epitope data each and an epitope map image file.

The first step is to specify the proteome by choosing an organism and optionally a subcellular location the proteins should be associated with. Possible options are given in the drop-down list. If the organism strain is not clearly selected multiple entries of the same protein can be retrieved. This can be prevented by answering the pop-up-window about redundancy deletion with yes. Different input will result in an error. Additionally, protein entry can be restricted to the status of review so that only manually curated and thus more reliable data will be retrieved. The output directory is set to create a folder epitope_prediction in the systems standard documents directory. This can be changed in the directory text field at the bottom. Data will be retrieved upon clicking the submit button.

When protein data is retrieved the NetMHCpan button is enabled. The epitope prediction requires the user to input a MHC allele and number of amino acids of epitope length which can be chosen from the drop-down list and a numerical threshold to restrict retrieved epitopes to a minimum binding score. Epitopes will be predicted for each protein automatically upon clicking the submit button.

When epitope data is retrieved the Epitope_Map button is enabled. The program will include every protein on the epitope map if the number of proteins is not limited in the respective input field. Limiting the number of proteins will increase the vertical resolution of the map, especially when proteomes comprise hundreds of proteins. Likewise, the length of the displayed proteins as number of amino acids can be limited to increase horizontal resolution.

Output:

The protein table comprises on the first sheet log data, the input parameters and the frequencies of each protein sequence. The second sheet is named after the subcellular and comprises data for each protein about UniProt ID, protein name, gene name, protein length, subcellular location, taxonomy, signal peptide sequence position and the amino acid sequence. The filename consists of a prefix unp, the organism and subcellular location. In the example unp_SARS_CoV_2_Membrane.xlsx.

The epitope table also shows log, input and epitope frequencies on the first sheet. On the second, for each protein the corresponding epitopes are listed with the respective position in the amino acid sequence and binding score. Additionally, there is the name and ID of the protein and three metrics to value the proteins given. The protein score is the mean epitope binding score. The density is the number of predicted epitopes divided by the number of overall possible peptides of the epitope length in the respective protein. The signal density describes the same restricted to the signal peptide range of the protein. Epitopes located inside the signal peptide range are written in green whereas IDs with green background mark the top scoring proteins across the whole set according to the epitope density. The filename differs to the protein file in the prefix nmp and the additional allele info. In the example nmp_SARS_CoV_2_Membrane_HLA_A_02_01.xlsx.

The epitope map output is a .png-file where each line of the heatmap corresponds to one proteins amino acid sequence. A light grey background indicates the range of the protein and darker bands the presence of one or multiple epitopes. If the are overlapping epitopes, the epitope score is added. The higher the epitope score at a position the darker the grey band. The filename has the suffix hmp, organism and location. In the example hmp_SARS_CoV_2_Membrane.png.

Contact:

Cedric Mahncke cedric.mahncke@leibniz-liv.de

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
.idea		.idea
DiscovEpi_linux		DiscovEpi_linux
DiscovEpi_win		DiscovEpi_win
Supplementary files DiscovEpi		Supplementary files DiscovEpi
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automated Protein Data Retrieval and Epitope Prediction

Table of contents

Requirements:

Installation on Linux OS:

Installation on Windows OS:

Run DiscovEpi

Output:

Contact:

About

Releases

Packages

Languages

cmahncke/DiscovEpi

Folders and files

Latest commit

History

Repository files navigation

Automated Protein Data Retrieval and Epitope Prediction

Table of contents

Requirements:

Installation on Linux OS:

Installation on Windows OS:

Run DiscovEpi

Output:

Contact:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages