TGEClassififcation is a collection of tools for automated classififcation of ORFs identified from PIT analysis.
Python 3.6, panda 0.19, numpy, Bio-python, shell, R ( packages : seqinr,ada,nnet,randomForest,caret), BLAST, MSGF+ (included), mzIdentML-lib 1.6 (included) This pipeline has been tested on Red Hat Enterprise Linux HPC Node (v. 6) (64 bit).
Our pipeline requires assembled RNA-seq data and mass-spectrometry raw reads converted in mgf format. This classification pipeline has been tested with RNA-Seq data assembled using Trinity and PASA. We use MSGF+ for protein identification amd mzidentML-lib for post-processing, such as FDR calculation, thresholding and protein grouping. Our tools identify To use our pipleine you need following files.
- PASA assembled Transcript file
- Transdecoder predicted ORFs
- Reference proteome in fasta format from Uniprot.
- Reference protein chromosome location tsv [You can download this file from uniprotkb by selecting 'proteomes' from 'Names and Taxonomy' section and downloading it as tsv file]
- Contaminant fasta file is distributed (crap.fasta) along with this pipeline. You can use other contaminants fasta.
- Modification txt file. Please follow the instruction from MSGF+ website to prepare the file.
- Mass-spectra file in mfg format.
- GFF3 file produced by PASA for visualization (optional)
Once pre-requisite tools/softwares have been installed and added to the path, you can simply download the scripts and run Launch.sh from the code directory. You can run individual scripts as well.
Launch.sh will run all the scripts of the pipeline in right orders.
You need all the files mentioned in Files section to run Launch.sh.
usage: Launch.sh
[-u proteomefasta]
[-p orffasta]
[-r transcriptfasta]
[-o outfolder]
[-s mgffile]
[-m modificationsFile]
[-t tolerance deafault 10ppm]
[-i instrument, deafault 1, options 1. Orbitrap/FTICR, 2. TOF, 3. Q-Exactive]
[-f fragment method, default 1. Possoble values 1.CID, 2. ETD, 3. HCD]
[-d decoysearch 0|1 default 1] [-c contaminantfile default crap.fasta]
[-l minlengthpeptide interger default 8]
[-v TSV file conatining reference protein location]
[-g transdecoder generated sample.genome.gff3 file]
If you decide to run the scripts individually, run them in following order.
- runProteinIdentificationAndPostProcessing_cluster.py with Transdecoder predicted ORFs [PIT Search]
- runProteinIdentificationAndPostProcessing_cluster.py with Reference fasta [Standard Search]
- PIT-DBData_processing.py for results from step 1.
- standardSearchResultProcessing.py for results from step 2.
- runBlastBristol.sh for identified ORFs (pass filtered fasta file obtained from step 3)
- contigStat.py - uses blast output from step 5
- UniProteinLocation.py - uses output from previous step and reference protein chromosome location tsv file (file no. 4 from file list)
- IdentifyProteinIsoformSAP.py - uses output from previous step and classify ORFs based on their BLAST mapping.
- PreliminaryProteinAnnotationForPITDBV2.py - uses output file from step 7 and 8 to create a intermediate annotation file.
- SplitAnnotationFile.py - uses annotation file generated by step 9 and create a more detailed annotation file.
- if GFF3 file is available, annotationMatrix.py pass filtered protein csv file generated by step 3 and the GFF3 file (File no 8 from file list)
- peptideEvidence.py - uses filtered PSM results from step 3 and the vcf output from 8 and find prptide evidence for the identified polymorphisms.
- peptideEvidenceIsoforms.py - uses output from step 7, detailed annotation file from step 10 and filtered PSM csv file from step 3 to find isoform specific peptide evidence.
- IsoformScoring.py - this tool uses filtered Amino acid fasta file and protein and peptide/PSM outputs from MSGF+ for PIT search (step 3) and standard search (step 4) and vcf files from step 12 and 13. This tool predicts the probability of variants.
You can try running the test data to check that all the components are working. All required file are available in the data folder except MS file, Transcript fasta, reference fasta file and GFF3 file
sh Launch.sh -u uniprot-proteomeUP000005640_2016.fasta -p human_adeno.assemblies.fasta.transdecoder.pep -r human_adeno.assemblies.fasta -o ./ -s DM_from_raw.mgf -m modifications.txt -t 20ppm -v HumanReferenceLocation.tsv -g human_adeno.assemblies.fasta.transdecoder.genome.gff3
Here is an example of downloading human proteome list with chromosome information for Uniprot website.