Classifying RNAs by Ensemble Machine learning Algorithms
Classifying transcripts as lncRNAs is difficult because there is no consensus on what a lncRNA truly is. Instead of using thresholds, or rules, for identifying lncRNAs, this tool uses an ensemble stacking method of 8 different gradient boostling models to predict lncRNAs. Trained only on true, validated lncRNAs, this method has been tested on plant transcriptomes with a high success rate.
See our publication at: Prediction of plant lncRNA by ensemble machine learning classifiers
If you use CREMA, please cite: Simopoulos CMA, Weretilnyk EA, Golding GB. "Prediction of plant lncRNA by ensemble machine learning classifiers." BMC Genomics (2018) doi:110.1186/s12864-018-4665-2
To use this tool, simply clone this repository on your machine by:
git clone https://github.com/gbgolding/crema.git
To use this tool you will need:
- CPAT v1.2.1
- Note: CREMA is not compatible with later versions of CPAT
Before you can run tool, you'll need to remove all rRNAs and tRNAs from your input data.
Then, you will need to run cpat.py. An example:
cpat.py -g your_transcript_fasta_file.fa -o cpat_output.txt -x ./cpat_models/ath_hexamer -d ./cpat_models/ath_logit.RData
Firstly you must create the DIAMOND database from the SwissProt protein database:
diamond makedb --in uniprot_sprot.fasta -d swissprot.dmnd
diamond blastx -d swissprot.dmnd -q your_transcript_fasta_file.fa -o diamond_output.txt \\ -e 0.001 -k 5 --matrix BLOSUM62 --gapopen 11 --gapextend 1 --more-sensitive \\ -f 6 qseqid pident length qframe qstart qend sstart send evalue bitscore
Once you have identified your transcript features using CPAT and DIAMOND, you can run the tool!
python3 bin/predict.py -f your_transcript_fasta_file.fa -c cpat_output.txt -d diamond_output.txt
Note: if script cannot find
logit_models.RData, please run
predict.py using its full file path. This is a known issue that we are working on solving.
All output files are written to your working directory. Custom output directories to come...
The most helpful output file is
The CSV has outputs of both the features used in prediction as well as the lncRNA prediction score and final decision.
The columns describe:
- gene name
- length of transcript
- ORF length
- Fickett score (for more info see the CPAT paper)
- Hexamer score (for more info see the CPAT paper)
- % identity to a hit in the SwissProt database
- Alignment length of hit in SwissProt database
- Ratio of alignment length to transcript lenth
- Ratio of alignment length to ORF length
- Score of lncRNA prediction (you can use this to rank your predictions)
- Final decision of prediction: 1 == lncRNA
The other files may be less useful to you, depending on what you're looking at.
all_model_predictions.csv: how each base model predicted the transcript (1 == lncRNA).
all_model_scores.csv: the lncRNA prediction scores of each transcript for each base model.
ensemble_logreg_pred.csv: the raw output of the final logistic regression stacking classifier.
Required arguments: -f input fasta file -c output file from CPAT run -d output file from Diamond blastx Optional arguments: -s minimum lncRNA prediction score (Default: 0.5)