Skip to content

Generates read count and annotates small and long non-coding RNAs mapping to mitochondrial genome

License

Notifications You must be signed in to change notification settings

asan-nasa/mtR_find

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mtR_find

Purpose:

mtR_find is a tool for identification and annotation of sequences mapping to mitochondrial genomes. To run the script on adapter-trimmed FASTQ files, the following dependencies are required: pandas (version 0.21.0 and above), multiprocessing, bowtie (version 1.1.2 and above) and samtools (version 1.9 and above). If users want to output basic plots, then matplotlib PYTHON module is also needed (optional - version 2.0.2 and above).

Usage:

mtR_find.py <species_name> <RNA_type> [--FASTA path/to/mitochondrial genome.fa file] [-GTF path/to/gtf file] [--graphical_output yes/no][--output_path path/to/folder] [--input_path path/to/folder] [--files list of files] [--parallel YES/NO]

Required arguments: The species name and RNA type are required arguments. The others listed above are optional arguments

Valid species name values:
(1) dre - for zebrafish
(2) hsa - for humans
(3) mmu - for mouse
(4) xen - for xenophus
(5) gal - for chicken
(6) rno - for rat
(7) non_model - for species not listed above


Valid RNA type:
(1) sRNA - for mitochondrial sRNA
(2) lncRNA - for mitochondrial long non-coding RNA

Optional Arguments:

--parallel: default value is "NO". If users want to suspend multuprocessing, they have to specify "YES".
--FASTA and -GTF: by default if users specify anyone of the 7 species code (listed above), the script would download the FASTA and GTF file automatically. In case if users want to analyze mtsRNAs/mtlncRNAs in any other species, they would have to manually download the mitochondrial genome FASTA file
--input_path: defalut value is current working directory. Users can specify a input path
--output_path: defalut value is current working directory. Users can specify a output path
--cutoff: default value = 200. cutoff corresponds to the threshold value of the total ncRNA count of individual ncRNAs from all the libraries combined together. For example, a cutoff value of 200 would discard ncRNA sequences with total count (from all libraries) less than 200.
--filter, default = 50, a length filter applicable only for mt-lncRNAs, if users want to study only lncRNAs greater than 200, they can specify "--filter 200" in the command line

--files: defalut value is "None". If the files are in different locations, the absolute path of the files can be specified. Note: if –input_path is specified, --files cannot be specified. If –files argument is specified, files in the current working directory will not be analyzed, even though the output directory will be the current working directory – unless a different output path is specified using –output_path argument.
--graphical_output: default values is "no". If graphical output of the basic plots has to be generated, the user has to specify “yes” under graphical_output. If "YES" is specified, then it is also mandatory to specify a metadata file. This metadata file should contain the filenames as the first column and the condition/factor as the second column

Usage examples:

To analyze mtsRNA with human as species name and if the current directory has all FASTQ files
python mtR_find.py hsa sRNA
To analyze mtsRNA with zebrafish as species name and if the current directory has all FASTQ files
python mtR_find.py dre sRNA

To analyze mtsRNA with zebrafish as species name and to specify filenames explicitly
python mtR_find.py dre --files filename1.fastq filename2.fastq

To analyze mtsRNA with zebrafish as species name and to specify path to folder containing FASTQ files
python mtR_find.py dre --path path/to/folder

To analyze mtsRNA with zebrafish as species name and to specify no graphical output
python mtR_find.py dre --graphical_output no

To analyze mtsRNA with zebrafish as species name and to suspend multiprocessing
python mtR_find.py dre --suspend yes

To analyze mtsRNA with zebrafish as species name and if the current directory has all FASTQ files
python mtR_find.py dre lncRNA

Note: bowtie and samtools should be added to PATH. More information can be found here:

https://phoenixnap.com/kb/linux-add-to-path

https://unix.stackexchange.com/questions/26047/how-to-correctly-add-a-path-to-path

https://opensource.com/article/17/6/set-path-linux

Test datasets

Adapter trimmed FASTQ files:

Adpater trimming was performed on test datsets using Cutadapt version 1.5. If users are trimming raw files downloaded from SRA using other adpater trimming tools (other than Cutadapt) or differnt version of Cutadapt, there could be variations to the input data and hence would affect reproducibility of the results. To ensure reprdocubility, we have uploaded the adapter trimmed FASTQ files for test datasets and can be downloaded in the links provided below. Interested users who wanted to test mt_find on test data before testing on their own data, can download the adpater trimmed FASTQ files from the link below.

Dataset-1:

https://filesender.sikt.no/?s=download&token=dcb87ce5-7224-4900-bd3b-5ec1a4295d09

Dataset-2:

https://filesender.sikt.no/?s=download&token=21680929-cd29-47d7-839c-91c5cf992052

Dataset-3:

https://filesender.sikt.no/?s=download&token=7547e686-11a8-4c62-a19b-67c888a6b904

Results:

The results from running mtR_find for each test dataset which inlcudes the output from mtR_find and nohup commonad line output can be found in the test folder, under sub-folders dataset-1, dataset-2 and dataset-3 respectively

About

Generates read count and annotates small and long non-coding RNAs mapping to mitochondrial genome

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages