Skip to content

A R script to perform clustering of gene expression time-series RNA-seq data with Mfuzz.

Notifications You must be signed in to change notification settings

a-velt/Mfuzz_RNAseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 

Repository files navigation

Mfuzz_RNAseq.R

A R script to perform clustering of gene expression time-series RNA-seq data with Mfuzz.

Required R libraries : optparse, tools, Mfuzz, GenomicFeatures, DESeq, edgeR

Mfuzz webpage : http://mfuzz.sysbiolab.eu/ Mfuzz paper : http://w3.ualg.pt/%7Emfutschik/publications/bioinformation.pdf

Mfuzz_RNAseq.R take as input a set of RNA-seq count tables, one per sample, from HTSeq-count for example. All the RNA-seq count tables must be contain in a same folder, given in input of the script.

For example, a folder containing four count data files : Sample1.txt,Sample2.txt,Sample3.txt,Sample4.txt

Sample1.txt contains the following data, without header :

GeneID1 S1Count1
GeneID2 S1Count2
GeneID3 S1Count3
GeneID4 S1Count4

And Mfuzz_RNAseq.R read all the file and generates :

GeneID Sample1 Sample2 Sample3 Sample4
GeneID2 S1Count2 S2Count2 S3Count2 S4Count2
GeneID3 S1Count3 S2Count3 S3Count3 S4Count3
GeneID4 S1Count4 S2Count4 S3Count4 S4Count4

From this table, Mfuzz_RNAseq.R performs a complete RNAseq data normalization and then uses Mfuzz package to perform a soft clustering of gene expression time-series data.

Normalization steps : From the input count tables, the Mfuzz_RNAseq.R script performs a library size normalization with DESeq method and then adjust these normalized data for gene length (normalized data / gene length). These normalization steps are carried out to make all the samples comparable, which is required by Mfuzz package.

Soft clustering steps : With these last normalized data (called RPKN data), the Mfuzz_RNAseq.R script performs a genes clustering analysis with Mfuzz package, generating clusters and associated genes lists.

This script has three principal inputs :

  • the argument "--folder" or "-f" which is the directory containing all the RNA-seq count tables (and only these files). Mfuzz_RNAseq.R will read and merge all these tables and will perform the normalization steps.
  • the argument "--annotation" or "-a" is the path to an genes/transcripts annotation file (gff or gtf format), allowing to calculate the genes length (sum of the exons length, overlap of exons is take into account). This lengths are used during the data normalization by gene length.
  • the argument "--time" or "-t" give the time value of each file by respecting the same order in the vector than the files in the folder. This is a list of type 'time1,time1,time1,time2,time2,time2,time3'. If several files correspond to a same time (replicates), give the same time value and then the script performs the mean on the normalized counts of all the samples of a same time to perform the soft clustering.

For a description of optional arguments, type : /usr/bin/Rscript Mfuzz_RNAseq.R -h

Minimal command: /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -t time

Complete command: /usr/bin/Rscript Mfuzz_RNAseq.R -f count_files_folder -a annotation -b gene_name_attribute -t time -n nb_clusters -m membership_cutoff -s min_std -e exclude_thres -r replacement_mode -o output_directory

About

A R script to perform clustering of gene expression time-series RNA-seq data with Mfuzz.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages