csv file:
In=read.csv("FileName", stringsAsFactors=F, row.names=1, header=T)
Data=data.matrix(In)
txt file:
In=read.table("FileName", stringsAsFactors=F, row.names=1, header=T)
Data=data.matrix(In)
check str(Data) and make sure it is a matrix instead of data frame. You may need to play around with the row.names and header option depends on how the input file was generated.
You may on an earlier version of EBSeq. The GetDEResults function was introduced since version 1.7.1. The latest release version could be found at: http://www.bioconductor.org/packages/devel/bioc/html/EBSeq.html And you may check your package version by typing packageVersion("EBSeq")
To generate a heatmap, you may consider the heatmap.2 function in gplots package. For example, you may run
heatmap.2(NormalizedMatrix[GenesOfInterest,], scale="row", trace="none", Colv=F)
The normalized matrix may be obtained from GetNormalizedMat() function.
The NoTest status comes from two sources
-
Using the default parameter settings of EBMultiTest(), the function will not test on genes with more than 75% values < 10 to ensure better model fitting. To disable this filter, you may set Qtrm=1 and QtrmCut=0.
-
numerical over/underflow in R. That happens when the within condition variance is extremely large or small. I did implemented a numerical approximation step to calculate the approximated PP for these genes with over/underflow. Here I use 10^-10 to approximate the parameter p in the NB distribution for these genes (I set it to a small value since I want to cover more over/underflow genes with low within-condition variation). You may try to tune this value (to a larger value) in the approximation by setting ApproxVal in EBTest() or EBMultiTest() function.
Yes you may modify the script rsem-for-ebseq-find-DE under RSEM/EBSeq change line 36
EBOut <- EBTest(Data = DataMat, NgVector = ngvector, Conditions =
conditions, sizeFactors = Sizes, maxround = 5)
to
EBOut <- EBTest(Data = DataMat, NgVector = ngvector, Conditions =
conditions, sizeFactors = Sizes, maxround = 10)
If you are running multiple condition analysis, you will need to change line 53:
MultiOut <- EBMultiTest(Data = DataMat, NgVector = ngvector,
Conditions = conditions, AllParti = patterns, sizeFactors = Sizes,
maxround = 5)
MultiOut <- EBMultiTest(Data = DataMat, NgVector = ngvector,
Conditions = conditions, AllParti = patterns, sizeFactors = Sizes,
maxround = 10)
You will need to redo make after you make the changes.
EBSeq calls a gene as DE (assign high PPDE) if the across-condition variability is significantly larger than the within-condition variability. In the cases that a gene has large within-condition variation, although the FC across two conditions is large (small), the across-condition difference could still be explained by biological variation within condition. In these cases the gene/isoform will have a moderate PPDE.
In general, it is not appropriate to perform cross sample comparisons using TPM, FPKM or RPKM without further normalization. Instead, you may use normalized counts (It can be generated by GetNormalizedMat() function from raw count, note EBSeq testing functions takes raw counts and library size factors)
Here is an example:
Suppose there are 2 samples S1 and S2 from different conditions. Each has 5 genes. For simplicity, we assume each of 5 genes contains only one isoform and all genes have the same length.
Assume only gene 5 is DE and the gene expressions of these 5 genes are:
Sample | g1 | g2 | g3 | g4 | g5 |
---|---|---|---|---|---|
S1 | 10 | 10 | 10 | 10 | 10 |
S2 | 20 | 20 | 20 | 20 | 100 |
Then the TPM/FPKM/RPKM will be (note sum TPM/FPKM/RPKM of all genes should be 10^6 ):
Sample | g1 | g2 | g3 | g4 | g5 |
---|---|---|---|---|---|
S1 | 2x10^5 | 2x10^5 | 2x10^5 | 2x10^5 | 2x10^5 |
S2 | 1.1x10^5 | 1.1x10^5 | 1.1x10^5 | 1.1x10^5 | 5.6x10^5 |
Based on TPM/FPKM/RPKM, an investigator may conclude that the first 4 genes are down-regulated and the 5th gene is up-regulated. Then we will get 4 false positive calls.
Cross-sample TPM/FPKM/RPKM comparisons will be feasible only when no hypothetical DE genes present across samples (Or when assuming the DE genes are sort of 'symmetric' regarding up and down regulation).
The posterior fold change estimations will give less extreme values for low expressers. e.g. if gene1 has mean1 = 5000 and mean2 = 1000, its FC and PostFC will both be 5. If gene2 has mean1 = 5 and mean2 = 1, its FC will be 5 but its PostFC will be < 5 and closer to 1. Therefore when we sort the PostFC, gene2 will be less significant than gene1.