Skip to content

YulongNiu/MPIPZ_microbe-host_homeostasis

Repository files navigation

Guidance of RNA-Seq and 16S amplicon sequencing analyses for the "microbe-host homeostasis" project

1. RNA-Seq analysis

1.1 Raw data

Raw RNA-Seq data-sets can be retrieved from GSE157128. Arabidopsis thaliana plants were germinated with the synthetic communities (SynComs) in the presence or absence of 1 µM flg22 and incubated for 14 days before harvesting. For pWER::FLS2-GFP plants, single-end RNA-Seq experiments were conducted for 4 conditions (3 biological replicates for each condition). For Col-0 plants, pair-end RNA-Seq experiments were conducted for 10 conditions (4 biological replicates for each condition).

1.2 Pre-processing

Raw Illumina RNA-Seq reads were pre-processed using fastp (v0.19.10) [1] with default settings for paired-end (Col-0 experiment) or single-end reads (pWER::FLS2-GFP experiment). For single-end reads, low quality sequences from the head (8 bases) and tail (2 bases) were trimmed.

Scripts:

script_trim.sh.

1.3 Pseudo-alignment

High quality reads were pseudo-aligned to TAIR 10 Arabidopsis thaliana transcriptome reference (Ensembl) [2] using kallisto (v0.46.1) [3].

Scripts:

Results:

1.4 DEGs analysis

After removal of low abundant transcripts that were absent in at least two replicates under each condition, count data were imported using the tximport [4] package.

Differential expression analyses were performed using the DESeq2 [5] package. Firstly, raw counts were normalized with respect to the library size (rlog function) and log2 transformed. We tested for sample batch effects by surrogate variable (SV) analysis using the sva [6] package. Significant SVs were automatically detected and integrated into the model for differential analyses. Principal component analysis (prcomp function) based on whole transcripts were performed and plotted to visualize the cluster and variance of biological replicates under each condition. Abundance of Arabidopsis latent virus-1 reads did not correlate with sample variances and therefore removed from downstream analyses. Pair-wise comparisons were designed as: (1) flg22 treatment effect only, (2) non- suppressive and suppressive SynCom effect only, (3) flg22 treatment plus SynCom effects, (4) living vs. heat-killed bacteria. Transcripts with fold-changes > 1.5 and adjusted p-value for multiple comparisons (Benjamini–Hochberg method) equal to or below 0.05 were considered significant.

The log2 scaled counts were normalized by the identified SVs using the limma [7] package (removeBatchEffect function), and transformed as median-centered z-score by transcripts (scaled counts, scale function).

scripts:

Results:

1.5 Clustering

The z-scores were used to conduct k-means clustering for all transcripts. The cluster number (k = 10) was determined by sum of squared error and Akaike information criterion. Next, confirmed transcripts with similar expression patterns were grouped in the same cluster. Differentially expressed transcripts (3,718 in pWER::FLS2-GFP and 4,450 in Col-0 experiments) and cluster results were visualized using heatmaps generated by ComplexHeatmap [8] package.

Scripts:

Results:

1.6 GO enrichment

Gene ontology (GO) enrichment for each cluster using the whole Arabidopsis transcriptome as background were performed with the goseq [9] package with the consideration of transcripts length. GO annotations were retrieved from the Gene Ontology Consortium [10-11] (September 2019). Significantly changed biological process GO terms (adjusted p-value < 0.05) were visualized in dot plots using the clusterProfiler [12] package.

Scripts:

Results:

2. 16S amplicon sequencing data

2.1 General description

Arabidopsis thaliana WT Col-0 and pWER::FLS2-GFP plants were germinated with the 5-member or 10-member SynCom in the presence or absence of 1 µM flg22 and incubated for 14 days before harvesting. Plants were inoculated with SynCom NS1 and S1 for experiment 1 (corresponding to Fig 2a,d and Extended Data Fig. 5a,d); SynCom NS3 and S3 for experiment 2 (corresponding to Fig 2b,e and Extended Data Fig. 5b,e); and SynCom NS4 and S2 for experiment 3 (corresponding to Fig 2c,f and Extended Data Fig. 5c,f). Detailed compositon of the SynCom can be found in supplementary Table S1.

2.2 Pre-processing

For each sequencing data set, you will find four different files named as 1) library_ID_forward_reads.fastq.gz ; 2) library_ID_reverse_reads.fastq.gz ; 3) library_ID_forward_barcodes.fastq.gz ; and 4)
library_ID_reverse_barcodes.fastq.gz. e.g. pwer001_forward_reads.fastq.gz.

You will also find a file named as library_ID_ref_seqs.fasta, which contains the reference 16S sequences (V5V7) of the individual strain used for each SynCom; and a mapping file named as library_ID_mapping.txt, which contains information of the barcode used for each sample. You can also find a fasta file containing the V5V7 region sequences of ATSPHERE in the script folder.

Join forward and reverse barcode files by running the following command on the terminal. Change the name of the files involved accordingly. e.g.

$ python /YOURPATH/merge_fastaq_pipe.py

pwer_001_forward_barcodes.fastq.gz

pwer_001_reverse_barcodes.fastq.gz --out

pwer_001_barcodes.fastq.gz

Scripts:

After merging the barcode files, prepare your own config file named as library_ID_config.sh. e.g. pwer_001_config.sh

Run the following scripts in the terminal. Change the name of the file accordingly.

$ /YOURPATH/syncom.sh pwer_001_config.sh

Scripts:

By running the above scripts, forward and reverse sequencing reads will be denoised and demultiplexed separately according to the barcode sequence using QIIME [13]. After quality-filtering, merging of paired-end reads, amplicon tags were then aligned to a reference set of sequences obtained from the whole-genome assemblies of every strain included in each experiment by using USEARCH (uparse_ref command) [14]

2.3 alpha and beta diversities

A count feature table for each strain will be generated by using only perfect matches to the reference sequence from the genome collection.

You can also skip the pre-processing step and use the provided otu tables directly.

This count table will be normalized by the total number of reads per sample and then used to calculate for the alpha and beta diversities.

The Simpson index was obtained using the diversity function in vegan package. Bray-Curtis dissimilarity index was calculated using the vegdist function in the vegan package [15] and used for unconstrained ordination by Principal Coordinate Analysis (PCoA). All data were used except for biological replicate c of experiment 1 due to potential contamination issue or PCR error. Constrained PCoA (CPCoA) was performed with the vegan capscale function on the Bray-Curtis dissimilarity matrices constraining by the interaction between flg22 treatment and SynCom variables and conditioning by technical parameters. Statistical significance of separation between community profiles according to flg22 treatment was determined using PERMANOVA with 999 permutations (anova.cca function in vegan). Finally, all amplicon data was visualized using the ggplot2 [16] R package.

The provided scripts can be used to reproduce the figure presented for pwer_001 dataset. Change the file name, pathway and specific parameters including individual strain ID accordingly to generate the figures for pwer_002 and pwer_003.

Scripts:

2.4 Relative abundance (RA)

The provided scripts can be used to reproduce the boxplots presented for pwer_001 dataset. Change the file name, pathway and specific parameters including individual strain ID accordingly to generate the figures for pwer_002 and pwer_003.

Scripts:

References

  1. Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884-i890, doi:10.1093/bioinformatics/bty560 (2018).

  2. Yates, A. D. et al. Ensembl 2020. Nucleic Acids Res 48, D682-D688, doi:10.1093/nar/gkz966 (2020).

  3. Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat Biotechnol 34, 525-527, doi:10.1038/nbt.3519 (2016).

  4. Soneson, C., Love, M. I. & Robinson, M. D. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 4, 1521, doi:10.12688/f1000research.7563.2 (2015).

  5. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol 15, 550, doi:10.1186/s13059-014-0550-8 (2014).

  6. Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The sva package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882-883, doi:10.1093/bioinformatics/bts034 (2012).

  7. Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res 43, e47, doi:10.1093/nar/gkv007 (2015).

  8. Gu, Z., Eils, R. & Schlesner, M. Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 32, 2847-2849, doi:10.1093/bioinformatics/btw313 (2016).

  9. Young, M. D., Wakefield, M. J., Smyth, G. K. & Oshlack, A. Gene ontology analysis for RNA- seq: accounting for selection bias. Genome Biol 11, R14, doi:10.1186/gb-2010-11-2-r14 (2010).

  10. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25, 25-29, doi:10.1038/75556 (2000).

  11. The Gene Ontology, C. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res 47, D330-D338, doi:10.1093/nar/gky1055 (2019).

  12. Yu, G., Wang, L. G., Han, Y. & He, Q. Y. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS 16, 284-287, doi:10.1089/omi.2011.0118 (2012).

  13. Caporaso, J. G. et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods 7, 335-336, doi:10.1038/nmeth.f.303 (2010).

  14. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460-2461, doi:10.1093/bioinformatics/btq461 (2010).

  15. Oksanen, J. et al. vegan: Community Ecology Package. R package version 2.4-3. (2016).

  16. Wickham, H., Sievert, C. & Springer International, P. ggplot2 : elegant graphics for data analysis. (2016).

About

Scripts and results for RNA-Seq and 16S data analysis.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published