This is a collection of miscellaneous bioinformatic tools and scripts that I
have created and used during my PhD studies. They are written in R, Python and
bash (but mostly R) and are meant to be called from the command line. They are
sorted according to their general theme and use. You can get help by calling
any script from the command line with the -h
(or --help
) flag. The topics
covered by mBIT are as follows:
- Variant analysis using seqCAT
- Fetching gene/transcript information from biomaRt
- Gene/transcript expression analyses, including differential expression
- Unsupervised learning and clustering analyses
- Enrichment analysis, ANOVA and a wrapper for eSNP-Karyotyping
These are scripts using the seqCAT Bioconductor R-package for variant analysis of high throughput sequencing (HTS) data. They are simple wrappers for a common workflow for seqCAT, to facilitate re-use and aggregate analysis of many samples at once.
# Create SNV profiles from VCFs in a directory
create_profiles.R <input VCF directory> <output profile directory>
# Compare genetic similarities for all created profiles
compare_profiles.R <input profile directory> similarities.txt
# Create a similarity heatmap from comparisons
similarity_heatmap.R similarities.txt similarities.png --cluster
This is a simple script for fetching information regarding gene and transcript IDs, symbols and biotypes from Ensembl using the biomaRt package. These information files are used in some of the other scripts for filtering out non-coding genes or mapping transcripts to genes.
# Get biomaRt information
get_biomart_info.R GRCh38 biomart/biomart.grch38.txt
These are scripts related to differential expression analyses (DEA) of RNA-seq
data. The main one, de_analysis.R
calculates which genes are differentially
expressed between two samples, using either DESeq2, edgeR or
Limma or a combination of overlapping DEGs between them. It can handle
inputs as either raw counts or gene expression measurements (TPM) from both
Kallisto and Salmon, using TXimport. The
other scripts can analyse the output from de_analysis.R
in various ways.
# Calculate differential expression
# This script has many options and parameters: use `-h` to see them all
de_analysis.R counts.txt sample1,sample2 biomaRt_info.txt degs.txt edgeR
# Create a volcano plot of DEGs
volcano.R degs.txt volcano.png
# Plot the p-value distribution of DEGs
p_distribution.R degs.txt p_distribution.png
The pathway_analysis.R
script can analyse a DEG list and one or more
specified KEGG pathways, finding which genes are differentially expressed
in that pathway and visualises them in an image with fold change colour
gradients. It can also quantify and list the various types of interactions
(e.g. phosphorylations, direct interactions, etc.) in said pathway(s),
where a perturbation event is defined as an interaction A --> B where
A is a DEG. All this is done via the Pathview package.
# Analyse the MAPK pathway
pathway_analysis.R degs.txt
These are scripts for collecting and analysing raw gene and transcript
expression estimates from both Salmon and Kallisto. The
first script, collect_tpm.sh
is a BASH/AWK script for collecting all the
replicates in a directory structure into a single file, and the subsequent
scripts can use its output.
# Collect TPM estimates across samples into a single file
collect_tpm.sh <base Salmon/Kalliso directory> > tpm.transcripts.txt
# Sum transcript-level TPM to the gene level
expression_sums.R tpm.transcripts.txt <biomaRt info file> tpm.genes.txt
# Calculate TPM means across samples
expression_means.R tpm.transcripts.txt <s1,s2,...> tpm.transcripts.means.txt
# Plot some specific transcripts
expression_barplot.R tpm.transcripts.txt <ENSTID1,ENSTID2,...> transcripts.png
These are scripts related to multivariate analyses and machine learning,
applied on either variant or expression data from the above analyses. The
pairwise_correlations.R
script performs correlations in a pairwise manner
between samples, for use in the heatmap.R
, dendrogram.R
and mds.R
scripts, while the pc_analysis.R
performs a principal component analysis on
expression data.
# Perform PCA on expression data (with log2 normalisation)
pc_analysis.R tpm.genes.txt pca.png -l
# Perform pairwise correlations with expression data (with log2 normalisation)
pairwise_correlations.R tpm.genes.txt correlations.txt -l
# Add some sample-specific metadata (such as known groups) to the correlations
add_metadata.R correlations.txt <metadata file> correlations.metadata.txt
# Cluster the data and plot a heatmap of similarities
heatmap.R correlations.metadata.txt r2 heatmap.png -c all
# Cluster the data and plot a dendrogram
dendrogram.R correlations.metadata.txt r2 dendrogram.png -g <groups column>
# Cluster the data a plot a multidimensional scaling plot
mds.R correlations.metadata.txt r2 mds.png -g <groups column>
The enrichment_analyses.R
script analyses a list of genes and performs an
enrichment analysis on them, to see if any annotation terms are statistically
overrepresented. The gene list can be either the output from the DE analysis
above, or any list of genes with a column header of ENSGID
. Enrichment can be
performed on either GO (Gene Ontology) or KEGG (Kyoto Encyclopedia
of Genes and Genomes) terms.
# Perform a GOslim enrichment analysis and plot top 10 terms
enrichment_analysis.R degs.txt biomaRt_info.txt enrichment.png -t GOslim
The anova.R
script takes long-format data containing both treatment
and
response
variables, and checks if there are any differences between groups
(differently coloured groups are significantly different).
# Perform ANOVA and Tukey's HSD
anova.R <input> anova.png <treatment variable> <response variable>
The miscellaneous group also contains the esnp_karyotyping.R
script for
performing eSNP-Karyotyping, which is a wrapper for the package with
the same name. It calculated the mean allelic ratios across SNV profiles
(created with the seqCAT package), which can yield information on
aneuploidy and chromosomal aberrations in RNA-seq data.
esnp_karyotyping.R <SNV profile> esnp.txt
esnp_karyotyping <SNV profile directory> esnp.txt --directory
These scripts are available with a MIT licence. They are free software: you may
redistribute and/or modify them under the terms of the MIT license. For more
information, please see the LICENCE
file.