Skip to content

Installation guide

G. E. Kenney edited this page Nov 10, 2023 · 28 revisions

making sure that you have everything needed to run prettyClusters

prettyClusters is primarily R-based, but it draws on some outside tools. Getting everything installed before you start is probably a good idea. Take a look at the TOC at the side of the page (on a desktop view) and you skip past stuff you don't need to revisit.

A hardware note

Some tools allow you to specify the number of threads on your CPU are getting used.

On macOS, you can use a sysctl command to get this information - the "hw.logicalcpu" value is your number of threads.

%      sysctl hw.physicalcpu hw.logicalcpu
hw.physicalcpu: 8
hw.logicalcpu: 16

On Windows, going to Task Manager > performance > CPU will give you the number of logical processors, i.e. threads.

On Linux, try nproc.

%      nproc --all
16

R

Installing R and RStudio

If you haven't already, install first R and then - if desired - the RStudio IDE and set up as required for your OS. Note: as of July 2023, the current version of R is 4.3.1.

On Windows

If you are on Windows, install rtools for building packages from source. If you are on windows, you may also need to install git separately before using r_tools; some users have had issues using devtools::install_github without it.

On macOS

If you are on MacOS, make sure you have XCode installed, and install gfortran and XQuartz. The first two are going to be necessary for building packages from source, and XQuartz is necessary for graphics in some R packages. If you aren't sure you have XCode installed, run the following in Terminal and it'll either tell you it is installed or install itself:

%   xcode-select --install

The gfortran link goes to the package suggested by the R team, but you can also install it via homebrew (brew install gcc), macports, etc. (See my notes on installing other packages - they apply here too.)

gcc, gfortran, and non-Intel macs

NOTE FOR PEOPLE ON M1/M2/etc. "Apple Silicon" COMPUTERS: If you are not running R/RStudio in Rosetta 2 mode, you may run into some fun and exciting problems. I've only just gotten a computer where I can test this myself, so caveat utilitor! However, based on the experiences of some users, gfortran may not install in the location that R and RStudio look for it on modern macs. First, see what your system is actually calling up when it tries to use it:

%   gcc --version
%   gfortran --version

If the output for either is something including "Apple clang version [...]", the system is not by default calling the version you installed. First, try adding the following to your $PATH. On macOS, you can edit the .zshrc file in your home directory (cd ~ to get there) to contain the following somewhere (note: correct the gfortran location depending on whether you installed via the recommended package or homebrew):

export PATH = "/usr/local/gfortran/bin:/Library/Frameworks/R.framework/Resources:$PATH"

(Or you can just add the directories to whatever $PATH list you already have.) You'll need to reload your .zshrc (source ~/.zshrc, or the equivalent for your .bashrc or whatever) and see if the output for the version commands above changes. If so, excellent. Check out what gfortran version you see listed; we're going to use this to edit another R configuration file. There may already be a .r directory in your home directory; if so, you can go there (cd ~/.r). If not, you'll have to make it first with mkdir (mkdir ~/.r). In this folder, you want to add a file called Makevars. (I usually use emacs, but any text editor is fine). Add the following, but change your actual gfortran installation location to whatever's correct on your computer, and update the XX.X.X to match the version reported above. Take the 2-digit version number and replace the XX in the "VER" line with it.

VER=-XX
CC=gcc$(VER)
CXX=g++$(VER)
CFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
CXXFLAGS=-mtune=native -g -O2 -Wall -pedantic -Wconversion
FC=/usr/local/gfortran/bin/gfortran
F77=/usr/local/gfortran/bin/gfortran
FLIBS=-L/usr/local/gfortran/lib/gcc/x86_64-apple-darwin20/XX.X.X -L/usr/local/gfortran/lib -lgfortran -lm 

Save that file, close, close RStudio, and re-open it. Hopefully R and RStudio will draw on the gfortran that you installed and not Apple's clang implementation, and your packages will compile.

Failing that: there are some reports that adding the FC and FLIBS lines to a Makeconf file located here (where X.X is your R version) works if the Makevars edits don't:

/Library/Frameworks/R.framework/Versions/X.X-arm64/Resources/etc/Makeconf

Failing that: a (potentially worse) hack to address this involves linking the actual gfortran location to the place R will look for it (note: this assumes you installed the ARM64 gfortran from GitHub; change the source directory if you installed gfortran via brew or macports):

%   ln -s /usr/local/gfortran /opt/R/arm64/

After this, you'll probably still want to edit Makevars as described above. I think symlinking generally shouldn't be necessary at this point, but I'm leaving it here as a legacy note.

xz and lzma.h on non-Intel macs

Occasionally, during package installation in R on M1/etc. macs, some tools cannot locate a file called lzma.h when installing certain packages. For this, try installing the xz package via brew (or macports, if desired), and then either symlinking the lzma.h file and the lzma directory to the directory in which R is looking for them (shown below for the most recent homebrew location, but please confirm the file locations!), or simply copying the files into that directory:

%   brew install xz
%   ln -s /opt/homebrew/Cellar/xz/5.2.5/include/lzma.h /opt/R/arm64/include/lzma.h
%   ln -s /opt/homebrew/Cellar/xz/5.2.5/include/lzma /opt/R/arm64/include/lzma

Note that these hacks may break when, say, you update gfortran or xz or possibly R/RStudio - you will need to change any hardcoded symlinks, and version numbers in the Makevars file, and so on.

Installing R packages

Most R packages used in prettyClusters are available as precompiled packages on CRAN; a few are found in other repositories or need to be compiled from source, which is why I suggested installing gfortran and XCode and/or rtools. Note that most of the core tidyverse packages - dplyr, ggplot2, stringr, tibble, tidyr - get used at some point, so I'd suggest installing it all.

install.packages(c("data.table", "devtools", "gggenes", "ggraph", "ggtext", "igraph", "influenceR", "magrittr", "pheatmap", "pvclust", "rlang", "scales", "seqinr", "tidygraph", "tidyverse", "utils", "viridis"))

Listing the palette-focused packages separately, yes, there are too many options. If you don't install any of them, viridis will be the default color package.

install.packages(c("fishualize", "ghibli", "lisa", "nord", "rtist", "scico", "wesanderson"))

Some packages are instead found on the biology-focused Bioconductor repository. To ensure that the latter can be installed, make sure you've installed BiocManager first (you can check on the current version available and its compatibility with your installed version of R here.):

# installing BiocManager
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version = "3.17")
# installing the non-CRAN packages
BiocManager::install(c("genbankr", "GenomicRanges", "GenomeInfoDb"))

Finally, prettyClusters isn't legit enough to be in a repository yet, so you'll have to install it here (this is why I included devtools in the package list):

devtools::install_github("g-e-kenney/prettyClusters")

Updating

I do try to keep this package more or less in sync with major changes in tidyverse or other key packages, and I've been trying to fix bugs relatively aggressively. It's worth re-installing if my main repository page says there is an update or if you haven't used this tool in a while - I might well have updated something useful! Generally standard R use of install.packages() (and devtools::install_github() for this package) should suffice. However, an RStudio caveat: if it doesn't seem to be correctly using a newer version of a package, try restarting or opening a new RStudio project. I've occasionally run into weird RStudio issues where cached versions of functions remain even when a package is fully removed and reinstalled.

Installing non-R tools

In prepNeighbors, tools including blast, mafft, and hmmer can be called when investigating hypothetical protein subgroups. However, they are not R packages and need to be installed separately. I've written up some basic guidelines here for getting these tools installed in a way that will work well with R & RStudio. Note that some of these tools have standalone installers, but it's honestly going to be less trouble to install everything via one method (if you need to troubleshoot things, you don't have to do it separately for each tool).

Managing packages

As a result, it can be convenient to use a package manager. There are some common options these days:

Windows

On Windows, I'd make sure you've installed WSL (the Windows Subsystem for Linux). I just use a standard Ubuntu installation for WSL; when you are using that, you can stick with the built-in apt-based package management system. I have not tried other non-WSL options, and do not guarantee that they will work.

Linux

See above - if you're using something other than Windows or macOS you probably already know what kind of package system your OS uses.

macOS

These days homebrew is probably the most common option - installation instructions here.. While macports also works, I'm not going to list commands for it too - if you're already a macports user, I'll list the package names and you can install accordingly.

Environment settings

Note that some RStudio/R/OS/shell combinations don't correctly pass on PATH information so depending on what installation method you use, you might need to update to update the paths that R is looking for. If you're not sure what directories to add, but you know that mafft or blastp work in the Terminal, check your profile files and see what the PATH variables contain there (or just list your full and probably ugly $PATH):

%   more ~/.zshrc
%   more ~/.bash_profile
%   echo $PATH

You can compare the $PATH reported there to what R(Studio) finds:

Sys.getenv("PATH")

If the R version is missing key things, you can tell it that no really, it needs to look in some additional places, either via R commands as below (replacing the miscellaneous paths with the rest of your $PATH output), or via editing your .Renviron file:

Sys.setenv(PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/YOUR/MISCELLANEOUS/PATHS/HERE")

If you're on a non-Intel Mac, you'll also want to make sure that any package managers you are using are getting detected (e.g. you would want to make sure that /opt/homebrew/bin:/opt/homebrew/sbin are present for homebrew.)

actual package installation

You can install blast+ and mafft without package managers, but I don't think there's a current option for hmmer, and I'm not sure how well the standalone packages play with R and other tools. I haven't tested prettyClusters with (bio)conda-based environments, so caveat utilitor on that front too.

macOS

For the homebrew install, you will want:

%   brew install blast
%   brew install mafft
%   brew install hmmer

They're also on macports (ncbi-blast+, mafft, hmmer).

NOTE: if you are on an M1 or M2 macOS system, for hmmer, you will currently want to follow these instructions with these modifications to the ./configure & make & make install process.

wsl (Ubuntu on Windows) & unix/linux

Ubuntu apt-based package instructions below; if you use something else, you probably can extrapolate accordingly.

%   apt install ncbi-blast+
%   apt install mafft
%   apt install hmmer

Additional tools for working with UniProt, EMBL, and GenBank files:

For extracting data from GenBank formatted files, I use a range of additional tools. Depending on what you're trying to do, it might be a good idea to use them too.

python

It can be a good idea to use a local install of python rather than the OS version, which doesn't get updated much, so that you can control updates. Homebrew and MacPorts (python311) are fine options, and python can be installed as above.

%   brew install python

And on WSL/unix/linux:

%   apt install python3

BioPerl

As with python, it can be good to have a version of perl that is not the MacOS-controlled default, though this is less mission-critical than having an up-to-date python install for most purposes. You can install it via Homebrew or macports (perl5) as above (or from other sources like perlbrew:

%   brew install perl

And on WSL/unix/linux:

%   apt install perl

Then install the BioPerl modules. Start by opening up CPAN:

%   perl -MCPAN -e shell

Look for bioperl distributions and figure out which is the newest:

cpan> d /bioperl/
Distribution    CJFIELDS/BioPerl-1.6.901.tar.gz
Distribution    CJFIELDS/BioPerl-1.6.923.tar.gz
Distribution    CJFIELDS/BioPerl-1.6.924.tar.gz

Now install it:

cpan> install CJFIELDS/BioPerl-1.6.924.tar.gz

emboss

Emboss is occasionally helpful when working with self-annotated files (from RAST, prokka, etc.) but is not needed for many workflows. Installable via homebrew (EMBOSS on macports):

%   brew install brewsci/bio/emboss

And on WSL/unix/linux:

%   apt install emboss

InterProScan

You will need access to a Linux machine when using InterProScan (or a cluster, which is what I run it on.) If you're in a situation where you can and must set it up yourself, install instructions are available.

Cytoscape

If you are working with datasets where you have run EFI-EST analyses, or if you want to view the neighborhood network, you may want to download Cytoscape - it's cross-platform and Java-based. (It's also a huge memory hog - run on whatever you can find that has the most memory.)

EFI-EST

If you want to generate sequence similarity networks for your genes of interest, the EFI-EST webtool is the current standard. Use of EFI-EST is its own topic, and they have a lot of great resources. For the purposes of prettyClusters usage, you'll want to submit a fasta file of your genes of interest via Option C (user fasta file), making sure that "Read FASTA headers" is not selected (it looks for UniProt protein IDs and will become very sad when it finds only IMG ones).