# **Walkthrough: Bioinformatics Tools Installation Script**

Before you can run the main adaptive sampling analysis pipeline, you need to ensure that all the necessary bioinformatics software is installed on your server (e.g., a Google Cloud Vertex AI instance). The `install_bioinfo_tools.sh` script automates this process.

This document will walk you through what each section of the installation script does. After understanding its components, you'll be guided on how to run it.

**Important Note:** This installation script is designed for Debian-based Linux systems (like the default images on Google Cloud Vertex AI). It uses apt-get for package management and may require sudo (administrator) privileges to install software system-wide.

> **Tip:** Run the cells one‑by‑one so you can watch each step complete.

## **Script Breakdown**

Let's look at the install\_bioinfo\_tools.sh script section by section.

### **Section 0: Script Header and Initial Messages**

* **\#\!/bin/bash**: This is the "shebang" line. It tells the system that this script should be executed with bash.  
* **Comments (\# ...)**: Lines starting with \# are comments, providing explanations for human readers.  
* **echo "..."**: These commands print messages to the terminal, informing the user about the script's progress.  
* **set \-e**: This is an important command for script robustness. It ensures that if any command within the script fails (exits with a non-zero status), the script will stop immediately. This prevents further errors or incomplete installations.

### **Section 1: Update Package Lists**

* **sudo apt-get update \-y**:  
  * sudo: Executes the command with superuser (administrator) privileges, which is necessary for system-wide package management.  
  * apt-get update: This command resynchronizes the package index files from their sources. It downloads the latest list of available software packages and their versions from the repositories configured on your system. It doesn't install or upgrade any software itself, but it's a crucial first step before installing new packages to ensure you get the latest available versions.  
  * \-y: Automatically answers "yes" to any prompts, making the command non-interactive.

### **Section 2: Install Essential Build Tools and Libraries**

* **sudo apt-get install \-y ...**: This command installs a list of packages.  
  * **build-essential**: A meta-package that installs many common development tools like GCC (C/C++ compiler), make, and other utilities needed for compiling software from source code.  
  * **wget, curl**: Utilities for downloading files from the internet.  
  * **unzip, gzip**: Utilities for decompressing files. gzip is also used by your main pipeline.  
  * **git**: A version control system, often needed to download (clone) software source code from repositories like GitHub.  
  * **zlib1g-dev, libncurses5-dev, libbz2-dev, liblzma-dev**: These are development libraries. Many bioinformatics tools depend on these for functionalities like compression (zlib, bzip2, lzma) or terminal interactions (ncurses). The \-dev suffix indicates that these packages include header files and other resources needed for compiling software that uses these libraries.  
  * **autotools-dev, autoconf, pkg-config**: Tools often used in the build process (compilation) of software from source, helping to configure the build for different systems.

### **Section 3: Install Core Utilities**

* **sudo apt-get install \-y gawk coreutils**:  
  * **gawk**: GNU Awk, a powerful text-processing utility used extensively in your main analysis script.  
  * **coreutils**: This package provides basic file, shell, and text manipulation utilities of the GNU operating system (e.g., cat, ls, mkdir, head, cut, sort, uniq, grep, rm). These are almost always present on a Linux system, but this line ensures they are.

### **Section 4: Install Samtools**

* **sudo apt-get install \-y samtools**: Installs Samtools, a suite of programs for interacting with high-throughput sequencing data in SAM/BAM/CRAM formats. Your main script uses it for sorting, viewing, indexing, and filtering BAM files.  
* **samtools \--version**: After installation, this command is run to print the installed version of Samtools, which helps verify that the installation was successful and the tool is in the system's PATH.

### **Section 5: Install Minimap2**

* **sudo apt-get install \-y minimap2**: Installs Minimap2, a fast sequence alignment program for mapping DNA or mRNA sequences against a large reference database. It's used for the read mapping step in your main pipeline.  
* **minimap2 \--version**: Verifies the Minimap2 installation.

### **Section 6: Install SeqKit**

* **SeqKit Installation from Binary**: SeqKit is often installed by downloading a pre-compiled binary (an executable file) directly from its GitHub releases page. This method doesn't rely on apt-get.  
  * SEQKIT\_VERSION: Sets a specific version to download. You might want to update this to the latest version available on the SeqKit GitHub page.  
  * SEQKIT\_ARCH=$(dpkg \--print-architecture): This command determines the system's architecture (e.g., amd64 for standard 64-bit Intel/AMD processors, arm64 for ARM-based processors) to download the correct binary.  
  * if ... elif ... else ... fi: Selects the appropriate download URL based on the detected architecture.  
  * cd /tmp: Changes to the /tmp directory, a standard location for temporary files.  
  * wget \-q ... \-O seqkit.tar.gz: Downloads the SeqKit archive quietly.  
  * tar \-xzf seqkit.tar.gz: Extracts the contents of the downloaded .tar.gz file.  
  * sudo mv seqkit /usr/local/bin/: Moves the extracted seqkit executable to /usr/local/bin/. This directory is typically in the system's PATH, making the command accessible from anywhere.  
  * rm seqkit.tar.gz: Deletes the downloaded archive to save space.  
  * cd /: Changes directory back (optional, just good practice).  
  * seqkit version: Verifies the SeqKit installation.

### **Section 7: Install Seqtk**

* **Seqtk Installation (Attempt apt then Source)**:  
  * if sudo apt-get install \-y seqtk; then ...: The script first tries to install seqtk using apt-get. If this is successful, it's done.  
  * else ... fi: If apt-get fails (e.g., seqtk is not in the repositories or the version is too old), the script falls back to installing from source:  
    * cd /tmp: Changes to the temporary directory.  
    * git clone https://github.com/lh3/seqtk.git: Downloads the source code for seqtk from its GitHub repository.  
    * cd seqtk: Enters the downloaded source code directory.  
    * make: Compiles the seqtk program from its source code. This often requires build-essential (installed in Section 2).  
    * sudo mv seqtk /usr/local/bin/: Moves the compiled seqtk executable to /usr/local/bin/.  
    * rm \-rf /tmp/seqtk: Deletes the downloaded source code directory.  
  * seqtk: Verifies the installation by running the command (which usually prints usage information if no arguments are given).

## **How to Run the Installation Script**

Now that you understand what the script does, here's how to use it:
 
1. **Run the Script:**  
   * Execute the script 
   * The script will print messages as it progresses through each installation step.  

In [None]:
!bash ~/dsc_workshop_2025/scripts/install_bioinfo_tools.sh

2. **Verify Installations (After Script Finishes):**  
   * Although the script attempts to verify some installations by printing version numbers, it's a good idea to manually check a few:  
     samtools \--version  
     minimap2 \--version  
     seqkit version  
     seqtk

In [None]:
!samtools --version | head -n 1
!echo -e "Minimap2 version: "
!minimap2 --version | head -n 1
!seqkit version
!seqtk 2>&1 | head -n 3

* If any tool failed to install, the script should have exited due to set \-e, or you might see error messages. Review the output to troubleshoot.
* By following these steps, you will set up the necessary software environment to run your main adaptive sampling analysis pipeline.