---
abbreviations:
  GUI: Graphical User Interface
  CLI: Command Line Interface
  URL: Unified Resource Locator
  SRA: Sequencing Read Archive
---

# Intro to Linux

:::{figure} ../../images/tux.webp
:label: tux
:align: center

Meet Tux, the official mascot of Linux.
:::

Linux is an open-source operating system built on top of the Linux kernel. It was initially developed by Linus Torvalds in 1991 and has now grown to involve a few hundred contributors from the open-source community. 

In genomics, Linux has been the prefered platform by scientists to run and orchestrate sophisticated pipelines due to its low memory footprint, stability, and maturity.

## Command Line Interface

The command line, sometimes called the terminal, is a text-based user interface for interacting with the Linux kernel. Upon opening a new shell session, the user is greeted with the following:

```bash
$
```

The dollar symbol (`$`) is what we call the terminal prompt. It is an indication that the current shell is ready to accept commands from the user. 

A linux command usually follows the pattern:

```{code} bash
:label: linux-command-structure

$ <command> <argument> <value>
```

`<command>` is the name of the executable (or program) to be invoked by the user. Parameters that modify the behavior of the program are usually passed after the command name in the form of arguments (`<argument>`). By convention, arguments (or flags) are preceeded by two dashes (`--`). It is immediately followed by a value.

:::{note} 
For brevity, the terminal prompt (`$`) will be excluded from all succeeding commands.

You could copy the commands by clicking on the paper icon on the right-most side of a code block.
:::

As an example, the `mkdir` command is used for creating new directories. Let's check its usage by passing in the `--help` flag:

In [2]:
mkdir --help

Usage: mkdir [OPTION]... DIRECTORY...
Create the DIRECTORY(ies), if they do not already exist.

Mandatory arguments to long options are mandatory for short options too.
  -m, --mode=MODE   set file mode (as in chmod), not a=rwx - umask
  -p, --parents     no error if existing, make parent directories as needed
  -v, --verbose     print a message for each created directory
  -Z                   set SELinux security context of each created directory
                         to the default type
      --context[=CTX]  like -Z, or if CTX is specified then set the SELinux
                         or SMACK security context to CTX
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/mkdir>
or available locally via: info '(coreutils) mkdir invocation'


The first line tells us how we could run the `mkdir` command. It expects _DIRECTORY_ which is a **positional** argument, meaning the program always expects the last entered value to be the path at which the folder will be created.

Note that _OPTION_ is enclosed in square brackets (`[]`). This tells the user that this part is optional. In the following lines, we get more information about the optional flags that we can use with the command:

- `-m` or `--mode` sets the file premissions for the directory (r = read, w = write, x = execute)
- `-p` or `--parents` automatically creates parent directories for the provided path
- `-v` or `--verbose` prints a message for each created directory 

Let's create a directory for storing sequencing reads. 

This will be located at our home directory with the path `~/data/reads`. Since we will be created a nested folder, we need to pass in the `-p` flag to automatically create the `data` directory, along with the nested `reads` directory:

In [3]:
mkdir -p --verbose ~/data/reads

mkdir: created directory '/home/dagsdags/data'
mkdir: created directory '/home/dagsdags/data/reads'


Notice that we can mix short-hand and long-hand options in a single command. From the output, we can see that two directories were generated as expected.

## File Navigation

In most modern computers, a GUI is provided to allow users to navigate between files and directories. However, the shell only supports text input, hence commands are used for moving in and out of directories.

A **file path** is a string that represents the location of a particular file or directory within your system. On a high level, the directory structure for _most_ linux systems follows @file-tree.

:::{figure} ../../images/linux-file-tree.png
:label: file-tree
:align: center

The Linux file tree.
:::

The **root path** is the top-most directory that store all other files within your system. It is usually denoted with a forward slash (`/`) and its serves as the starting point for **absolute paths**. As an example, we could write the absolute path to mary's data directory as `/home/mary/data`.

In contrast, **relative paths**, as the name suggests, are _relative_ to your current directory. This provides a way to anchor the location of a file based on where another file is stored. The path to the `lib` directory relative to robert's home directory is `../local/lib` where `..` is the parent directory of robert.

Some paths are accessed more often than others. Examples include the root and home directories. For convenience, special symbols are used to denote these file locations as tabulated in @path-symbols.

:::{table} Path aliases.
:label: path-symbols
:align: center

| Symbol | Path Description |
| --- | ---------- |
| `.` | the current directory |
| `..` | the parent of your current directory |
| `/` | the root directory |
| `~` | the home directory |
| `-` | the previously navigated directory |

:::

## A Survey of Linux Commands

Skim over @commands to familiarize yourself with the built-in commands included in your Linux installation. 

Afterwards, proceed to [](#a-practical-example).

:::{table} Command Cheatsheet.
:label: commands
:align: center

| Command | Description |
| ----- | ---------- |
| `pwd` | print your current working directory |
| `ls` | list all files and directory in your current path |
| `cd` | change to another directory |
| `mkdir` | make a new directory |
| `cp` | copy a file or the contents of a directory | 
| `rm` | remove a file or a directory |
| `mv` | move a file or a directory to a new location |
| `cat` | concatenate the contents of one or more files |
| `less` | load the contents of a text file to a new, interactive buffer |
| `curl` | a tool for downloading data associated with a URL |
| `man` | access the manual of a program |
| `grep` | a tool for searching text in a file that follows a specified pattern |
 
:::


### A Practical Example

In this section, we will do the following tasks purely from the terminal:

1. Setup a project directory
2. Download a FASTQ file from the SRA
3. Inspect the sequencing data
4. Extract the headers to a separate file

#### Creating A Project Directory

Invoke the `pwd` command to check your current path:

In [31]:
pwd

/home/dagsdags/Downloads


I am currently in the `Downloads` directory, but I want to create my project within the home directory. Use the `cd` command to navigate to the home path:

In [32]:
cd ~

As mentioned in @path-symbols, tilde (`~`) is an alias to the home directory. Re-run `pwd` to verify that you are in the home directory:

In [33]:
pwd

/home/dagsdags


We can now create a new separate folder for our project. To do this, run `mkdir` and pass in the folder name as an argument:

In [34]:
mkdir hello-linux

Now enter into the new folder by invoking the `cd` command:

In [None]:
cd hello-linux

#### Retrieving Data from SRA

For the data, we will be retrieving a subset of ASFV sequencing reads generated by <doi:10.1128/mra.00719-22>. It is hosted by NCBI thru the following link:

- https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos4/sra-pub-run-30/SRR031/31340/SRR31340505/SRR31340505.1

To retrieve this file, we will use the `curl` command. Let's check its usage:

In [4]:
curl --help

Usage: curl [options...] <url>
 -d, --data <data>          HTTP POST data
 -f, --fail                 Fail silently (no output at all) on HTTP errors
 -h, --help <category>      Get help for commands
 -i, --include              Include protocol response headers in the output
 -o, --output <file>        Write to file instead of stdout
 -O, --remote-name          Write output to a file named as the remote file
 -s, --silent               Silent mode
 -T, --upload-file <file>   Transfer local FILE to destination
 -u, --user <user:password> Server user and password
 -A, --user-agent <name>    Send User-Agent <name> to server
 -v, --verbose              Make the operation more talkative
 -V, --version              Show version number and quit

This is not the full help, this menu is stripped into categories.
Use "--help category" to get an overview of all categories.
For all options use the manual or "--help all".


Run the following command to download the FASTQ file:

In [None]:
URL=https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos4/sra-pub-run-30/SRR031/31340/SRR31340505/SRR31340505.1
curl -O -v ${URL}

The `-O` flag will name give the download the same name as the remote file (SRR31340505.1). Run the `ls` command to verify that the download was completed.

In [13]:
ls -h SRR*

[38;5;6m[1m SRR31340505.1[0m


Let us now add some structure to our working directory. It is often good practice to store all reads within a designated folder. Let's create a _reads_ directory using the `mkdir` command, then move our FASTQ file using `mv`. We could also rename our file to include the **.fastq** file extension to make it apparent that it is a sequencing file.

:::{code} bash
:label: rename-and-move-reads
# Create a reads directory.
mkdir -p reads/

# Rename the file to include extension.
mv SRR31340505.1 SRR31340505.fastq

# Move the file to the reads directory.
mv SRR31340505.fastq reads/
:::

Verify the contents of the `reads` directory by re-running the `ls` command:

In [57]:
ls reads/

[38;5;6m[1m SRR31340505.fastq.gz[0m


#### Exploring the FASTQ File

FASTQ files often store a copious amount of reads, making them very large in terms of file size. These files are often compressed into _gzipped_ files which explains the `.gz` file extension. Luckily, a lot of downstream tools for reading and processing FASTQ files support the gzipped format, allowing the user to save on disk space in exchange for bit of more compute.

We cannot directly view the contents of the FASTQ files because it is stored in binary format. We could decompress the read file, but this would mean that we would need to allocate more memory for the same file. A better approach would be to use a program that "translates" a binary file into human-readable text. One such example is `zcat`:

:::{warning}
Do not run the `zcat` program below. It would print the contents of the entire FASTQ file in your terminal which may cause it to exit.
:::

In [58]:
zcat reads/SRR31340505.fastq.gz

[38;5;6m[1m SRR31340505.fastq.gz[0m


Running the `zcat` command on our FASTQ file will output the contents of the file to _standard output_, which is just another way of saying that the data will be printed in your console. This is not that useful as it is hard to inspect the output. Instead, we could redirect the output of `zcat` to another program called `head` to allow us to inspect only the first few lines of the read file. This is done using the **pipe** operator (`|`):

In [1]:
zcat reads/SRR31340505.fastq.gz | head 

@SRR31340505.1 64c24fe2-0b6d-4128-8c18-d29740c4a838 length=432
GCCCCCTAGCGTCACCGAGGTCAGGACGGAGAAGTTCGATGACAGCTCGATCCAGGTGCGCTGTGCCGCGGCGTCCCTGGCTTCCTTCTGTGGTTCTTGGGTTCCGTGGGCGCCGCTGGGAGCGCAGGGCCTGCCTCGTGCCTGAGGAGATAAACGTCGCGGCCTCGGAGCCCCTGTCGGCAGTTGGTTGGCGCTGGGGCCCTCTCTTGCTTCGTTGTGGGTTGTGTGCAAAGCTCCGCCTCCGGAGTCCTCTGAGCCTTGGCTTTTCTCCGGCGGCCTGGCGGGCTTCTAGTCTGTTTCCAGGTCGGGGAGTCAGCCTGTGTGGACGCGCCCGTCGCGGGCCGGCGCGCAGGGTAGGAGGCCTTCAGCGACTGGGGCTAAATGCTGGCCGAGCACGGTGCTGGGGCGCTACCACAGCAGAGCTTCGCAGTG
+SRR31340505.1 64c24fe2-0b6d-4128-8c18-d29740c4a838 length=432
=>==<=>>;><<<<<;<99433+****2689:;<=9A@?>>>>65556@??>2+++,,6?A??>>>;:;99:98:;506;<>==<..?>>1115?=<>?><<===<=755569;;;8;/.-----/.-.)39:;;<===>==..8<:658:;:,++++47655546-----/./--.199:65688?===::9:8:889:9:04,,+2)(/6<;:999:;:::;<<242.+)()+,.167<=*((,++-++--3287323367;<???==<));::;<67><653346;:////098::<=87555112.1:<<;::;;;=><=;:9/..--3678;=:555559:;:63&&&&''''()2339:<;;<=<=<<<=<<:<;876000122.----325668::41),&&%$%%$%%%%(()11229''&%%%
@SRR3134

We could specify the number of lines by using the `-n` flag, followed by a number. Let's limit the output to only the first read:

In [7]:
zcat reads/SRR31340505.fastq.gz | head -n 4

@SRR31340505.1 64c24fe2-0b6d-4128-8c18-d29740c4a838 length=432
GCCCCCTAGCGTCACCGAGGTCAGGACGGAGAAGTTCGATGACAGCTCGATCCAGGTGCGCTGTGCCGCGGCGTCCCTGGCTTCCTTCTGTGGTTCTTGGGTTCCGTGGGCGCCGCTGGGAGCGCAGGGCCTGCCTCGTGCCTGAGGAGATAAACGTCGCGGCCTCGGAGCCCCTGTCGGCAGTTGGTTGGCGCTGGGGCCCTCTCTTGCTTCGTTGTGGGTTGTGTGCAAAGCTCCGCCTCCGGAGTCCTCTGAGCCTTGGCTTTTCTCCGGCGGCCTGGCGGGCTTCTAGTCTGTTTCCAGGTCGGGGAGTCAGCCTGTGTGGACGCGCCCGTCGCGGGCCGGCGCGCAGGGTAGGAGGCCTTCAGCGACTGGGGCTAAATGCTGGCCGAGCACGGTGCTGGGGCGCTACCACAGCAGAGCTTCGCAGTG
+SRR31340505.1 64c24fe2-0b6d-4128-8c18-d29740c4a838 length=432
=>==<=>>;><<<<<;<99433+****2689:;<=9A@?>>>>65556@??>2+++,,6?A??>>>;:;99:98:;506;<>==<..?>>1115?=<>?><<===<=755569;;;8;/.-----/.-.)39:;;<===>==..8<:658:;:,++++47655546-----/./--.199:65688?===::9:8:889:9:04,,+2)(/6<;:999:;:::;<<242.+)()+,.167<=*((,++-++--3287323367;<???==<));::;<67><653346;:////098::<=87555112.1:<<;::;;;=><=;:9/..--3678;=:555559:;:63&&&&''''()2339:<;;<=<=<<<=<<:<;876000122.----325668::41),&&%$%%$%%%%(()11229''&%%%


To inspect the last few lines of the file, we instead use the `tail` command:

In [8]:
zcat reads/SRR31340505.fastq.gz | tail -n 4

@SRR31340505.118391 e1a99ecf-c0b2-476a-8fa0-a0ad555b79af length=4135
TGGCCTCTTCTTTCCCCCTCCAGTCTACACGTCATGATGAGTTATGTCAACATCTTAAAAAAACTTTTGACGCCGCGCATTGAATTCTATGAAGATATTGAAACCATCGACCGCGGTCTTATTAATATATACCTCCTGCCAGTATTCGACCCTAAACCAACCGCAGTGCAAAAAAAAGCAGCTTTACGCATTCCTGTAAACCACTTTGAAAATTATATTCACATTCTTGCGGCGGATATTTTAAATCCCTTAAAACAGTACTCTATTTTTAACAGGTCTGGGCGTTATAGATGACCTACAGTTTATATTGCGGCCGCAGGAAATTATTAGTGTAAAAAATAAGTTTTAAAGTATATGGATATAAAAAGAGCACTTATCCTTTTTTTACTATTTTTAGTCGTATTGAGCAATGCTTTTGTGGACTACATTATTAGCAATTTTAACCATGCCGTGACATGCGGAAAACCTACCTACTTTGGTATAGTTCTTCAAGGTATTTTTCTTGTTATTCTTTTTAGCATAGTCGATTACCTTATTAATGAAAACATTCTTTAATCGGGTACCGGTAACGCACAAAATTTCTTGCCAATGTACGCTGATAAGATCTTTCAAAAACGTATAAATGGTCTCCGCGGATTCGCATGTACAGCCTAAAATATTTATCTTACCTTTAAGAAAAACATTAATGCGTACTTTTTTTCCCGGACTGACCTTAAATTTTGCGGATAACGCGGTCTTCTAATGGAGGCTTTATTTTTCTAATATGGGATATGGAGTAAAGGGATGTGTTCCAACAAAGCTGCAAATTTTTTTAAATGAATGATGACGCGGGGAGACACTGGGTTTATTTGAAACTTAAAATTAATCATAATCGTTTTAAATTCAATAATCTGGATTTTTTTCAATGGGTTGATGTTGCAAGAAGTCTACC

#### Extracting the Header Lines

Sequencing reads are stored in chunks of four lines in a FASTQ file:

1. metadata associated with the read (header line)
2. the nucleotide sequence of the read
3. a delimiter (denoted by `+`)
4. a confidence score for each position of the read


In [16]:
# Header line or read metadata
zcat reads/SRR31340505.fastq.gz | awk 'FNR==1' 

@SRR31340505.1 64c24fe2-0b6d-4128-8c18-d29740c4a838 length=432


In [17]:
# Nucleotide sequence
zcat reads/SRR31340505.fastq.gz | awk 'FNR==2' 

GCCCCCTAGCGTCACCGAGGTCAGGACGGAGAAGTTCGATGACAGCTCGATCCAGGTGCGCTGTGCCGCGGCGTCCCTGGCTTCCTTCTGTGGTTCTTGGGTTCCGTGGGCGCCGCTGGGAGCGCAGGGCCTGCCTCGTGCCTGAGGAGATAAACGTCGCGGCCTCGGAGCCCCTGTCGGCAGTTGGTTGGCGCTGGGGCCCTCTCTTGCTTCGTTGTGGGTTGTGTGCAAAGCTCCGCCTCCGGAGTCCTCTGAGCCTTGGCTTTTCTCCGGCGGCCTGGCGGGCTTCTAGTCTGTTTCCAGGTCGGGGAGTCAGCCTGTGTGGACGCGCCCGTCGCGGGCCGGCGCGCAGGGTAGGAGGCCTTCAGCGACTGGGGCTAAATGCTGGCCGAGCACGGTGCTGGGGCGCTACCACAGCAGAGCTTCGCAGTG


In [14]:
# A delimiter which is the same as the header line
zcat reads/SRR31340505.fastq.gz | awk 'FNR==3' 

+SRR31340505.1 64c24fe2-0b6d-4128-8c18-d29740c4a838 length=432


In [18]:
# Basecalling scores
zcat reads/SRR31340505.fastq.gz | awk 'FNR==4' 

=>==<=>>;><<<<<;<99433+****2689:;<=9A@?>>>>65556@??>2+++,,6?A??>>>;:;99:98:;506;<>==<..?>>1115?=<>?><<===<=755569;;;8;/.-----/.-.)39:;;<===>==..8<:658:;:,++++47655546-----/./--.199:65688?===::9:8:889:9:04,,+2)(/6<;:999:;:::;<<242.+)()+,.167<=*((,++-++--3287323367;<???==<));::;<67><653346;:////098::<=87555112.1:<<;::;;;=><=;:9/..--3678;=:555559:;:63&&&&''''()2339:<;;<=<=<<<=<<:<;876000122.----325668::41),&&%$%%$%%%%(()11229''&%%%


Each header line is prefixed with the `@` symbol, followed by the read accession number (SRR31340505). To extract all headers, we use the `grep` tool which scans each line of the file and only prints the output if it matches our specified pattern:

In [20]:
zcat reads/SRR31340505.fastq.gz | grep "^@SRR" | head

@SRR31340505.1 64c24fe2-0b6d-4128-8c18-d29740c4a838 length=432
@SRR31340505.2 990dc2b3-1a0b-4b3d-9082-248a77f64af0 length=359
@SRR31340505.3 2bacc2d6-1cd0-4350-adaf-ee83368b47d2 length=176
@SRR31340505.4 673843c3-f1d9-4b01-a64a-e6d8ac8918d9 length=174
@SRR31340505.5 42386295-1bbd-4dec-94f7-a45b91b110b2 length=188
@SRR31340505.6 9e5c8eec-0367-4757-ab60-eb3fdc299a6d length=240
@SRR31340505.7 e44c2b35-4476-4177-b1df-8098a78e99d2 length=248
@SRR31340505.8 e3ed03db-2051-4a44-8b28-bd8589d16c5d length=98
@SRR31340505.9 16377107-7359-475e-81f0-a047f88e7dd4 length=178
@SRR31340505.10 66a95941-638d-4d31-94a9-af394a2b72bf length=149


The pattern `^@SRR` tells `grep` to only print out lines that **start with @SRR**.

To count the number of reads, we could pipe the `grep` output to `wc` which stands for _word count_. We specify the `-l` flag to tell `wc` to count by line:

In [21]:
zcat reads/SRR31340505.fastq.gz | grep "^@SRR" | wc -l

118391


To make our counting more robust, we should make sure that we only increment our count if the read is **unique**. Let's use the `uniq` tool for filter out duplicates:

In [22]:
zcat reads/SRR31340505.fastq.gz | grep "^@SRR" | uniq | wc -l

118391


It seems that the FASTQ file doesn't have duplicates. As seen above, there are a total of **118,391** reads.