# Global recapitulation

## BASH (& R)

* Framework and structures need to be first understood.
* You need to understand what each parameter of the applied tool does. If you don't know, first read the docs.
* Second, try asking specific questions in *Google* etc.
* Only then, you may ask ChatGPT for specific solutions.

### Framework

* We use the command line to execute things in a Terminal.
* There is a hierarchical file system: *directory/subdirectory/file.extension*.
* This file system is at the base of any computer, and the direct interface with a remote server.
* Files are often very large and because of that compressed, marked with extensions such as `.gz`.

### Structures

* Syntax: We execute *commands* with *-arguments* to perform actions on *.files*. Commands take *input* and give *output*.
* We can re-use commands (*Ctrl-C Ctrl-V!*; scrolling with cursor arrows), and autocomplete with *TAB* key.
* This is not secret knowledge, but *open source*: most problems have been encountered before!

### Commands

* Basic commands for navigation: `cd `, `ls`, `pwd` etc.
* Basic commands to look at files, copy, remove: `less`, `wc`, `head`, `cp`, `rm` etc.
* Basic commands to subset/modify files: `grep`, `cut`, `awk`, `sort` etc.
* Regular expressions allow complex arguments (for example for pattern search): `==`, `>=`, `&&`, `[0-9]`, `^` etc.

### Code and data

* We used a GitHub repository for the code - git is a software to control code.
* `R` can be used for statistics and visualisation (even as command line tool)!
* QC of raw sequencing stats, filtering outcomes.
* For example: Distribution of coverage is good to look at!


## Sequencing data

### Genomes and sequencing data

* A genome comes as fasta `.fa` - *always* looks like: header, sequence (possibly many rows).
* A reference genome is an "assembly" of all the chromosomes, providing information on each base across them.
* Raw sequencing data comes as `.fastq` format, in a repeat pattern of *always* 4 rows per individual sequencing read (header, sequence, spaceholder, quality scores).
* Sequencing data can have many rows (current state of technology: 1 billion reads for one sequencing experiment).
* Sequencing data has adapters, you may need to remove them (`trimmomatic`)!
* Check for quality (`fastQC`)!

### Mapping

* Mapping: finding the right spot in the reference genome (`.fa`) for each read in `.fastq` with `bwa` program.
* Mapped sequencing data is stored in `.bam` (binary compressed) format, with *always* one line per read with *always* the same 11 columns for position, sequence, quality scores etc. (and possibly extra columns).
* Header rows start with `@` and contain *metadata*.
* Common toolkit to analyse, generate statistics, filter: `samtools view`, `samtools sort`, `samtools stats` etc.
* PCR duplicates to be removed, best practice: `gatk MarkDuplicates`.
* Sequencing coverage obtained with `mosdepth`.


## Genotype data

### Genotype files

* Genotype calling: determine the allele state for each position in the genome (`.fa`) for the mapped sequencing reads (`.bam`).
* Commonly used program: `gatk HaplotypeCaller` - this is the standard for human genotype calling; `-ERC` option to determine if all sites (`BP_RESOLUTION`), only variant sites (`NONE`), or genome-wide data (`GVCF`).
* Genotypes are in `vcf` format, with *always* one line per genotype - but in "gvcf" file, it can be regions.
* `vcf` files *always* have 9 columns with information (position, alleles), and then from column 10 on genotypes per sample.
* Header rows start with `#` and contain *metadata*.
* Individual `vcf` files from new sequencing data - and population-scale data in multi-individual `vcf` files.
* Possibly 1000s of individuals can be in one `vcf` file!


### Filtering

* A toolkit to analyse, generate statistics, filter: `bcftools`.
* `bcftools view` for regions (`-r`), individuals (`-s`), alleles (`-x`).
* `bcftools view -m2 -M2` to select positions with *at least* 2 alleles (`REF` + `ALT` field), and *at most* 2 alleles.
* `bcftools view -a` after subsetting to retain only polymorphic sites.
* `bcftools concat` across positions.
* `bcftools merge` across individuals.
* `bcftools filter` for many possible things (`-i` to include *EXPRESSION*, `-e` to exclude).
* Beware of input and output: `-Oz` to define zipped `vcf` format as output, `-o filename.vcf.gz` to write directly to file, piping is possible, and beware of header (`awk` and other tools may remove it).
