# 4.1 De novo genome assembly

## 4.1.1 Assembly approaches

### 1. Briefly describe overlap-layout-consensus assembly.
The first step is to find overlapping regions on the contigs, and based on the overlap generate a layout on a graph, that is deciding on an order and configuration of the contigs. Then the assembly is completed by resolving discrepancies by coming to a consensus.

### 2. What is a de Bruijn graph?
It is a graphical representation of a sequence, and works by generating k-mers from the reads, which are assigned as nodes, and if there is an overlap of k-1 characters, the nodes are connected with an edge.

### 3. How can you use a de Bruijn graph for de novo genome assembly?
The sequencing reads will be used to generate the k-mers as nodes. Then, by traversing the graph, the full sequence can be found. It should be noted that this method collapses the repetitive regions.

### 4.
de Bruijn graphs are much more computationally efficient, as they avoid the need for all-vs-all read comparisons required by OLC, which is particularly important for high throughput sequencing. Thanks to this, it also features lower memory usage and faster assembly times. Furthermore, because de Bruijn graphs collapse identical k-mers into one node, it effectively handles the challenge caused by repetitive regions. They can be particularly advantageous when it comes to handling short and error prone sequences.


## 4.1.2 Data preprocessing

### 1. Estimate the quality of raw data.

```{bash}
./programs/FastQC/fastqc -o p4/1-2-Data-Preprocessing//fastqc/ data/Capsella-sequencing/DNA-seq/rawdata/*.fq.gz
multiqc -n p4/1-2-Data-Preprocessing/fastqc/*

```
multiqc to visualize multiple files at once together
In the raw sequencing data quality scores are worse towards the end of the reads, average quality per read is low, and there are problematic duplication statistics.

### 2.Do quality trimming of the rawdata. You may wanna test multiple parameters.
```{bash}
outdir=~/p4/1-2-Data-Preprocessing/subsets-phred33-trimmomatic

mkdir -p $outdir $outdir/unpaired

for sample in Cru Cgr; do

    java -jar \
         /mnt/silo/hts2024/dnarabaci/programs/Trimmomatic/Trimmomatic-0.39/trimmomatic-0.39.jar \
         PE \
         -threads 4 \
         /mnt/silo/hts2024/data/Capsella-sequencing/DNA-seq/subsets-phred33/$sample\_1.fq.gz \
         /mnt/silo/hts2024/data/Capsella-sequencing/DNA-seq/subsets-phred33/$sample\_2.fq.gz \
         $outdir/$sample\_1.fq.gz \
         $outdir/unpaired/$sample\_1_unpaired.fq.gz \
         $outdir/$sample\_2.fq.gz \
         $outdir/unpaired/$sample\_2_unpaired.fq.gz \
         LEADING:20 \
         TRAILING:20 \
         SLIDINGWINDOW:20:15 \
         MINLEN:36

done

```
### 3. Do error correction of the rawdata. You may wanna test multiple programs and/or parameters.


```{bash}
#!/bin/bash
PATH=/mnt/silo/hts2024/dnarabaci/programs/bfc/bfc-master:$PATH

outdir=~/p4/1-2-Data-Preprocessing/subsets-phred33-bfc


for f in /mnt/silo/hts2024/data/Capsella-sequencing/DNA-seq/subsets-phred33/C*.fq.gz; do
    sample=$(echo $f | sed 's/.*\///' | sed 's/.fq.gz//')
    echo $sample $f
    bfc \
        -s 1m \
        -t 4 \
        $f \
        | gzip > $outdir/$sample.fq.gz

done

### Quality control
mkdir -p $outdir/fastqc

/mnt/silo/hts2024/dnarabaci/programs/FastQC/fastqc \
    -o $outdir/fastqc \
    $outdir/*.fq.gz

```
### 4. Do adapter removal followed by error correction.

## 4.1.3 De novo genome assembly using a short read assembler
### 1.
SOAPdenovo2

### 2.

```{bash}

#!/bin/bash

PATH=/mnt/silo/hts2024/dnarabaci/programs/miniconda/bin:$PATH

k=51

### C. rubella rawdata
mkdir -p ~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-raw/K$k

SOAPdenovo-63mer all \
  -s ~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-raw.conf \
  -K $k \
  -R \
  -o ~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-raw/K$k \
  1>~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-raw/K$k-assembly.log \
  2>~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-raw/K$k-assembly.err


### C. rubella bfc error corrected
mkdir -p ~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-bfc/K$k

SOAPdenovo-63mer all \
  -s ~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-bfc.conf \
  -K $k \
  -R \
  -o ~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-bfc/K$k \
  1>~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-bfc/K$k-assembly.log \
  2>~/p4/1-3-denovo-genome-assembly-using-short-assembler/soapdenovo/Cru-bfc/K$k-assembly.err

```

### 3.

Capsella rubella bfc corrected:
 Scaffold number                  37
 In-scaffold contig number        504
 Total scaffold length            989536
 Average scaffold length          26744
 Filled gap number                163
 Longest scaffold                 179362
 Scaffold and singleton number    124
 Scaffold and singleton length    1016497
 Average length                   8197
 N50                              95324
 N90                              15401
 Weak points                      0

Capsella rubella raw:  
 Scaffold number                  41
 In-scaffold contig number        593
 Total scaffold length            988778
 Average scaffold length          24116
 Filled gap number                152
 Longest scaffold                 156601
 Scaffold and singleton number    208
 Scaffold and singleton length    1020498
 Average length                   4906
 N50                              65930
 N90                              15327
 Weak points                      0



### 4
with the corrected version, the number of scaffolds are lower, average length is higher, and N50 is better.

### 5
Capsella grandiflora raw:
 Scaffold number                  1286
 In-scaffold contig number        15737
 Total scaffold length            743873
 Average scaffold length          578
 Filled gap number                883
 Longest scaffold                 8562
 Scaffold and singleton number    13177
 Scaffold and singleton length    1817294
 Average length                   137
 N50                              290
 N90                              52
 Weak points                      0

Capsella grandiflora bfc:
 Scaffold number                  1204
 In-scaffold contig number        14908
 Total scaffold length            726989
 Average scaffold length          603
 Filled gap number                901
 Longest scaffold                 8586
 Scaffold and singleton number    12342
 Scaffold and singleton length    1735882
 Average length                   140
 N50                              301
 N90                              52
 Weak points                      0

C.rubella had a much lower number of scaffolds as well as longer scaffolds, suggesting a more contiguous assembly. A very important metric is also N50, which the C. rubella assembly had a much higher N50, suggesting a higher quality assembly.

## 4.1.4 Evaluating your genome assembly

### 1.
NG50 scaffold length,NG50 contig length,  Amount of scaffold sequence gene-sized scaffolds (≥25 Kbp), CEGMA, number of 458 core genes mapped, Fosmid coverage, Fosmid validity, VFR tag scaffold summary score, Optical map data, Optical map data, REAPR summary score

### 2.
NG50 scaffold length,NG50 contig length,  Amount of scaffold sequence gene-sized scaffolds (≥25 Kbp), CEGMA, number of 458 core genes mapped, REAPR summary score apply to de novo assemblies.
### 3. 
N50 is calculated by summing all sequence lengths, starting with the longest, and observing the length that takes the sum length past 50% of the total assembly length. NG50 is similar, it is calculated by sorting scaffolds from longest to shortest and finding the length where cumulative length reaches 50% of the expected genome size.
### 4.
NG50 here is normalized to an expected genome length instead of total assembly length, and makes N50 values comparable between assemblies of the same species (or genome size), even if one assembly is missing parts of the genome.

## 4.1.5 Other assembly programs

SPAdes (de novo assembler for microbial, but has other versions, still not very suited to repetitive plant genome)
MaSuRCA: Hybrid de novo assembler, Designed for large genomes, integrates Illumina + long-read data
MEGAHIT: Ultra-fast de novo assembler (k-mer based), Low memory usage, scalable to large datasets.
Canu: Long-read assembler (PacBio / Nanopore), Handles noisy long reads, built-in error correction, good for high-contiguity assemblies.
Flye: Long-read assembler, Very fast with long reads, works well with low-coverage data, good repeat resolution.
wtdbg2 (Redbean) : Long-read assembler, Extremely fast, lightweight, generates contiguous assemblies quickly.
Hifiasm: HiFi read assembler (PacBio CCS reads), Very accurate, haplotype-resolved assemblies, handles high ploidy.

Gorman, Z., Chen, J., de Leon, A.A.P. et al. Comparison of assembly platforms for the assembly of the nuclear genome of Trichoderma harzianum strain PAR3. BMC Genomics 24, 454 (2023). https://doi.org/10.1186/s12864-023-09544-6

## 4.1.7 Whole genome de novo assembly of Capsella rubella and Capsella

### 1. 
MaSuRCA handles large eukaryotic genomes better than SOAPdenovo2, produces more contiguous assemblies, integrates error correction.
### 2. 
Use only raw data for MaSuRCA.
### 3.
https://bioinformaticsworkbook.org/dataAnalysis/GenomeAssembly/Assemblers/MaSuRCA.html#gsc.tab=0

```{bash}
git clone https://github.com/ISUGIFsingularity/masurca.git

  MASURCA masurca sr_config.txt
  MASURCA ./assemble.sh

## sr_config.txt
DATA
   PE = pa 250 50 SRR3166543_1.fastq  SRR3166543_2.fastq
   PE = pb 250 50 SRR3157034_1.fastq  SRR3157034_2.fastq
   JUMP = ja 8000 1600 SRR3156163_1.fastq  SRR3156163_2.fastq
   JUMP = jb 20000 4000 SRR3156596_1.fastq  SRR3156596_2.fastq
   PACBIO = SRR3405330.fastq
   END

   PARAMETERS
   GRAPH_KMER_SIZE = auto
   USE_LINKING_MATES = 0
   LIMIT_JUMP_COVERAGE = 300
   CA_PARAMETERS =  cgwErrorRate=0.15
   KMER_COUNT_THRESHOLD = 1
   NUM_THREADS = 28
   JF_SIZE = 200000000
   SOAP_ASSEMBLY=0
   END
```

## 4.1.8 Evolution of whole genome assembly
### 1.
Genome assembly has changed a lot over the past twenty years. The Human Genome Project (Frazier et al., 2001) used Sanger sequencing with a BAC-by-BAC approach, which was slow, labor-intensive, and left many gaps, especially in repetitive regions. By the time the panda genome was sequenced (Li et al., 2010), short-read next-generation sequencing made assemblies faster and cheaper, but they were still fragmented and struggled with repeats. More recently, genomes like those of mosquitoes (Matthews et al., 2018) and complex crops (Jarvis et al., 2017; Ling et al., 2018) have been assembled using long-read technologies, often combined with Hi-C or optical mapping, allowing for near-complete, chromosome-level assemblies that resolve difficult repetitive and structural regions. In short, genome assembly has moved from slow, gap-filled drafts to highly contiguous, high-quality genomes thanks to advances in sequencing and assembly strategies.

### 2.
By today, genome assembly has become far more accurate and complete than ever before. In this rapidly evolving field, higher computational power allowed with newer technology allows better algorithms to come to life that would not be possible before. High-fidelity long reads from PacBio HiFi and ultra-long Oxford Nanopore sequencing make it possible to read long stretches of DNA with very few errors. These reads are often combined with Hi-C or optical mapping data to build chromosome-level assemblies that are phased, meaning the two copies of each chromosome can be distinguished. Automated tools like hifiasm, Verkko, and Shasta make the process much faster and easier, even for complex genomes. Today’s assemblies can capture not just the genes but also repetitive regions, centromeres, telomeres, and structural variations, giving a near-complete picture of the genome.

### 3.
Bacterial genome assembly is now much easier than for complex eukaryotes. Long reads from PacBio or Oxford Nanopore often allow entire chromosomes and plasmids to be assembled in a single, gapless contig. Because bacterial genomes are small, circular, and haploid, they don’t require phasing, but repetitive elements, small plasmids, and mixed-strain samples can still pose challenges. Overall, complete bacterial genomes are now routine, though tricky cases need careful sequencing and analysis.

## 4.1.9 De novo genome assembly with long reads



## 4.2 Structural and functional gene annotation
### 4.2.1 Methods of gene predictions

#### 1. 
A gene is a stretch of DNA that carries instructions for making a protein or functional RNA. Functionally, it’s defined by the role its product plays in the cell, shaping traits and processes. From an evolutionary view, a gene is a heritable unit that can change over generations, influencing an organism’s survival and evolution.
#### 2.
Extrinsic and intrinsic methods. There are ab initio (intrinsic) methods without prior knowledge, and homology based (extrinsic) with information from homologs.  
#### 3. 
Cufflinks, Scripture, SGP2 intrinsic, while GeneID, Augustus and Scripture are extrinsic.


### 4.2.2 Aligning scaffolds against close species

#### 1.
Arabidopsis thaliana is close to these.
#### 2.
```
#!/bin/bash

PATH=/mnt/silo/hts2024/dnarabaci/programs/canu-2.2/canu/bin:$PATH

canu \
    -p unknown \
    -d assembly \
    useGrid=false \
    genomeSize=1.5m \
    -pacbio /mnt/silo/hts2024/data/unknown/unknown-pacbio.fq.gz
```

### 4.2.3 Gene predictions using GeneMark

#### 1. 
GeneMark-ES can be used for novel eukaryotic genomes.

### 4.2.4 Sequencing based gene predictions
Trinity and Oases, trinity modules can assemble contigs via kmers, or build de-Brujin graphs modeling gene isoforms or into actual transcripts, and Oases builds de_bruijn graphs from reads using different kmers for each graphs, then merge all assemblies into one.

