# Whole Genome Sequencing 101

This notebook goes through the main step of a whole genome sequencing (WGS) project. It presents:

1. [**DNA Sequencing**](#part1) - obtain *reads* from DNA fragments.
2. [**Genome Assembly or Mapping**](#part2) - put together a *genome* from reads.
3. [**Genome Annotation**](#part3) - document the *functions* of the genome.

**Remember**, this notebook is not a complete tutorial! It only presents the key steps of WGS to
demonstrate their algorithmic complexity. A real bioinformatics pipeline typically does not rely
purely on a Python script but rather executes standalone programs. To learn how to do full WGS
pipelines, please refer to:

- https://pmc.ncbi.nlm.nih.gov/articles/PMC10646344/ for *de novo* WGS (i. e. use that when
  studying an organism or strain with no reference genome)
- https://bmcmedgenomics.biomedcentral.com/articles/10.1186/s12920-023-01495-x for *reference-based*
  WSG (i.e. use that when you have a reference genome for your organism / strain)

This notebook uses the sequencing data available at https://pmc.ncbi.nlm.nih.gov/articles/PMC9299564/

## Configuration

Before we start, run the following cells to ensure that your environment is functional!

In [2]:
ACCESSION_NUMBER = "SRR10256704"  # Influenza virus sequence reads.
SEQUENCE_FILENAME = "sequence.fastq"  # Used to store the sequence in a local file.
EMAIL = "therrien.vincent.2@courrier.uqam.ca"  # Not obligatory, but it is courteous to tell the NCBI who you are when downloading data :)

In [3]:
import os
from Bio import SeqIO
from Bio import Entrez

<a id='part1'></a>

## 1. DNA Sequencing

DNA sequencing consists in reading the 

In [43]:
def download_sequence(filename: str, accession_number: str) -> None:
    """Download a sequence from the NCBI.

    Args:
        filename: Name of the local file in which to write the sequence.
        accession_number: ID of the sequence to download.
    """
    Entrez.email = EMAIL
    net_handle = Entrez.efetch(
        db="sra", id=accession_number, rettype="gb", retmode="text"
    )
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read().decode("ascii"))
    out_handle.close()
    net_handle.close()

In [42]:
print("Downloading a sequence!")
download_sequence(SEQUENCE_FILENAME, ACCESSION_NUMBER)
print("First few lines of the downloaded file:")
with open(SEQUENCE_FILENAME, "r") as f:
    print(f.read()[:10000])

#record = SeqIO.read(filename, "fasta")
#print(record)

Downloading a sequence!
First few lines of the downloaded file:

1. Genome characterization and mutation analysis of influenza virus by NGS technology
BioProject Accession: PRJNA576776
ID: 576776




<a id='part2'></a>

## 2. Genome Assembly or Mapping

<a id='part3'></a>

## 3. Genome Annotation