In [2]:
import os
import pysam
import pandas as pd
from Bio import SeqIO

# Bioinformatic File Formats Notebook

## Introduction

Loka manage different clients with different needs depending the user use cases and business needs. This mean we need to be informed regards the different file formats we can see and the different storage technologies to allocated it, the diffetent posible processings it can suffer and the different analysis and ways to retrieve the informacion.

This guide is designed to inform and give some bases to manage the diffent genetic formats.

First at all it is important to determine the kind of life information we are treating. In life science we can see Omics information (Genomic, Proteomic and Metabolomic info), molecular and physico-chemical information, and clinical information. This guide only will cover Genomic information.

In this notebook, we will explore some of the most common bioinformatics file formats: BCL, SRRA,  ,FASTA, BAM, BED, and VCF. We will see how these files are structured and how to read them using Python.

It is important to firstly undestand the process to get the bioinformatic data.

1. Sample Collection
Obtaining Biological Samples: Biological samples such as blood, tissues, or cells from plants and animals are collected.
Sample Preparation: The samples are prepared for analysis, which may include extracting DNA or RNA.
2. Sequencing
Sequencing Machines: The DNA or RNA samples are loaded into a sequencing machine. Illumina machines are very common.
Data Generation: The machine reads the sequences of nucleotides (A, T, C, G) in the DNA or RNA and generates sequence data.
3. Generation of Raw Data (BCL)
BCL Files: Sequencing machines generate BCL files, which contain raw data of the base calls (the "letters" of the sequences) and their quality.
4. Data Conversion (FASTQ)
Conversion to FASTQ: The BCL files are converted to a more manageable format called FASTQ using specialized software. FASTQ files contain the DNA/RNA sequences and the quality of each base.
Storage and Preprocessing: FASTQ files are stored and preprocessed to remove low-quality data or contaminants.
5. Bioinformatics Analysis
Sequence Analysis: FASTQ files are analyzed to identify genes, genetic variants, or to assemble complete genomes.
Comparison with Databases: The sequences are compared with known sequence databases (e.g., using BLAST).
6. Data Storage and Sharing (SRA)
SRA Files: For archiving and sharing, sequence data can be stored in the Sequence Read Archive (SRA) format, which includes sequence reads and associated metadata.
Data Submission: Researchers submit their sequencing data to repositories like NCBI, where it is stored in SRA format for public access and further research.
In summary, the process starts with collecting and preparing biological samples, sequencing the DNA or RNA to generate raw data (BCL files), converting these to FASTQ files for analysis, and finally storing and sharing the data in SRA format. Each step involves specific tools and techniques to ensure the data is accurate, manageable, and useful for research.

That said, we can proceed to explore the structure of each of the non binary file formats and how to operate on them.


## Requirements
To follow along with this notebook, you need to install the following libraries:
```bash
pip install biopython pysam pandas

### Fasta Format
#### Description
The FASTA format is used to represent nucleotide or protein sequences. Each sequence begins with a description line starting with '>', followed by lines of sequence data.

##### Reading FASTA Files in Python
We will use the Biopython library to read FASTA files.

In [6]:
from Bio import SeqIO

# Read FASTA file
fasta_file = "../../../../data/fasta_sample.fasta"
for record in SeqIO.parse(fasta_file, "fasta"):
    print(f"ID: {record.id}")
    print(f"Sequence: {record.seq}")

ID: seq1
Sequence: AGCTAGCTAGCTACGATCG
ID: seq2
Sequence: CGATCGATCGATCGATCGA


### BAM Format

#### Description
The BAM format is the binary version of the SAM format, used to store sequence alignments. It is space-efficient and allows for fast access.

BAM files Can not be shown directly as it is a binary format.

##### Reading BAM Files in Python
We will use the pysam library to read BAM files.

In [None]:
import pysam

# Read BAM file
bam_file = "../../../../data/bam_sample.bam"
bam = pysam.AlignmentFile(bam_file, "rb")
for read in bam.fetch():
    print(f"Read ID: {read.query_name}")
    print(f"Sequence: {read.query_sequence}")
    print(f"Alignment start: {read.reference_start}")

### BED Format

#### Description
The BED format is used to describe genomic features such as regions of interest using coordinates. The first three columns are: chromosome name, feature start, and feature end.

#### Example BED File

In [9]:
import pandas as pd

# Read BED file
bed_file = "../../../../data/bed_sample.bed"
bed_df = pd.read_csv(bed_file, sep="\t", header=None, names=["chrom", "start", "end", "name"])
print(bed_df)

                        chrom  start  end  name
0  chr1  1000  2000  feature1    NaN  NaN   NaN
1  chr1  1500  2500  feature2    NaN  NaN   NaN


### VCF Format

#### Description
The VCF (Variant Call Format) is used to store genomic variants such as SNPs, indels, and other structural variants.

In [10]:
import pysam

# Read VCF file
vcf_file = "../../../../data/vcf_sample.vcf"
vcf = pysam.VariantFile(vcf_file)
for record in vcf.fetch():
    print(f"Chromosome: {record.chrom}")
    print(f"Position: {record.pos}")
    print(f"ID: {record.id}")
    print(f"Reference: {record.ref}")
    print(f"Alternate: {record.alts}")
    print(f"Quality: {record.qual}")
    print(f"Filter: {record.filter.keys()}")
    print(f"Info: {record.info}")

[E::bcf_hdr_parse_sample_line] Could not parse the "#CHROM.." line, either the fields are incorrect or spaces are present instead of tabs:
	#CHROM  POS  ID  REF  ALT  QUAL  FILTER  INFO



ValueError: file `b'../../../../data/vcf_sample.vcf'` does not have valid header (mode=`b'r'`) - is it VCF/BCF format?

Now we see the structure of the most basic non binary formartmats we can start exploring some good architecture practices to store and process it.

Depending of the kind of project we are working, we will need to store and process different genomic formats. Lets supose we have a client which need to store every kind of bioinformatic formats, raw reads (BCL) and processed files (FASTA, FASTQ, GFF3, SAM, VCF).

For raw read formats, these can be stored in a data lake using S3, where they can be indexed and partitioned for easier access using patters susch as tenant/year/month/day/file/ which will then be used to process them and run the secondary analyses that will generate the other formats that will allow us to extract information of interest to produce biological insights. 

BCL to (FASTA or FASTAQ)