Skip to content

Bioinformatic File

ammykam edited this page Jul 26, 2023 · 2 revisions

Welcome to the Data Commons Bioinformatics File Types page. Here, you will find comprehensive information about the various types of bioinformatic files that are commonly encountered within the Data Commons platform. This guide will help you better understand the diverse range of data formats used in the field of bioinformatics.


We have to understand first that there are different format of bioinformatic files.These file formats have their own specific use cases depending on: Compatibility with specific software, Data processing, parsing, and human readability needs, and Efficiency for storage

Sequence format

FASTA

image

  • is a file format to store sequence data with annotations.
  • is a text-based format for representing either nucleotide sequences or peptide sequences, in which base pairs or amino acids are represented using single-letter codes.
  • A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length.
  • Code Example

FASTQ

image

  • is a text file that contains** the sequence data from the clusters** that pass filter on a flow cell If samples were multiplexed, the first step in FASTQ file generation is demultiplexing
  • is a human-readable file format that stores the nucleotide base sequences, the calculated confidence for each base in a sequence, and information describing the origin of the read down to its position on the sequencing platform flow-cell.
  • An* untrimmed, unfiltered FASTQ file is considered the standard for “raw” sequence in a study and should always be maintained as a permanent part of the study’s data.

Compare FASTA and FASTQ

  • FASTQ is an extension of the FASTA file format
  • allowing for the storage of sequencing quality data along with the sequence itself and the sequence ID.
  • FASTA files are the most common standard for storing reference or consensus sequence data, while FASTQ is the most common format for storing raw sequence data.

Alignment format

  • In bioinformatics, alignment data for large numbers of aligned reads are often output as a sequence alignment and map (SAM) or binary alignment and map (BAM) file.
  • Alignment is a common step in many bioinformatics workflows involving nucleic acid sequencing.

SAM

  • is a type of text file format that contains the alignment information of various sequences that are mapped against reference sequences.
  • These files can also contain unmapped sequences. Since SAM files are a text file format, they are more readable by humans

BAM

  • Most bioinformatics tools accept and expect alignment results in BAM format.
  • From these files, with downstream bioinformatics analysis, you can compare gene expression, survey biodiversity, analyze DNA methylation, or investigate DNA-protein interaction, among many other NGS applications.
  • A BAM file (*.bam) is the compressed binary version of a SAM file that is used to represent aligned sequences up to 128 Mb.
  • BAM files contain the same information as SAM files except they are in binary file format which is not readable by humans.
  • BAM files are smaller and more efficient for software to work with than SAM files – saving time and reducing costs of computation and storage.
  • BAM files are often accompanied by a BAM index file also known as a BAI file with a similar name. This file will always be much smaller than the BAM file and acts as a “table of contents” for the BAM file

CRAM

Stockholmformat

VCF

Generic feature format

GFF

GTF

Unlabeled format

BED

tar.gz

PDB

PED

MAP

CSV

Other

TIFF File (Tag Image File Format)

  • is a computer file used to store raster graphics and image information.
  • is a handy way to store high-quality images before editing if you want to avoid lossy file formats.
  • Usage: High-quality photographs, High-resolution scans , Container files(https://www.adobe.com/au/creativecloud/file-types/image/raster/jpeg-file.html) within one TIFF if you wanted to email a selection of photos to a contact.

Reference

Clone this wiki locally