# Reading and Managing data
This notebook will approach how to manage data usually in bioinformatics it is used FASTA files so here I will show some libraries such biopython for managing this type of data. 

**FASTA file format**
DNA and protein sequences are the most common data type in bioinformatics, and standard file format for storing these sequences is FASTA format.

DNA: A,T,C,G

PROTEINS: letter acroynims, e.g A for alanine

A FASTA record begins with a one line identifier. This identifier line always begins with a greater than symbol > and following is the sequence. Note that a FASTA file can contain more than one record

In [1]:
#to use modules of the project
%load_ext autoreload
%autoreload 2

In [2]:
#import sys
#sys.path.insert(0,"../bioinformatics_guide")

In [2]:
#from bioinformatics_guide.utils.paths as path
from bioinformatics_guide.utils.paths import data_dir
data_dir()

PosixPath('/home/alejandrodf1/Documents/bioinformatics/bioinformatics_guide/data')

In [1]:
#getting data
#!wget https://cocalc.com/share/download/df81e09e5b8f16f28b3a2e818dcdd4560e7818ae/support/2015-04-02-ISB-notes/ls_orchid.fasta.txt

--2022-10-09 15:51:15--  https://cocalc.com/share/download/df81e09e5b8f16f28b3a2e818dcdd4560e7818ae/support/2015-04-02-ISB-notes/ls_orchid.fasta.txt
Resolving cocalc.com (cocalc.com)... 104.22.1.102, 172.67.20.35, 104.22.0.102, ...
Connecting to cocalc.com (cocalc.com)|104.22.1.102|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘ls_orchid.fasta.txt’

ls_orchid.fasta.txt     [ <=>                ]  74.69K  --.-KB/s    in 0.08s   

2022-10-09 15:51:19 (889 KB/s) - ‘ls_orchid.fasta.txt’ saved [76480]



In [3]:
seq_file = data_dir('raw/ls_orchid.fasta.txt') #path of fasta file
seq_file

PosixPath('/home/alejandrodf1/Documents/bioinformatics/bioinformatics_guide/data/raw/ls_orchid.fasta.txt')

In [10]:
#!pip install biopython

Collecting biopython
  Downloading biopython-1.79-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.6/2.6 MB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m0m
Installing collected packages: biopython
Successfully installed biopython-1.79


# SqeIO package
SeqIO package provides the interface for working with sequence file formats.

parse(file_path, format) - read in sequence files as SeqRecord object. SeqRecord contains following info:

* id - ID used to identify the sequence – a string
* seq - Seq object containing sequence (convert to string: str(), check length of sequence with len())

In [4]:
from Bio import SeqIO
sequences = [] #empty list to save sequences

for seq_record in SeqIO.parse(seq_file, 'fasta'):
    #save each sequency to the list
    sequences.append(str(seq_record.seq))
    #print sequence
    print(seq_record.seq)
    #print seq identifier
    print(f' ID of the sequence: {seq_record.id}')
    #print length of the seq
    print(f'Length of the sequence: {len(seq_record)}')


CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC
 ID of the sequence: gi|2765658|emb|Z78533.1|CIZ78533
Length of the sequence: 740
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTA

In [29]:
sequences[0]

'CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTGAATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGGCCGCCTCGGGAGCGTCCATGGCGGGTTTGAACCTCTAGCCCGGCGCAGTTTGGGCGCCAAGCCATATGAAAGCATCACCGGCGAATGGCATTGTCTTCCCCAAAACCCGGAGCGGCGGCGTGCTGTCGCGTGCCCAATGAATTTTGATGACTCTCGCAAACGGGAATCTTGGCTCTTTGCATCGGATGGAAGGACGCAGCGAAATGCGATAAGTGGTGTGAATTGCAAGATCCCGTGAACCATCGAGTCTTTTGAACGCAAGTTGCGCCCGAGGCCATCAGGCTAAGGGCACGCCTGCTTGGGCGTCGCGCTTCGTCTCTCTCCTGCCAATGCTTGCCCGGCATACAGCCAGGCCGGCGTGGTGCGGATGTGAAAGATTGGCCCCTTGTGCCTAGGTGCGGCGGGTCCAAGAGCTGGTGTTTTGATGGCCCGGAACCCGGCAAGAGGTGGACGGATGCTGGCAGCAGCTGCCGTGCGAATCCCCCATGTTGTCGTGCTTGTCGGACAGGCAGGAGAACCCTTCCGAACCCCAATGGAGGGCGGTTGACCGCCATTCGGATGTGACCCCAGGTCAGGCGGGGGCACCCGCTGAGTTTACGC'

# Big data and speed issues
It can happen that the amount of data it is hugh to be stored in RAM. In that case working with low level SimpleFastaParser is often more practical than Bio.SeqIO.parse when dealing with large high-throughput FASTA sequencing files where speed matters.

In [30]:
# getting the data
#!wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/xenoMrna.fa.gz

--2022-10-09 17:31:14--  http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/xenoMrna.fa.gz
Resolving hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)... 128.114.119.163
Connecting to hgdownload.soe.ucsc.edu (hgdownload.soe.ucsc.edu)|128.114.119.163|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7144575001 (6.7G) [application/x-gzip]
Saving to: ‘xenoMrna.fa.gz’


In [9]:
!ls -al #show all the files

total 4520268
drwxr-xr-x 2 alejandrodf1 alejandrodf1       4096 Oct  9 18:48 .
drwxr-xr-x 9 alejandrodf1 alejandrodf1       4096 Oct  9 15:48 ..
-rw-r--r-- 1 alejandrodf1 alejandrodf1          0 Oct  9 15:44 .gitkeep
-rw-r--r-- 1 alejandrodf1 alejandrodf1      92934 Oct  9 18:52 1.1-managing_data.ipynb
-rw-r--r-- 1 alejandrodf1 alejandrodf1       1991 Oct  9 18:48 1.2-DNA_exercises.ipynb
-rw-r--r-- 1 alejandrodf1 alejandrodf1 4628643840 Oct  9 18:32 xenoMrna.fa.gz


In [12]:
data_dir('raw/xenoMrna.fa')

PosixPath('/home/alejandrodf1/Documents/bioinformatics/bioinformatics_guide/data/raw')

In [14]:
#unzip the file
#import gzip
#import shutil
#with gzip.open('xenoMrna.fa.gz', 'rb') as f_in:
#    with open(data_dir('raw/xenoMrna.fa'), 'wb') as f_out:
#        shutil.copyfileobj(f_in, f_out)

In [5]:
# and read the file with SimpleFasterParse
from Bio.SeqIO.FastaIO import SimpleFastaParser
import time


start = time.time()
count = 0
with open(data_dir("raw/xenoMrna.fa")) as handle: # parsing using handle == pointer to file
     for seq_id, seq in SimpleFastaParser(handle):
        # seq_id - sequence identifier
        # seq - sequence itself
        count += 1
end = time.time()
print(f"Time: {end - start}")
print(f"Number of sequences: {count}")

KeyboardInterrupt: 

### Exercises
1. Create a function that takes the name of a FASTA file as input and returns its content (sequences) as list
2. Create a function that takes the name of a FASTA file as input and returns a list of FASTA identifiers

In [None]:
#first exercise
def read_fasta(seq_file):
    sequences = [] #empty list to save sequences
    for seq_record in SeqIO.parse(seq_file, 'fasta'):
        #save each sequency to the list
        sequences.append(str(seq_record.seq))
    return sequences


In [7]:
#second exercise
def id_fasta(seq_file):
    sequences_id = [] #empty list to save sequences
    for seq_record in SeqIO.parse(seq_file, 'fasta'):
        #save each sequency to the list
        sequences_id.append(seq_record.id)
    return sequences_id

In [8]:
id_fasta(seq_file)

['gi|2765658|emb|Z78533.1|CIZ78533',
 'gi|2765657|emb|Z78532.1|CCZ78532',
 'gi|2765656|emb|Z78531.1|CFZ78531',
 'gi|2765655|emb|Z78530.1|CMZ78530',
 'gi|2765654|emb|Z78529.1|CLZ78529',
 'gi|2765652|emb|Z78527.1|CYZ78527',
 'gi|2765651|emb|Z78526.1|CGZ78526',
 'gi|2765650|emb|Z78525.1|CAZ78525',
 'gi|2765649|emb|Z78524.1|CFZ78524',
 'gi|2765648|emb|Z78523.1|CHZ78523',
 'gi|2765647|emb|Z78522.1|CMZ78522',
 'gi|2765646|emb|Z78521.1|CCZ78521',
 'gi|2765645|emb|Z78520.1|CSZ78520',
 'gi|2765644|emb|Z78519.1|CPZ78519',
 'gi|2765643|emb|Z78518.1|CRZ78518',
 'gi|2765642|emb|Z78517.1|CFZ78517',
 'gi|2765641|emb|Z78516.1|CPZ78516',
 'gi|2765640|emb|Z78515.1|MXZ78515',
 'gi|2765639|emb|Z78514.1|PSZ78514',
 'gi|2765638|emb|Z78513.1|PBZ78513',
 'gi|2765637|emb|Z78512.1|PWZ78512',
 'gi|2765636|emb|Z78511.1|PEZ78511',
 'gi|2765635|emb|Z78510.1|PCZ78510',
 'gi|2765634|emb|Z78509.1|PPZ78509',
 'gi|2765633|emb|Z78508.1|PLZ78508',
 'gi|2765632|emb|Z78507.1|PLZ78507',
 'gi|2765631|emb|Z78506.1|PLZ78506',
 