#### Read Fasta Files in Python ####


- Fasta Files are text-file for storing nucleotide(DNA) or peptide(amino acids) sequences
- Single or multi-fasta
- Fasta files can be read using python library Biopython
    * Function : SeqIO
- Biopython can be installed using pip ( pip install biopython --user)

Read Single Sequence Fasta

In [4]:
filename = "D:/Bionome Internship/Bioinformatics Practicals/BioPython/sequence.fasta"

In [7]:
from Bio import SeqIO

In [8]:
seq_object = SeqIO.read(filename, "fasta")

In [9]:
type(seq_object)

Bio.SeqRecord.SeqRecord

In [10]:
#print sequence id

seq_id = seq_object.id
print(seq_id)

NG_047557.1


In [14]:
# print sequence name

seq_definition = seq_object.name
print(seq_definition)

NG_047557.1


In [16]:
# print descriptor line

description = seq_object.description
print(description)

NG_047557.1 Staphylococcus aureus N315 bleO gene for bleomycin binding protein, complete CDS


In [17]:
sequence = seq_object.seq
print(sequence)

CGGGCCATTTTGCGTAATAAGAAAAAGGATTAATTATGAGCGAATTGAATTAATAATAAGGTAATAGATTTACATTAGAAAATGAAAGGGGATTTTATGCGTGAGAATGTTACAGTCTATCCCGGCATTGCCAGTCGGGGATATTAAAAAGAGTATAGGTTTTTATTGCGATAAACTAGGTTTCACTTTGGTTCACCATGAAGATGGATTCGCAGTTCTAATGTGTAATGAGGTTCGGATTCATCTATGGGAGGCAAGTGATGAAGGCTGGCGCTCTCGTAGTAATGATTCACCGGTTTGTACAGGTGCGGAGTCGTTTATTGCTGGTACTGCTAGTTGCCGCATTGAAGTAGAGGGAATTGATGAATTATATCAACATATTAAGCCTTTGGGCATTTTGCACCCCAATACATCATTAAAAGATCAGTGGTGGGATGAACGAGACTTTGCAGTAATTGATCCCGACAACAATTTGATTAGCTTTTTTCAACAAATAAAAAGCTAAAATCTATTATTAATCTGTTCAGCAATCGGGCGCGATTGCTGAATAAAAGATACGAGAGACCTCTCTTGTATCTTTTTTATTTTGAGTGGTTTTGTCCGTT


In [20]:
# print number of nucleotides in a sequence

length = len(sequence)
print(length, "bp")

605 bp


#### Read Multi Fasta Files ####

- A **_multi-fasta_** file contains two or more sequence records
- This can be read using the **_biopython_** package
- Using iterations ( **for/while loops** ), analysis can be done on all or some of the sequences
- When working with several sequences, the results of the analysis can be better organized using **pandas** library 
- Python libraries for this tutorial
    - biopython
    - pandas

In [60]:
from Bio import SeqIO
import pandas as pd
import os

In [23]:
filepath = "D:/Bionome Internship/Bioinformatics Practicals/BioPython/multi-fasta.fa"

In [24]:
# parse function instead of read because dealing with multiple sequence

seq_objects = SeqIO.parse(filepath, 'fasta')

sequences = []
for seq in seq_objects:
    sequences.append(seq)

In [25]:
len(sequences)

5

In [26]:
first_record =  sequences[0]

In [27]:
first_record.id

'CP029082.1'

In [28]:
first_record.name

'CP029082.1'

In [29]:
first_record.description

'CP029082.1 Staphylococcus aureus strain AR465 chromosome, complete genome'

In [36]:
first_seqence = first_record.seq

In [40]:
len(first_seqence)

2911287

In [48]:
for record in sequences:
    seq_id =  record.id
    # seq_name =  record.name
    # seq_description = record.description
    sequence = record.seq
    length = len(sequence)
    print(seq_id, length)

CP029082.1 2911287
CP030138.1 3050015
CP039157.1 2970728
CP039167.1 2866643
CP013957.1 3085555


In [49]:
seq_ids = []
seq_lengths = []

In [51]:
for record in sequences:
    seq_id = record.id
    sequence = record.seq
    length = len(sequence)

    seq_ids.append(seq_id)
    seq_lengths.append(length)
    print("Analysis Completed")


Analysis Completed
Analysis Completed
Analysis Completed
Analysis Completed
Analysis Completed


In [52]:
print(seq_ids)

['CP029082.1', 'CP030138.1', 'CP039157.1', 'CP039167.1', 'CP013957.1']


In [55]:
print(seq_lengths)

[2911287, 3050015, 2970728, 2866643, 3085555]


In [61]:
dataframe = pd.DataFrame()
dataframe['Seq_Id'] = seq_ids
dataframe['Seq_Length'] = seq_lengths

In [62]:
dataframe.head()

Unnamed: 0,Seq_Id,Seq_Length
0,CP029082.1,2911287
1,CP030138.1,3050015
2,CP039157.1,2970728
3,CP039167.1,2866643
4,CP013957.1,3085555


Save Result to a File

In [63]:
outdir = "D:/Bionome Internship/Bioinformatics Practicals/BioPython"
os.chdir(outdir)

In [65]:
dataframe.to_csv('sequence_analysis.csv', index=False)