# Introduction to Python and Bioinformatics

# - The Python Fundamentals

#### Developed by:  A. Fahim, California State Univeristy Long Beach

This notebook is a supplement to the workshop "Introduction to Python and Bioinformatics"


# Basic Sequence Analysis

In [28]:
from Bio import Entrez, Seq, SeqIO

As our sequence of interest is available in a Biopython sequence object, let's start by saving it to a FASTA file on our local disk

In [31]:
Entrez.email = "arjangvt@gmail.com" 
hdl = Entrez.efetch(db='nucleotide', id=['NM_002299'], rettype='fasta')  # Lactase gene
#for l in hdl:
#    print l
seq = SeqIO.read(hdl, 'fasta')

Note that we chose a subset of a sequence to write what was hardcoded in
our known case. If you look at the features of the downloaded sequence,
you will see that it corresponds to the Coding Sequence (CDS) part.
When you download a sequence, it may have more than the code part of
the gene. It probably has the coding sequence and exons. This includes
everything that is not removed by RNA splicing (which is more than the
coding sequence) and many other features. In this case, we have chosen a
sequence with only a single CDS entry, but in general, you may have to go
through multiple CDS features in order to reconstruct the coding sequence
(that is, start and end codons plus all codons coding amino acids). So, do
not forget that the downloaded "gene sequence" is normally bigger than
the exomic part, which is bigger than the coding (CDS) part.

In [19]:
w_seq = seq[11:5795]
w_seq

SeqRecord(seq=Seq('GAAAATGGAGCTGTCTTGGCATGTAGTCTTTATTGCCCTGCTAAGTTTTTCATG...ATT'), id='NM_002299.4', name='NM_002299.4', description='NM_002299.4 Homo sapiens lactase (LCT), mRNA', dbxrefs=[])

In [20]:
w_hdl = open('example.fasta', 'w')
SeqIO.write([w_seq], w_hdl, 'fasta')
w_hdl.close()

In most situations, you will actually have the sequence on the disk, so you will be interested in reading it.

In [32]:
recs = SeqIO.parse('example.fasta', 'fasta')
for rec in recs:
    seq = rec.seq
    print(rec.description)
    print(seq[:10])

NM_002299.4 Homo sapiens lactase (LCT), mRNA
GAAAATGGAG


In [25]:
seq = Seq.Seq(str(seq))
seq

Seq('GAAAATGGAGCTGTCTTGGCATGTAGTCTTTATTGCCCTGCTAAGTTTTTCATG...ATT')

In [33]:
print((seq[:12], seq[-12:]))
rna = seq.transcribe()
rna

(Seq('GAAAATGGAGCT'), Seq('GGTGTCTTCATT'))


Seq('GAAAAUGGAGCUGUCUUGGCAUGUAGUCUUUAUUGCCCUGCUAAGUUUUUCAUG...AUU')

In [34]:
prot = seq.translate()
prot

Seq('ENGAVLACSLYCPAKFFMLGVRLGV**KFHFHRWSSNQ*LAAQPEWSPGRPEF*...VFI')