# Sequence annotation objects

Sequence annotation objects in Biopython represent features of biological sequences, such as genes, promoters, or regulatory regions, often stored as `SeqFeature` objects within sequence records `SeqRecord`. These enable structured metadata for regions of interest.

In [3]:
from Bio.SeqRecord import SeqRecord
# Execute help() to view help on class SeqRecord
# help(SeqRecord)

### Creation and structure of a SeqRecord

In [4]:
from Bio.Seq import Seq
# Create Seq Object
simple_seq = Seq("GATC")

print(f"Sequence Object: {simple_seq}", "\n")
# Use Seq Object for Seq Record creation
simple_seq_r = SeqRecord(simple_seq)

print(simple_seq_r)

Sequence Object: GATC 

ID: <unknown id>
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('GATC')


### Attributes of a SeqRecord

Main attributes:
 - id          - Identifier such as a locus tag (string)
 - seq         - The sequence itself (Seq object or similar)
 
Additional attributes:
 - name        - Sequence name, e.g. gene name (string)
 - description - Additional text (string)
 - dbxrefs     - List of database cross references (list of strings)
 - features    - Any (sub)features defined (list of SeqFeature objects)
 - annotations - Further information about the whole sequence (dictionary).<br>
  Most entries are strings, or lists of strings.
 - letter_annotations - Per letter/symbol annotation (restricted
     dictionary). This holds Python sequences (lists, strings
     or tuples) whose length matches that of the sequence.
     A typical use would be to hold a list of integers
     representing sequencing quality scores, or a string
     representing the secondary structure.

In [5]:
# No information on ID, this can be added manually
print(simple_seq_r.id)

<unknown id>


In [6]:
# Adding identifier information to the SeqRecord
simple_seq_r.id = "AC12345"
simple_seq_r.id

'AC12345'

In [None]:
# Adding description information to the SeqRecord
simple_seq_r.description = "Made up sequence I wish I could write a paper about"
simple_seq_r.description

'Made up sequence I wish I could write a paper about'

In [None]:
# seq is still a Seq Object
simple_seq_r.seq

Seq('GATC')

In [None]:
# Directly add ID information on creation of SeqRecord
simple_seq_r = SeqRecord(simple_seq, id="AC54321")
print(simple_seq_r)

ID: AC54321
Name: <unknown name>
Description: <unknown description>
Number of features: 0
Seq('GATC')


In [7]:
# Add annotation dictionary key pair to the SeqRecord
simple_seq_r.annotations["evidence"] = "None. I just made it up."
# View the added dictionary
print(simple_seq_r.annotations)
# View the value pair by providing the key
print(simple_seq_r.annotations["evidence"])

{'evidence': 'None. I just made it up.'}
None. I just made it up.


### SeqFeature objects


In [None]:
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

#help(SeqFeature)

Attributes:
 - location - the location of the feature on the sequence (SimpleLocation)
 - type - the specified type of the feature (ie. CDS, exon, repeat...)
 - id - A string identifier for the feature.
 - qualifiers - A dictionary of qualifiers on the feature. These are analogous to the qualifiers from a GenBank feature table. The keys of the dictionary are qualifier names, the values are the qualifier values.

In [None]:
# Create a sequence
sequence = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG")

# Create a SeqRecord
seq_record = SeqRecord(
    sequence,
    id="NC_000913.3",  # Accession number or ID
    name="Example_Record",
    description="Example sequence with annotations",
    annotations={"molecule_type": "DNA", "organism": "E. coli"},
)

# Add features
seq_record.features.append(
    SeqFeature(location=FeatureLocation(0, 12), type="CDS", qualifiers={"gene": "example_gene"})
)
seq_record.features.append(
    SeqFeature(location=FeatureLocation(15, 30), type="regulatory", qualifiers={"note": "promoter"})
)

# Add Phred quality scores as letter annotations
seq_record.letter_annotations["phred_quality"] = [40, 38, 39, 37, 40, 38, 37, 35, 36, 38, 39, 40, 
                                                   35, 36, 38, 39, 40, 37, 38, 39, 40, 38, 37, 35,
                                                   36, 38, 39, 40, 35, 36, 38, 39, 40, 37, 38, 39, 
                                                   37, 37, 37]

# Output
print(seq_record)


ID: NC_000913.3
Name: Example_Record
Description: Example sequence with annotations
Number of features: 2
/molecule_type=DNA
/organism=E. coli
Per letter annotation for: phred_quality
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG')


### SeqRecord objects from FASTA files

In [None]:
from Bio import SeqIO

# Title of FASTA is used for ID, Name and Description
# Theoretically gi number can be extracted and added as separate annotation to the SeqRecord but FASTA files from other sources vary.
record = SeqIO.read("NC_005816.fna", "fasta")

print(record)

ID: gi|45478711|ref|NC_005816.1|
Name: gi|45478711|ref|NC_005816.1|
Description: gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Number of features: 0
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG')


### SeqRecord objects from GenBank files

In [30]:
from Bio import SeqIO

record = SeqIO.read("NC_005816.gb", "genbank")

print(record)

ID: NC_005816.1
Name: NC_005816
Description: Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence
Database cross-references: Project:58037
Number of features: 41
/molecule_type=DNA
/topology=circular
/data_file_division=BCT
/date=21-JUL-2008
/accessions=['NC_005816']
/sequence_version=1
/gi=45478711
/keywords=['']
/source=Yersinia pestis biovar Microtus str. 91001
/organism=Yersinia pestis biovar Microtus str. 91001
/taxonomy=['Bacteria', 'Proteobacteria', 'Gammaproteobacteria', 'Enterobacteriales', 'Enterobacteriaceae', 'Yersinia']
/references=[Reference(title='Genetics of metabolic variations between Yersinia pestis biovars and the proposal of a new biovar, microtus', ...), Reference(title='Complete genome sequence of Yersinia pestis strain 91001, an isolate avirulent to humans', ...), Reference(title='Direct Submission', ...), Reference(title='Direct Submission', ...)]
/comment=PROVISIONAL REFSEQ: This record has not yet been subject to final
NCBI review. The 

What is missing? 

- Fuzzy positions 
- Location testing
- Sequence described by a feature or location