# Welcome to Day 5! 

## BLAST in NCBI and creating, reading, and writing alignments

### Section 1: BLAST against the NCBI database

### Section 2: Aligning a set of sequences 

### Section 3: Converting between files

---

## Session summary


For our last day we are going to take things easy and do something everyone loves to do, BLAST genes against NCBI. Afterwards, we will learn how to align the fasta sequences we've been working with. And finally, we will wrap things up by learning how to convert between some different file formats that we've used (fasta, gbk, fastq, aln)

---


In [18]:
from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Blast.NCBIWWW import qblast
from Bio.Blast import NCBIXML
from Bio.Alphabet import IUPAC

Let's say we are interested in finding homologs of ermA from Staphylococcus aureus in the NCBI non-redundant database. 

Let's use the first sequence from `'mixed_args.fasta'`

In [159]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    pass
    break
record

SeqRecord(seq=Seq('ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACAT...CAT', SingleLetterAlphabet()), id='ERMA_STAAR', name='ERMA_STAAR', description='ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain MRSA252) OX=282458 GN=ermA1 PE=3 SV=1', dbxrefs=[])

Let's use `qblast()` to search for homologs of our sequence. [View full qblast options here](https://biopython.org/docs/1.75/api/Bio.Blast.NCBIWWW.html)

In [160]:
# searching for homologs of record.seq in the non-redudundant (nr) database using BLASTn. BLAST will only return the top five matches (hits).
blast_results = qblast(program='blastn', database='nr', sequence=record.seq, hitlist_size=5)

# storing the results of the blast search in a file called blast_output.xml 
output_file = open('blast_output.xml', 'w') # the 'w' indicates we are writing to this file.
output_file.write(blast_results.read()) # the actual writing of the file
output_file.close() # closing the file we wrote to
blast_results.close() # closing the blast result

Each blast 'hit', (i.e. a sequence in the database that matched our sequence of interest) is an item in the list `blast_record.alignments`.

Let's check that five hits matching our sequence of interest were found 

In [162]:
blast_result_handle = open('blast_output.xml') # open the blast output file
blast_record = NCBIXML.read(blast_result_handle) # read it into memory

count = 0
for alignment in blast_record.alignments:
    count = count + 1

count

5

We can print the name of each sequence that matched our search sequence.

In [161]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

for alignment in blast_record.alignments:
    print(alignment.title)
    #print(alignment.title.split('>')[0]) # we use split in case some titles are very long. 

gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome
gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome
gi|2030875670|gb|CP064434.1| Enterococcus faecium strain PR00859-7 plasmid unnamed_2, complete sequence
gi|2006900978|dbj|AP024511.1| Staphylococcus aureus 2007-13 DNA, complete genome
gi|1995591623|gb|CP070983.1| Staphylococcus aureus strain WBG8366 chromosome, complete genome


We can access additional information stored in each alignment with a second for loop. These are stored in a list called `alignment.hsps` and provide all the important information about the alignment.

Information found in `alignment.hsps` includes the length of the alignment, which bases did and did not align, and even the alignment itself! 

In [164]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

for alignment in blast_record.alignments:
    print(alignment.title.split('>')[0])
    for hsp in alignment.hsps:
        print('score: ', hsp.score)
        print('expected value: ', hsp.expect)
        print('number of exact matches: ', hsp.identities)
        print('number of aligned letters: ', hsp.align_length)
        print('First 100 characters aligned: \n') # for visual clarity only showing first 100 characters
        print(hsp.query[:100])
        print(hsp.match[:100])
        print(hsp.sbjct[:100])
        print('\n')
        break

gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome
score:  791.0
expected value:  0.0
number of exact matches:  595
number of aligned letters:  728
First 100 characters aligned: 

ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACATGTAAAAGAAATTTTAAATCATACAAATATTTCTAAACAAGATAATG
||||| || ||||| |||||||| || |||||||||||||| |||||||| |||||||||||||| || ||||| || ||||||  ||||||||| || |
ATGAACCAGAAAAACCCTAAAGACACGCAAAATTTTATTACTTCTAAAAAGCATGTAAAAGAAATATTGAATCACACGAATATTAGTAAACAAGACAACG


gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome
score:  786.0
expected value:  0.0
number of exact matches:  594
number of aligned letters:  728
First 100 characters aligned: 

ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACATGTAAAAGAAATTTTAAATCATACAAATATTTCTAAACAAGATAATG
||||| || ||||| |||||||| || |||||||||||||| |||||||| |||||||||||||| || ||||| || |||||   ||||||||| || |
ATGAACCAGAAAAACCCTAAAGACACGCAAAATTTTATTACTTCTAAA

In practice, we may just want to extract certain bits of information from the BLAST search

For example, we can extract a list of each genome we found a hit in

In [166]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

store_data = []

for alignment in blast_record.alignments:
    hit_title = alignment.title.split('>')
    store_data.append(hit_title)

store_data

[['gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome'],
 ['gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome'],
 ['gi|2030875670|gb|CP064434.1| Enterococcus faecium strain PR00859-7 plasmid unnamed_2, complete sequence'],
 ['gi|2006900978|dbj|AP024511.1| Staphylococcus aureus 2007-13 DNA, complete genome'],
 ['gi|1995591623|gb|CP070983.1| Staphylococcus aureus strain WBG8366 chromosome, complete genome']]

This list shows us that although we searched for a Staphylococcus aureus sequence, we have found a match in a plasmid contained within an Enterococcus faecium strain.

We can calculate strength in homology of each match. We do this by calculating the identity (number of identical bases between two sequences) and the query coverage (how many bases were aligned between the two sequences)

In [176]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

store_data = []

for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        # make variables for easier readibility
        length_sbjct = len(hsp.sbjct)
        length_query = len(hsp.query)
        num_gaps = hsp.gaps
        num_identities = hsp.identities
        
        # calculate identity
        identity = num_identities/length_query

        # calculate coverage
        coverage = (length_sbjct-num_gaps)/length_query
        
        # append to store_data
        store_data.append([alignment.title, identity, coverage])
        break

store_data

[['gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome',
  0.8173076923076923,
  1.0],
 ['gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome',
  0.8159340659340659,
  1.0],
 ['gi|2030875670|gb|CP064434.1| Enterococcus faecium strain PR00859-7 plasmid unnamed_2, complete sequence',
  0.8159340659340659,
  1.0],
 ['gi|2006900978|dbj|AP024511.1| Staphylococcus aureus 2007-13 DNA, complete genome',
  0.8159340659340659,
  1.0],
 ['gi|1995591623|gb|CP070983.1| Staphylococcus aureus strain WBG8366 chromosome, complete genome',
  0.8159340659340659,
  1.0]]

This is just one utility of parsing BLAST outputs with Biopython. 

Another utility is to extract all the sequences that you have gotten through BLAST and write them to a new file.

---

### Exercise 1a:

`translate()` `record.seq` so that it has an amino acid sequence.

In [177]:
for record in SeqIO.parse('mixed_args.fasta','fasta'):
    pass
    break

record.seq = record.seq.___()
record

SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH', ExtendedIUPACProtein()), id='ERMA_STAAR', name='ERMA_STAAR', description='ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain MRSA252) OX=282458 GN=ermA1 PE=3 SV=1', dbxrefs=[])

### Exercise 1b:

Specify the `blast` program to be `'blastp`', with the `'nr`' database. Make the number of max number of hits equal to 4.

Afterwards, store the blast output in the file `'day5_1b_output.xml'` 

In [178]:
blast_results = qblast(program='___', database='___', sequence=record.seq, hitlist_size=___)

output_file = open('___.xml', 'w')
output_file.write(blast_results.read())
output_file.close()
blast_results.close()

### Exercise 1c:

Print the title of each alignment. Is there something unexpected?

In [181]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('day5_1b_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

for alignment in blast_record.alignments:
    print(alignment.___)
    print('\n')

gb|ETJ08129.1| rRNA adenine N-6-methyltransferase, partial [Streptococcus parasanguinis DORA_23_24]


gb|EXM57106.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus DAR133]


ref|WP_001072201.1| MULTISPECIES: 23S rRNA (adenine(2058)-N(6))-methyltransferase Erm(A) [Bacilli] >sp|P0A0H1.1| RecName: Full=rRNA adenine N-6-methyltransferase; AltName: Full=Erythromycin resistance protein; AltName: Full=Macrolide-lincosamide-streptogramin B resistance protein [Staphylococcus aureus subsp. aureus Mu50] >sp|P0A0H2.1| RecName: Full=rRNA adenine N-6-methyltransferase; AltName: Full=Erythromycin resistance protein; AltName: Full=Macrolide-lincosamide-streptogramin B resistance protein [Staphylococcus aureus subsp. aureus N315] >sp|P0A0H3.1| RecName: Full=rRNA adenine N-6-methyltransferase; AltName: Full=Erythromycin resistance protein; AltName: Full=Macrolide-lincosamide-streptogramin B resistance protein [Staphylococcus aureus] >sp|Q6GKQ0.1| RecName: Full=rRNA adenine N-6-methyltransfera

### Exercise 1d:

 If we look at the above output, we can see that the third alignment title is a MULTISPECIES. This is a sequence that is an identical sequence that is found in multiple species. When this happens, NCBI has its alignment title as a long string with all the different species names. But we just want to have a short identifier.

 We can see that the MULTISPECIES title has each different species separated by a `>`. We can change the alignment title by using the `split()` method we went over in day 4.


In [184]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('day5_1b_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

for ____ in ___.alignments:
    # use split()
    new_alignment_title = alignment.title.___('>')

    #select the first item in new_alignment_title, assign it back to new_alignment_title
    new_alignment_title = ___[0]

    # print the new title
    print(new_alignment_title)
    print('\n') # for clarity

gb|ETJ08129.1| rRNA adenine N-6-methyltransferase, partial [Streptococcus parasanguinis DORA_23_24]


gb|EXM57106.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus DAR133]


ref|WP_001072201.1| MULTISPECIES: 23S rRNA (adenine(2058)-N(6))-methyltransferase Erm(A) [Bacilli] 


gb|ALY18984.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus] 


emb|CAC5857653.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus]




### Exercise 1f:

Now we want to extract the new alignment title and the sequence for each hit so that we can write them to a new file.

Convert the `hsp.sbjct` sequence into a `Seq()` object. 

Afterwards, make a `SeqRecord` object from the extracted sequence and the extracted title.

Store the `new_alignment_title` and the `hsp.sbjct` in the list `data_store` 

In [195]:
from Bio.SeqRecord import SeqRecord

data_store = []

# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('day5_1b_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

for alignment in blast_record.alignments:
    # use split()
    new_alignment_title = alignment.title.___('>')

    #select the first item in new_alignment_title, assign it back to new_alignment_title
    new_alignment_title = ___[0]

    # assign hsp.sbjct to sbjct_sequence
    for hsp in alignment.hsps:
        sbjct_sequence = hsp.___
        break
    
    # Make a Seq object called extracted_sequence 
    extracted_sequence = Seq(___)

    # make a SeqRecord object called extracted_record. The first item will be the sequence, id is the new alignment title
    extracted_record = SeqRecord(___, id = ___, description = '')

    # store the SeqRecord in a list
    data_store.append(extracted_record)
data_store

[SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='gb|ETJ08129.1| rRNA adenine N-6-methyltransferase, partial [Streptococcus parasanguinis DORA_23_24]', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='gb|EXM57106.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus DAR133]', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='ref|WP_001072201.1| MULTISPECIES: 23S rRNA (adenine(2058)-N(6))-methyltransferase Erm(A) [Bacilli] ', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='gb|ALY18984.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus] ', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS..

### Exercise 1g:

Use `SeqIO.write()` to write each `SeqRecord` in `data_store` to a file called `'day5_1g.fasta'`



In [196]:
SeqIO.write(data_store,'day5_1g.fasta','fasta')

5

---

If we were to do the above with a lot of hits, like 500, we might want to also set some thresholds



#  Section 2: Aligning a set of sequences