# Welcome to Day 5! 

## BLAST-ing against the NCBI database

### Section 1: BLAST with genes

### Section 2: BLAST with fastq reads

---

## Session summary


For our last day we are going to take things easy and do something everyone loves to do, BLAST genes against NCBI. Afterwards, we will BLAST fastq files that we have first converted to fasta files. 

---


# Section 1: Section 1: BLAST with genes

In [18]:
# import all our previously used modules

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC

# import Biopython modules for BLAST
from Bio.Blast.NCBIWWW import qblast
from Bio.Blast import NCBIXML

Let's say we are interested in finding homologs of ermA from Staphylococcus aureus in the NCBI non-redundant database. 

We have a fasta file called `single_seq.fasta`, which contains only one ermA sequence. We use `SeqIO.read()` (instead of `SeqIO.parse()`) because there is only one sequence.

In [199]:
record = SeqIO.read('single_seq.fasta','fasta')
record

SeqRecord(seq=Seq('ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACAT...CAT', SingleLetterAlphabet()), id='ERMA_STAAR', name='ERMA_STAAR', description='ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain MRSA252) OX=282458 GN=ermA1 PE=3 SV=1', dbxrefs=[])

Let's use `qblast()` to search for homologs of our sequence. [View full qblast options here](https://biopython.org/docs/1.75/api/Bio.Blast.NCBIWWW.html)

In [160]:
# searching for homologs of record.seq in the non-redudundant (nr) database using BLASTn. BLAST will only return the top five matches (hits).
# these are only just some of the options we can specify in our search. The first three are mandatory, however.
blast_results = qblast(program='blastn', database='nr', sequence=record.seq, hitlist_size=5) 

# storing the results of the blast search in a file called blast_output.xml 
output_file = open('blast_output.xml', 'w') # the 'w' indicates we are writing to this file.
output_file.write(blast_results.read()) # write blast_results to output_file
output_file.close() # closing the file we wrote to
blast_results.close() # closing the blast result

Each blast 'hit', (i.e. a sequence in the database that matched our sequence of interest) is an item in the list `blast_record.alignments`.

Let's check that five hits matching our sequence of interest were found 

In [200]:
blast_result_handle = open('blast_output.xml') # open the blast output file
blast_record = NCBIXML.read(blast_result_handle) # read it into memory

count = 0
for alignment in blast_record.alignments:
    count = count + 1

count

5

We can print the name of each sequence that matched our search sequence.

In [201]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

for alignment in blast_record.alignments:
    print(alignment.title)

gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome
gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome
gi|2030875670|gb|CP064434.1| Enterococcus faecium strain PR00859-7 plasmid unnamed_2, complete sequence
gi|2006900978|dbj|AP024511.1| Staphylococcus aureus 2007-13 DNA, complete genome
gi|1995591623|gb|CP070983.1| Staphylococcus aureus strain WBG8366 chromosome, complete genome


We can access additional information stored in each alignment with a second for loop. These are stored in a list called `alignment.hsps` and provide all the important information about the alignment.

Information found in `alignment.hsps` includes the length of the alignment, which bases did and did not align, and even the alignment itself! 

In [164]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

for alignment in blast_record.alignments:
    print(alignment.title.split('>')[0])
    for hsp in alignment.hsps:
        print('score: ', hsp.score)
        print('expected value: ', hsp.expect)
        print('number of exact matches: ', hsp.identities)
        print('number of aligned letters: ', hsp.align_length)
        print('First 100 characters aligned: \n') # for visual clarity only showing first 100 characters
        print(hsp.query[:100])
        print(hsp.match[:100])
        print(hsp.sbjct[:100])
        print('\n')
        break

gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome
score:  791.0
expected value:  0.0
number of exact matches:  595
number of aligned letters:  728
First 100 characters aligned: 

ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACATGTAAAAGAAATTTTAAATCATACAAATATTTCTAAACAAGATAATG
||||| || ||||| |||||||| || |||||||||||||| |||||||| |||||||||||||| || ||||| || ||||||  ||||||||| || |
ATGAACCAGAAAAACCCTAAAGACACGCAAAATTTTATTACTTCTAAAAAGCATGTAAAAGAAATATTGAATCACACGAATATTAGTAAACAAGACAACG


gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome
score:  786.0
expected value:  0.0
number of exact matches:  594
number of aligned letters:  728
First 100 characters aligned: 

ATGAATCAAAAAAATCCTAAAGATACACAAAATTTTATTACATCTAAAAAACATGTAAAAGAAATTTTAAATCATACAAATATTTCTAAACAAGATAATG
||||| || ||||| |||||||| || |||||||||||||| |||||||| |||||||||||||| || ||||| || |||||   ||||||||| || |
ATGAACCAGAAAAACCCTAAAGACACGCAAAATTTTATTACTTCTAAA

In practice, we may just want to extract certain bits of information from the BLAST search

For example, we can extract a list of each genome we found a hit in

In [166]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

store_data = []

for alignment in blast_record.alignments:
    hit_title = alignment.title.split('>')
    store_data.append(hit_title)

store_data

[['gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome'],
 ['gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome'],
 ['gi|2030875670|gb|CP064434.1| Enterococcus faecium strain PR00859-7 plasmid unnamed_2, complete sequence'],
 ['gi|2006900978|dbj|AP024511.1| Staphylococcus aureus 2007-13 DNA, complete genome'],
 ['gi|1995591623|gb|CP070983.1| Staphylococcus aureus strain WBG8366 chromosome, complete genome']]

This list shows us that although we searched for a Staphylococcus aureus sequence, we have found a match in a plasmid contained within an Enterococcus faecium strain.

We can calculate strength in homology of each match. We do this by calculating the identity (number of identical bases between two sequences) and the query coverage (how many bases were aligned between the two sequences)

In [176]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('blast_output.xml')
blast_record = NCBIXML.read(blast_result_handle)

store_data = []

for alignment in blast_record.alignments:
    for hsp in alignment.hsps:
        # make variables for easier readibility
        length_sbjct = len(hsp.sbjct)
        length_query = len(hsp.query)
        num_gaps = hsp.gaps
        num_identities = hsp.identities
        
        # calculate identity
        identity = num_identities/length_query

        # calculate coverage
        coverage = (length_sbjct-num_gaps)/length_query
        
        # append to store_data
        store_data.append([alignment.title, identity, coverage])
        break

store_data

[['gi|1908485171|gb|CP049486.1| Staphylococcus aureus strain pt228 chromosome, complete genome',
  0.8173076923076923,
  1.0],
 ['gi|2041328505|gb|CP071594.1| Staphylococcus aureus strain PNID0137 chromosome, complete genome',
  0.8159340659340659,
  1.0],
 ['gi|2030875670|gb|CP064434.1| Enterococcus faecium strain PR00859-7 plasmid unnamed_2, complete sequence',
  0.8159340659340659,
  1.0],
 ['gi|2006900978|dbj|AP024511.1| Staphylococcus aureus 2007-13 DNA, complete genome',
  0.8159340659340659,
  1.0],
 ['gi|1995591623|gb|CP070983.1| Staphylococcus aureus strain WBG8366 chromosome, complete genome',
  0.8159340659340659,
  1.0]]

Suprisingly, all sequences show 81.7% identity and 100% coverage to our query sequence, including the plasmid-borne version.

This is just one utility of parsing BLAST outputs with Biopython. 

Another utility is to extract all the sequences that you have gotten through BLAST and writing them to a new file.

---

### Exercise 1a

Use `SeqIO.read()` to read in the only sequence in `single_seq.fasta`

`translate()` `record.seq` so that it has an amino acid sequence.

In [177]:
record = SeqIO.___('___.fasta','fasta')

record.seq = record.seq.___()
record

SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH', ExtendedIUPACProtein()), id='ERMA_STAAR', name='ERMA_STAAR', description='ERMA_STAAR Q6GKQ0 rRNA adenine N-6-methyltransferase OS=Staphylococcus aureus (strain MRSA252) OX=282458 GN=ermA1 PE=3 SV=1', dbxrefs=[])

### Exercise 1b

Specify the `qblast()` program to be `'blastp'`, with the `'nr'` database. Make the number of max number of hits equal to `4`.

Afterwards, store the blast output in the file `'day5_1b_output.xml'` 

In [178]:
blast_results = qblast(program='___', database='___', sequence=record.seq, hitlist_size=___)

output_file = open('___.xml', 'w')
output_file.write(blast_results.read())
output_file.close()
blast_results.close()

### Exercise 1c:

Read the contents of `blast_result_handle` with `NCBIXML.read` and assign the contents  to `blast_record`.

Print the `alignment.title` in each `blast_record.alignment`. Is there something unexpected?

In [181]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('day5_1b_output.xml')
___ = ___.read(blast_result_handle)

for alignment in blast_record.alignments:
    print(alignment.___)
    print('\n')

gb|ETJ08129.1| rRNA adenine N-6-methyltransferase, partial [Streptococcus parasanguinis DORA_23_24]


gb|EXM57106.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus DAR133]


ref|WP_001072201.1| MULTISPECIES: 23S rRNA (adenine(2058)-N(6))-methyltransferase Erm(A) [Bacilli] >sp|P0A0H1.1| RecName: Full=rRNA adenine N-6-methyltransferase; AltName: Full=Erythromycin resistance protein; AltName: Full=Macrolide-lincosamide-streptogramin B resistance protein [Staphylococcus aureus subsp. aureus Mu50] >sp|P0A0H2.1| RecName: Full=rRNA adenine N-6-methyltransferase; AltName: Full=Erythromycin resistance protein; AltName: Full=Macrolide-lincosamide-streptogramin B resistance protein [Staphylococcus aureus subsp. aureus N315] >sp|P0A0H3.1| RecName: Full=rRNA adenine N-6-methyltransferase; AltName: Full=Erythromycin resistance protein; AltName: Full=Macrolide-lincosamide-streptogramin B resistance protein [Staphylococcus aureus] >sp|Q6GKQ0.1| RecName: Full=rRNA adenine N-6-methyltransfera

### Exercise 1d

 If we look at the above output, we can see that the third alignment title has the term MULTISPECIES in it. This term denotes a sequence that is identical sequence in found in multiple species. When this happens, NCBI has its `alignment.title` as a long string with all the different species names. 
 
 But we just want to shorten the name for our purposes.

 We can see that the `alignment.title` with the MULTISPECIES term has each different species separated by a `>`. We can extract the contents of `alignment.title` using  `split()` and keep only the first species that is found.


In [184]:
# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('day5_1b_output.xml')
___ = ___.read(blast_result_handle)

for ___ in blast_record.alignments:
    # use split() on alignment.title
    new_alignment_title = alignment.title.___('>')

    #select the first item in new_alignment_title, assign it back to new_alignment_title
    new_alignment_title = ___[0]

    # print the new title
    print(new_alignment_title)
    print('\n') # for clarity

gb|ETJ08129.1| rRNA adenine N-6-methyltransferase, partial [Streptococcus parasanguinis DORA_23_24]


gb|EXM57106.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus DAR133]


ref|WP_001072201.1| MULTISPECIES: 23S rRNA (adenine(2058)-N(6))-methyltransferase Erm(A) [Bacilli] 


gb|ALY18984.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus] 


emb|CAC5857653.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus]




### Exercise 1f

Now we want to extract the new alignment title and the sequence for each hit so that we can write them to a `fasta` file

Convert the `hsp.sbjct` sequence into a `Seq()` object. 

Afterwards, make a `SeqRecord` object from the extracted sequence and the extracted title.

Store the `new_alignment_title` and the `hsp.sbjct` in the list `data_store` 

Read the code comments for clues!

In [195]:
from Bio.SeqRecord import SeqRecord

data_store = []

# blast_record has to be read with NCBIXML each time you call the for loop
blast_result_handle = open('day5_1b_output.xml')
___ = ___.read(blast_result_handle)

for alignment in blast_record.alignments:
    # use split()
    new_alignment_title = alignment.title.___('>')

    #select the first item in new_alignment_title, assign it back to new_alignment_title
    new_alignment_title = ___[0]

    # assign hsp.sbjct to sbjct_sequence
    for hsp in alignment.hsps:
        sbjct_sequence = hsp.___
        break
    
    # Make a Seq object out of sbjct_sequence, Assign it to extracted_sequence 
    extracted_sequence = Seq(___)

    # make a SeqRecord object called extracted_record. The first item will be the sequence, id is the new alignment title
    extracted_record = SeqRecord(___, id = ___, description = '')

    # store the SeqRecord in a list
    data_store.append(extracted_record)
data_store

[SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='gb|ETJ08129.1| rRNA adenine N-6-methyltransferase, partial [Streptococcus parasanguinis DORA_23_24]', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='gb|EXM57106.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus DAR133]', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='ref|WP_001072201.1| MULTISPECIES: 23S rRNA (adenine(2058)-N(6))-methyltransferase Erm(A) [Bacilli] ', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS...LFH'), id='gb|ALY18984.1| rRNA adenine N-6-methyltransferase [Staphylococcus aureus] ', name='<unknown name>', description='', dbxrefs=[]),
 SeqRecord(seq=Seq('MNQKNPKDTQNFITSKKHVKEILNHTNISKQDNVIEIGSGKGHFTKELVKMSRS..

### Exercise 1g

Use `SeqIO.write()` to write the contents of `data_store` to a file called `'day5_1g.fasta'`. Specify the output file to be of type `'fasta'`



In [196]:
SeqIO.___(___,'day5_1g.fasta','___')

5

# Section 2: BLAST with fastq reads

One quick method for checking whether your reads actually belong to the organism you are studying is to BLAST some of the reads against the NCBI database.

Unlike in our above code where we were supplied `qlbast()` with just **one** sequence to BLAST, here we want to BLAST several sequences, and have each give its own result.

Before we can do that, though, we need to convert our reads from `fastq` format to `fasta` format.


We will start by reading in the fastq file `short_reads.fastq`, which contains reads from whole-genome sequencing of a Pseudomonas monteilii strain. 

We will extract their `record.id` and `record.seq` and store them in a list.

In [218]:
# empty list to store modified sequence records
extracted_seqs = []

for record in SeqIO.parse('short_reads.fastq', 'fastq'):
    # get the id and sequence assigned to two new variables
    read_id = record.id
    dna_seq = record.seq
    
    # format them into a SeqRecord. keep description empty
    new_record = SeqRecord(dna_seq, id = read_id, description = '')

    # append to extracted_seqs
    extracted_seqs.append(new_record)

    print(new_record)


SeqIO.write(extracted_seqs, 'short_fasta_reads.fasta','fasta')

ID: M05164:33:000000000-BGDN6:1:1101:9154:2416
Name: <unknown name>
Number of features: 0
Seq('GCCCAGGTTCAACAACATCTCCTCGCGCTGCGGCTTCAGGTATTGGTGTCCGAC...CCA', SingleLetterAlphabet())
ID: M05164:33:000000000-BGDN6:1:1101:16178:2450
Name: <unknown name>
Number of features: 0
Seq('GTATCAATCAACGTGCGGCGCTGATCGTTACCTTCAGAGTAACTGTTCCGAGCA...CCA', SingleLetterAlphabet())
ID: M05164:33:000000000-BGDN6:1:1101:16201:2387
Name: <unknown name>
Number of features: 0
Seq('CCCAGTATGAATAGATATTAAACAAGCACCATCACCTTAAAAATCACTGGTGCC...CCA', SingleLetterAlphabet())
ID: M05164:33:000000000-BGDN6:1:1101:24424:2376
Name: <unknown name>
Number of features: 0
Seq('CGTCTCCTGTCTCAGGTGGACTGGGTCATCCATGCCGCCGCCATTACACGGCTG...GCC', SingleLetterAlphabet())


4

Now that we have our reads in fasta format, we can supply the whole all the sequences at once to `qblast()`.

We are setting our `hitlist_size` to 1. This is because we want to keep our results simple: one sequence homology BLAST hit per read. We will also ask that it

In [219]:
# open short read fasta sequences using open() 
read_seqs = open('short_fasta_reads.fasta')
# place the contents of read_seq into memory with read()
read_seqs_memory = read_seqs.read()
# supply qblast parameters
blast_results = qblast(program='blastn', database='nr', sequence=read_seqs_memory, hitlist_size=1)

# write the contents of the blast result to a file called 'short_read_blast_results.xml'
output_file = open('short_read_blast_results.xml', 'w')
output_file.write(blast_results.read())

# close the output file
output_file.close()
# close the blast_results file
blast_results.close()
# close the fasta file
read_seqs.close()

In [248]:
blast_result_handle = open('short_read_blast_results.xml') # open the blast output file
blast_records = NCBIXML.parse(blast_result_handle) # use parse to read a blast result with more than one query

for blast_record in blast_records:
    print('Read ID: ', blast_record.query)
    for alignment in blast_record.alignments:
        print('Hit title: ,', alignment.title)
    print('\n')

Read ID:  M05164:33:000000000-BGDN6:1:1101:9154:2416
Hit title: , gi|982198872|gb|CP013997.1| Pseudomonas monteilii strain USDA-ARS-USMARC-56711, complete genome


Read ID:  M05164:33:000000000-BGDN6:1:1101:16178:2450
Hit title: , gi|1963973487|emb|LR991037.1| Cosmia trapezina genome assembly, chromosome: 18


Read ID:  M05164:33:000000000-BGDN6:1:1101:16201:2387
Hit title: , gi|982198872|gb|CP013997.1| Pseudomonas monteilii strain USDA-ARS-USMARC-56711, complete genome


Read ID:  M05164:33:000000000-BGDN6:1:1101:24424:2376
Hit title: , gi|1524514342|gb|CP027762.1| Pseudomonas sp. LBUM920 chromosome, complete genome




We expected to see Pseudomonas only matches. However, the second read matches Cosmia trapezina. In the next exercise let's re-run our blast search and allow for 10 matches per query.

---

### Exercise 2a

Submit the sequences in `'short_fasta_reads.fasta'` to `qblast()`.

Use program `'blastn'`, the `'nr'` database, sequence `read_seqs_memory`, and set `hitlist_size` to `5`

Write the contents to a file called `'day5_2a.xml'`


In [232]:
# open short read fasta sequences using open() 
read_seqs = open('short_fasta_reads.fasta')
# place the contents of read_seq into memory with read()
read_seqs_memory = read_seqs.read()
# supply qblast parameters
blast_results = qblast(program='___', database='___', sequence=___, hitlist_size=___)

# write the contents of the blast_results to a file called 'short_read_blast_results.xml'
output_file = open('day5_2a.xml', 'w')
output_file.write(___.read())

# close the output file
output_file.___()
# close the blast_results file
____.close()
# close the fasta file
read_seqs.close()

### Exercise 2b

For each `blast_record`, print the `blast_record.query` in the blast result file, print the `alignment.title` of the hit.

In [249]:
blast_result_handle = open('day5_2a.xml') # open the blast output file
blast_records = NCBIXML.parse(blast_result_handle) # use parse to read a blast result with more than one query

for ___ in blast_records:
    print('Read ID: ', ___.___)
    for alignment in blast_record.alignments:
        print('Hit title:', alignment.___)
    print('\n')

Read ID:  M05164:33:000000000-BGDN6:1:1101:9154:2416
Hit title: gi|982198872|gb|CP013997.1| Pseudomonas monteilii strain USDA-ARS-USMARC-56711, complete genome
Hit title: gi|675318909|gb|CP009048.1| Pseudomonas alkylphenolica strain KL28 chromosome, complete genome
Hit title: gi|2042583920|gb|CP071007.1| Pseudomonas sp. SORT22 chromosome, complete genome
Hit title: gi|1935551420|gb|CP062498.1| Pseudomonas sp. BIGb0427 chromosome
Hit title: gi|1339002592|gb|CP026386.1| Pseudomonas sp. PONIH3 chromosome, complete genome
Hit title: gi|684194542|gb|CP009365.1| Pseudomonas soli strain SJ10, complete genome
Hit title: gi|2063469783|gb|CP077075.1| Pseudomonas sp. COR54 chromosome, complete genome
Hit title: gi|1419237561|gb|CP030750.1| Pseudomonas putida strain NX-1 chromosome, complete genome
Hit title: gi|1024771698|gb|CP011789.1| Pseudomonas putida strain PC2, complete genome
Hit title: gi|2063614699|gb|CP077094.1| Pseudomonas sp. RW10S1 chromosome, complete genome


Read ID:  M05164:33:00

### Exercise 2c

Our results indicate that most reads return Pseudomonas matches, which is expected. However, read 2 had only two total hits found (and had up to 10 hits available to return). This suggests that read 2 is not contamination, but just a bad read. We can determine whether this is the case by examining the strength of the homology of our reads. If the homology strength for read 2 is high, then it is a contaminant read. If it is low, it is likely that it is simply a low-quality read.

For each query in the blast result file, print the `alignment.title` of the hit and the `hsp.expect` (e-value) within `alignment.hsps` to get the strength of homology



In [241]:
blast_result_handle = open('day5_2a.xml') # open the blast output file
blast_records = ___.___(blast_result_handle) # use NCBIXML.parse to read a blast result with more than one query

for ___ in blast_records:
    print('Read ID: ', blast_record.___)
    for alignment in blast_record.alignments:
        print('Hit title:', alignment.title)
        for ___ in alignment.hsps:
            print('Expected value: ', hsp.expect)
    print('\n')

Hit title: gi|982198872|gb|CP013997.1| Pseudomonas monteilii strain USDA-ARS-USMARC-56711, complete genome
Expected_value:  7.0221e-29
Hit title: gi|675318909|gb|CP009048.1| Pseudomonas alkylphenolica strain KL28 chromosome, complete genome
Expected_value:  1.6529e-11
Hit title: gi|2042583920|gb|CP071007.1| Pseudomonas sp. SORT22 chromosome, complete genome
Expected_value:  2.45312e-09
Hit title: gi|1935551420|gb|CP062498.1| Pseudomonas sp. BIGb0427 chromosome
Expected_value:  2.45312e-09
Hit title: gi|1339002592|gb|CP026386.1| Pseudomonas sp. PONIH3 chromosome, complete genome
Expected_value:  0.000658263
Hit title: gi|684194542|gb|CP009365.1| Pseudomonas soli strain SJ10, complete genome
Expected_value:  0.000658263
Hit title: gi|2063469783|gb|CP077075.1| Pseudomonas sp. COR54 chromosome, complete genome
Expected_value:  0.000658263
Hit title: gi|1419237561|gb|CP030750.1| Pseudomonas putida strain NX-1 chromosome, complete genome
Expected_value:  0.00801929
Hit title: gi|1024771698|g

### Exercise 2d

From our results we can see two things. Most hits have a very [low e-value](http://resources.qiagenbioinformatics.com/manuals/clcgenomicsworkbench/650/_E_value.html#:~:text=The%20default%20threshold%20for%20the,%3C%2010e%2D100%20Identical%20sequences) (less than 1 ). Hits to read 2 have e-values of 4.

Create a variable called `e_value_threshold` with a value of `0.0001`

Let's filter our results so that only `hsp.expect` (evalues) of a certain value are printed.

In [245]:
blast_result_handle = open('day5_2a.xml') # open the blast output file
blast_records = NCBIXML.parse(blast_result_handle) # use parse to read a blast result with more than one query

e_value_threshold = ___

for blast_record in blast_records:
    print('Read ID: ', blast_record.query)
    for alignment in blast_record.alignments:
        print('Hit title:', alignment.title)
        for ___ in alignment.hsps:
            if hsp.___ <= e_value_threshold:
                print('Expected value: ', hsp.expect)
            else:
                print('Expected value: did not pass')
            break
    print('\n')

Read ID:  M05164:33:000000000-BGDN6:1:1101:9154:2416
Hit title: gi|982198872|gb|CP013997.1| Pseudomonas monteilii strain USDA-ARS-USMARC-56711, complete genome
Expected value:  7.0221e-29
Hit title: gi|675318909|gb|CP009048.1| Pseudomonas alkylphenolica strain KL28 chromosome, complete genome
Expected value:  1.6529e-11
Hit title: gi|2042583920|gb|CP071007.1| Pseudomonas sp. SORT22 chromosome, complete genome
Expected value:  2.45312e-09
Hit title: gi|1935551420|gb|CP062498.1| Pseudomonas sp. BIGb0427 chromosome
Expected value:  2.45312e-09
Hit title: gi|1339002592|gb|CP026386.1| Pseudomonas sp. PONIH3 chromosome, complete genome
Expected value: did not pass
Hit title: gi|684194542|gb|CP009365.1| Pseudomonas soli strain SJ10, complete genome
Expected value: did not pass
Hit title: gi|2063469783|gb|CP077075.1| Pseudomonas sp. COR54 chromosome, complete genome
Expected value: did not pass
Hit title: gi|1419237561|gb|CP030750.1| Pseudomonas putida strain NX-1 chromosome, complete genome
E

### Exercise 2e

We will now store our results to a list called `store_results` using `append()`

In [250]:
blast_result_handle = open('day5_2a.xml') # open the blast output file
blast_records = NCBIXML.parse(blast_result_handle) # use parse to read a blast result with more than one query

store_results = ___

___ = 0.0001

for blast_record in blast_records:
    # get the read_id
    read_id = blast_record.query

    for alignment in blast_record.alignments:
        # get the name of the hit
        hit_title = alignment.title

        for hsp in alignment.hsps:
            # get the evalue from hsp.expect
            evalue = hsp.___

            # get a qualitative result for the evalue threshold
            if hsp.expect <= e_value_threshold:
                evalue_pass = 1
            else:
                evalue_pass = 0
            
            # append to list
            store_results.___([read_id, hit_title, evalue, evalue_pass])


store_results

[['M05164:33:000000000-BGDN6:1:1101:9154:2416',
  'gi|982198872|gb|CP013997.1| Pseudomonas monteilii strain USDA-ARS-USMARC-56711, complete genome',
  7.0221e-29,
  1],
 ['M05164:33:000000000-BGDN6:1:1101:9154:2416',
  'gi|675318909|gb|CP009048.1| Pseudomonas alkylphenolica strain KL28 chromosome, complete genome',
  1.6529e-11,
  1],
 ['M05164:33:000000000-BGDN6:1:1101:9154:2416',
  'gi|2042583920|gb|CP071007.1| Pseudomonas sp. SORT22 chromosome, complete genome',
  2.45312e-09,
  1],
 ['M05164:33:000000000-BGDN6:1:1101:9154:2416',
  'gi|1935551420|gb|CP062498.1| Pseudomonas sp. BIGb0427 chromosome',
  2.45312e-09,
  1],
 ['M05164:33:000000000-BGDN6:1:1101:9154:2416',
  'gi|1339002592|gb|CP026386.1| Pseudomonas sp. PONIH3 chromosome, complete genome',
  0.000658263,
  0],
 ['M05164:33:000000000-BGDN6:1:1101:9154:2416',
  'gi|684194542|gb|CP009365.1| Pseudomonas soli strain SJ10, complete genome',
  0.000658263,
  0],
 ['M05164:33:000000000-BGDN6:1:1101:9154:2416',
  'gi|2063469783|gb|

That's it for Day 5! Thanks for attending this workshop!