## Lab 3: Database Searching Using Biopython

#### Enter your name below.

### Part 1: Search NCBI's Gene database for genes involved in cystic fibrosis.

Your goal in this section is to use NCBI's Gene database to identify genes involved in cystic fibrosis in humans.

##### 1. Load the Biopython module "Entrez", enter your email address, then execute.

In [None]:
from Bio import Entrez

#Tell NCBI who you are
Entrez.email = ""

##### 2. Generate query and execute the search.

In [None]:
db = "gene" # This is the database we want to search

query = "cystic fibrosis" # This is the query

#We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) # tell Entrez what database we want to search, who we are, and what we want to look for

record = Entrez.read(h_search) # read the esearch record

res_ids = record["IdList"] # save the list of ids returned by our query to res_ids

#print the list of ids
print(res_ids)

In [None]:
# Use Entrez esummary to retrieve the record for the first id in the list 
summary = Entrez.esummary(db=db, id=res_ids[0])

# Read the summary 
gene_summary = Entrez.read(summary)

# and print it out
print(gene_summary)

Q1: Name three data types you see in the printed output of gene_summary.

##### 3. Analyze the results.

In [None]:
for r_id in res_ids: #loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) #retrieve summary of document
    gene_read = Entrez.read (summary)#use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
#print gene_summary[0]
print("Name:", gene_summary[0]['Name'])

# print gene_summary[0]['Description'] 
print("Description:", gene_summary[0]['Description'])

# print gene_summary[0]['Orgname']
print("Organism:", gene_summary[0]['Organism'])

# print gene_summary[0]['Summary']
print("Summary:", gene_summary[0]['Summary'])

# print gene_summary[0]['OtherAliases']
print("OtherAliases:",  gene_summary[0]['OtherAliases'])

# print '\n'
print('\n') 

Let's quantify the success of our query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

Use a for loop to sort through the genes returned by your query. 
Assign the genes into true and false positive lists based on your criteria.
Remember that Python considers the case of a string when checking equality.
Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

In [None]:
true_positive = []
false_positive = []

for r_id in res_ids: # loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) # retrieve summary of document
    gene_read = Entrez.read(summary) # use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
    # use an if statement to define the true positive classification, remember that Python considers the case of the string used 
    
        # if the condition is met add the record to the true positive list
    
    # if the condition is NOT met add the record to the false positive list


Q2: Why didn't we include any if statments for true or false negatives?

In [None]:
#print out a statement describing how many of the results were true/false positives
print(len(true_positive), "true positives")
print(len(false_positive), "false positives")

In [None]:
#Write a for loop to check if the gene cystic fibrosis transmembrane conductance regulator, 
#abbreviated "CFTR", was classified as a true positive and print out all the information about CFTR if it was found..
for gene in true_positive:
    if gene[0] ['Name'] == "CFTR":
        print(gene, "\n")

### 2. Craft a PubMed query to return journal articles about CFTR.

Your goal in this section is to retrieve articles (either journal or review articels) about the cystic fibrosis transmembrane conductance regulator gene ("CFTR").

The code in the following cell is complete - simply execute and observe.

In [None]:
# This is the database we want to search
db = "pubmed"

# This is the query
query = "cystic fibrosis transmembrane conductance regulator"

# We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) #tell Entrez what database we want to search, who we are, 

record = Entrez.read(h_search)

ids = record["IdList"] # save the list of ids returned by our query to res_ids

print(ids) # print the list of ids

How many records were returned? 

Don't count them yourself, write code below to find out!

In [None]:
print(len(ids))

The code in the following cell is complete - simply execute and observe.

In [None]:
# Use Entrez esummary to retrieve the record for the *first id* in the list 
summ = Entrez.esummary(db=db, id=ids[0])

# Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print(pub_summary)

Q3: What part of the code in the previous cell (copy and paste it below), limited the results to the record corresponding to the *first* id in the list?

Q4: Modify the code in the following cell so that it prints out the record for the **3rd** id in the list.

In [None]:
# Use Entrez esummary to retrieve the record for the *third id* in the list 
summ = Entrez.esummary(db=db, id=ids[0])

# Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print(pub_summary)

In [None]:
# print out a list of the elements in the record pub_summary
# hint use .keys()
print(pub_summary.keys())

NCBI's PubMed database contains both primary and secondary research articles. Write a for loop to count the number of primary research articles ('Journal Article') and secondary research articles ('Review').   
This information is contained in pub_summary[0]['PubTypeList'].

In [None]:
count_journal = 0
count_review = 0

for id in ids: #loop over pubmed IDs
    #print("id:",id)
 
    summary = Entrez.esummary(db="pubmed", id=id) # retrieve summary of document
    pub_summary = Entrez.read(summary) # use biopython to parse the summary
    
    print(pub_summary[0]['PubTypeList'])
 
    #ask if 'Review' is in pub_summary[0]['PubTypeList']
    
        # if so increase the value of count_review by one
         
            
    # ask if 'Journal Article' is in pub_summary[0]['PubTypeList'] 
    
        # if so increase the value of count_review by one
         
    
print("Number of primary research articles:", count_journal)
print("Number of review articles:", count_review)

In [None]:
# Generate a list of the titles of the publications

for id in ids: # loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=id) # retrieve summary of document
    pub_summary = Entrez.read(summary) # use biopython to parse the summary
    # print the title and a new line '\n' for readability
    print("Title:", pub_summary[0]['Title'])
    title_list.append(pub_summary[0]['Title'])# add the title to the list title_list

**Based on the titles of your results** - quantify the success of your query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

In [None]:
#Use a for loop to sort through the journal articles returned by your query. 
#Assign the titles into true and false positive lists based on your criteria.
#Remember that Python considers the case of a string when checking equality.
#Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

true_positive = [] # an empty list to contain the titles of the true positives
false_positive = [] # an empty list to contain the title of the false positives
#loop through the elements of title_list

    #use an if statement to define the true positive classification, remember that Python considers the case of the string use
    
    #if the condition is met add the title to the true positive list
    
    #else add the title to the list of false positives
    

In [None]:
#print out a statement describing how many of the results were true/false positives
print("true positives:", len(true_positive))
print("true positives:", len(false_positive))

#### That's all folks!!! Save and download your notebook, then upload it to Blackboard.