## Lab 3: Database Searching Using Biopython

#### Enter your name below.

### Part 1: Search NCBI's Gene database for genes involved in cystic fibrosis.

Your goal in this section is to use NCBI's Gene database to identify genes involved in cystic fibrosis in humans.

##### 1. Load the Biopython module "Entrez", enter your email address, then execute.

In [2]:
from Bio import Entrez

#Tell NCBI who you are
Entrez.email = "todd.riley@umb.edu"

##### 2. Generate query and execute the search.

In [5]:
db = "gene" # This is the database we want to search

query = "cystic fibrosis" # This is the query

#We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) # tell Entrez what database we want to search, who we are, and what we want to look for

record = Entrez.read(h_search) # read the esearch record

res_ids = record["IdList"] # save the list of ids returned by our query to res_ids

#print the list of ids
print(res_ids)
len(res_ids)

['7124', '3569', '7040', '1636', '3586', '3091', '4318', '4790', '5243', '21926', '207', '21898', '5743', '3553', '7421', '6774', '7099', '5468', '1080', '3576']


20

In [6]:
# Use Entrez esummary to retrieve the record for the first id in the list 
summary = Entrez.esummary(db=db, id=res_ids[0])

# Read the summary 
gene_summary = Entrez.read(summary)

# and print it out
print(gene_summary)

{'DocumentSummarySet': DictElement({'DocumentSummary': [DictElement({'GeneticSource': 'genomic', 'GeneWeight': '585516', 'Summary': 'This gene encodes a multifunctional proinflammatory cytokine that belongs to the tumor necrosis factor (TNF) superfamily. This cytokine is mainly secreted by macrophages. It can bind to, and thus functions through its receptors TNFRSF1A/TNFR1 and TNFRSF1B/TNFBR. This cytokine is involved in the regulation of a wide spectrum of biological processes including cell proliferation, differentiation, apoptosis, lipid metabolism, and coagulation. This cytokine has been implicated in a variety of diseases, including autoimmune diseases, insulin resistance, and cancer. Knockout studies in mice also suggested the neuroprotective function of this cytokine. [provided by RefSeq, Jul 2008]', 'CurrentID': '0', 'Organism': {'ScientificName': 'Homo sapiens', 'TaxID': '9606', 'CommonName': 'human'}, 'OtherAliases': 'DIF, TNF-alpha, TNFA, TNFSF2, TNLG1F', 'ChrStart': '315755

Q1: Name three data types you see in the printed output of gene_summary.

##### 3. Analyze the results.

In [7]:
for r_id in res_ids: #loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) #retrieve summary of document
    gene_read = Entrez.read (summary)#use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
#print gene_summary[0]
print("Name:", gene_summary[0]['Name'])

# print gene_summary[0]['Description'] 
print("Description:", gene_summary[0]['Description'])

# print gene_summary[0]['Orgname']
print("Organism:", gene_summary[0]['Organism'])

# print gene_summary[0]['Summary']
print("Summary:", gene_summary[0]['Summary'])

# print gene_summary[0]['OtherAliases']
print("OtherAliases:",  gene_summary[0]['OtherAliases'])

# print '\n'
print('\n') 

Name: CXCL8
Description: C-X-C motif chemokine ligand 8
Organism: {'ScientificName': 'Homo sapiens', 'TaxID': '9606', 'CommonName': 'human'}
Summary: The protein encoded by this gene is a member of the CXC chemokine family and is a major mediator of the inflammatory response. The encoded protein is secreted primarily by neutrophils, where it serves as a chemotactic factor by guiding the neutrophils to the site of infection. This chemokine is also a potent angiogenic factor. This gene is believed to play a role in the pathogenesis of bronchiolitis, a common respiratory tract disease caused by viral infection. This gene and other members of the CXC chemokine gene family form a gene cluster in a region of chromosome 4q. [provided by RefSeq, Aug 2017]
OtherAliases: GCP-1, GCP1, IL8, LECT, LUCT, LYNAP, MDNCF, MONAP, NAF, NAP-1, NAP1




Let's quantify the success of our query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

Use a for loop to sort through the genes returned by your query. 
Assign the genes into true and false positive lists based on your criteria.
Remember that Python considers the case of a string when checking equality.
Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

In [12]:
true_positives = []
false_positives = []

for r_id in res_ids: # loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) # retrieve summary of document
    gene_read = Entrez.read(summary) # use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
    # use an if statement to define the true positive classification, remember that Python considers the case of the string used 
    positive = False
    if query in str.lower(gene_summary[0]['Summary']):
        positive = True
        
    # if the condition is met add the record to the true positive list
    if positive == True:
        true_positives.append(gene_summary)
    # if the condition is NOT met add the record to the false positive list
    else:
        false_positives.append(gene_summary)
        

1
19


Q2: Why didn't we include any if statments for true or false negatives?

In [13]:
#print out a statement describing how many of the results were true/false positives
print(len(true_positives), "true positives")
print(len(false_positives), "false positives")

1 true positives
19 false positives


In [15]:
#Write a for loop to check if the gene cystic fibrosis transmembrane conductance regulator, 
#abbreviated "CFTR", was classified as a true positive and print out all the information about CFTR if it was found..
for a_gene_summary in true_positives:
    if a_gene_summary[0]['Name'] == "CFTR":
        print(a_gene_summary, "\n")

[DictElement({'GeneticSource': 'genomic', 'GeneWeight': '219557', 'Summary': 'This gene encodes a member of the ATP-binding cassette (ABC) transporter superfamily. The encoded protein functions as a chloride channel, making it unique among members of this protein family, and controls ion and water secretion and absorption in epithelial tissues. Channel activation is mediated by cycles of regulatory domain phosphorylation, ATP-binding by the nucleotide-binding domains, and ATP hydrolysis. Mutations in this gene cause cystic fibrosis, the most common lethal genetic disorder in populations of Northern European descent. The most frequently occurring mutation in cystic fibrosis, DeltaF508, results in impaired folding and trafficking of the encoded protein. Multiple pseudogenes have been identified in the human genome. [provided by RefSeq, Aug 2017]', 'CurrentID': '0', 'Organism': {'ScientificName': 'Homo sapiens', 'TaxID': '9606', 'CommonName': 'human'}, 'OtherAliases': 'ABC35, ABCC7, CF, C

### 2. Craft a PubMed query to return journal articles about CFTR.

Your goal in this section is to retrieve articles (either journal or review articels) about the cystic fibrosis transmembrane conductance regulator gene ("CFTR").

The code in the following cell is complete - simply execute and observe.

In [None]:
# This is the database we want to search
db = "pubmed"

# This is the query
query = "cystic fibrosis transmembrane conductance regulator"

# We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) #tell Entrez what database we want to search, who we are, 

record = Entrez.read(h_search)

ids = record["IdList"] # save the list of ids returned by our query to res_ids

print(ids) # print the list of ids

How many records were returned? 

Don't count them yourself, write code below to find out!

In [None]:
print(len(ids))

The code in the following cell is complete - simply execute and observe.

In [None]:
# Use Entrez esummary to retrieve the record for the *first id* in the list 
summ = Entrez.esummary(db=db, id=ids[0])

# Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print(pub_summary)

Q3: What part of the code in the previous cell (copy and paste it below), limited the results to the record corresponding to the *first* id in the list?

Q4: Modify the code in the following cell so that it prints out the record for the **3rd** id in the list.

In [None]:
# Use Entrez esummary to retrieve the record for the *third id* in the list 
summ = Entrez.esummary(db=db, id=ids[0])

# Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print(pub_summary)

In [None]:
# print out a list of the elements in the record pub_summary
# hint use .keys()
print(pub_summary.keys())

NCBI's PubMed database contains both primary and secondary research articles. Write a for loop to count the number of primary research articles ('Journal Article') and secondary research articles ('Review').   
This information is contained in pub_summary[0]['PubTypeList'].

In [None]:
count_journal = 0
count_review = 0

for id in ids: #loop over pubmed IDs
    #print("id:",id)
 
    summary = Entrez.esummary(db="pubmed", id=id) # retrieve summary of document
    pub_summary = Entrez.read(summary) # use biopython to parse the summary
    
    print(pub_summary[0]['PubTypeList'])
 
    #ask if 'Review' is in pub_summary[0]['PubTypeList']
    if 'Review' in pub_summary[0]['PubTypeList']:
        # if so increase the value of count_review by one
         count_review += 1
            
    # ask if 'Journal Article' is in pub_summary[0]['PubTypeList'] 
    elif 'Journal Article' in pub_summary[0]['PubTypeList']:
        # if so increase the value of count_review by one
         count_journal += 1
    
print("Number of primary research articles:", count_journal)
print("Number of review articles:", count_review)

In [None]:
# Generate a list of the titles of the publications
title_list = []
for id in ids: # loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=id) # retrieve summary of document
    pub_summary = Entrez.read(summary) # use biopython to parse the summary
    # print the title and a new line '\n' for readability
    print("Title:", pub_summary[0]['Title'])
    title_list.append(pub_summary[0]['Title'])# add the title to the list title_list

**Based on the titles of your results** - quantify the success of your query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

In [None]:
#Use a for loop to sort through the journal articles returned by your query. 
#Assign the titles into true and false positive lists based on your criteria.
#Remember that Python considers the case of a string when checking equality.
#Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

true_positive = [] # an empty list to contain the titles of the true positives
false_positive = [] # an empty list to contain the title of the false positives
#loop through the elements of title_list

    #use an if statement to define the true positive classification, remember that Python considers the case of the string use
    
    #if the condition is met add the title to the true positive list
    
    #else add the title to the list of false positives
    

In [None]:
#print out a statement describing how many of the results were true/false positives
print("true positives:", len(true_positive))
print("true positives:", len(false_positive))

#### That's all folks!!! Save and download your notebook, then upload it to Blackboard.