## Lab 3: Database Searching Using Biopython

#### Enter your name below.

### Part 1: Search NCBI's Gene database for genes involved in cystic fibrosis.

Your goal in this section is to use NCBI's Gene database to identify genes involved in cystic fibrosis in humans.

##### 1. Load the Biopython module "Entrez", enter your email address, then execute.

In [1]:
from Bio import Entrez

#Tell NCBI who you are
Entrez.email = "mensainah.hector001@umb.edu"

##### 2. Generate query and execute the search.

In [2]:
db = "gene" #This is the database we want to search

query = "cystic fibrosis" #This is the query

#We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) #tell Entrez what database we want to search, who we are, and what we want to look for

record = Entrez.read(h_search) #read the esearch record

res_ids = record["IdList"] #save the list of ids returned by our query to res_ids

#print the list of ids
print (res_ids)
type (res_ids)

['7124', '3569', '7040', '1636', '3586', '3091', '21926', '4790', '5743', '5243', '4318', '21898', '3553', '207', '7421', '1080', '5468', '7099', '3576', '6774']


Bio.Entrez.Parser.ListElement

In [3]:
#Use Entrez esummary to retrieve the record for the first id in the list 
summary = Entrez.esummary(db=db, id=res_ids[0])

#Read the summary 
gene_summary = Entrez.read(summary)

#and print it out
print (gene_summary)

{u'DocumentSummarySet': DictElement({u'DbBuild': 'Build160404-0125m.1', u'DocumentSummary': [DictElement({u'Status': '0', u'NomenclatureSymbol': 'TNF', u'OtherDesignations': 'APC1 protein|TNF, macrophage-derived|TNF, monocyte-derived|TNF-a|cachectin|tumor necrosis factor ligand 1F|tumor necrosis factor ligand superfamily member 2|tumor necrosis factor-alpha', u'Mim': ['191160'], u'Name': 'TNF', u'NomenclatureName': 'tumor necrosis factor', u'CurrentID': '0', u'GenomicInfo': [{u'ChrAccVer': 'NC_000006.12', u'ChrLoc': '6', u'ExonCount': '4', u'ChrStop': '31578335', u'ChrStart': '31575566'}], u'OtherAliases': 'DIF, TNF-alpha, TNFA, TNFSF2, TNLG1F', u'Summary': 'This gene encodes a multifunctional proinflammatory cytokine that belongs to the tumor necrosis factor (TNF) superfamily. This cytokine is mainly secreted by macrophages. It can bind to, and thus functions through its receptors TNFRSF1A/TNFR1 and TNFRSF1B/TNFBR. This cytokine is involved in the regulation of a wide spectrum of biol

Q1: Name three data types you see in the printed output of gene_summary.

##### 3. Analyze the results.

In [4]:
for r_id in res_ids: #loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) #retrieve summary of document
    gene_read = Entrez.read (summary)#use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
    #print gene_summary[0]
    print ("Name:", gene_summary[0]['Name'])
    print ("Description:", gene_summary[0]['Description']) 
    print ("Organism:", gene_summary[0]['Organism'])
    print ("Summary:", gene_summary[0]['Summary'])
    print ("OtherAliases:",  gene_summary[0]['OtherAliases'])
    print ('\n')

Name: TNF
Description: tumor necrosis factor
Organism: {u'CommonName': 'human', u'ScientificName': 'Homo sapiens', u'TaxID': '9606'}
Summary: This gene encodes a multifunctional proinflammatory cytokine that belongs to the tumor necrosis factor (TNF) superfamily. This cytokine is mainly secreted by macrophages. It can bind to, and thus functions through its receptors TNFRSF1A/TNFR1 and TNFRSF1B/TNFBR. This cytokine is involved in the regulation of a wide spectrum of biological processes including cell proliferation, differentiation, apoptosis, lipid metabolism, and coagulation. This cytokine has been implicated in a variety of diseases, including autoimmune diseases, insulin resistance, and cancer. Knockout studies in mice also suggested the neuroprotective function of this cytokine. [provided by RefSeq, Jul 2008]
OtherAliases: DIF, TNF-alpha, TNFA, TNFSF2, TNLG1F


Name: IL6
Description: interleukin 6
Organism: {u'CommonName': 'human', u'ScientificName': 'Homo sapiens', u'TaxID': '960

Let's quantify the success of our query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

Use a for loop to sort through the genes returned by your query. 
Assign the genes into true and false positive lists based on your criteria.
Remember that Python considers the case of a string when checking equality.
Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

In [5]:
true_positive = []
false_positive = []
for r_id in res_ids: #loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) #retrieve summary of document
    gene_read = Entrez.read(summary) #use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
    #use an if statement to define the true positive classification, remember that Python considers the case of the string used 
    if "cystic fibrosis" in str.lower(gene_summary[0]['Summary']):
       
        #if the condition is met add the record to the true positive list
        true_positive.append(gene_summary)
    
    #if the condition is NOT met add the record to the false positive list
    else:
        false_positive.append(gene_summary) 


Q2: Why didn't we include any if statments for true or false negatives?

In [6]:
#print out a statement describing how many of the results were true/false positives
print "true positives:", len (true_positive) 
print "false positive:", len (false_positive)


true positives: 1
false positive: 19


In [7]:
#Write a for loop to check if the gene cystic fibrosis transmembrane conductance regulator, 
#abbreviated "CFTR", was classified as a true positive and print out all the information about CFTR if it was found..
#if so, print out all of the information on the gene
for gene in true_positive:
    if gene[0] ['Name'] == "CFTR":
        print gene, "\n"

[DictElement({u'Status': '0', u'NomenclatureSymbol': 'CFTR', u'OtherDesignations': 'cAMP-dependent chloride channel|channel conductance-controlling ATPase|cystic fibrosis transmembrane conductance regulator (ATP-binding cassette sub-family C, member 7)', u'Mim': ['602421'], u'Name': 'CFTR', u'NomenclatureName': 'cystic fibrosis transmembrane conductance regulator', u'CurrentID': '0', u'GenomicInfo': [{u'ChrAccVer': 'NC_000007.14', u'ChrLoc': '7', u'ExonCount': '33', u'ChrStop': '117668664', u'ChrStart': '117478339'}], u'OtherAliases': 'ABC35, ABCC7, CF, CFTR/MRP, MRP7, TNR-CFTR, dJ760C5.1', u'Summary': 'This gene encodes a member of the ATP-binding cassette (ABC) transporter superfamily. ABC proteins transport various molecules across extra- and intra-cellular membranes. ABC genes are divided into seven distinct subfamilies (ABC1, MDR/TAP, MRP, ALD, OABP, GCN20, White). This protein is a member of the MRP subfamily that is involved in multi-drug resistance. The encoded protein function

### 2. Craft a PubMed query to return journal articles about CFTR.

Your goal in this section is to retrieve articles (either journal or review articels) about the cystic fibrosis transmembrane conductance regulator gene ("CFTR").

The code in the following cell is complete - simply execute and observe.

In [8]:
#This is the database we want to search
db = "pubmed"

#This is the query
query = "cystic fibrosis transmembrane conductance regulator"

#We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) #tell Entrez what database we want to search, who we are, 

record = Entrez.read(h_search)

ids = record["IdList"] #save the list of ids returned by our query to res_ids

print ids #print the list of ids

['27035618', '27034411', '27033560', '27031658', '27030675', '27017198', '27007499', '27004488', '26993289', '26989463', '26976279', '26973296', '26968770', '26968005', '26965147', '26962591', '26950439', '26939393', '26935091', '26930426']


How many records were returned? 

Don't count them yourself, write code below to find out!

In [9]:
print"number of pubs:", len(ids)


number of pubs: 20


The code in the following cell is complete - simply execute and observe.

In [10]:
#Use Entrez esummary to retrieve the record for the *first id* in the list 
summ = Entrez.esummary(db=db, id=ids[0])

#Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print pub_summary

{'DOI': '10.1038/cdd.2016.22', 'Title': 'A novel treatment of cystic fibrosis acting on-target: cysteamine plus epigallocatechin gallate for the autophagy-dependent rescue of class II-mutated CFTR.', 'Source': 'Cell Death Differ', 'PmcRefCount': 0, 'Issue': '', 'SO': '2016 Apr 1;', 'ISSN': '1350-9047', 'Volume': '', 'FullJournalName': 'Cell death and differentiation', 'RecordStatus': 'PubMed - as supplied by publisher', 'ESSN': '1476-5403', 'ELocationID': 'doi: 10.1038/cdd.2016.22', 'Pages': '', 'PubStatus': 'aheadofprint', 'AuthorList': ['Tosco A', 'De Gregorio F', 'Esposito S', 'De Stefano D', 'Sana I', 'Ferrari E', 'Sepe A', 'Salvadori L', 'Buonpensiero P', 'Di Pasqua A', 'Grassia R', 'Leone CA', 'Guido S', 'De Rosa G', 'Lusa S', 'Bona G', 'Stoll G', 'Maiuri MC', 'Mehta A', 'Kroemer G', 'Maiuri L', 'Raia V'], 'EPubDate': '2016 Apr 1', 'PubDate': '2016 Apr 1', 'NlmUniqueID': '9437445', 'LastAuthor': 'Raia V', 'ArticleIds': {'pii': 'cdd201622', 'medline': [], 'pubmed': ['27035618'], '

Q3: What part of the code in the previous cell (copy and paste it below), limited the results to the record corresponding to the *first* id in the list?

Q4: Modify the code in the following cell so that it prints out the record for the **3rd** id in the list.

In [11]:
#Use Entrez esummary to retrieve the record for the *third id* in the list 
summ = Entrez.esummary(db=db, id=ids[2])

#Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print pub_summary

{'DOI': '10.1007/s00125-016-3936-1', 'Title': 'Islet-intrinsic effects of CFTR mutation.', 'Source': 'Diabetologia', 'PmcRefCount': 0, 'Issue': '', 'SO': '2016 Mar 31;', 'ISSN': '0012-186X', 'Volume': '', 'FullJournalName': 'Diabetologia', 'RecordStatus': 'PubMed - as supplied by publisher', 'ESSN': '1432-0428', 'ELocationID': '', 'Pages': '', 'PubStatus': 'aheadofprint', 'AuthorList': ['Koivula FN', 'McClenaghan NH', 'Harper AG', 'Kelly C'], 'EPubDate': '2016 Mar 31', 'PubDate': '2016 Mar 31', 'NlmUniqueID': '0006777', 'LastAuthor': 'Kelly C', 'ArticleIds': {'pii': '10.1007/s00125-016-3936-1', 'medline': [], 'pubmed': ['27033560'], 'eid': '27033560', 'rid': '27033560', 'doi': '10.1007/s00125-016-3936-1'}, u'Item': [], 'History': {'received': '2016/01/19 00:00', 'medline': ['2016/04/02 06:00'], 'pubmed': ['2016/04/02 06:00'], 'aheadofprint': '2016/03/31 00:00', 'accepted': '2016/02/26 00:00', 'entrez': '2016/04/02 06:00'}, 'LangList': ['English'], 'HasAbstract': 1, 'References': [], 'P

In [12]:
#print out a list of the elements in the record pub_summary
#hint use .keys()
print pub_summary.keys()


['DOI', 'Title', 'Source', 'PmcRefCount', 'Issue', 'SO', 'ISSN', 'Volume', 'FullJournalName', 'RecordStatus', 'ESSN', 'ELocationID', 'Pages', 'PubStatus', 'AuthorList', 'EPubDate', 'PubDate', 'NlmUniqueID', 'LastAuthor', 'ArticleIds', u'Item', 'History', 'LangList', 'HasAbstract', 'References', 'PubTypeList', u'Id']


NCBI's PubMed database contains both primary and secondary research articles. Write a for loop to count the number of primary research articles ('Journal Article') and secondary research articles ('Review').   
This information is contained in pub_summary[0]['PubTypeList'].

In [13]:
count_journal = 0
count_review = 0

for id in ids: #loop over pubmed IDs
    #print "id:",id
 
    summary = Entrez.esummary(db="pubmed", id=id) #retrieve summary of document
    pub_summary = Entrez.read(summary)#use biopython to parse the summary
    
    print pub_summary[0]['PubTypeList']
 
    #ask if 'Review' is in pub_summary[0]['PubTypeList']
    if 'Review' in pub_summary[0]['PubTypeList']:
        #if so increase the value of count_review by one
         count_review += 1
            
    #ask if 'Journal Article' is in pub_summary[0]['PubTypeList'] 
    elif 'Journal Article' in pub_summary[0]['PubTypeList']:
        #if so increase the value of count_review by one
         count_journal += 1
    
print "Number of primary research articles:", count_journal
print "Number of review articles:", count_review

['Journal Article']
['Journal Article']
['Journal Article', 'Review']
['Journal Article']
['Journal Article', 'Review']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article', 'Review']
['Journal Article', 'Review']
['Journal Article', 'Review']
['Journal Article']
['Journal Article']
['Journal Article']
['Review']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
Number of primary research articles: 14
Number of review articles: 6


In [14]:
#Generate a list of the titles of the publications
title_list = []
for id in ids: #loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=id) #retrieve summary of document
    pub_summary = Entrez.read(summary) #use biopython to parse the summary
    #print the title and a new line '\n' for readability
    
    print "Title:", pub_summary[0]['Title'] 
    
    title_list.append(pub_summary[0]['Title'])#add the title to the list title_list

Title: A novel treatment of cystic fibrosis acting on-target: cysteamine plus epigallocatechin gallate for the autophagy-dependent rescue of class II-mutated CFTR.
Title: Modeling Cystic Fibrosis Using Pluripotent Stem Cell-Derived Human Pancreatic Ductal Epithelial Cells.
Title: Islet-intrinsic effects of CFTR mutation.
Title: Cystic fibrosis: a model system for precision medicine.
Title: New and emerging targeted therapies for cystic fibrosis.
Title: Novel CFTR Mutations in Two Iranian Families with Severe Cystic Fibrosis.
Title: Robust Stimulation of W1282X-CFTR Channel Activity by a Combination of Allosteric Modulators.
Title: Effects of Helicobacter pylori Infection on the Expressions and Functional Activities of Human Duodenal Mucosal Bicarbonate Transport Proteins.
Title: Staphylococcus aureus and Pseudomonas aeruginosa co-infection is associated with cystic fibrosis-related diabetes and poor clinical outcomes.
Title: Emerging role of cystic fibrosis transmembrane conductance re

**Based on the titles of your results** - quantify the success of your query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

In [15]:
#Use a for loop to sort through the journal articles returned by your query. 
#Assign the titles into true and false positive lists based on your criteria.
#Remember that Python considers the case of a string when checking equality.
#Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

true_positive = [] #an empty list to contain the titles of the true positives
false_positive = [] #an empty list to contain the title of the false positives
#loop through the elements of title_list
for id in ids: #loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) #retrieve summary of document
    pub_summary = Entrez.read(summary) #use biopython to parse the summary
    #print the title and a new line '\n' for readability
    title = pub_summary[0] ['Title']
    print "Title:", title
     
    
    #use an if statement to define the true positive classification, remember that Python considers the case of the string use
    
    #if the condition is met add the title to the true positive list
    #else add the title to the list of false positives

Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its sulfoxide in schizophrenic patients.
Title: Chlorpromazine metabolism VIII: blood levels of chlorpromazine and its su

In [16]:
#print out a statement describing how many of the results were true/false positives


#### That's all folks!!! Save and download your notebook, then upload it to Blackboard.