## Lab 3: Database Searching Using Biopython

#### Enter your name below.

### Part 1: Search NCBI's Gene database for genes involved in cystic fibrosis.

Your goal in this section is to use NCBI's Gene database to identify genes involved in cystic fibrosis in humans.

##### 1. Load the Biopython module "Entrez", enter your email address, then execute.

In [1]:
from Bio import Entrez

#Tell NCBI who you are
Entrez.email = "mensainah.hector001@umb.edu"

##### 2. Generate query and execute the search.

In [2]:
db = "gene" # This is the database we want to search

query = "cystic fibrosis" # This is the query

#We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) # tell Entrez what database we want to search, who we are, and what we want to look for

record = Entrez.read(h_search) # read the esearch record

res_ids = record["IdList"] # save the list of ids returned by our query to res_ids

#print the list of ids
print(res_ids)
type(res_ids)

['109867792', '109867022', '109889621', '7124', '3569', '7040', '1636', '3586', '3091', '4318', '4790', '5243', '21926', '5743', '21898', '207', '3553', '7421', '5468', '6774']


Bio.Entrez.Parser.ListElement

In [3]:
#Use Entrez esummary to retrieve the record for the first id in the list 
summary = Entrez.esummary(db=db, id=res_ids[0])

#Read the summary 
gene_summary = Entrez.read(summary)

#and print it out
print(gene_summary)

{'DocumentSummarySet': DictElement({'DocumentSummary': [DictElement({'GeneticSource': 'genomic', 'ChrSort': '~~last', 'ChrStart': '999999999', 'Organism': {'CommonName': 'coho salmon', 'ScientificName': 'Oncorhynchus kisutch', 'TaxID': '8019'}, 'LocationHist': [{'AnnotationRelease': '100', 'ChrStart': '45834005', 'ChrAccVer': 'NC_034195.1', 'ChrStop': '45819184', 'AssemblyAccVer': 'GCF_002021735.1'}], 'CurrentID': '0', 'Status': '0', 'MapLocation': '', 'NomenclatureName': '', 'GeneWeight': '0', 'Chromosome': 'LG22', 'OtherDesignations': 'cystic fibrosis transmembrane conductance regulator-like', 'GenomicInfo': [{'ChrLoc': 'LG22', 'ChrStart': '45834005', 'ChrAccVer': 'NC_034195.1', 'ChrStop': '45819184', 'ExonCount': '5'}], 'Name': 'LOC109867792', 'Summary': '', 'Mim': [], 'OtherAliases': '', 'Description': 'cystic fibrosis transmembrane conductance regulator-like', 'NomenclatureSymbol': '', 'NomenclatureStatus': ''}, attributes={'uid': '109867792'})], 'DbBuild': 'Build170318-0315m.1'},

Q1: Name three data types you see in the printed output of gene_summary.

##### 3. Analyze the results.

In [4]:
for r_id in res_ids: #loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) #retrieve summary of document
    gene_read = Entrez.read (summary)#use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
#print gene_summary[0]
print("Name:", gene_summary[0]['Name'])
print("Description:", gene_summary[0]['Description']) 
print("Organism:", gene_summary[0]['Organism'])
print("Summary:", gene_summary[0]['Summary'])
print("OtherAliases:",  gene_summary[0]['OtherAliases'])
print('\n')

Name: STAT3
Description: signal transducer and activator of transcription 3
Organism: {'CommonName': 'human', 'ScientificName': 'Homo sapiens', 'TaxID': '9606'}
Summary: The protein encoded by this gene is a member of the STAT protein family. In response to cytokines and growth factors, STAT family members are phosphorylated by the receptor associated kinases, and then form homo- or heterodimers that translocate to the cell nucleus where they act as transcription activators. This protein is activated through phosphorylation in response to various cytokines and growth factors including IFNs, EGF, IL5, IL6, HGF, LIF and BMP2. This protein mediates the expression of a variety of genes in response to cell stimuli, and thus plays a key role in many cellular processes such as cell growth and apoptosis. The small GTPase Rac1 has been shown to bind and regulate the activity of this protein. PIAS3 protein is a specific inhibitor of this protein. Mutations in this gene are associated with infant

Let's quantify the success of our query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

Use a for loop to sort through the genes returned by your query. 
Assign the genes into true and false positive lists based on your criteria.
Remember that Python considers the case of a string when checking equality.
Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

In [5]:
true_positive = []
false_positive = []
for r_id in res_ids: # loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=r_id) # retrieve summary of document
    gene_read = Entrez.read(summary) # use biopython to parse the summary
    gene_summary = gene_read ['DocumentSummarySet'] ['DocumentSummary']
    
    # use an if statement to define the true positive classification, remember that Python considers the case of the string used 
    if "cystic fibrosis" in str.lower(gene_summary[0]['Summary']):
       
        # if the condition is met add the record to the true positive list
        true_positive.append(gene_summary)
    
    # if the condition is NOT met add the record to the false positive list
    else:
        false_positive.append(gene_summary)

Q2: Why didn't we include any if statments for true or false negatives?

In [6]:
#print out a statement describing how many of the results were true/false positives
print("true positives:", len (true_positive)) 
print("false positive:", len (false_positive))

true positives: 0
false positive: 20


In [7]:
#Write a for loop to check if the gene cystic fibrosis transmembrane conductance regulator, 
#abbreviated "CFTR", was classified as a true positive and print out all the information about CFTR if it was found..
#if so, print out all of the information on the gene
for gene in true_positive:
    if gene[0] ['Name'] == "CFTR":
        print(gene, "\n")

### 2. Craft a PubMed query to return journal articles about CFTR.

Your goal in this section is to retrieve articles (either journal or review articels) about the cystic fibrosis transmembrane conductance regulator gene ("CFTR").

The code in the following cell is complete - simply execute and observe.

In [8]:
# This is the database we want to search
db = "pubmed"

# This is the query
query = "cystic fibrosis transmembrane conductance regulator"

# We'll use the function Entrez.esearch to search the pubmed database with our query
h_search =  Entrez.esearch(db=db, term=query) # tell Entrez what database we want to search, who we are, 

record = Entrez.read(h_search)

ids = record["IdList"] # save the list of ids returned by our query to res_ids

print(ids) # print the list of ids

['28289144', '28287550', '28279152', '28273890', '28270008', '28258579', '28247055', '28242698', '28242630', '28236359', '28235656', '28235470', '28234153', '28231890', '28230981', '28230279', '28225751', '28221098', '28215711', '28209466']


How many records were returned? 

Don't count them yourself, write code below to find out!

In [9]:
print("number of pubs:", len(ids))

number of pubs: 20


The code in the following cell is complete - simply execute and observe.

In [10]:
# Use Entrez esummary to retrieve the record for the *first id* in the list 
summ = Entrez.esummary(db=db, id=ids[0])

# Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print(pub_summary)

{'ESSN': '1098-5522', 'Source': 'Infect Immun', 'FullJournalName': 'Infection and immunity', 'References': [], 'Issue': '', 'PubDate': '2017 Mar 13', 'ISSN': '0019-9567', 'RecordStatus': 'PubMed - as supplied by publisher', 'History': {'medline': ['2017/03/16 06:00'], 'entrez': '2017/03/15 06:00', 'pubmed': ['2017/03/16 06:00']}, 'Title': '<i>Staphylococcus aureus</i> survives in cystic fibrosis macrophages forming a reservoir for chronic pneumonia.', 'Pages': '', 'HasAbstract': 1, 'NlmUniqueID': '0246127', 'EPubDate': '2017 Mar 13', 'Id': '28289144', 'PmcRefCount': 0, 'SO': '2017 Mar 13;', 'Volume': '', 'ArticleIds': {'medline': [], 'eid': '28289144', 'pii': 'IAI.00883-16', 'doi': '10.1128/IAI.00883-16', 'pubmed': ['28289144'], 'rid': '28289144'}, 'AuthorList': ['Li C', 'Wu Y', 'Riehle A', 'Ma J', 'Kamler M', 'Gulbins E', 'Grassmé H'], 'Item': [], 'PubTypeList': ['Journal Article'], 'DOI': '10.1128/IAI.00883-16', 'PubStatus': 'aheadofprint', 'LastAuthor': 'Grassmé H', 'ELocationID': '

Q3: What part of the code in the previous cell (copy and paste it below), limited the results to the record corresponding to the *first* id in the list?

Q4: Modify the code in the following cell so that it prints out the record for the **3rd** id in the list.

In [11]:
# Use Entrez esummary to retrieve the record for the *third id* in the list 
summ = Entrez.esummary(db=db, id=ids[2])

# Read the summary and print it out
pub_summary = Entrez.read(summ)[0]
print(pub_summary)

{'ESSN': '1471-2180', 'Source': 'BMC Microbiol', 'FullJournalName': 'BMC microbiology', 'References': [], 'Issue': '1', 'PubDate': '2017 Mar 9', 'ISSN': '', 'RecordStatus': 'PubMed - in process', 'History': {'medline': ['2017/03/11 06:00'], 'accepted': '2017/03/03 00:00', 'received': '2016/04/27 00:00', 'entrez': '2017/03/11 06:00', 'pubmed': ['2017/03/11 06:00']}, 'Title': 'The altered gut microbiota in adults with cystic fibrosis.', 'Pages': '58', 'HasAbstract': 1, 'NlmUniqueID': '100966981', 'EPubDate': '2017 Mar 9', 'Id': '28279152', 'PmcRefCount': 0, 'SO': '2017 Mar 9;17(1):58', 'Volume': '17', 'ArticleIds': {'medline': [], 'eid': '28279152', 'pmcid': 'pmc-id: PMC5345154;', 'rid': '28279152', 'pii': '10.1186/s12866-017-0968-8', 'pmc': 'PMC5345154', 'pubmed': ['28279152'], 'doi': '10.1186/s12866-017-0968-8'}, 'AuthorList': ['Burke DG', 'Fouhy F', 'Harrison MJ', 'Rea MC', 'Cotter PD', "O'Sullivan O", 'Stanton C', 'Hill C', 'Shanahan F', 'Plant BJ', 'Ross RP'], 'Item': [], 'PubTypeLi

In [12]:
#print out a list of the elements in the record pub_summary
#hint use .keys()
print(pub_summary.keys())

dict_keys(['ESSN', 'Source', 'FullJournalName', 'References', 'Issue', 'PubDate', 'ISSN', 'RecordStatus', 'History', 'Title', 'Pages', 'HasAbstract', 'NlmUniqueID', 'EPubDate', 'Id', 'PmcRefCount', 'SO', 'Volume', 'ArticleIds', 'AuthorList', 'Item', 'PubTypeList', 'DOI', 'PubStatus', 'LastAuthor', 'ELocationID', 'LangList'])


NCBI's PubMed database contains both primary and secondary research articles. Write a for loop to count the number of primary research articles ('Journal Article') and secondary research articles ('Review').   
This information is contained in pub_summary[0]['PubTypeList'].

In [13]:
count_journal = 0
count_review = 0

for id in ids: #loop over pubmed IDs
    #print("id:",id)
 
    summary = Entrez.esummary(db="pubmed", id=id) # retrieve summary of document
    pub_summary = Entrez.read(summary) # use biopython to parse the summary
    
    print(pub_summary[0]['PubTypeList'])
 
    #ask if 'Review' is in pub_summary[0]['PubTypeList']
    if 'Review' in pub_summary[0]['PubTypeList']:
        # if so increase the value of count_review by one
         count_review += 1
            
    # ask if 'Journal Article' is in pub_summary[0]['PubTypeList'] 
    elif 'Journal Article' in pub_summary[0]['PubTypeList']:
        # if so increase the value of count_review by one
         count_journal += 1
    
print("Number of primary research articles:", count_journal)
print("Number of review articles:", count_review)

['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article']
['Journal Article', 'Review']
['Journal Article']
Number of primary research articles: 19
Number of review articles: 1


In [14]:
# Generate a list of the titles of the publications
title_list = []
for id in ids: # loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=id) # retrieve summary of document
    pub_summary = Entrez.read(summary) # use biopython to parse the summary
    # print the title and a new line '\n' for readability
    
    print("Title:", pub_summary[0]['Title'])
    
    title_list.append(pub_summary[0]['Title'])# add the title to the list title_list

Title: <i>Staphylococcus aureus</i> survives in cystic fibrosis macrophages forming a reservoir for chronic pneumonia.
Title: Forskolin-induced Swelling in Intestinal Organoids: An In Vitro Assay for Assessing Drug Response in Cystic Fibrosis Patients.
Title: The altered gut microbiota in adults with cystic fibrosis.
Title: CFTR is involved in the regulation of glucagon secretion in human and rodent alpha cells.
Title: Gene delivery to the lungs: pulmonary gene therapy for cystic fibrosis.
Title: Water Transport Mediated by Other Membrane Proteins.
Title: cAMP-dependent secretagogues stimulate the NaHCO<sub>3</sub> cotransporter in the villous epithelium of the brushtail possum, Trichosurus vulpecula.
Title: Synergy of cAMP and calcium signaling pathways in CFTR regulation.
Title: Electrostatic tuning of the pre- and post-hydrolytic open states in CFTR.
Title: Single nucleotide polymorphisms related to cystic fibrosis in chronic rhinositus-a pilot study.
Title: High-expressing cystic f

**Based on the titles of your results** - quantify the success of your query using true and false positive terminology.  
Talk with your team and decide how you will categorize the results as true and false positives. Write your definitions below.

In [15]:
#Use a for loop to sort through the journal articles returned by your query. 
#Assign the titles into true and false positive lists based on your criteria.
#Remember that Python considers the case of a string when checking equality.
#Hint - you can convert both the query and the subject to a lower case string using str.lower("YourString")

true_positive = [] # an empty list to contain the titles of the true positives
false_positive = [] # an empty list to contain the title of the false positives
# loop through the elements of title_list
for id in ids: # loop over pubmed IDs
    summary = Entrez.esummary(db=db, id=id) # retrieve summary of document
    pub_summary = Entrez.read(summary) # use biopython to parse the summary
    # print the title and a new line '\n' for readability
    title =  pub_summary[0]['Title']
    print("title:", pub_summary[0]['Title'])

    # use an if statement to define the true positive classification, remember that Python considers the case of the string use
    if ("CFTR" in title) or ("cystic fibrosis transmembrane conductance regulator" in title):
        # if the condition is met add the title to the true positive list
        true_positive.append(title)
    # else add the title to the list of false positives
    else:
        false_positive.append(title)

title: <i>Staphylococcus aureus</i> survives in cystic fibrosis macrophages forming a reservoir for chronic pneumonia.
title: Forskolin-induced Swelling in Intestinal Organoids: An In Vitro Assay for Assessing Drug Response in Cystic Fibrosis Patients.
title: The altered gut microbiota in adults with cystic fibrosis.
title: CFTR is involved in the regulation of glucagon secretion in human and rodent alpha cells.
title: Gene delivery to the lungs: pulmonary gene therapy for cystic fibrosis.
title: Water Transport Mediated by Other Membrane Proteins.
title: cAMP-dependent secretagogues stimulate the NaHCO<sub>3</sub> cotransporter in the villous epithelium of the brushtail possum, Trichosurus vulpecula.
title: Synergy of cAMP and calcium signaling pathways in CFTR regulation.
title: Electrostatic tuning of the pre- and post-hydrolytic open states in CFTR.
title: Single nucleotide polymorphisms related to cystic fibrosis in chronic rhinositus-a pilot study.
title: High-expressing cystic f

In [16]:
#print out a statement describing how many of the results were true/false positives
print("true positives:", len(true_positive))
print("true positives:", len(false_positive))

true positives: 9
true positives: 11


#### That's all folks!!! Save and download your notebook, then upload it to Blackboard.