## Project Stage 1: Database Creation

This jupyter notebook documents the Python code and SQL syntax for creating the database used in the project. There are 3 steps in this stage:

1. Download data from the PubMed database with BioPython
2. Create the database in PostgreSQL through Python with SQLAlchemy
3. Save the downloaded data into the database with Pandas

### Step 1: Download the data

In this step I will use the Entrez utilities provided by the BioPython package to download the data from PubMed.

In [3]:
from Bio import Entrez

### Set up API key and search terms for Entrez:
API_KEY = "81301f2670cac89be2a84fc6ffe183cfad09"
EMAIL = "digicosmos@gmail.com"
SEARCH_TERM = '("epilepsy" [MeSH Major Topic]) AND ("2000/1/1"[ppdat] :"2014/12/31"[ppdat])'
Entrez.email = EMAIL
Entrez.api_key = API_KEY

In [14]:
# Iteratively run queries to retrieve a list of IDs

handle = Entrez.esearch(db="pubmed", retmax=100, term=SEARCH_TERM)
count = int(Entrez.read(handle)['Count'])

ids = []

for start in range(0, count, 10000):
    handle = Entrez.esearch(db="pubmed", retstart=start, retmax=10000, term=SEARCH_TERM)
    ids += Entrez.read(handle)['IdList']
    handle.close()

assert len(ids) == 35803
assert count == len(ids) # Make sure each article has its own id

In [15]:
# Prepare four lists to be put into a pandas DataFrame

import datetime

list_id = []
list_title = []
list_date = []
list_authors = []

In [6]:
# Utility function for retrieving author information

def author_list_as_dicts(authorlist):
    result = []
    for element in authorlist:
        author = {}
        if 'LastName' in element.keys():
            author['LastName'] = element['LastName']
        if 'ForeName' in element.keys():
            author['ForeName'] = element['ForeName']
        if 'Initials' in element.keys():
            author['Initials'] = element['Initials']
        result.append(author)
    return result

#author_list_as_dicts(AuthorsList)

In [8]:
# Utility function for converting dates saved in a dictionary to string
def date_dict_as_str(d):
    month = int(d['Month'])
    day = int(d['Day'])
    year = int(d['Year'])
    return datetime.date(year, month, day)

# date_dict_as_str(date)

In [12]:
# Iteratively fetch information on articles, 100 at a time
# Since processing XMLs takes time, this process does not exceed the 10 requests per second limit

for i in range(0, len(ids), 100):
    
    
    if i + 100 < len(ids) - 1:
        j = i+100
    else:
        j = len(ids) - 1
        
    
    handle = Entrez.efetch(db="pubmed", id=ids[i:j], retmode="xml")
    
    print("Retrieved info from PubMed Database, articles {0} to {1}".format(i+1, j))
    
    temp = Entrez.read(handle)
    
    print("Parsing XML document complete.")
    
    # retrieve relevant information from parsed xml, and append them to the lists
    
    for article in temp["PubmedArticle"]:
        id = int(str(article["MedlineCitation"]["PMID"]))
        title = article["MedlineCitation"]["Article"]["ArticleTitle"]
        date = date_dict_as_str(article["PubmedData"]["History"][0])
        authors = None
        if "AuthorList" in article["MedlineCitation"]["Article"].keys():
            authors = author_list_as_dicts(article["MedlineCitation"]["Article"]["AuthorList"])

        list_id.append(id)
        list_title.append(title)
        list_date.append(date)
        if authors is not None:
            list_authors.append(authors)
        else:
            list_authors.append(None)
            
    handle.close()
    temp = None
            
assert len(list_id) == len(ids)

Retrieved info from PubMed Database, articles 1 to 101
Parsing XML document complete.
Retrieved info from PubMed Database, articles 101 to 201
Parsing XML document complete.
Retrieved info from PubMed Database, articles 201 to 301
Parsing XML document complete.
Retrieved info from PubMed Database, articles 301 to 401
Parsing XML document complete.
Retrieved info from PubMed Database, articles 401 to 501
Parsing XML document complete.
Retrieved info from PubMed Database, articles 501 to 601
Parsing XML document complete.
Retrieved info from PubMed Database, articles 601 to 701
Parsing XML document complete.
Retrieved info from PubMed Database, articles 701 to 801
Parsing XML document complete.
Retrieved info from PubMed Database, articles 801 to 901
Parsing XML document complete.
Retrieved info from PubMed Database, articles 901 to 1001
Parsing XML document complete.
Retrieved info from PubMed Database, articles 1001 to 1101
Parsing XML document complete.
Retrieved info from PubMed Data

Parsing XML document complete.
Retrieved info from PubMed Database, articles 9201 to 9301
Parsing XML document complete.
Retrieved info from PubMed Database, articles 9301 to 9401
Parsing XML document complete.
Retrieved info from PubMed Database, articles 9401 to 9501
Parsing XML document complete.
Retrieved info from PubMed Database, articles 9501 to 9601
Parsing XML document complete.
Retrieved info from PubMed Database, articles 9601 to 9701
Parsing XML document complete.
Retrieved info from PubMed Database, articles 9701 to 9801
Parsing XML document complete.
Retrieved info from PubMed Database, articles 9801 to 9901
Parsing XML document complete.
Retrieved info from PubMed Database, articles 9901 to 10001
Parsing XML document complete.
Retrieved info from PubMed Database, articles 10001 to 10101
Parsing XML document complete.
Retrieved info from PubMed Database, articles 10101 to 10201
Parsing XML document complete.
Retrieved info from PubMed Database, articles 10201 to 10301
Par

Retrieved info from PubMed Database, articles 18101 to 18201
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18201 to 18301
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18301 to 18401
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18401 to 18501
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18501 to 18601
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18601 to 18701
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18701 to 18801
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18801 to 18901
Parsing XML document complete.
Retrieved info from PubMed Database, articles 18901 to 19001
Parsing XML document complete.
Retrieved info from PubMed Database, articles 19001 to 19101
Parsing XML document complete.
Retrieved info from PubMed Database, articles 19101 to 19201
Parsing XML documen

Parsing XML document complete.
Retrieved info from PubMed Database, articles 27101 to 27201
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27201 to 27301
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27301 to 27401
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27401 to 27501
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27501 to 27601
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27601 to 27701
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27701 to 27801
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27801 to 27901
Parsing XML document complete.
Retrieved info from PubMed Database, articles 27901 to 28001
Parsing XML document complete.
Retrieved info from PubMed Database, articles 28001 to 28101
Parsing XML document complete.
Retrieved info from PubMed Database, articles 281

AssertionError: 

In [13]:
len(list_id)

55802

In [None]:
import pandas as pd

In [None]:
# Create a pandas DataFrame from the 4 lists

pubs = pd.DataFrame(PubMedID=list_id, Title=list_title, PubDate=list_date, Authors=list_authors)

In [None]:
# Create the article DataFrame

articles = pubs[["PubMedID", "Title", "PubDate"]]

In [None]:
# Create the article author data frame for many-to-many relationships

article_authors = pubs[["PubMedID, Authors"]]

article_authors = article_authors.explode()

### Step 2: Use pandas and SQLAlchemy to add downloaded data to database