# PostgreSQL and Python 1

**Author**: Aaron Liu

**Date**: 5/9/2023

**Objective**: Practice basic insert queries using SQL, and the corresponding programming to automate the process in psycopg2

## Setup

**Premade psycopg2 functions**

When using pandas or numpy to work with data types, adapters are required to switch between Python and PostgreSQL data types. A connection function is also provided, for your convenience.

In [1]:
import psycopg2 as pg # Postgres python
import numpy as np
import sys
from psycopg2.extensions import AsIs

# # Adapters necessary for converting python data types to PostgreSQL compatible data types 
# def addapt_numpy_float64(numpy_float64):
#     return AsIs(numpy_float64)

# def addapt_numpy_int64(numpy_int64):
#     return AsIs(numpy_int64)

# def nan_to_null(f,
#         _NULL=AsIs('NULL'),
#         _Float=pg.extensions.Float):
#     if not np.isnan(f):
#         return _Float(f)
#     return _NULL

# pg.extensions.register_adapter(np.float64, addapt_numpy_float64)
# pg.extensions.register_adapter(np.int64, addapt_numpy_int64)
# pg.extensions.register_adapter(float, nan_to_null)

def connect(params_dict):
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        conn = pg.connect(**params_dict)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        sys.exit(1) 
    print("Connection successful")
    return conn

**Connection Details**

Fill in your connection details here. Note that `127.0.0.1`, `localhost`, and your **local IP address** (found using the `ipconfig` command in your command line) are all synonymous with your local computer as a server. If you are connecting to an external server, you of course need to find the appropriate connection details of that server.

I recommend creating your own database as a test environment for interacting with your database. You must do this either through psql or pgAdmin, externally from Python. Call the database whatever you want, like `pg_practice` or `ofetdb_testenv`, etc. Either way, the default username and password are what go into the connection details. The port by default for PostgreSQL is almost always `5432`, unless this was specified differently during your installation of PostgreSQL.

In [2]:
conn_kwargs = {
    "host"      : "localhost",
    "database"  : "test", ## FILL IN CONNECTION DETAILS HERE
    "user"      : "postgres",
    "password"  : "password",
    "port"      : "5432",
}

conn = pg.connect(**conn_kwargs)
print("Connection Successful")

conn.close()
print("Connection Closed")

Connection Successful
Connection Closed


**ALWAYS** ensure that you close a connection after any given operation. Any idle connections can otherwise prevent database managers from updating the database on the backend.

**Creating Tables**

For this session, we will work with two different tables of information. The first is a list of chemicals, and the second is a list of journal articles.

In [3]:
# pg.connect returns a connection instance, based on the login parameters
conn = pg.connect(**conn_kwargs)
print("Connection Successful")

# A cursor object is used to query the database from Python
cur = conn.cursor()

# The execute command takes a query as an argument. This query is creating a SOLVENT table. 
# Note, inside the triple quotes is exactly what you would type into an SQL interface
sql = '''
    
    DROP TABLE IF EXISTS SOLVENT;
    
    CREATE TABLE SOLVENT (
        pubchem_cid   INT         PRIMARY KEY,
        iupac_name    VARCHAR(50) UNIQUE,
        boiling_point FLOAT
    ); 
    '''

cur.execute(sql)


print("Table(s) created successfully")
conn.commit()

print("Operation successful")
conn.close()

Connection Successful
Table(s) created successfully
Operation successful


## Data Inserts

Let's start by filling up the `SOLVENT` table with some chemicals. From SQL, the syntax would be as follows:

In [4]:
# Again, we can store the query as a string and execute it with psycopg2
sql = '''
INSERT INTO SOLVENT(pubchem_cid, iupac_name, boiling_point) 
VALUES (13, '1,2,4-trichlorobenzene', 214.4)
'''

# pg.connect returns a connection instance, based on the login parameters
conn = pg.connect(**conn_kwargs)
print("Connection Successful")

# A cursor object is used to query the database from Python
cur = conn.cursor()

cur.execute(sql)

print("Table(s) created successfully")
conn.commit()

print("Operation successful")
conn.close()
print("Connection closed")

Connection Successful
Table(s) created successfully
Operation successful
Connection closed


This function will delete rows from the SOLVENT table. We will cover duplicate keys later.

In [18]:
def delete_solvent_rows():
    
    delete_query = "DELETE FROM SOLVENT *"

    # pg.connect returns a connection instance, based on the login parameters
    conn = connect(conn_kwargs)

    # A cursor object is used to query the database from Python
    cur = conn.cursor()

    cur.execute(delete_query)

    print("All rows deleted")
    conn.commit()

    print("Operation successful")
    conn.close()
    print("Connection closed")
    
    return

In [19]:
delete_solvent_rows()

Connecting to the PostgreSQL database...
Connection successful
All rows deleted
Operation successful
Connection closed


**String Formatting**

The above code functions well enough, but it's really not scalable. What if you want to add 10 different solvents at a time? An advantage of psycopg2 is that it makes room for automating most queries. To illustrate this, let's look at something that has nothing to do with psycopg2, but just Python syntax in general: string formatting. 

See below for two possible ways of printing a string. Read more at: https://www.geeksforgeeks.org/what-does-s-mean-in-a-python-format-string/#

In [5]:
print("My name is Theodore")

My name is Theodore


In [6]:
name = "Theodore"
print("My name is %s" % name)

My name is Theodore


Okay, but what is the utility of this? A couple things: most importantly, you can use the `%s` to cast anything as a string, even if it's a different data type. Secondly, it allows you to automate things through loops. The `%s` is kind of like a placeholder.

In [7]:
solvents = ["1,2,4-trichlorobenzene", "benzene", "toluene", "chloroform", "1,1,2,2-tetrachloroethane", "1-chloronaphthalene"]
boiling_points = [214.4, 80.1, 110.6, 61.2, 146.6, 305.2]

my_str = "The boiling point of %s is %s°C"

for solvent, bp in zip(solvents, boiling_points):
    print(my_str % (solvent, bp))


# # A string that will be repeated
# my_str = "The boiling point of %s is %s°C"

# # Print out a statement for all matching boiling points
# for solvent, bp in zip(solvents, boiling_points):
#     print(my_str % (solvent, bp))

The boiling point of 1,2,4-trichlorobenzene is 214.4°C
The boiling point of benzene is 80.1°C
The boiling point of toluene is 110.6°C
The boiling point of chloroform is 61.2°C
The boiling point of 1,1,2,2-tetrachloroethane is 146.6°C
The boiling point of 1-chloronaphthalene is 305.2°C


Now, let's see how psycopg2 helps to apply the simplest formatting case to an insert. Note the INSERT statement that we use before presents the columns and values as a *tuple* (see: https://www.geeksforgeeks.org/tuples-in-python/).

In [21]:
delete_solvent_rows() #if needed

Connecting to the PostgreSQL database...
Connection successful
All rows deleted
Operation successful
Connection closed


In [22]:
# Again, we can store the query as a string and execute it with psycopg2
sql = '''
INSERT INTO SOLVENT (%s) VALUES %s
'''

columns = ["pubchem_cid", "iupac_name", "boiling_point"]
values = (13, '1,2,4-trichlorobenzene', 214.4)

tup = (AsIs(','.join(columns)), values)

# pg.connect returns a connection instance, based on the login parameters
conn = pg.connect(**conn_kwargs)
print("Connection Successful")

# A cursor object is used to query the database from Python
cur = conn.cursor()

cur.execute(sql, tup)

print("Record inserted successfully")
conn.commit()

print("Operation successful")
conn.close()
print("Connection closed")

Connection Successful
Record inserted successfully
Operation successful
Connection closed


In [23]:
def pg_insert(sql, tup):
    
    try:
        # Database connection
        conn = pg.connect(**conn_kwargs)
        cur = conn.cursor()
        
        # Pass SQL query, using string and placeholders
        cur.execute(sql, tup)
        
#         # Fetch result
#         fetched = cur.fetchone()[0]
        
        # Commit result
        conn.commit()
        print("Operation Successful")

        cur.close()
        conn.close()
        
    except (Exception, pg.DatabaseError) as error:
        # If database connection unsuccessful, then close connection 
        print("Error: %s" % error)
        conn.rollback()
        cur.close()
        conn.close()
    
    return 



### Exercise: Insert from Excel

The Microsoft Excel file `solvents.xlsx` contains simple data that obeys the same schema as the SOLVENT table. Write code to automate the population of the SOLVENT table using psycopg2, tuple logic, and other python syntax.

In [24]:
import pandas as pd
# import pprint as pp

df = pd.read_excel('solvents.xlsx')

sql = '''
INSERT INTO SOLVENT (%s) VALUES %s
'''

cols = AsIs(','.join(df.columns))

for row in df.itertuples(index=False, name=None): # note: iterating through the generator df.iterrows() returns rows as tuples
    tup = (cols, row)
    pg_insert(sql, tup)

Error: duplicate key value violates unique constraint "solvent_pkey"
DETAIL:  Key (pubchem_cid)=(13) already exists.

Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful


In [25]:
delete_solvent_rows()

Connecting to the PostgreSQL database...
Connection successful
All rows deleted
Operation successful
Connection closed


In [31]:
sql = '''
INSERT INTO SOLVENT (%s) 
VALUES %s
ON CONFLICT(pubchem_cid) DO UPDATE
    SET iupac_name = excluded.iupac_name,
        boiling_point = excluded.boiling_point;
'''

cols = AsIs(','.join(df.columns))

for row in df.itertuples(index=False, name=None): # note: iterating through the generator df.iterrows() returns rows as tuples
    tup = (cols, row)
    pg_insert(sql, tup)

Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful


In [58]:
print(AsIs(str(row)[1:-1]))

13229, 'cyclohexylbenzene', 257.0


In [63]:
sql % tup

"\nINSERT INTO SOLVENT (pubchem_cid,iupac_name,boiling_point) \nVALUES (13229, 'cyclohexylbenzene', 257.0)\nON CONFLICT(pubchem_cid) \nDO UPDATE SET (pubchem_cid,iupac_name,boiling_point) = \n    (SELECT None)\n    WHERE SOLVENT.constraint_columns = excluded.constraint_columns;\n\n"

In [70]:
sql = '''
INSERT INTO SOLVENT (%s) 
VALUES %s
ON CONFLICT(pubchem_cid) 
DO UPDATE SET (%s) = 
    (SELECT %s)
    WHERE SOLVENT.pubchem_cid = excluded.pubchem_cid;

'''

cols = AsIs(','.join(df.columns))

for row in df.itertuples(index=False, name=None): # note: iterating through the generator df.iterrows() returns rows as tuples
    a = AsIs(str(row)[1:-1])
    tup = (cols, row, cols, a)
    pg_insert(sql, tup)

Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful


Aaron's question: Is there an easier way?

https://towardsdatascience.com/the-easiest-way-to-upsert-with-sqlalchemy-9dae87a75c35

In [None]:
import requests
import json
import pandas as pd
import bibtexparser #API for requesting DOI information
import pprint as pp #The function pp.pprint helps with better visualizing nested key-value information

# Given a valid doi string, return a dictionary of digital object information. Credit: Ron Volkovinsky
def doi2dict(doi):
    #create url
    url = "http://dx.doi.org/" + doi
    
    #create dictionary of http bibtex headers that requests will retrieve from the url
    headers = {"accept": "application/x-bibtex"}
    
    #reqeusts information specified by bibtex from url
    r = requests.get(url, headers = headers).text    

    #parse the returned bibtex text to a dictionary
    #NOTE: USE bibtexparser.customization to split strings into list, etc. (https://bibtexparser.readthedocs.io/en/master/bibtexparser.html?highlight=bparser#module-bibtexparser.bparser)
    bibdata = bibtexparser.bparser.BibTexParser().parse(r)
    
    #return dict of metadata
    return bibdata.entries[0]

doi = '10.1021/acsami.1c20994'

doidict = doi2dict(doi)
pp.pprint(doidict)

In [None]:
import psycopg2
from psycopg2.extras import Json 


kwargs = {
    'database': 'test',
    'user': 'postgres',
    'password': 'password',
    'host': '127.0.0.1',
    'port': '5432'
}

# %% Create Tables for EXPERIMENT_INFO

conn = psycopg2.connect(**kwargs)

print("Connection Successful")

cur = conn.cursor()
cur.execute(
    '''
    CREATE TABLE IF NOT EXISTS EXPERIMENT_INFO (
        exp_id              SERIAL          PRIMARY KEY,
        citation_type       VARCHAR(20),
        meta                JSONB,
        UNIQUE(citation_type, meta)
    );
    '''
)

print("Table(s) created successfully")
conn.commit()

print("Operation successful")
conn.close()

In [None]:
## I left off talking about inserting new tuples that already exist... and violating key constraints. What about sequencing?
## Let's insert like 5 doi's, see what happens