## Data Insert Exercises

In this set of exercises, you will practice a series of transactions using a dataset of solution-based experiments. You will find the data you need under `data\solution_experiments.xlsx`.

The first two sheets `solvent` and `polymer` contain a list of chemicals that are contained in a hypothetical laboratory storage. The third sheet, `solution`, contains a list of solutions that a hypothetical student named Jason, has made in the last several months.

In [1]:
# Import Libraries

import requests
import json
import pandas as pd
import bibtexparser #API for requesting DOI information
import pprint as pp #The function pp.pprint helps with better visualizing nested key-value information
import psycopg2 as pg # Postgres python
import numpy as np
import sys
from psycopg2.extensions import AsIs





In [2]:
## Functions

# Adapters necessary for converting python data types to PostgreSQL compatible data types 
def addapt_numpy_float64(numpy_float64):
    return AsIs(numpy_float64)

def addapt_numpy_int64(numpy_int64):
    return AsIs(numpy_int64)

def nan_to_null(f,
        _NULL=AsIs('NULL'),
        _Float=pg.extensions.Float):
    if not np.isnan(f):
        return _Float(f)
    return _NULL

pg.extensions.register_adapter(np.float64, addapt_numpy_float64)
pg.extensions.register_adapter(np.int64, addapt_numpy_int64)
pg.extensions.register_adapter(float, nan_to_null)

def connect(params_dict):
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        conn = pg.connect(**params_dict)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        sys.exit(1) 
    print("Connection successful")
    return conn

def pg_query(sql, tup):
    
    try:
        # Database connection
        conn = pg.connect(**conn_kwargs)
        cur = conn.cursor()
        
        # Pass SQL query, using string and placeholders
        cur.execute(sql, tup)
        
#         # Fetch result
#         fetched = cur.fetchone()[0]
        
        # Commit result
        conn.commit()
        print("Operation Successful")

        cur.close()
        conn.close()
        
    except (Exception, pg.DatabaseError) as error:
        # If database connection unsuccessful, then close connection 
        print("Error: %s" % error)
        conn.rollback()
        cur.close()
        conn.close()
    
    return 

In [3]:
conn_kwargs = {
    "host"      : "localhost",
    "database"  : "test", ## FILL IN CONNECTION DETAILS HERE
    "user"      : "postgres",
    "password"  : "Rahul2411!",
    "port"      : "5432",
}

def connect(**params_dict):
    """ Connect to the PostgreSQL database server """
    conn = None
    try:
        # connect to the PostgreSQL server
        print('Connecting to the PostgreSQL database...')
        conn = pg.connect(**params_dict)
    except (Exception, pg.DatabaseError) as error:
        print(error)
        sys.exit(1) 
    print("Connection successful")
    return conn

conn = connect(**conn_kwargs)

conn.close()
print("Connection Closed")

Connecting to the PostgreSQL database...
Connection successful
Connection Closed


### 1. Create Polymer and Solvent Tables

First, please create from scratch two tables which contain the chemical inventory of the laboratory (Solvent and Polymer). 

* Solvent contains three attributes: a unique identifier given from its PubChem CID, its IUPAC name, and boiling points
* Polymer should contain a unique identifier of your choice (e.g., "polymer_id"), two columns for its common name and its full name, and three fields for molecular weight information (Mn, Mw, dispersity)

The primary key for `solvent` should be the PubChem CID, and the primary key for `polymer` should be a one that you generate. You may use the `SERIAL` data type to facilitate automatic key generation for new rows. Make sure you add `UNIQUE` constraints that properly account for the real-world scenario.

In [8]:
conn = connect(**conn_kwargs)

cur = conn.cursor()

sql = '''

    DROP TABLE IF EXISTS SOLVENT;
    
    CREATE TABLE SOLVENT (
        pubchem_cid         INT             PRIMARY KEY,             
        iupac_name          VARCHAR(50),
        boiling_point       FLOAT,
        UNIQUE(iupac_name)
    );
    
    
    DROP TABLE IF EXISTS POLYMER;
    
    CREATE TABLE POLYMER (
        polymer_id          SERIAL          PRIMARY KEY,
        common_name         VARCHAR(500),
        iupac_name          VARCHAR(500),
        Mn       FLOAT,
        Mw       FLOAT,
        dispersity        FLOAT,
        UNIQUE(common_name,iupac_name,Mn,Mw,dispersity)
    );
'''

cur.execute(sql)
conn.commit()

print("Operation successful")
conn.close()

Connecting to the PostgreSQL database...
Connection successful
Operation successful


### 2. Create a Record of Solutions

Next, we must create a table to store new experimental records of solutions that are made from the laboratory reagents available in inventory. Recently, Jason has only been interested in testing the solubility of single polymers in solution. In his digital lab notebook, he has written down the date he made the solution, the concentration of the solution, and what polymer and solvent he used. Usually, Jason was a good student and wrote down the batch information (molecular weights, dispersity) of the polymer, but sometimes he was an idiot and forgot. Regardless, we should record all of the data points. 

1. Create an entity-relationship diagram that shows the connections between polymer, solvent, and solution.

2. Create a table `solution` that accurately models the conceptual schema you generated in the ERD.

`solution` should contain five attributes: This includes:

* A primary key (a solution identifier)
* Two foreign keys (referencing the polymer and solvent table)
* Two fields that describe the solution concentration and the date the solution was created

**Warning**: Do not use the given dataset columns to decide how you actually name your columns. Name your attributes in a way that is friendly to programming. For this example, feel free to ignore units. We will deal with that another day.

In [9]:
sql = '''

DROP TABLE IF EXISTS SOLUTION;

CREATE TABLE SOLUTION(
    solution_id    SERIAL   PRIMARY KEY,
    solution_concentration   FLOAT,
    solvent_id     INT,
    polymer_id     INT,
    date_created   DATE,
    
    FOREIGN KEY(polymer_id) REFERENCES POLYMER(polymer_id)
            ON DELETE SET NULL ON UPDATE CASCADE,
    FOREIGN KEY(solvent_id) REFERENCES SOLVENT(pubchem_cid)
            ON DELETE SET NULL ON UPDATE CASCADE
)
'''



conn = connect(**conn_kwargs)

cur = conn.cursor()

cur.execute(sql)
conn.commit()

print("Operation successful")
conn.close()

Connecting to the PostgreSQL database...
Connection successful
Operation successful


### 3. Populate the Lab Inventory

In this example, you can pretend that you exported Sheets 1 and 2 of solution_experiments.xlsx from an online inventory manager, and you want to input it into your own database. Upload the inventory in sheets 1 and 2 to `polymer` and `solvent`. 

As a note: sometimes, the online manager accidentally duplicates records, so you want to ensure that no duplicates exist in your local copy. 

In [10]:
db = pd.ExcelFile('data/solution_experiments.xlsx')
df_solvent = db.parse('solvent')
df_polymer = db.parse('polymer')
df_solution = db.parse('solution')

In [11]:
#Inserting into solvent table

sql = '''
INSERT INTO SOLVENT (%s) VALUES %s
'''
columns = list(df_solvent.columns)

for row in df_solvent.itertuples(index=False):
    values = tuple(row)
    tup = (AsIs(','.join(columns)),values)
    pg_query(sql,tup)


Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful


In [12]:
#Inserting into polymer table

sql = '''
INSERT INTO POLYMER (%s) VALUES %s
ON CONFLICT(common_name,iupac_name,Mn,Mw,dispersity) DO NOTHING
'''
columns = list(df_polymer.columns)

for row in df_polymer.itertuples(index=False):
    values = (tuple(row))
    tup = (AsIs(','.join(columns)),values)
    pg_query(sql,tup)

Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Successful
Operation Suc

### 4. Update Jason's Experimental Records

Using the "Solution" sheet in the Excel file, figure out a way to import Jason's historical record of solution-based experiments into the `solution` table you made in PostgreSQL. Make sure you properly tailor Jason's dataset to the schema that you created in **Part 2**. 

Hints:
* You may have to do some python pre-processing to generate the tuple you want. Don't be afraid to make a few helper functions that do that.
* Jason is not going to write PubChem IDs in his laboratory notebook, because he's not insane. You will have to find a way to look up the PubChem ID that is associated with a given solvent name. You might be able to do this in your SQL query. More information: https://dba.stackexchange.com/questions/46410/how-do-i-insert-a-row-which-contains-a-foreign-key
* Similarly, you're not putting all of the polymer information into your solution table, because that would be redundant. Instead, you need to look up which polymer he used from the known inventory, and assign the polymer_id that way

If the link above does not help, try to find your own resources! Programming isn't always easy, but this is a standard transaction that many other people have probably tried to do. Learn how to learn!

In [13]:
#Inserting into solution table

# Establish a connection to the PostgreSQL database
conn = connect(**conn_kwargs)

# Create a cursor object to execute SQL queries
cur = conn.cursor()

# Loop through each row of the DataFrame and insert the data into the solution table
for _, row in df_solution.iterrows():
    
    #print(row)
    
    # Retrieve the solvent_id based on the solvent name from the solvent table
    cur.execute("SELECT pubchem_cid FROM solvent WHERE iupac_name = %s", (row['Solvent Name'],))
    solvent_id = cur.fetchone()[0]  # Assuming there's only one matching solvent
    
    # Check if at least one of the required columns has a non-null value
    if pd.notnull(row['Mw (kg/mol)']) or pd.notnull(row['Mn (kg/mol)']) or pd.notnull(row['PDI']):
        # Retrieve the polymer_id based on the polymer, molecular weight, molecular number, and dispersity from the polymer table
        if pd.notnull(row['Mw (kg/mol)']) and pd.notnull(row['Mn (kg/mol)']) and pd.notnull(row['PDI']):
            # If all required columns have non-null values, use the exact match query
            cur.execute("SELECT polymer_id FROM polymer WHERE common_name = %s AND mw = %s AND mn = %s AND dispersity = %s",
                        (row['Polymer Used'], row['Mw (kg/mol)'], row['Mn (kg/mol)'], row['PDI']))
        else:
            # If any of the required columns have null values, use a query with null handling
            cur.execute("SELECT polymer_id FROM polymer WHERE common_name = %s AND (mw = %s OR mw IS NULL) AND (mn = %s OR mn IS NULL) AND (dispersity = %s OR dispersity IS NULL)",
                        (row['Polymer Used'], row['Mw (kg/mol)'], row['Mn (kg/mol)'], row['PDI']))
            
        polymer_id = cur.fetchone()[0]  # Assuming there's only one matching polymer
        
        # Insert the data into the solution table
        cur.execute("INSERT INTO solution (solvent_id, polymer_id, solution_concentration, date_created) VALUES (%s, %s, %s, %s)",
                    (solvent_id, polymer_id, row['Solution Concentration (mg/mL)'], row['Date Created']))

# Commit the changes to the database
conn.commit()

# Close the cursor and the connection
cur.close()
conn.close()

Connecting to the PostgreSQL database...
Connection successful


### 5. Upload New Experiments

Two weeks later, Jason compiles another dataset with a new set of solutions. After speaking with his advisor, he has actually decided to test a few new solvents to try out. You can find these new solutions within `data/new_experiments.xlsx`

Add these new experiments to the relational database that you have generated. 

* Note that the solution table contains reagents that have not previously been added to the inventory. What happens when you try to import the data with the queries you generated above? What type of constraint is this called?
* Can you figure out how to structure queries or a workflow that will account for this situation? For example: Try inserting the record. If the solvent does not exist, get the PubChem CID using an API, add it to the `solvent` table, and retry the add. Normally, I would also ask you do this with polymers; i.e., if the polymer does not exist, similarly add new information to the `polymer` table, then retry the add -- but for now let's try the exercise with solvent only. You can assume he only experimented on existing polymers.

The below code block contains a simple *Chemical Name to PubChem CID* function. You may use it to facilitate the programming process.

In [None]:
#workflow :
#1.Go through the solvents in the new_experiment excel sheet and generate the PubChem_CID and boiling point using some API
  #** (NOTE the boiling point is hard to obtain from most APIs)
#2.Check with solvent table and add records that are not present in solvent table
#3.Now add new records to the solution table if they are not duplicates

In [8]:
db_New = pd.ExcelFile('data/new_experiments.xlsx')
df_New = db_New.parse('new_experiments')

In [28]:

import pubchempy as pcpy #You may have to pip or conda install pubchempy API. 

#Read more here: https://pubchempy.readthedocs.io/en/latest/

def name2cid(chemName):
    cid = pcpy.get_cids(chemName)
    bp  = pcpy.get_
    #if chemname isn't valid, cid will return an empty array. if so, this if statement will return an error message.
    if not cid:
        return 'Please enter a valid name.'
    else:
        return cid

name2cid('chloroform')

[6212]

### Bonus

You may have seen that a lot of these queries can be challenging to program using just psycopg2. Many online articles and tutorials recommend the use of SQLAlchemy to facilitate the generation of queries, and many Python users prefer it because it "requires little knowledge of SQL". Your bonus assignment is to test this theory by exploring SQLAlchemy on your own. For example, can you perform all of the questions in this homework with the help of an SQLAlchemy connection to your database? You are a superstar if you do this entire exercise with SQLAlchemy, and you'll be on your way to helping develop OFETdb. But for starters, try the first few problems.

Here is a tutorial I found, but feel free to find others on YouTube or Google, or just ask ChatGPT.
https://www.learndatasci.com/tutorials/using-databases-python-postgres-sqlalchemy-and-alembic/