## Exercise : Extracting data from OFET-DB to perform ML

Its great that we have data stored in a standardized manner in our database but now we need to extract data from this database in the form of a table containing descriptors [X] and property [y] to perform ML. This notebook will focus on how we can do that. 

First i want you to use the backup file called 20231206_ofetdb_v2_backup11 and add it to your pgadmin as a practice database. We will avoid working with the original database.

In [1]:
# Connect to the database


import psycopg2
import pandas as pd
import numpy as np
import plotly.express as px


#sample connection details
# pgparams = {
#     "host": "127.0.0.1",
#     "database": "ofetdb_testenv",
#     "user": "postgres",
#     "password": "password",
#     "port": "5432",
# }


# Set max number of displayed columns and rows in Jupyter Notebook
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)


pgparams = {
    "host": "chbe-ofet-db.postgres.database.azure.com",
    "database": "ofetdb_v2_test",
    "user": "mg200_ofetdb",
    "password": "MGEROFETDB23!",
    "port": "5432",
}

def read_select_query(query):

    with psycopg2.connect(**pgparams) as conn:

        df = pd.read_sql_query(query, conn)

    return df


## Simplest Scenario :

- assume only single polymer scenarios, no blends first (wt_frac = 1)
- assume only single solvent scenarios, no multiple solvents (vol_frac = 1)
- show device substrate information 
- show film deposition information (spin, blade, etc). We will not use the parameters for now
- dont go into detail of solution treatment, substrate pretreat and post process. Just show if treatment was performed
- show hole mobility information

Follow the code blocks below and we will eventually end up with a table containing this information 

1. Prepare a dataframe containing the sample_Id, citation_type and meta information from the experimental table (This one is done as a practice for you)

In [2]:
## Adding all the experiment information 

# SQL query to fetch the required data
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta
    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id;
'''

# Use the read_select_query function to execute the query
result_df = read_select_query(query)

# Display the resulting DataFrame
#print(result_df)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta
0,1,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
1,2,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
2,3,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
3,4,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
4,5,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
5,6,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
6,7,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
7,8,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
8,9,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."
9,10,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ..."


remember to keep adding to the existing query to add more information to the table. It is going to get complicated and long soon FYI. 

2. Now to this database add the solution concentration information

In [3]:
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta,
        sol.concentration
    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id
    LEFT JOIN
        OFET_PROCESS op ON s.process_id = op.process_id
    LEFT JOIN
        SOLUTION sol ON op.solution_id = sol.solution_id;
'''

result_df = read_select_query(query)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta,concentration
0,1,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
1,2,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
2,3,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
3,4,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
4,5,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
5,6,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
6,7,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
7,8,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
8,9,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0
9,10,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0


3. Now to this table add the solvent information but only consider devices made from single solvent vol frac = 1

In [4]:
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta,
        sol.concentration,
        sms.vol_frac as solvent_vol_fract,
        sv.pubchem_cid as solvent_pubchem_cid,
        sv.iupac_name as solvent_iupac_name
    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id
    LEFT JOIN
        OFET_PROCESS op ON s.process_id = op.process_id
    LEFT JOIN
        SOLUTION sol ON op.solution_id = sol.solution_id
    LEFT JOIN
        SOLUTION_MAKEUP_SOLVENT sms ON sol.solution_id = sms.solution_id
    LEFT JOIN
        SOLVENT sv ON sms.solvent_id = sv.pubchem_cid
    WHERE
        sms.vol_frac=1;
'''

result_df = read_select_query(query)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta,concentration,solvent_vol_fract,solvent_pubchem_cid,solvent_iupac_name
0,1,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
1,2,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
2,3,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
3,4,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
4,5,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
5,6,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
6,7,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
7,8,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
8,9,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene
9,10,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene


4. Now to this table add polymer information only for devices made from one polymer (no blends) (wt-frac = 1)

In [5]:
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta,
        sol.concentration,
        sms.vol_frac as solvent_vol_fract,
        sv.pubchem_cid as solvent_pubchem_cid,
        sv.iupac_name as solvent_iupac_name,
        p.common_name as polymer_common_name,
        p.iupac_name as polymer_iupac_name,
        p.mw as polymer_mw,
        p.mn as polymer_mn,
        p.dispersity as polymer_dispersity
    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id
    LEFT JOIN
        OFET_PROCESS op ON s.process_id = op.process_id
    LEFT JOIN
        SOLUTION sol ON op.solution_id = sol.solution_id
    LEFT JOIN
        SOLUTION_MAKEUP_SOLVENT sms ON sol.solution_id = sms.solution_id
    LEFT JOIN
        SOLVENT sv ON sms.solvent_id = sv.pubchem_cid
    LEFT JOIN
        SOLUTION_MAKEUP_POLYMER smp ON sol.solution_id = smp.solution_id
    LEFT JOIN
        POLYMER p ON smp.polymer_id = p.polymer_id
    WHERE
        sms.vol_frac=1 AND smp.wt_frac=1;
'''

result_df = read_select_query(query)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta,concentration,solvent_vol_fract,solvent_pubchem_cid,solvent_iupac_name,polymer_common_name,polymer_iupac_name,polymer_mw,polymer_mn,polymer_dispersity
0,1,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
1,2,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
2,3,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
3,4,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
4,5,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
5,6,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
6,7,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
7,8,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
8,9,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32
9,10,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32


5. Now to this table add the device substrate information 

In [6]:
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta,
        sol.concentration,
        sms.vol_frac as solvent_vol_fract,
        sv.pubchem_cid as solvent_pubchem_cid,
        sv.iupac_name as solvent_iupac_name,
        p.common_name as polymer_common_name,
        p.iupac_name as polymer_iupac_name,
        p.mw as polymer_mw,
        p.mn as polymer_mn,
        p.dispersity as polymer_dispersity,
        df.params as device_substrate_parameters
    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id
    LEFT JOIN
        OFET_PROCESS op ON s.process_id = op.process_id
    LEFT JOIN
        SOLUTION sol ON op.solution_id = sol.solution_id
    LEFT JOIN
        SOLUTION_MAKEUP_SOLVENT sms ON sol.solution_id = sms.solution_id
    LEFT JOIN
        SOLVENT sv ON sms.solvent_id = sv.pubchem_cid
    LEFT JOIN
        SOLUTION_MAKEUP_POLYMER smp ON sol.solution_id = smp.solution_id
    LEFT JOIN
        POLYMER p ON smp.polymer_id = p.polymer_id
    LEFT JOIN
        DEVICE_FABRICATION df ON op.device_fab_id = df.device_fab_id
    WHERE
        sms.vol_frac=1 AND smp.wt_frac=1;
'''

result_df = read_select_query(query)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta,concentration,solvent_vol_fract,solvent_pubchem_cid,solvent_iupac_name,polymer_common_name,polymer_iupac_name,polymer_mw,polymer_mn,polymer_dispersity,device_substrate_parameters
0,1,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
1,2,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
2,3,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
3,4,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
4,5,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
5,6,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
6,7,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
7,8,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
8,9,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."
9,10,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-..."


6. Now to this table add the film deposition type only not the parameters associated with it

In [7]:
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta,
        sol.concentration,
        sms.vol_frac as solvent_vol_fract,
        sv.pubchem_cid as solvent_pubchem_cid,
        sv.iupac_name as solvent_iupac_name,
        p.common_name as polymer_common_name,
        p.iupac_name as polymer_iupac_name,
        p.mw as polymer_mw,
        p.mn as polymer_mn,
        p.dispersity as polymer_dispersity,
        df.params as device_substrate_parameters,
        fd.deposition_type as film_deposition_type,
        fd.params as film_deposition_params
    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id
    LEFT JOIN
        OFET_PROCESS op ON s.process_id = op.process_id
    LEFT JOIN
        SOLUTION sol ON op.solution_id = sol.solution_id
    LEFT JOIN
        SOLUTION_MAKEUP_SOLVENT sms ON sol.solution_id = sms.solution_id
    LEFT JOIN
        SOLVENT sv ON sms.solvent_id = sv.pubchem_cid
    LEFT JOIN
        SOLUTION_MAKEUP_POLYMER smp ON sol.solution_id = smp.solution_id
    LEFT JOIN
        POLYMER p ON smp.polymer_id = p.polymer_id
    LEFT JOIN
        DEVICE_FABRICATION df ON op.device_fab_id = df.device_fab_id
    LEFT JOIN
        FILM_DEPOSITION fd ON op.film_deposition_id = fd.film_deposition_id
    WHERE
        sms.vol_frac=1 AND smp.wt_frac=1;
'''

result_df = read_select_query(query)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta,concentration,solvent_vol_fract,solvent_pubchem_cid,solvent_iupac_name,polymer_common_name,polymer_iupac_name,polymer_mw,polymer_mn,polymer_dispersity,device_substrate_parameters,film_deposition_type,film_deposition_params
0,1,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
1,2,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
2,3,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
3,4,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
4,5,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
5,6,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
6,7,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
7,8,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
8,9,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"
9,10,literature,"{'doi': '10.1039/C5TC02579F', 'publication_typ...",4.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",299.0,90.0,3.32,"{'channel_width': 1500.0, 'gate_material': 'n-...",spin,"{'spin_rate': 1000.0, 'environment': 'air'}"


7. Now to this table add the solution treatment, substrate preatreat and post process informaiton 

1 if treatment is done and 0 if no treatment



In [8]:
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta,
        sol.concentration,
        sms.vol_frac as solvent_vol_fract,
        sv.pubchem_cid as solvent_pubchem_cid,
        sv.iupac_name as solvent_iupac_name,
        p.common_name as polymer_common_name,
        p.iupac_name as polymer_iupac_name,
        p.mw as polymer_mw,
        p.mn as polymer_mn,
        p.dispersity as polymer_dispersity,
        df.params as device_substrate_parameters,
        fd.deposition_type as film_deposition_type,
        fd.params as film_deposition_params,
        CAST(m.data -> 'electron_mobility' ->> 'value' AS FLOAT) AS electron_mobility

    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id
    LEFT JOIN
        OFET_PROCESS op ON s.process_id = op.process_id
    LEFT JOIN
        SOLUTION sol ON op.solution_id = sol.solution_id
    LEFT JOIN
        SOLUTION_MAKEUP_SOLVENT sms ON sol.solution_id = sms.solution_id
    LEFT JOIN
        SOLVENT sv ON sms.solvent_id = sv.pubchem_cid
    LEFT JOIN
        SOLUTION_MAKEUP_POLYMER smp ON sol.solution_id = smp.solution_id
    LEFT JOIN
        POLYMER p ON smp.polymer_id = p.polymer_id
    LEFT JOIN
        DEVICE_FABRICATION df ON op.device_fab_id = df.device_fab_id
    LEFT JOIN
        FILM_DEPOSITION fd ON op.film_deposition_id = fd.film_deposition_id
    LEFT JOIN
        MEASUREMENT m on s.sample_id = m.sample_id
    WHERE
        sms.vol_frac=1 AND smp.wt_frac=1 AND
        m.measurement_type = 'transfer_curve' AND
        m.data -> 'electron_
        mobility' ->> 'value' IS NOT NULL;
'''

result_df = read_select_query(query)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta,concentration,solvent_vol_fract,solvent_pubchem_cid,solvent_iupac_name,polymer_common_name,polymer_iupac_name,polymer_mw,polymer_mn,polymer_dispersity,device_substrate_parameters,film_deposition_type,film_deposition_params,electron_mobility


8. Now to this table add the hole mobility information 

only keep devices that have an actual hole mobility value and is not Null or NAN

In [2]:
query = '''
    SELECT
        s.sample_id,
        e.citation_type,
        e.meta as experiment_meta,
        sol.concentration,
        sms.vol_frac as solvent_vol_frac,
        sv.pubchem_cid as solvent_pubchem_cid,
        sv.iupac_name as solvent_iupac_name,
        p.common_name as polymer_common_name,
        p.iupac_name as polymer_iupac_name,
        p.mw as polymer_mw,
        p.mn as polymer_mn,
        p.dispersity as polymer_dispersity,
        df.params as device_substrate_parameters,
        fd.deposition_type as film_deposition_type,
        fd.params as film_deposition_params,
        CAST(m.data -> 'electron_mobility' ->> 'value' AS FLOAT) AS electron_mobility,
        CASE WHEN op.solution_treatment_id IS NOT NULL THEN 1 ELSE 0 END AS Solution_treatment,
        CASE WHEN op.substrate_pretreat_id is NOT NULL THEN 1 ELSE 0 END AS Substrate_pretreatment,
        CASE WHEN op.postprocess_id IS NOT NULL THEN 1 ELSE 0 END AS Post_process
    FROM
        SAMPLE as s
    JOIN
        EXPERIMENT_INFO as e ON s.exp_id = e.exp_id
    LEFT JOIN
        OFET_PROCESS op ON s.process_id = op.process_id
    LEFT JOIN
        SOLUTION sol ON op.solution_id = sol.solution_id
    LEFT JOIN
        SOLUTION_MAKEUP_SOLVENT sms ON sol.solution_id = sms.solution_id
    LEFT JOIN
        SOLVENT sv ON sms.solvent_id = sv.pubchem_cid
    LEFT JOIN
        SOLUTION_MAKEUP_POLYMER smp ON sol.solution_id = smp.solution_id
    LEFT JOIN
        POLYMER p ON smp.polymer_id = p.polymer_id
    LEFT JOIN
        DEVICE_FABRICATION df ON op.device_fab_id = df.device_fab_id
    LEFT JOIN
        FILM_DEPOSITION fd ON op.film_deposition_id = fd.film_deposition_id
    LEFT JOIN
        MEASUREMENT m on s.sample_id = m.sample_id
    WHERE
        sms.vol_frac=1 AND smp.wt_frac=1 AND
        m.measurement_type = 'transfer_curve' AND
        m.data -> 'electron_mobility' ->> 'value' IS NOT NULL;
'''

result_df = read_select_query(query)

result_df



Unnamed: 0,sample_id,citation_type,experiment_meta,concentration,solvent_vol_frac,solvent_pubchem_cid,solvent_iupac_name,polymer_common_name,polymer_iupac_name,polymer_mw,polymer_mn,polymer_dispersity,device_substrate_parameters,film_deposition_type,film_deposition_params,electron_mobility,solution_treatment,substrate_pretreatment,post_process
0,666,laboratory,"{'email': 'myl220@lehigh.edu', 'last_name': 'L...",6.0,1.0,7964,chlorobenzene,N2200,"poly{[N,N′-bis(2-octyldodecyl)-naphthalene-1,4...",202.0,91.0,2.22,"{'channel_width': 2000, 'gate_material': 'p-do...",blade,"{'blade_angle': 90, 'environment': 'air', 'tem...",0.04615,1,1,1
1,670,laboratory,"{'email': 'myl220@lehigh.edu', 'last_name': 'L...",10.0,1.0,7964,chlorobenzene,N2200,"poly{[N,N′-bis(2-octyldodecyl)-naphthalene-1,4...",202.0,91.0,2.22,"{'channel_width': 2000, 'gate_material': 'p-do...",blade,"{'blade_angle': 90, 'environment': 'air', 'tem...",0.04,1,1,1
2,125,literature,"{'doi': '10.1002/adma.201102786', 'publication...",10.0,1.0,7239,"1,2-dichlorobenzene",DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",193.5,50.0,3.87,"{'channel_width': 10000.0, 'gate_material': 'g...",spin,{'environment': 'nitrogen'},0.02,0,1,1
3,661,laboratory,"{'email': 'myl220@lehigh.edu', 'last_name': 'L...",1.0,1.0,7964,chlorobenzene,N2200,"poly{[N,N′-bis(2-octyldodecyl)-naphthalene-1,4...",202.0,91.0,2.22,"{'channel_width': 2000, 'gate_material': 'p-do...",blade,"{'blade_angle': 90, 'environment': 'air', 'tem...",0.013149,1,1,1
4,24,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,"{'channel_width': 4000.0, 'gate_material': 'n-...",spin,{'environment': 'air'},1.35,0,1,1
5,25,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,"{'channel_width': 4000.0, 'gate_material': 'n-...",spin,{'environment': 'air'},1.75,0,1,1
6,26,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,"{'channel_width': 1400.0, 'gate_material': 'n-...",spin,{'environment': 'air'},2.25,0,1,1
7,27,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,"{'channel_width': 1400.0, 'gate_material': 'n-...",spin,{'environment': 'air'},2.55,0,1,1
8,28,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,"{'channel_width': 1400.0, 'gate_material': 'n-...",spin,{'environment': 'air'},2.65,0,1,1
9,29,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,"{'channel_width': 1400.0, 'gate_material': 'n-...",spin,{'environment': 'air'},0.03,0,1,1


This is a simple version of the dataframe result_df. There are a couple more things we can now do to the dataframe to make it more machine readable. We no longer need to use sql queries. can just work with the dataframe and pandas

9. The code block below will unpack the data stored in device_substrate_parameters and store it as columns (this one is done for you)

In [3]:
 #unpacking the information stored in device_substrate_parameters

import pandas as pd
from pandas import json_normalize

# # Extract the 'device_substrate_parameters' column and normalize it
params_df = json_normalize(result_df['device_substrate_parameters'])

# # Concatenate the original DataFrame with the new 'params_df'
result_df = pd.concat([result_df, params_df], axis=1)

# # Drop the original 'device_substrate_parameters' column
result_df = result_df.drop('device_substrate_parameters', axis=1)

# # Display the resulting DataFrame
result_df

Unnamed: 0,sample_id,citation_type,experiment_meta,concentration,solvent_vol_frac,solvent_pubchem_cid,solvent_iupac_name,polymer_common_name,polymer_iupac_name,polymer_mw,polymer_mn,polymer_dispersity,film_deposition_type,film_deposition_params,electron_mobility,solution_treatment,substrate_pretreatment,post_process,channel_width,gate_material,channel_length,dielectric_1_material,dielectric_1_thickness,electrode_configuration,dielectric_material,dielectric_thickness,dielectric_material_2,dielectric_thickness_2
0,666,laboratory,"{'email': 'myl220@lehigh.edu', 'last_name': 'L...",6.0,1.0,7964,chlorobenzene,N2200,"poly{[N,N′-bis(2-octyldodecyl)-naphthalene-1,4...",202.0,91.0,2.22,blade,"{'blade_angle': 90, 'environment': 'air', 'tem...",0.04615,1,1,1,2000.0,p-doped Si,50.0,SiO2,3.0,BGTC,,,,
1,670,laboratory,"{'email': 'myl220@lehigh.edu', 'last_name': 'L...",10.0,1.0,7964,chlorobenzene,N2200,"poly{[N,N′-bis(2-octyldodecyl)-naphthalene-1,4...",202.0,91.0,2.22,blade,"{'blade_angle': 90, 'environment': 'air', 'tem...",0.04,1,1,1,2000.0,p-doped Si,50.0,SiO2,3.0,BGTC,,,,
2,125,literature,"{'doi': '10.1002/adma.201102786', 'publication...",10.0,1.0,7239,"1,2-dichlorobenzene",DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",193.5,50.0,3.87,spin,{'environment': 'nitrogen'},0.02,0,1,1,10000.0,glass,20.0,,,TGBC,PMMA,550.0,,
3,661,laboratory,"{'email': 'myl220@lehigh.edu', 'last_name': 'L...",1.0,1.0,7964,chlorobenzene,N2200,"poly{[N,N′-bis(2-octyldodecyl)-naphthalene-1,4...",202.0,91.0,2.22,blade,"{'blade_angle': 90, 'environment': 'air', 'tem...",0.013149,1,1,1,2000.0,p-doped Si,50.0,SiO2,3.0,BGTC,,,,
4,24,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,spin,{'environment': 'air'},1.35,0,1,1,4000.0,n-doped Si,100.0,,,BGTC,SiO2,200.0,,
5,25,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,spin,{'environment': 'air'},1.75,0,1,1,4000.0,n-doped Si,125.0,,,BGTC,SiO2,200.0,,
6,26,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,spin,{'environment': 'air'},2.25,0,1,1,1400.0,n-doped Si,30.0,,,BGBC,SiO2,300.0,,
7,27,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,spin,{'environment': 'air'},2.55,0,1,1,1400.0,n-doped Si,40.0,,,BGBC,SiO2,300.0,,
8,28,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,spin,{'environment': 'air'},2.65,0,1,1,1400.0,n-doped Si,50.0,,,BGBC,SiO2,300.0,,
9,29,literature,"{'doi': '10.1038/srep00754 ', 'publication_typ...",5.0,1.0,7964,chlorobenzene,DPP-DTT,"poly[2,5-(2-octyldodecyl)-3,6-diketopyrrolopyr...",501.0,110.0,4.55,spin,{'environment': 'air'},0.03,0,1,1,1400.0,n-doped Si,50.0,,,BGBC,SiO2,300.0,,


In [4]:

# Drop unnecessary columns that won't be used for modeling
columns_to_drop = ['film_deposition_params', 'citation_type', 'experiment_meta', 'solvent_vol_frac', 'solvent_iupac_name', 'polymer_iupac_name', 'dielectric_material_2', 'dielectric_thickness_2', 'dielectric_1_material', 'dielectric_1_thickness']
result_df = result_df.drop(columns=columns_to_drop)

# Display the resulting DataFrame
result_df


Unnamed: 0,sample_id,concentration,solvent_pubchem_cid,polymer_common_name,polymer_mw,polymer_mn,polymer_dispersity,film_deposition_type,electron_mobility,solution_treatment,substrate_pretreatment,post_process,channel_width,gate_material,channel_length,electrode_configuration,dielectric_material,dielectric_thickness
0,666,6.0,7964,N2200,202.0,91.0,2.22,blade,0.04615,1,1,1,2000.0,p-doped Si,50.0,BGTC,,
1,670,10.0,7964,N2200,202.0,91.0,2.22,blade,0.04,1,1,1,2000.0,p-doped Si,50.0,BGTC,,
2,125,10.0,7239,DPP-DTT,193.5,50.0,3.87,spin,0.02,0,1,1,10000.0,glass,20.0,TGBC,PMMA,550.0
3,661,1.0,7964,N2200,202.0,91.0,2.22,blade,0.013149,1,1,1,2000.0,p-doped Si,50.0,BGTC,,
4,24,5.0,7964,DPP-DTT,501.0,110.0,4.55,spin,1.35,0,1,1,4000.0,n-doped Si,100.0,BGTC,SiO2,200.0
5,25,5.0,7964,DPP-DTT,501.0,110.0,4.55,spin,1.75,0,1,1,4000.0,n-doped Si,125.0,BGTC,SiO2,200.0
6,26,5.0,7964,DPP-DTT,501.0,110.0,4.55,spin,2.25,0,1,1,1400.0,n-doped Si,30.0,BGBC,SiO2,300.0
7,27,5.0,7964,DPP-DTT,501.0,110.0,4.55,spin,2.55,0,1,1,1400.0,n-doped Si,40.0,BGBC,SiO2,300.0
8,28,5.0,7964,DPP-DTT,501.0,110.0,4.55,spin,2.65,0,1,1,1400.0,n-doped Si,50.0,BGBC,SiO2,300.0
9,29,5.0,7964,DPP-DTT,501.0,110.0,4.55,spin,0.03,0,1,1,1400.0,n-doped Si,50.0,BGBC,SiO2,300.0


We are now going to do something called one hot encoding to convert columns with textual information into numbers (1 and 0)

In [5]:
# List of columns to one-hot encode
columns_to_one_hot_encode = ['film_deposition_type', 'gate_material', 'dielectric_material', 'electrode_configuration']

# Perform one-hot encoding
result_df = pd.get_dummies(result_df, columns=columns_to_one_hot_encode)

# Display the resulting DataFrame with one-hot encoding
result_df


Unnamed: 0,sample_id,concentration,solvent_pubchem_cid,polymer_common_name,polymer_mw,polymer_mn,polymer_dispersity,electron_mobility,solution_treatment,substrate_pretreatment,post_process,channel_width,channel_length,dielectric_thickness,film_deposition_type_blade,film_deposition_type_drop,film_deposition_type_inkjet,film_deposition_type_spin,film_deposition_type_water,gate_material_Al,gate_material_Au,gate_material_Cu,gate_material_PET,gate_material_glass,gate_material_n-doped Si,gate_material_p-doped Si,dielectric_material_CYTOP,dielectric_material_PAN,dielectric_material_PMMA,dielectric_material_PTrFE,dielectric_material_PVP,dielectric_material_Shellac,dielectric_material_Si3N4,dielectric_material_SiO2,electrode_configuration_BGBC,electrode_configuration_BGTC,electrode_configuration_TGBC
0,666,6.0,7964,N2200,202.0,91.0,2.22,0.04615,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
1,670,10.0,7964,N2200,202.0,91.0,2.22,0.04,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
2,125,10.0,7239,DPP-DTT,193.5,50.0,3.87,0.02,0,1,1,10000.0,20.0,550.0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1
3,661,1.0,7964,N2200,202.0,91.0,2.22,0.013149,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
4,24,5.0,7964,DPP-DTT,501.0,110.0,4.55,1.35,0,1,1,4000.0,100.0,200.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0
5,25,5.0,7964,DPP-DTT,501.0,110.0,4.55,1.75,0,1,1,4000.0,125.0,200.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0
6,26,5.0,7964,DPP-DTT,501.0,110.0,4.55,2.25,0,1,1,1400.0,30.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0
7,27,5.0,7964,DPP-DTT,501.0,110.0,4.55,2.55,0,1,1,1400.0,40.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0
8,28,5.0,7964,DPP-DTT,501.0,110.0,4.55,2.65,0,1,1,1400.0,50.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0
9,29,5.0,7964,DPP-DTT,501.0,110.0,4.55,0.03,0,1,1,1400.0,50.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0


Now lets see all the columns in this database 

In [10]:
column_names = result_df.columns
print(column_names)

Index(['sample_id', 'concentration', 'solvent_pubchem_cid',
       'polymer_common_name', 'polymer_mw', 'polymer_mn', 'polymer_dispersity',
       'electron_mobility', 'solution_treatment', 'substrate_pretreatment',
       'post_process', 'channel_width', 'channel_length',
       'dielectric_thickness', 'film_deposition_type_blade',
       'film_deposition_type_drop', 'film_deposition_type_inkjet',
       'film_deposition_type_spin', 'film_deposition_type_water',
       'dielectric_material_CYTOP', 'dielectric_material_PAN',
       'dielectric_material_PMMA', 'dielectric_material_PTrFE',
       'dielectric_material_PVP', 'dielectric_material_Shellac',
       'dielectric_material_Si3N4', 'dielectric_material_SiO2',
       'electrode_configuration_BGBC', 'electrode_configuration_BGTC',
       'electrode_configuration_TGBC', 'gate_material_Si',
       'gate_material_Other'],
      dtype='object')


### Consolidating Descriptors

We are going to consolidate some of the descriptors into one column. 


Coating :

* film_deposition_type_MGC (dip,Dip,blade, inkjet, shear, wire) - value of 1 if any of these columns are true or else 0
* film_deposition_type_spin
* film_deposition_type_drop

Gate Material :

* gate_material_n_doped Si = ('gate_material_n-doped Si', 'gate_material_Si','gate_material_p-doped Si') 

* gate_material_other = ('gate_material_Al', 'gate_material_Au', 'gate_material_PEDOT:PSS', 'gate_material_PET','gate_material_glass')


Dielectric Material :

* dielectric_material_SiO2
* dielectric_material_other = (
        'dielectric_material_6FDA-DABC',
       'dielectric_material_CYTOP', 'dielectric_material_PAN',
       'dielectric_material_PMMA', 'dielectric_material_PTrFE',
       'dielectric_material_PVP', 'dielectric_material_Shellac',
       'dielectric_material_Si3N4')




In [9]:

#consolidating gate material columns
gate_material_Si_to_consolidate = ['gate_material_n-doped Si','gate_material_p-doped Si']
gate_material_other_to_consolidate = ['gate_material_Al', 'gate_material_Au', 'gate_material_Cu', 'gate_material_PET','gate_material_glass']
# Create new columns
result_df['gate_material_Si'] = result_df[gate_material_Si_to_consolidate].max(axis=1)
result_df['gate_material_Other'] = result_df[gate_material_other_to_consolidate].max(axis=1)
# Drop the original columns
result_df.drop(columns=gate_material_Si_to_consolidate, inplace=True)
result_df.drop(columns=gate_material_other_to_consolidate, inplace=True)


#consolidating coating columns
MGC_columns_to_consolidate = ['film_deposition_type_blade',
                           'film_deposition_type_inkjet']
# Create new columns
result_df['film_deposition_type_MGC'] = result_df[MGC_columns_to_consolidate].max(axis=1)
# Drop the original columns
result_df.drop(columns=MGC_columns_to_consolidate, inplace=True)


#consolidating dielectric columns
dielectric_material_columns_to_consolidate = ['dielectric_material_CYTOP','dielectric_material_PAN',
       'dielectric_material_PMMA', 'dielectric_material_PTrFE',
       'dielectric_material_Shellac', 'dielectric_material_Si3N4',
       'dielectric_material_SiO2']
# Create new columns
result_df['dielectric_material_other'] = result_df[dielectric_material_columns_to_consolidate].max(axis=1)
# Drop the original columns
result_df.drop(columns=dielectric_material_columns_to_consolidate, inplace=True)


KeyError: "None of [Index(['gate_material_n-doped Si', 'gate_material_p-doped Si'], dtype='object')] are in the [columns]"

In [14]:
result_df

Unnamed: 0,sample_id,concentration,solvent_pubchem_cid,polymer_common_name,polymer_mw,polymer_mn,polymer_dispersity,electron_mobility,solution_treatment,substrate_pretreatment,post_process,channel_width,channel_length,dielectric_thickness,film_deposition_type_blade,film_deposition_type_drop,film_deposition_type_inkjet,film_deposition_type_spin,film_deposition_type_water,gate_material_Al,gate_material_Au,gate_material_Cu,gate_material_PET,gate_material_glass,gate_material_n-doped Si,gate_material_p-doped Si,dielectric_material_CYTOP,dielectric_material_PAN,dielectric_material_PMMA,dielectric_material_PTrFE,dielectric_material_PVP,dielectric_material_Shellac,dielectric_material_Si3N4,dielectric_material_SiO2,electrode_configuration_BGBC,electrode_configuration_BGTC,electrode_configuration_TGBC
0,666,6.0,7964,N2200,202.0,91.0,2.22,0.04615,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
1,670,10.0,7964,N2200,202.0,91.0,2.22,0.04,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
2,125,10.0,7239,DPP-DTT,193.5,50.0,3.87,0.02,0,1,1,10000.0,20.0,550.0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1
3,661,1.0,7964,N2200,202.0,91.0,2.22,0.013149,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0
4,24,5.0,7964,DPP-DTT,501.0,110.0,4.55,1.35,0,1,1,4000.0,100.0,200.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0
5,25,5.0,7964,DPP-DTT,501.0,110.0,4.55,1.75,0,1,1,4000.0,125.0,200.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0
6,26,5.0,7964,DPP-DTT,501.0,110.0,4.55,2.25,0,1,1,1400.0,30.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0
7,27,5.0,7964,DPP-DTT,501.0,110.0,4.55,2.55,0,1,1,1400.0,40.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0
8,28,5.0,7964,DPP-DTT,501.0,110.0,4.55,2.65,0,1,1,1400.0,50.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0
9,29,5.0,7964,DPP-DTT,501.0,110.0,4.55,0.03,0,1,1,1400.0,50.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0


In [15]:
## replacing the pubchem_cid with solvent boiling point

# Get unique PubChem CIDs from the 'solvent_pubchem_cid' column
unique_pubchem_cids = result_df['solvent_pubchem_cid'].unique()

# Display the unique PubChem CIDs
print(unique_pubchem_cids)



[7964 7239 6212 6591 1140 7809 7003 7947]


In [16]:
# Dictionary mapping PubChem CIDs to boiling points
boiling_point_dict = {
    7964: 132,
    6212: 62,
    7239: 180.1,
    6591: 146,
    7809: 138,
    1140: 111,
    7947: 164.7,
    7003: 259.3
}

# Add a new column "solvent_boiling_point" based on PubChem CIDs
result_df['solvent_boiling_point'] = result_df['solvent_pubchem_cid'].map(boiling_point_dict)


# Drop unnecessary columns that won't be used for modeling
columns_to_drop = ['solvent_pubchem_cid']
result_df = result_df.drop(columns=columns_to_drop)

# Display the resulting DataFrame
result_df


Unnamed: 0,sample_id,concentration,polymer_common_name,polymer_mw,polymer_mn,polymer_dispersity,electron_mobility,solution_treatment,substrate_pretreatment,post_process,channel_width,channel_length,dielectric_thickness,film_deposition_type_blade,film_deposition_type_drop,film_deposition_type_inkjet,film_deposition_type_spin,film_deposition_type_water,gate_material_Al,gate_material_Au,gate_material_Cu,gate_material_PET,gate_material_glass,gate_material_n-doped Si,gate_material_p-doped Si,dielectric_material_CYTOP,dielectric_material_PAN,dielectric_material_PMMA,dielectric_material_PTrFE,dielectric_material_PVP,dielectric_material_Shellac,dielectric_material_Si3N4,dielectric_material_SiO2,electrode_configuration_BGBC,electrode_configuration_BGTC,electrode_configuration_TGBC,solvent_boiling_point
0,666,6.0,N2200,202.0,91.0,2.22,0.04615,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,132.0
1,670,10.0,N2200,202.0,91.0,2.22,0.04,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,132.0
2,125,10.0,DPP-DTT,193.5,50.0,3.87,0.02,0,1,1,10000.0,20.0,550.0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,1,180.1
3,661,1.0,N2200,202.0,91.0,2.22,0.013149,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,132.0
4,24,5.0,DPP-DTT,501.0,110.0,4.55,1.35,0,1,1,4000.0,100.0,200.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,132.0
5,25,5.0,DPP-DTT,501.0,110.0,4.55,1.75,0,1,1,4000.0,125.0,200.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,132.0
6,26,5.0,DPP-DTT,501.0,110.0,4.55,2.25,0,1,1,1400.0,30.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,132.0
7,27,5.0,DPP-DTT,501.0,110.0,4.55,2.55,0,1,1,1400.0,40.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,132.0
8,28,5.0,DPP-DTT,501.0,110.0,4.55,2.65,0,1,1,1400.0,50.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,132.0
9,29,5.0,DPP-DTT,501.0,110.0,4.55,0.03,0,1,1,1400.0,50.0,300.0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,1,0,0,132.0


In [17]:
column_names = result_df.columns
print(column_names)

Index(['sample_id', 'concentration', 'polymer_common_name', 'polymer_mw',
       'polymer_mn', 'polymer_dispersity', 'electron_mobility',
       'solution_treatment', 'substrate_pretreatment', 'post_process',
       'channel_width', 'channel_length', 'dielectric_thickness',
       'film_deposition_type_blade', 'film_deposition_type_drop',
       'film_deposition_type_inkjet', 'film_deposition_type_spin',
       'film_deposition_type_water', 'gate_material_Al', 'gate_material_Au',
       'gate_material_Cu', 'gate_material_PET', 'gate_material_glass',
       'gate_material_n-doped Si', 'gate_material_p-doped Si',
       'dielectric_material_CYTOP', 'dielectric_material_PAN',
       'dielectric_material_PMMA', 'dielectric_material_PTrFE',
       'dielectric_material_PVP', 'dielectric_material_Shellac',
       'dielectric_material_Si3N4', 'dielectric_material_SiO2',
       'electrode_configuration_BGBC', 'electrode_configuration_BGTC',
       'electrode_configuration_TGBC', 'solvent_boilin

In [18]:
#result_df_P3HT = result_df[result_df['polymer_common_name'] == 'P3HT']
#result_df_DPP_DTT = result_df[result_df['polymer_common_name'] != 'P3HT']
result_df_N2200 = result_df[result_df['polymer_common_name'] == 'N2200']

### P3HT Modeling

In [19]:
result_df_P3HT = result_df_P3HT.drop(columns='polymer_common_name')
result_df_P3HT

NameError: name 'result_df_P3HT' is not defined

In [None]:
num_rows, num_columns = result_df_P3HT.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")


### N2200 Modeling

In [20]:
result_df_N2200 = result_df_N2200.drop(columns='polymer_common_name')
result_df_N2200

Unnamed: 0,sample_id,concentration,polymer_mw,polymer_mn,polymer_dispersity,electron_mobility,solution_treatment,substrate_pretreatment,post_process,channel_width,channel_length,dielectric_thickness,film_deposition_type_blade,film_deposition_type_drop,film_deposition_type_inkjet,film_deposition_type_spin,film_deposition_type_water,gate_material_Al,gate_material_Au,gate_material_Cu,gate_material_PET,gate_material_glass,gate_material_n-doped Si,gate_material_p-doped Si,dielectric_material_CYTOP,dielectric_material_PAN,dielectric_material_PMMA,dielectric_material_PTrFE,dielectric_material_PVP,dielectric_material_Shellac,dielectric_material_Si3N4,dielectric_material_SiO2,electrode_configuration_BGBC,electrode_configuration_BGTC,electrode_configuration_TGBC,solvent_boiling_point
0,666,6.0,202.0,91.0,2.22,0.04615,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,132.0
1,670,10.0,202.0,91.0,2.22,0.04,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,132.0
3,661,1.0,202.0,91.0,2.22,0.013149,1,1,1,2000.0,50.0,,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,132.0
25,179,7.0,23.0,12.0,1.9,0.38,0,0,1,500.0,50.0,650.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,146.0
26,180,3.0,23.0,12.0,1.9,0.64,0,0,1,500.0,50.0,650.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,146.0
27,181,1.0,23.0,12.0,1.9,0.0005,0,0,1,500.0,50.0,650.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,146.0
28,182,0.5,23.0,12.0,1.9,0.0001,0,0,1,500.0,50.0,650.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,146.0
29,183,8.0,72.0,22.5,3.2,0.25,0,0,1,500.0,50.0,650.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,180.1
30,184,7.0,72.0,22.5,3.2,0.3,0,0,1,500.0,50.0,650.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,146.0
31,185,3.0,72.0,22.5,3.2,0.275,0,0,1,500.0,50.0,650.0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,146.0


In [21]:
num_rows, num_columns = result_df_N2200.shape
print(f"Number of rows: {num_rows}")
print(f"Number of columns: {num_columns}")

Number of rows: 104
Number of columns: 36


In [22]:
result_df_N2200.to_excel('result_df_N2200_test.xlsx', index=True, header=True)