# Design a bigger dataset of queries by using data augmentation

We want to change two things:
*  we change some filter conditions 
*  we (exhaustively) change the attribute in the MIN-aggregation, such that for each query one variable of each relation is represented once in the new dataset

Dataset size in the beginning: 229
*  STATS: 146
*  SNAP: 40
*  JOB: 15
*  LSQB: 2
*  HETIO: 26

#### Augmentation: Filter (+ changing connections for HETIO)
By hand we change each STATS and JOB query twice with changing different filters. (SNAP and LSQB do not have any filters.)   
(The new queries are called "query"-augF1 and "query"-augF2, where "query" is the original name of the query.)   
For 6 STATS queries there is only one filter. Then we only create one new query "query"-augF1. This is the case for:
*  STATS: 024-017
*  STATS: 025-001
*  STATS: 096-095
*  STATS: 100-005
*  STATS: 111-056
*  STATS: 143-126
In the most cases we change the values of $>, <, \geq, \leq$ conditions.
In some cases we also change the equality conditions and we got the new string values like the following:

![Example Image](images/imdb_random_keyword.PNG)

For the HETIO dataset we create 12 new queries, where we have 4 queries with filters (queries 5-8). Here we again change the filters two times.  
For the other 8 queries we do augmentation in a way that we replace one connection in the graph with another one between the same two nodes.   
(e.g. we replace upregulates with downregulates between disease and gene) 

![Example Image2](images/hetio_graph.PNG)

We get multiple variants for those 8 queries:
*  2 variants for queries: 12, 13
*  3 variants for queries: 10, 11, 14, 16
*  9 variants for queries: 9, 15

Therefore we get a new dataset with size: 591
*  STATS: 140 * 3 + 6 * 2 = 432
*  SNAP: 40
*  JOB: 15 * 3 = 45
*  LSQB: 2
*  HETIO: 26 (not changed) + 4 * 3 (filter) + 2 * 2 + 4 * 3 + 2 * 9 = 72

#### Augmentation: Change aggregate-attribute
Now we take this new dataset and change the attribute in the MIN such that for each relation per query there is a query with an attribute of this table in the aggregation. The new dataset size is now the sum of the number of relations per query.   

In [5]:
import re

# Define input and output file paths
input_file = 'scala_commands_augment_filter.txt'

queries_stats = 0
queries_snap = 0
queries_job = 0
queries_lsqb = 0
queries_hetio = 0

# Open input and output files
with open(input_file, 'r') as f_input:
    # Read input file line by line
    for line in f_input:
        # Split each line into components
        pattern = r'(?<!\\)\"|\"(?<!\\)(?=\s+\"|$)'
        components = re.split(pattern, line)
        
        # Extract relevant information
        benchmark = components[3]
        number = components[5]
        query = components[1].strip()

        query_upper = query.upper()
        from_index = query_upper.find("FROM")
        where_index = query_upper.find("WHERE")
        number_of_relations = query[from_index:where_index].count(",") + 1

        if benchmark == "STATS":
            queries_stats += number_of_relations
        elif benchmark == "SNAP":
            queries_snap += number_of_relations
        elif benchmark == "JOB":
            queries_job += number_of_relations
        elif benchmark == "LSQB":
            queries_lsqb += number_of_relations
        elif benchmark == "HETIO":
            queries_hetio += number_of_relations
        else:
            print("other benchmark?")

print("STATS:", queries_stats)
print("SNAP:", queries_snap)
print("JOB:", queries_job)
print("LSQB:", queries_lsqb)
print("HETIO:", queries_hetio)
print("This gives us", queries_stats + queries_snap + queries_job + queries_lsqb + queries_hetio, "queries in total")

STATS: 1876
SNAP: 244
JOB: 264
LSQB: 14
HETIO: 538
This gives us 2936 queries in total


In [1]:
%%bash
pip install psycopg2-binary





In [2]:
import psycopg2
import re

In [3]:
# Define input and output file paths
input_file = 'scala_commands_augment_filter.txt'
output_file = 'scala_commands_augment_filter_agg.txt'

# Open input and output files
with open(input_file, 'r') as f_input, open(output_file, 'w', newline='') as f_output:
    
    # Read input file line by line
    for line in f_input:
        # Split each line into components
        pattern = r'(?<!\\)\"|\"(?<!\\)(?=\s+\"|$)'
        components = re.split(pattern, line)
        
        benchmark = components[3]
        number = components[5]
        query = components[1].strip()
        
        query_upper = query.upper()
        from_index = query_upper.find("FROM")
        where_index = query_upper.find("WHERE")
        relations_list = query[from_index+4:where_index].split(",")
        relations = {relation.strip().rsplit(maxsplit=1)[-1]: relation.strip().split(maxsplit=1)[0] for relation in relations_list}

        # get all relations occuring in the query and their aliases
        min_index = query_upper.find("MIN")
        agg = query[min_index+4:from_index-2].strip().split(".")[0].strip()
        relations = {key: value for key, value in relations.items() if key != agg}
    
        if benchmark == "JOB":
            database = "imdb"
        else:
            database = benchmark.lower()
        conn = psycopg2.connect(
            host="postgres",
            database=database,
            user=database,
            password=database
        )
        cur = conn.cursor()

        # get one column name for each relation
        new_aggs = []
        for key, value in relations.items():
            query_col = f"""SELECT column_name FROM information_schema.columns WHERE table_name = '{value.lower()}'"""
            cur.execute(query_col)
            row = cur.fetchone()[0]
            new_aggs.append(key + "." + row)
        cur.close()
        conn.close()

        # replace the MIN-aggregate with the new agg (one new query for each relation)
        i = 1
        f_output.write(line)
        for new_agg in new_aggs:
            result = re.sub(r'MIN\([^)]*\)', "MIN(" + new_agg + ")", line)
            result = result[:-2] + "-augA" + str(i) + '"' + "\n"
            f_output.write(result)
            i += 1

In [52]:
df = pd.read_csv("results/features_times.csv")

In [53]:
df[df["bench"] != "SNAP"]["#relations"].sum()

734

In [54]:
203-163

40

In [55]:
df[df["#filters"] == 0]

Unnamed: 0,bench,query,orig/rewr(mean),orig/rewr+rewr(mean),orig mean,rewr mean,rewr mean+rewr,diff rewr-orig,diff rewr+rewr-orig,#relations,...,min(branching factors),max(branching factors),mean(branching factors),median(branching factors),balancedness factor,list table rows,list join rows,container counts list,branching factors list,text
146,SNAP,dblp-path02,rewr,rewr,2.294022,0.870167,2.119866,-1.423855,-0.174155,3,...,2,2,2.0,2.0,1.0,"[437444, 437444, 437444]","[17539780, 2769959]","[1, 1, 1, 1, 2]",[2],"select MIN(p1.toNode) from dblp p1, dblp p2, d..."
147,SNAP,dblp-path03,rewr,rewr,30.795075,1.407008,2.529921,-29.388067,-28.265154,4,...,1,2,1.5,1.5,1.0,"[437444, 437444, 1049866, 1049866]","[111064404, 2769959, 6647902]","[1, 1, 1, 1, 1, 1, 2]","[1, 2]","select MIN(p1.toNode) from dblp p1, dblp p2, d..."
148,SNAP,dblp-path04,rewr,rewr,387.542229,1.62427,2.630539,-385.917959,-384.91169,5,...,1,2,1.5,1.5,1.0,"[437444, 437444, 437444, 1049866, 1049866]","[703275764, 17539780, 2769959, 6647902]","[1, 1, 1, 1, 1, 1, 1, 1, 2]","[1, 2]","select MIN(p1.toNode) from dblp p1, dblp p2, d..."
149,SNAP,dblp-path05,rewr,rewr,1800.0,1.857328,2.905927,-1798.142672,-1797.094073,6,...,1,2,1.5,1.5,1.0,"[437444, 437444, 437444, 1049866, 1049866, 104...","[4453243180, 17539780, 2769959, 42095471, 6647...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]","[1, 2]","select MIN(p1.toNode) from dblp p1, dblp p2, d..."
150,SNAP,dblp-path06,rewr,rewr,1800.0,2.054539,3.08787,-1797.945461,-1796.91213,7,...,1,2,1.5,1.5,1.0,"[437444, 437444, 1049866, 1049866, 1049866, 10...","[28198575625, 111064404, 2769959, 6647902, 420...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]","[1, 2]","select MIN(p1.toNode) from dblp p1, dblp p2, d..."
151,SNAP,dblp-path07,rewr,rewr,1800.0,2.245325,3.223917,-1797.754675,-1796.776083,8,...,1,2,1.5,1.5,1.0,"[437444, 437444, 1049866, 1049866, 1049866, 10...","[178557432222, 111064404, 2769959, 6647902, 26...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2]","[1, 2]","select MIN(p1.toNode) from dblp p1, dblp p2, d..."
152,SNAP,dblp-path08,rewr,rewr,1800.0,2.513097,3.493093,-1797.486903,-1796.506907,9,...,1,2,1.5,1.5,1.0,"[437444, 437444, 437444, 1049866, 1049866, 104...","[1130651314648, 703275764, 17539780, 2769959, ...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...","[1, 2]","select MIN(p1.toNode) from dblp p1, dblp p2, d..."
153,SNAP,dblp-tree01,rewr,rewr,1435.961778,1.863914,2.910837,-1434.097864,-1433.050941,5,...,2,2,2.0,2.0,0.666667,"[437444, 1049866, 1049866, 1049866, 1049866]","[703275764, 17539780, 2769959, 6647902]","[1, 1, 1, 1, 1, 2, 3]","[2, 2]","SELECT MIN(p1.toNode) FROM dblp p1, dblp p2, d..."
154,SNAP,dblp-tree02,rewr,rewr,1800.0,3.441422,4.450278,-1796.558578,-1795.549722,6,...,2,3,2.5,2.5,0.625,"[437444, 1049866, 1049866, 1049866, 1049866, 1...","[4453243180, 17539780, 2769959, 42095471, 6647...","[1, 1, 1, 1, 1, 1, 1, 2, 3]","[3, 2]","select MIN(p1.toNode) from dblp p1, dblp p2a, ..."
155,SNAP,dblp-tree03,rewr,rewr,1800.0,4.976418,5.984052,-1795.023582,-1794.015948,8,...,1,3,2.0,2.0,0.625,"[437444, 437444, 1049866, 1049866, 1049866, 10...","[178557432222, 4453243180, 111064404, 17539780...","[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 3]","[1, 3, 2]","select MIN(p1.toNode) from dblp p1, dblp p2a, ..."


In [None]:
df[df["bench"] 

In [59]:
730*3

2190

In [2]:
import re

# Define input and output file paths
input_file = 'scala_commands_augment_filter.txt'

queries_stats = 0
queries_snap = 0
queries_job = 0
queries_lsqb = 0
queries_hetio = 0

# Open input and output files
with open(input_file, 'r') as f_input:
    # Read input file line by line
    for line in f_input:
        # Split each line into components
        pattern = r'(?<!\\)\"|\"(?<!\\)(?=\s+\"|$)'
        components = re.split(pattern, line)
        
        # Extract relevant information
        benchmark = components[3]
        number = components[5]
        query = components[1].strip()

        query_upper = query.upper()
        from_index = query_upper.find("FROM")
        where_index = query_upper.find("WHERE")
        number_of_relations = query[from_index:where_index].count(",") + 1

        if benchmark == "STATS":
            queries_stats += number_of_relations
        elif benchmark == "SNAP":
            queries_snap += number_of_relations
        elif benchmark == "JOB":
            queries_job += number_of_relations
        elif benchmark == "LSQB":
            queries_lsqb += number_of_relations
        elif benchmark == "HETIO":
            queries_hetio += number_of_relations
        else:
            print("other benchmark?")

print("STATS:", queries_stats)
print("SNAP:", queries_snap)
print("JOB:", queries_job)
print("LSQB:", queries_lsqb)
print("HETIO:", queries_hetio)
print("total:", queries_stats + queries_snap + queries_job + queries_lsqb + queries_hetio)

STATS: 1876
SNAP: 244
JOB: 264
LSQB: 14
HETIO: 538
total: 2936


In [75]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('results/POS_Scala_comparison_TO_augment_filter_test.csv')

# Rename columns
# Replace 'old_column_name' with the current column name you want to change
# Replace 'new_column_name' with the new column name
df.rename(columns={'query': 'orig_mean', 'orig_mean': 'orig1', 'orig1':'orig2', 'orig2':'orig3', 'orig3':'orig4', 'orig4':'orig5', 
                   'orig5': 'orig med', 'orig med': 'orig std'}, inplace=True)

# Drop the last column
# Specify axis=1 to indicate that we are dropping columns
# Specify inplace=True to modify the DataFrame in place
df.drop(df.columns[-1], axis=1, inplace=True)

# Write the modified DataFrame back to a new CSV file
df.to_csv('results/POS_Scala_comparison_TO_augment_filter_test.csv', index=False)


In [76]:
import pandas as pd

# Read the CSV file into a DataFrame
df = pd.read_csv('results/POS_Scala_comparison_TO_augment_filter_test.csv')

# Select the column you want to analyze (replace 'column_name' with the actual column name)
column = df['orig mean']

# Initialize counters for each category
count_0_01 = 0
count_0_1 = 0
count_1 = 0
count_10 = 0
count_100 = 0

# Iterate over the values in the column
for value in column:
    if value < 0.01:
        count_0_01 += 1
    if value < 0.1:
        count_0_1 += 1
    if value < 1:
        count_1 += 1
    if value < 10:
        count_10 += 1
    if value < 100:
        count_100 += 1

# Print the counts
print("Number of values < 0.01:", count_0_01)
print("Number of values < 0.1:", count_0_1)
print("Number of values < 1:", count_1)
print("Number of values < 10:", count_10)
print("Number of values < 100:", count_100)


Number of values < 0.01: 6
Number of values < 0.1: 28
Number of values < 1: 47
Number of values < 10: 61
Number of values < 100: 68


In [80]:
47-28

19

In [81]:
6+22+19+14+7+1

69

In [None]:
< 0.01: 6
[0.01, 0.1]: 22
[0.1, 1]: 19
[1, 10]: 14
[10, 100]: 7
> 100: 1

In [31]:
%%bash
pip install psycopg2-binary
pip install numpy
pip install pandas









Collecting pandas
  Downloading pandas-2.2.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.0/13.0 MB 3.8 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2024.1-py2.py3-none-any.whl (505 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 505.5/505.5 KB 693.7 kB/s eta 0:00:00
Collecting tzdata>=2022.7
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 345.4/345.4 KB 3.0 MB/s eta 0:00:00
Installing collected packages: pytz, tzdata, pandas
Successfully installed pandas-2.2.1 pytz-2024.1 tzdata-2024.1




In [34]:
import json
import time
import psycopg2
import numpy as np
import csv
import multiprocessing
import signal
import pandas as pd

In [68]:
df = pd.read_csv('scala_commands_with_MIN_augment_filter_test.txt', delimiter=' ', quotechar='"', escapechar='\\', header = None)
df.replace('\\\\"', '', regex=True, inplace=True)

#df.iloc[10][1]

In [70]:
def handler_orig(signum, frame):
    global timeout_flag_orig
    timeout_flag_orig = True
    raise Exception("Query execution of the original query > 1800s")

def handler_rewr(signum, frame):
    global timeout_flag_rewr
    timeout_flag_rewr = True
    raise Exception("Query execution of the rewritten query > 1800s")
    
def run_query(benchmark, query):

    original_query = query 
    database = benchmark.lower()
    conn = psycopg2.connect(
        host="postgres",
        database=database,
        user=database,
        password=database
    )

    # if the evaluation takes longer than 30min then break it
    global timeout_flag_orig
    timeout_flag_orig = False

    print("original1")
    # the first run is just a warm up run and to check for the time out
    signal.signal(signal.SIGALRM, handler_orig) 
    signal.alarm(1800) 
    try:
        print("o1")
        cur = conn.cursor()
        print("o2")
        cur.execute(original_query)
        print("o3")
        result = cur.fetchall()
        print(result)
    except Exception as exc: 
        print(exc)
    signal.alarm(0) 

    if not timeout_flag_orig:
        list_original = []
        print("orig")
        for i in range(5):
            # execute the original query
            start_time_original = time.time()
            cur.execute(original_query)
            end_time_original = time.time()
            original_time = end_time_original - start_time_original
            list_original.append(original_time)
        orig_mean = np.mean(list_original)
        orig_med = np.median(list_original)
        orig_std = np.std(list_original)
        
            
    list_output = [benchmark] + [orig_mean] + list_original + [orig_med, orig_std] 
    print(list_output)
    file_path = "results/POS_Scala_comparison_TO_augment_filter_test.csv"
    with open(file_path, 'a', newline='') as csvfile:
        writer = csv.writer(csvfile)
        writer.writerow(list_output)

In [45]:
file_path = "results/POS_Scala_comparison_TO_augment_filter_test.csv"

names = ["bench", "query", "orig mean", "orig 1", "orig 2", "orig 3", "orig 4", "orig 5", "orig med", "orig_std"]

with open(file_path, 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(names)

In [71]:
for index, row in df.iterrows():
    run_query("HETIO", row[1])

original1
o1
o2
o3
[('D000006',)]
orig
['HETIO', 4.920053863525391, 4.430660009384155, 4.657412052154541, 5.271406412124634, 5.433137655258179, 4.807653188705444, 4.807653188705444, 0.3762511310275177]
original1
o1
o2
o3
[('D000006',)]
orig
['HETIO', 2.773213243484497, 2.5823097229003906, 2.6654515266418457, 3.2209224700927734, 2.818692207336426, 2.57869029045105, 2.6654515266418457, 0.24016986021912112]
original1
o1
o2
o3
[('D000006',)]
orig
['HETIO', 0.8279342174530029, 0.7910575866699219, 0.8605530261993408, 0.7733461856842041, 0.8472321033477783, 0.8674821853637695, 0.8472321033477783, 0.038314947370998796]
original1
o1
o2
o3
[('D000006',)]
orig
['HETIO', 5.764636898040772, 5.994003772735596, 5.801026821136475, 5.711092710494995, 5.492204904556274, 5.824856281280518, 5.801026821136475, 0.16410114863816896]
original1
o1
o2
o3
[('D000006',)]
orig
['HETIO', 3.7000794410705566, 3.545757293701172, 3.5433855056762695, 4.001932859420776, 3.710003137588501, 3.6993184089660645, 3.6993184089

In [45]:
run_query("STATS", 'SELECT MIN(pl.Id) FROM postLinks as pl, posts as p, users as u, badges as b WHERE p.Id = pl.RelatedPostId AND u.Id = p.OwnerUserId AND u.Id = b.UserId AND pl.CreationDate<=CAST(\'2011-08-17 01:23:50\' AS TIMESTAMP) AND p.Score>=-1 AND p.Score<=10 AND p.AnswerCount<=5 AND p.CommentCount=2 AND p.FavoriteCount>=0 AND p.FavoriteCount<=6 AND u.Views<=33 AND u.DownVotes>=0 AND u.CreationDate>=CAST(\'2010-08-19 17:31:36\' AS TIMESTAMP) AND u.CreationDate<=CAST(\'2014-08-06 07:23:12\' AS TIMESTAMP) AND b.Date<=CAST(\'2014-09-10 22:50:06\' AS TIMESTAMP)')

original1
o1
o2
o3
orig


In [35]:
# Open the CSV file in read mode
with open('results/POS_Scala_comparison_TO.csv', 'r') as file:
    lines = file.readlines()  # Read all lines into a list

# Exclude the last 9 lines
modified_lines = lines[:-9]

# Open the CSV file in write mode and overwrite it with the modified content
with open('results/POS_Scala_comparison_TO.csv', 'w') as file:
    file.writelines(modified_lines)


In [4]:
database = "imdb"
conn = psycopg2.connect(
    host="postgres",
    database=database,
    user=database,
    password=database
)
cur = conn.cursor()
cur.execute("SELECT DISTINCT country_code FROM company_name")
distinct_values = cur.fetchall()
print(distinct_values)
cur.close()
conn.close()

[('[ad]',), ('[ae]',), ('[af]',), ('[ag]',), ('[ai]',), ('[al]',), ('[am]',), ('[an]',), ('[ao]',), ('[ar]',), ('[as]',), ('[at]',), ('[au]',), ('[aw]',), ('[az]',), ('[ba]',), ('[bb]',), ('[bd]',), ('[be]',), ('[bf]',), ('[bg]',), ('[bh]',), ('[bi]',), ('[bj]',), ('[bl]',), ('[bm]',), ('[bn]',), ('[bo]',), ('[br]',), ('[bs]',), ('[bt]',), ('[bv]',), ('[bw]',), ('[by]',), ('[bz]',), ('[ca]',), ('[cd]',), ('[cg]',), ('[ch]',), ('[ci]',), ('[ck]',), ('[cl]',), ('[cm]',), ('[cn]',), ('[co]',), ('[cr]',), ('[cshh]',), ('[csxx]',), ('[cu]',), ('[cv]',), ('[cy]',), ('[cz]',), ('[ddde]',), ('[de]',), ('[dk]',), ('[dm]',), ('[do]',), ('[dz]',), ('[ec]',), ('[ee]',), ('[eg]',), ('[er]',), ('[es]',), ('[et]',), ('[fi]',), ('[fj]',), ('[fo]',), ('[fr]',), ('[ga]',), ('[gb]',), ('[gd]',), ('[ge]',), ('[gf]',), ('[gg]',), ('[gh]',), ('[gi]',), ('[gl]',), ('[gm]',), ('[gn]',), ('[gp]',), ('[gq]',), ('[gr]',), ('[gt]',), ('[gu]',), ('[gw]',), ('[gy]',), ('[hk]',), ('[hn]',), ('[hr]',), ('[ht]',), ('[

In [19]:
database = "imdb"
conn = psycopg2.connect(
    host="postgres",
    database=database,
    user=database,
    password=database
)
cur = conn.cursor()
cur.execute("SELECT name FROM name ORDER BY RANDOM() LIMIT 8")
distinct_values = cur.fetchall()
print(distinct_values)
cur.close()
conn.close()

[('Maler, Henri',), ('Whelan, Laura',), ('Pawlett, Jason',), ('Mach, Oldrich',), ('Sucu, Ertugrul',), ('Lacroix, Chantal',), ('Oliver, Tracy',), ('Chittella, Rahul',)]


In [None]:
database = "imdb"
conn = psycopg2.connect(
    host="postgres",
    database=database,
    user=database,
    password=database
)
cur = conn.cursor()
cur.execute("SELECT DISTINCT country_code FROM company_name")
distinct_values = cur.fetchall()
print(distinct_values)
cur.close()
conn.close()

In [109]:
database = "hetio"
conn = psycopg2.connect(
    host="postgres",
    database=database,
    user=database,
    password=database
)
cur = conn.cursor()

# Define the query to get all table names
query = """
    SELECT table_name
    FROM information_schema.tables
    WHERE table_schema = 'public'  -- Assuming all tables are in the public schema
"""

# Execute the query
cur.execute(query)

# Fetch all the table names
table_names = [row[0] for row in cur.fetchall()]

# Print the table names
print("Table Names:")
for table_name in table_names:
    print(table_name)

# Close the cursor and connection
cur.close()
conn.close()

Table Names:
molecular_function
side_effect
gene
biological_process
compound
pathway
anatomy
cellular_component
symptom
disease
pharmacologic_class
upregulates
expresses
interacts
participates
downregulates
causes
binds
regulates
associates
covaries
localizes
resembles
treats
includes
presents
palliates


In [134]:
database = "hetio"
conn = psycopg2.connect(
    host="postgres",
    database=database,
    user=database,
    password=database
)
cur = conn.cursor()
query = """
    SELECT column_name
    FROM information_schema.columns
    WHERE table_name = %s
"""
table_name = 'interacts'
cur.execute(query, (table_name,))
column_names = [row[0] for row in cur.fetchall()]
print("Column Names:")
for column_name in column_names:
    print(column_name)
cur.close()
conn.close()

Column Names:
sid
tid


In [51]:
database = "hetio"
conn = psycopg2.connect(
    host="postgres",
    database=database,
    user=database,
    password=database
)
cur = conn.cursor()
cur.execute("SELECT name FROM cellular_component ORDER BY RANDOM() LIMIT 8")
distinct_values = cur.fetchall()
print(distinct_values)
cur.close()
conn.close()

[('ER membrane insertion complex',), ('microtubule organizing center attachment site',), ('organelle membrane contact site',), ('azurophil granule lumen',), ('peroxisome',), ('SOSS complex',), ('central element',), ('eukaryotic 43S preinitiation complex',)]


In [49]:
database = "hetio"
conn = psycopg2.connect(
    host="postgres",
    database=database,
    user=database,
    password=database
)
cur = conn.cursor()
cur.execute("SELECT name FROM side_effect ORDER BY RANDOM() LIMIT 100")
distinct_values = cur.fetchall()
print(distinct_values)
cur.close()
conn.close()

[('Nail bed tenderness',), ('Fibrocystic disease',), ('Rash morbilliform',), ('Urine output',), ('Granulomatosis with polyangiitis',), ('Eye pruritus',), ('Biliary sludge',), ('Intravascular large B-cell lymphoma',), ('Yeast infection',), ('Latent tetany',), ('Urge incontinence',), ('Ecchymosis',), ('Caecitis',), ('Postoperative fever',), ('Organic brain syndrome',), ('Uricaciduria',), ('Coccydynia',), ('Incision site complication',), ('Embolic stroke',), ('Osteogenesis imperfecta',), ('COPD exacerbation',), ('Pulse irregular',), ('QRS axis abnormal',), ('Allergic oedema',), ('Binge eating',), ('Creatinine renal clearance decreased',), ('Electrocardiogram QT shortened',), ('Ground glass appearance',), ('Tardive dyskinesia',), ('Accidental overdose',), ('Blood chloride increased',), ('Tracheostomy infection',), ('Haemolytic anaemia',), ('Depressive disorder',), ('Device interaction',), ('Myasthenia gravis-like syndrome',), ('Serum ferritin decreased',), ('Bronchitis',), ('Coarctation of

In [None]:
('carpal region',), ('endocrine system',), ('serous membrane',), ('orbit of skull',), ('semen',), ('ophthalmic artery',), ('semicircular canal',), ('nervous system',)

In [1]:
%%bash
pip install pandas

Collecting pandas
  Downloading pandas-2.2.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (13.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.0/13.0 MB 16.8 MB/s eta 0:00:00
Collecting numpy>=1.22.4
  Downloading numpy-1.26.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.2/18.2 MB 28.0 MB/s eta 0:00:00
Collecting tzdata>=2022.7
  Downloading tzdata-2024.1-py2.py3-none-any.whl (345 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 345.4/345.4 KB 15.7 MB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2024.1-py2.py3-none-any.whl (505 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 505.5/505.5 KB 15.8 MB/s eta 0:00:00
Installing collected packages: pytz, tzdata, numpy, pandas
Successfully installed numpy-1.26.4 pandas-2.2.2 pytz-2024.1 tzdata-2024.1




In [8]:
import os
import pandas as pd

# Define the folder path where your text files are located
folder_path = 'rewritten/'

# Get a list of all files in the folder
files = os.listdir(folder_path)

# Filter only the txt files
txt_files = [file for file in files if file.endswith('.txt')]

data = []

# Loop through each txt file and do modifications
for file in txt_files:
    # Split the file name by underscore
    parts = file.split('_')
    
    # Extract the part between the first and last underscore
    modified_name = '_'.join(parts[1:-1])
    
    # Add the original and modified names to the list
    data.append({'Original Name': file, 'Number': modified_name})

# Create a DataFrame from the list
df = pd.DataFrame(data)

# Display the DataFrame
df.head()

Unnamed: 0,Original Name,Number
0,STATS_028-062-augA2_jointree.txt,028-062-augA2
1,STATS_110-138-augF2_jointree.txt,110-138-augF2
2,STATS_016-021-augF1_jointree.txt,016-021-augF1
3,HETIO_7-01-DaGiGpBP-augA3_jointree.txt,7-01-DaGiGpBP-augA3
4,STATS_107-104-augA1_jointree.txt,107-104-augA1


In [19]:
import re
# Define input and output file paths
input_file = 'scala_commands_augment_filter_agg.txt'

data = []
# Open input and output files
with open(input_file, 'r') as f_input:
    
    # Read input file line by line
    for line in f_input:
        # Split each line into components
        pattern = r'(?<!\\)\"|\"(?<!\\)(?=\s+\"|$)'
        components = re.split(pattern, line)
        
        number = components[5]
        data.append({'Number': number})
df1 = pd.DataFrame(data)
df1.shape

(2936, 1)

In [5]:
merged_df = df.merge(df1, on='Number', how='left', indicator=True)

# Filter to get only values from df1 that are not present in df2
result = merged_df[merged_df['_merge'] == 'left_only']['Number']

result

Series([], Name: Number, dtype: object)

In [20]:
duplicates = df1['Number'].duplicated()

# Print the rows with duplicates
print(df[duplicates])

                                Original Name                 Number
827  HETIO_15-05-SpDtCdGiG-augA6_jointree.txt  15-05-SpDtCdGiG-augA6
828    STATS_097-077-augF1-augA1_jointree.txt    097-077-augF1-augA1
829          STATS_008-045-augA3_jointree.txt          008-045-augA3
830    HETIO_9-05-DdGiGpMF-augA3_jointree.txt    9-05-DdGiGpMF-augA3
831     HETIO_3-05-CdGdCtD-augA6_jointree.txt     3-05-CdGdCtD-augA6


  print(df[duplicates])


In [21]:
duplicates = df1[df1.duplicated(subset=['Number'], keep=False)]

print(duplicates)

      Number
817  067-110
827  067-110
828  067-110
829  067-110
830  067-110
831  067-110


In [13]:
# Define the file path
file_path = 'scala_commands_augment_filter_agg.txt'

# Initialize a counter for 'TEST' occurrences
test_count = 0

with open(file_path, 'r') as file:
    # Iterate over each line in the file
    for line in file:
        # Split the line into words using regular expression to consider word boundaries
        words = re.findall(r'\b\w+\b', line)
        # Count the occurrences of 'TEST' in the words of the line
        test_count += words.count('STATS')

print(f"The word 'STATS' appears {test_count} times in the file.")

The word 'STATS' appears 1876 times in the file.


In [14]:
# Define the file path
file_path = 'scala_commands_augment_filter_agg.txt'

# Initialize a counter for 'TEST' occurrences
test_count = 0

with open(file_path, 'r') as file:
    # Iterate over each line in the file
    for line in file:
        # Split the line into words using regular expression to consider word boundaries
        words = re.findall(r'\b\w+\b', line)
        # Count the occurrences of 'TEST' in the words of the line
        test_count += words.count('SNAP')

print(f"The word 'SNAP' appears {test_count} times in the file.")

The word 'SNAP' appears 244 times in the file.


In [15]:
# Define the file path
file_path = 'scala_commands_augment_filter_agg.txt'

# Initialize a counter for 'TEST' occurrences
test_count = 0

with open(file_path, 'r') as file:
    # Iterate over each line in the file
    for line in file:
        # Split the line into words using regular expression to consider word boundaries
        words = re.findall(r'\b\w+\b', line)
        # Count the occurrences of 'TEST' in the words of the line
        test_count += words.count('LSQB')

print(f"The word 'LSQB' appears {test_count} times in the file.")

The word 'LSQB' appears 14 times in the file.


In [16]:
# Define the file path
file_path = 'scala_commands_augment_filter_agg.txt'

# Initialize a counter for 'TEST' occurrences
test_count = 0

with open(file_path, 'r') as file:
    # Iterate over each line in the file
    for line in file:
        # Split the line into words using regular expression to consider word boundaries
        words = re.findall(r'\b\w+\b', line)
        # Count the occurrences of 'TEST' in the words of the line
        test_count += words.count('JOB')

print(f"The word 'JOB' appears {test_count} times in the file.")

The word 'JOB' appears 264 times in the file.


In [17]:
# Define the file path
file_path = 'scala_commands_augment_filter_agg.txt'

# Initialize a counter for 'TEST' occurrences
test_count = 0

with open(file_path, 'r') as file:
    # Iterate over each line in the file
    for line in file:
        # Split the line into words using regular expression to consider word boundaries
        words = re.findall(r'\b\w+\b', line)
        # Count the occurrences of 'TEST' in the words of the line
        test_count += words.count('HETIO')

print(f"The word 'HETIO' appears {test_count} times in the file.")

The word 'HETIO' appears 538 times in the file.


In [18]:
1876+244+14+264+538

2936