Programming assignment

1. File `data.csv.gz` contains transportation costs between municipalities in Brazil. Load the data from this file into a PostgreSQL database, in a table named `costs` with the following columns: `orig_id`, `dest_id`, `cost`, with values read from columns of the same names in the csv file.


2. Python package `transportation_cost` contains functions to retrieve data from the database of transportation costs in module `helpers.py`. Implement the function `helpers.connect_to_db`, which returns a cursor connected to the database. Use package `psycopg2` to connect to the PostgreSQL database, with cursor `psycopg2.extensions.cursor`.


3. Script `main.py` runs the functions from module `helpers.py` in a sequence. It simulates the type of queries that would need to be performed while running a supply chain model, by repeatedly looking for the transportation cost between a subset of municipalities, and searching for the destination with the lowest transportation cost (referred to as the "closest", for simplicity) within a list of possible destinations. Functions `helpers.get_distance_slow` and `helpers.get_closest_slow` are supposed to perform these operations, but they run into issues with certain input values. Identify and fix the bug(s) with these functions.


4. The functions as they are implemented are quite inefficient. Try to improve the performance of the code with your own implementation: `helpers.get_distance_fast` and `helpers.get_closest_fast`. In the course of running the script `main.py`, information about execution time will be printed, so that you can monitor the performance of your implementation. Please add comments to your code to explain the rationale and discuss pros and cons of your solution, as well as other ideas you may have to improve performance.


5. Create a database dump for the database that you used for this assignment (using `pg_dump`).

Please provide your answer as a compressed file containing the Python package and the database dump.


By randomly generating the input, this could create pairs that aren't possible for query.

Input values might not map to actual rows in the dataset; query return null.
Test possible cases for these functions to fail.

In [44]:
import psycopg2

In [45]:
conn = psycopg2.connect(host = "localhost", dbname = "postgres", user = "postgres")
cur = conn.cursor()

In [47]:
command = """
CREATE TABLE costs(
orig_id integer, 
dest_id integer, 
cost float)
"""

cur.execute(command)

with open("./data.csv") as f:
    next(f)
    cur.copy_from(f, 'costs', sep=';')
conn.commit()

ProgrammingError: relation "costs" already exists


In [356]:
conn = psycopg2.connect(host = "localhost", dbname = "postgres", user = "postgres")
CUR = conn.cursor()

print("Initializing test dataset...")
query = (
    "SELECT DISTINCT orig_id "         # distinct origin ids
    "FROM costs")
CUR.execute(query)
all_orig_ids = [
    orig_id
    for (orig_id,)
    in CUR.fetchall()]
subset_orig_ids = [
    choice(all_orig_ids)
    for _
    in range(N)]

"""
This query could generate subset
"""

query = (
    "SELECT DISTINCT dest_id "
    "FROM costs")
CUR.execute(query)
all_dest_ids = [
    dest_id
    for (dest_id,)
    in CUR.fetchall()]
subset_dest_ids = [
    choice(all_dest_ids)
    for _
    in range(N)]

"""
By using choice to form test_distance set
and test closest set, it is possible to be
searching for a cost associated with orig-dest
pairs that aren't actually in the table.
To remedy this, 
"""
test_distance_set = [
    (choice(subset_orig_ids),
     choice(subset_dest_ids))
    for _ in range(N*N)]
test_closest_set = [
    (choice(subset_orig_ids),
     [choice(subset_dest_ids)
      for _ in range(N)])
    for _ in range(N*N)]

# Baseline: slow functions
start = time()

print("Running get_distance_slow...")
results_distance_slow = [
    get_distance_slow(orig_id, dest_id, CUR)
    for orig_id, dest_id
    in test_distance_set]

# print("Running get_closest_slow...")
# results_closest_slow = [
#     get_closest_slow(orig_id, dest_ids, CUR)
#     for orig_id, dest_ids
#     in test_closest_set]

Initializing test dataset...
Running get_distance_slow...


In [359]:
len(set(results_distance_slow))

559

In [112]:
conn = psycopg2.connect(host = "localhost", dbname = "postgres", user = "postgres")
cur = conn.cursor()

In [304]:
def get_closest_fast(
        orig_id,
        dest_ids,
        cur):
    """
    Return destination with lowest transportation cost
    from origin in list of destinations. 
    """
    dest_cost_pair = [dest_ids[0], get_distance_slow(orig_id, dest_ids[0], cur)]
    
    for dest_id in dest_ids:
        pair = [dest_id, get_distance_slow(orig_id, dest_id, cur)] 
        if pair[1] < dest_cost_pair[1]:
            dest_cost_pair = pair

    return dest_cost_pair[0]

In [306]:
def get_closest_fast(
        orig_id,
        dest_ids,
        cur):
    """
    Return destination with lowest transportation cost
    from origin in list of destinations. 
    """
    closest = ('', 10**10)
    
    for dest_id in dest_ids:
        cost = get_distance_slow(orig_id, dest_id, cur)
        if cost < closest[1]:
            closest = (dest_id, cost)

    return closest[0]

In [307]:
results_closest_fast = [
        get_closest_fast(orig_id, dest_ids, cur)
        for orig_id, dest_ids
        in test_closest_set]
print(len(results_closest_fast))

900


In [316]:
"""
trying to figure out this query.
The current issue is that I can't easily select dest_id because its not part of the groupby
"""

def get_closest_single_query(
           orig_id,
           dest_ids,
           cur):
    """
    Return destination with lowest transportation cost
    from origin in list of destinations in a single sql query.
    """
    query = (
        """
        select dest_id from 
        (select MIN(cost), orig_id 
        from costs 
        WHERE orig_id = %s 
        AND dest_id IN %s
        GROUP BY orig_id)
        AS min_cost
        JOIN costs on costs.orig_id = min_cost.orig_id
        """)
    cur.execute(query, (orig_id, tuple(dest_ids)))
    return cur.fetchone()

In [294]:
conn.commit()

In [300]:
results_get_closest_single = [get_closest_single_query(orig_id, dest_ids, cur)
        for orig_id, dest_ids in test_closest_set[:5]]

Still need to figure out the IN clause  and converting array to string

In [301]:
conn = psycopg2.connect(host = "localhost", dbname = "postgres", user = "postgres")

In [303]:
cur = conn.cursor()

In [314]:
query = (
        """
        select dest_id from 
        (select MIN(cost), orig_id 
        from costs 
        WHERE orig_id = %s 
        AND dest_id IN %s
        GROUP BY orig_id)
        AS min_cost
        JOIN costs on costs.orig_id = min_cost.orig_id
        """)

cur.execute(query, (test_closest_set[0][0], tuple(test_closest_set[0][1])))

In [315]:
cur.fetchone()[0]

1219