# Combining SQL and Python

An advantage of using a popular language is the high probability that programmers before us have found solutions to our problems already. As such, there exists a Python module which allows us to interface with a PostgreSQL database, called [Psycopg2](http://initd.org/psycopg/docs/).

Using this module requires a quite advanced understanding of Python and SQL, so we'll start off with a very simple query and work through it step by step.

Let's suppose we want to implement the following SQL query:

In [None]:
SELECT 2 + 3;

### Query Result

| ?column? |
|----------|
|        5 |


To run this SQL query in Python, we use psycopg2 as follows:

In [None]:
import psycopg2

# Establish the connection
conn = psycopg2.connect(dbname='db', user='grok')
cursor = conn.cursor()

# Execute an SQL query and receive the output
cursor.execute('SELECT 2 + 3;')
records = cursor.fetchall()

print(records)

[(5,)]

And that's our expected result. Now let's look at each line in detail.

# Interfacing with databases in Python

We will go through the example on the last slide step by step.

1. Establish a connection to the database

conn = psycopg2.connect(dbname='db', user='grok')

This command initialises a new database session and returns a connection object. We have to specify the name of the database and the name of the user. Note that the **dbname** is the name of the database, not a table in the database.

Throughout this module we're calling our database 'db'. On your local machine, you would use your user account name for the user. Here we're going to use 'grok'.

2. Create a cursor object

cursor = conn.cursor()

The cursor is the object that interfaces with the database. We can execute SQL queries and receive their output through this object. We can call the object's functions by using the dot (.) notation just like we do for modules. The two functions that we will use most often are **execute** and **fetchall**.

3. Run a SQL query

cursor.execute('SELECT 2 + 3;')

To run a SQL query, we call the **execute** function, which is a function of the cursor object. This function takes the SQL query in form of a string as its argument.

4. Receive the query return

records = cursor.fetchall()

The **fetchall** function returns the output of the last query. When taking SQL data into Python, the data types are converted to the closest match in Python data types. We'll have a closer look at this later.

# **Question: Taking it all in**

To get started with basic Psycopg2 usage, write a function called select_all which queries either our Star or Planet table in PostgreSQL and returns all the rows using the following query:

In [None]:
SELECT * FROM Star;

Your function should take the name of the table as a string argument, so you can call it like to access the Star table:

In [None]:
>>> select_all('Star')
[(2713049, 5996, 0.956), (3114167, 5666, 0.677), (3115833, 5995, 0.847), ...]

Or like this for the Planet table:

In [None]:
>>> select_all('Planet')
[(10666592, 'K00002.01', 'Kepler-2b', 'CONFIRMED', 2.204735365, 16.39, 2025), ...]

It should return the result of cursor.fetchall() directly.



# ⌛Solution:

In [None]:
import psycopg2

def select_all(table_name):
    # Establish a connection to the database
    conn = psycopg2.connect(dbname='db', user='grok')

    try:
        # Create a cursor object
        cursor = conn.cursor()

        # Execute the SQL query to select all rows from the specified table
        cursor.execute(f"SELECT * FROM {table_name};")

        # Fetch all rows and return the result
        return cursor.fetchall()

    finally:
        # Close the cursor and connection
        cursor.close()
        conn.close()

# Handling cursor data

As you've seen in the last problem, the data from SQL queries in Psycopg2 is returned in form of Python lists. In the last problem, you requested the full Star and Planet table, which returned a list of n tuples of length m, where m is the number of columns and n is the number of rows in these tables.

A list of tuples cannot be used in the same way as e.g. a 2D Numpy array. For example, the following method of indexing to access the first element will not work:

In [None]:
a = [(1, 2, 3), (4, 5, 6)]
print(a[0, 0])

TypeError: list indices must be integers or slices, not tuple

Instead, we have to use the [] operator twice: first to access the first list element, i.e. the first tuple, and then to access the first element in that tuple:

In [None]:
a = [(1, 2, 3), (4, 5, 6)]
print(a[0][0])

1


Using this indexing method, we can then access every individual data element. This allows us to, e.g. extract entire columns of the data by looping over the rows. The following code snippet shows an example which extracts the t_eff column from the full Star table and appends it to a new list:

In [None]:
import psycopg2

conn = psycopg2.connect(dbname='db', user='grok')
cursor = conn.cursor()
cursor.execute('SELECT * FROM Star')
records = cursor.fetchall()

t_eff = []
for row in records:
  t_eff.append(row[1])

print(t_eff)

# Output

[5996, 5666, 5995, 5735, 6167, 5717, 5733, 5349, 5485, 5934, 5170, 4905, 3887, 5557, 5413, 6079, 5071, 4980, 5796, 6225, 5881, 6391, 4812, 6117, 4856, 4536, 5559, 4722, 6350, 5339, 5850, 5853, 5795, 6031, 6046, 5851, 5126, 5803, 5015, 5588, 6117, 6075, 5468, 5592, 6174, 5653, 5641, 5520, 6144, 5957, 3898, 5492, 5446, 3741, 5627, 4989, 3672, 5992, 5485, 3767, 5557, 5880, 5841, 5127, 5354, 5795]


# Data types in SQL and Python

Now we've seen how to work with query results, we can have a closer look at the data itself. In the previous activity, we learned about different data types in SQL when we were setting up tables.

How do these SQL data types get converted into Python types?

Let's have a look at the Planet table's data types. We can use a query which selects all columns but only a single row:

In [None]:
SELECT * FROM Planet LIMIT 1;

### Query Result

| kepler_id | koi_name  | kepler_name |  status   |   period    | radius | t_eq |
|-----------|-----------|-------------|-----------|-------------|--------|------|
|  10666592 | K00002.01 | Kepler-2b   | CONFIRMED | 2.204735365 |  16.39 | 2025 |


In Python, this query will return a list containing a single tuple. We can loop over the entries of this tuple and call the type function to determine the data types:

In [None]:
import psycopg2

conn = psycopg2.connect(dbname='db', user='grok')
cursor = conn.cursor()

cursor.execute('SELECT * FROM Planet LIMIT 1;')

records = cursor.fetchall()

for col in records[0]:
    print(type(col))

$\text{<type 'int'>}$

$\text{<type 'str'>}$

$\text{<type 'str'>}$

$\text{<type 'str'>}$

$\text{<type 'float'>}$

$\text{<type 'float'>}$

$\text{<type 'int'>}$

The type conversion of these types is straight-forward: SQL's SMALLINT and INTEGER get converted to Python integers, CHAR and VARCHAR to Python strings, and FLOAT to Python floats.

Check out the [Psycopg2 documentation](http://initd.org/psycopg/docs/usage.html#adaptation-of-python-values-to-sql-types) when you want to learn about type conversion in more detail.

# Writing data into NumPy arrays

Once we have the numerical data from the database in Python, we can write them into NumPy arrays.

Since we're often dealing with data of different types in databases, it is important to remember that while Python lists and tuples can hold data of different types, NumPy arrays cannot.

To convert a Python list into a simple NumPy array, we must ensure that the list only contains data of one type. Other than that, SQL results can easily be loaded into NumPy arrays:

In [None]:
import psycopg2
import numpy as np

conn = psycopg2.connect(dbname='db', user='grok')
cursor = conn.cursor()

cursor.execute('SELECT radius FROM Star;')

records = cursor.fetchall()
array = np.array(records)

print(array.shape)
print(array.mean())
print(array.std())

In [None]:
(66, 1)
0.886863636364
0.237456527847

Once the data is stored in NumPy arrays, we have access to all of NumPy's functionality to manipulate and analyse our data. One thing that we can now easily do is for example calculating a median.

# **Question: A proper median**

Write a function called column_stats which calculates the mean and median of a selected column in either Star or Planet table. For this, let your function take two string arguments:

1. the name of the table;
2. the name of the column.

and have it return the mean and median (in this order) of the selected column.

When you call your function on, for example, the t_eff column of the Star table, the function call and return should look like this:

In [None]:
>>> column_stats('Star', 't_eff')
(5490.681818181818, 5634.0)

You can compare your calculation with the pure SQL query:

In [None]:
SELECT AVG(t_eff) FROM Star;

### Query Result

| avg                     |
|-------------------------|
| 5490.6818181818181818   |


# ⌛Solution:

In [None]:
import psycopg2

def column_stats(table_name, column_name):
    # Establish a connection to the database
    conn = psycopg2.connect(dbname='db', user='grok')

    try:
        # Create a cursor object
        cursor = conn.cursor()

        # Query to calculate the mean
        cursor.execute(f"SELECT AVG({column_name}) FROM {table_name};")
        mean = float(cursor.fetchone()[0])  # Convert Decimal to float

        # Query to calculate the median
        cursor.execute(f"""
            SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY {column_name})
            FROM {table_name};
        """)
        median = float(cursor.fetchone()[0])  # Convert Decimal to float

        return mean, median

    finally:
        # Close the cursor and connection
        cursor.close()
        conn.close()

# SQL vs. Python

In this course you've learned two different approaches to dealing with data. Which you choose for a particular project depends on a variety of factors including the questions you're posing of the data or whether you're using a public database or catalogue.

We have seen that SQL is convenient to use for a lot of things – but exactly how convenient is it? Can we do the same thing in Python?

Let's go through a few problems in which we implement typical SQL queries from the previous activities in Python. We will start of with a simple query and add a new element in each problem.

# **Question: Simple queries in Python-1**

Your first task is to replicate the following SQL query:

In [None]:
SELECT kepler_id, radius
FROM Star
WHERE radius > 1.0;

### Query Result

| kepler_id | radius |
|-----------|--------|
|   3342970 |  1.064 |
|   3351888 |  1.057 |
|   6922244 |  1.451 |
|   8395660 |  1.029 |
|   9579641 |  1.332 |
|  10666592 |  1.991 |
|  10797460 |   1.04 |
|  10854555 |  1.046 |
|  10875245 |  1.411 |
|  10984090 |  1.073 |
|  11138155 |  1.025 |
|  11304958 |  1.046 |
|  11403044 |  1.103 |
|  11493732 |  1.091 |
|  11904151 |  1.056 |


The data is stored in stars.csv, with the kepler_id in the first column and the radius in the last.

Write a function called query which takes the CSV filename as an argument and returns the data in a 2-column NumPy array.

For example, this small CSV file:

### Query Result

| kepler_id | t_eff | radius |
|-----------|-------|--------|
|  10666592 |  6350 |  1.991 |
|  10682541 |  5339 |  0.847 |
|  10797460 |  5850 |  1.04 |


your query function should work as follows:

In [None]:
>>> query('stars.csv')
array([[  1.06665920e+07   1.99100000e+00]
       [  1.07974600e+07   1.04000000e+00]])

The numerical data gets automatically converted to floats in this procedure, don't worry if it doesn't look like the SQL output.

# **Hint**:

You can use NumPy's [loadtxt](https://numpy.org/doc/stable/reference/generated/numpy.loadtxt.html) function with the optional usecols argument to read in only those columns you're interested in.

# ⌛Solution:

In [None]:
# Write your query function here
import numpy as np

def query(filename):
    # Load the data from the CSV file
    data = np.loadtxt(filename, delimiter=',', usecols=(0, 2))

    # Filter rows where radius > 1.0
    filtered_data = data[data[:, 1] > 1.0]

    return filtered_data


# You can use this to test your code
# Everything inside this if-statement will be ignored by the automarker
if __name__ == '__main__':
  # Compare your function output to the SQL query
  result = query('stars.csv')

# **Question: Simple queries in Python-2**

Let's add another element to our query. Sort the resulting table in ascending order to match the result you would get with:

In [None]:
SELECT kepler_id, radius
FROM Star
WHERE radius > 1.0
ORDER BY radius ASC;

### Query Result

| kepler_id | radius |
|-----------|--------|
|  11138155 |  1.025 |
|   8395660 |  1.029 |
|  10797460 |  1.04  |
|  11304958 |  1.046 |
|  10854555 |  1.046 |
|  11904151 |  1.056 |
|   3351888 |  1.057 |
|   3342970 |  1.064 |
|  10984090 |  1.073 |
|  11493732 |  1.091 |
|  11403044 |  1.103 |
|   9579641 |  1.332 |
|  10875245 |  1.411 |
|   6922244 |  1.451 |
|  10666592 |  1.991 |


You can use your results from the last problem and then build up on that. Again, the function should be named query and it should take the filename as argument.

# **Hint**:

You can use NumPy's [argsort](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) function to solve this problem. Take a look at how it works:

In [None]:
import numpy as np

a = np.array([3, 1, 2, 0])
b = np.argsort(a)
print(b)
print(a[b])

[3 1 2 0]
[0 1 2 3]


It returns the indices of the unsorted array a in their new, sorted positions. You can pass this returned array b to the original array a to rearrange its entries.

# ⌛Solution:

In [None]:
# Write your query function here
import numpy as np

def query(filename):
    # Load the data from the CSV file
    data = np.loadtxt(filename, delimiter=',', usecols=(0, 2))

    # Filter rows where radius > 1.0
    filtered_data = data[data[:, 1] > 1.0]

    # Sort the filtered data by the radius (second column)
    sorted_indices = np.argsort(filtered_data[:, 1])
    sorted_data = filtered_data[sorted_indices]

    return sorted_data

# You can use this to test your code
# Everything inside this if-statement will be ignored by the automarker
if __name__ == '__main__':
  # Compare your function output to the SQL query
  result = query('stars.csv')

# **Question: Simple queries in Python-3**

Let's add yet another element to our query. Join the Star table with the Planet table and calculate the size ratio, i.e. planet radius / star radius for each star-planet pair. Your **query** function should produce the same result as the SQL query:

In [None]:
SELECT p.radius/s.radius AS radius_ratio
FROM Planet AS p
INNER JOIN star AS s USING (kepler_id)
WHERE s.radius > 1.0
ORDER BY p.radius/s.radius ASC;

### Query Result

|   radius_ratio    |
|-------------------|
| 0.487987987987988 |
| 0.826044703595724 |
| 0.962099125364432 |
|  1.15563839701771 |
|  1.30403968816442 |
|  1.37310606060606 |
|  1.41141141141141 |
|  2.28377065111759 |
|  2.46246246246246 |
|  2.50728862973761 |
|  2.59082217973231 |
|  2.98076923076923 |
|  2.98076923076923 |
|  3.29887218045113 |
|  3.40225563909774 |
|  4.04351767905712 |
|  5.47801147227533 |
|  8.23204419889503 |
|  9.21475875118259 |
| 10.2205375603032  |
| 11.590243902439  |
| 58.8725939505041 |


You can use your results from the last problem and then build up on that. The function must be named **query**, but this time it should take two filenames as arguments, for the stars and planets.

In planets.csv, the first column is the **kepler_id** and the second last column is the **radius**.

Your function should be a column vector of ratios, like this:

In [None]:
>>> query('stars.csv', 'planets.csv')
array([[  0.48798799],
       [  0.8260447 ],
       [  0.96209913],
       [  1.1556384 ],
       [  1.30403969],
       ...

# **Hint**:

You may need to use a nested loop to compare each Planet's **kepler_id** against each Star's **kepler_id**. Once you've found a match and the star's radius is larger than one, you can append the ratio to the results list or array.

# ⌛Solution:

In [None]:
# Write your query function here
import numpy as np

def query(stars_file, planets_file):
    # Load star and planet data
    stars = np.loadtxt(stars_file, delimiter=',', usecols=(0, 2))  # kepler_id, radius
    planets = np.loadtxt(planets_file, delimiter=',', usecols=(0, -2))  # kepler_id, radius

    # Filter stars with radius > 1.0
    stars_filtered = stars[stars[:, 1] > 1.0]

    # Initialize a list to store the radius ratios
    radius_ratios = []

    # Nested loop to compare kepler_id and calculate radius ratio
    for planet in planets:
        for star in stars_filtered:
            if planet[0] == star[0]:  # Match kepler_id
                ratio = planet[1] / star[1]  # Calculate radius ratio
                radius_ratios.append(ratio)

    # Convert the list to a NumPy array and sort it in ascending order
    radius_ratios = np.array(radius_ratios).reshape(-1, 1)
    radius_ratios = radius_ratios[np.argsort(radius_ratios[:, 0])]

    return radius_ratios

# You can use this to test your code
# Everything inside this if-statement will be ignored by the automarker
if __name__ == '__main__':
  # Compare your function output to the SQL query
  result = query('stars.csv', 'planets.csv')

The last three problems showed that Python is straight-forward to use for simple queries, but gets a lot more difficult as queries become more complex. The last question on joins was especially hard to implement in Python, whereas it's relatively simple in SQL.

This shouldn't be surprising, as that's exactly what SQL is designed for and what we should use for these problems. There are good reasons though for why you might not want to drop Python completely for database-related work.

One important thing to consider is that SQL is developed for accessing data and the built-in functions support only basic mathematical operations. Beyond that it gets very complicated.

A good example for this is the calculation of the median, which we have done a couple of times in Python. There are no built-in functions for the median in SQL however, and doing it by hand in SQL gets pretty complicated. We haven't even covered enough SQL to understand how to implement a median, but if you're interested, check out this [Postgresql article](https://wiki.postgresql.org/wiki/Aggregate_Median) which shows examples of how a median could be implemented.