# SQL Exercises

This exercise is about using SQL to retrieve information from a database.

We will work with a pandas dataframe and use pandas' built-in support for interacting with SQL databases via the `sqlite3` standard library module.

## Examples

We first define a couple of helper functions to open a database connection to a temporary file and to convert the output of a `sqlite3` query to a pandas dataframe to allow us to exploit the rich display of dataframes as HTML tables in Jupyter.

In [None]:
import sqlite3
import pandas as pd
import tempfile

def create_temporary_database_connection() -> sqlite3.Connection:
    temporary_file = tempfile.NamedTemporaryFile()
    return sqlite3.connect(temporary_file.name), temporary_file

def sqlite3_cursor_to_dataframe(cursor: sqlite3.Cursor) -> pd.DataFrame:
    dataframe = pd.DataFrame(
        data=cursor.fetchall(),
        columns=[c[0] for c in cursor.description],
    )
    if "index" in dataframe.columns:
        dataframe.set_index("index", inplace=True)
        dataframe.index.name = ""
    return dataframe


As an initial quick demonstration, we will create a simple dataframe with `pandas`, and then use the `DataFrame.to_sql` method to write the dataframe to a temporary database connection as a table `example_table`.

In [None]:
dataframe = pd.DataFrame({"numeric": [0, 1, 2], "text": ["A", "B", "C"]})

connection, temporary_file = create_temporary_database_connection()

dataframe.to_sql("example_table", con=connection)

As a first example of running an SQL query on the resulting database table, let's retrieve all values.

In [None]:
results = connection.execute("""
    SELECT *
    FROM example_table
""")

sqlite3_cursor_to_dataframe(results)

We can also retrieve particular rows that match a condition:

In [None]:
sqlite3_cursor_to_dataframe(
    connection.execute(
        """
        SELECT *
        FROM example_table
        WHERE numeric=0
        """
    )
)

and columns:

In [None]:
sqlite3_cursor_to_dataframe(
    connection.execute(
        """
        SELECT numeric
        FROM example_table
        """
    )
)

We clean-up the database connection and associated temporary file.

In [None]:
connection.close()
temporary_file.close()

Now let's get started with the exercises. First, let's download a dataset from `scikit-learn`.

In [None]:
from sklearn.datasets import fetch_california_housing
data_california = fetch_california_housing()

Let's convert this to a dataframe so we can play with it:

In [None]:
california = pd.DataFrame(data=data_california.data, columns=data_california.feature_names)
california['target'] = data_california.target
california.head()

And as before we create a temporary file and database connection and write the dataframe to it as a new table

In [None]:
connection, temporary_file = create_temporary_database_connection()

_ = california.to_sql("california", con=connection)

## Finding rows with high target variable

We can query this to find the rows with `target` value greater than 4:

We can also find how many such rows there are, using the `COUNT` function:

To get an idea of the distribution of values in the target column, we can use some aggregate SQL functions. Compute the average, minimum and maximum value of the target variable in the table:

The results should show that the target values range between approximately 0.15 and 5, so our choice of 4 as a "high" target may be reasonable.

## Focus on older buildings

Find the rows where `HouseAge` is greater than 50 and `Population` is more than 1000.

And count how many rows like these there are:

## More advanced keywords

Find the 5 rows with the highest number of average bedrooms, which are less than 30 years old.

**Hint:** You will need the `ORDER BY` and `LIMIT` keywords.
`ORDER BY` is followed by a column name and a sorting direction (`ASC` or `DESC` for ascending or descending, respectively).
`LIMIT` is followed by the maximum number of results we want to retrieve

Again we close the connection and temporary file to clean up correctly.

In [None]:
connection.close()
temporary_file.close()