<img src="images/banner.png" style="width: 100%;">

# Working with Databases I Notebook 1

References:

[1] McKinney, Wes. *Python for data analysis.* " O'Reilly Media, Inc.", 2022.

[2] Teate, Renee MP. *SQL for Data Scientists: A Beginner's Guide for Building Datasets for Analysis.* John Wiley & Sons, 2021.

[3] Forta, Ben. *Sams Teach Yourself SQL in 10 Minutes a Day, 5th Edition*. O'Reilly Media, Inc., 2020

[4] Python sqlite3 documentation - https://docs.python.org/3/library/sqlite3.html

[5] Revised and grammar checked using ChatGPT - https://chatgpt.com/

Prepared by: Leodegario Lorenzo II

In today's world, we interact with databases more often than we realize. Everytime you perform a Google search, post a video on TikTok, or view a friend's profile on Instagram, you are interacting with a database.

As a data scientist, most of the data that you'll encounter in the industry, will not live in simple text files or Excel spreadsheets. Instead, companies will usually create and store their data in **SQL-based relatational databases** (e.g. Microsoft SQL Server, PostgreSQL, and MySQL). Because of this, building a strong foundation in SQL and how to write effective queries is an essential skill for any aspiring data scientist.

In this notebook, we will demonstrate how Python can be used to interact with database, and then introduce fundamental SQL concepts, focusing on the `SELECT` statement and `WHERE` clause.

## 1 Interacting with a database in Python

We can interact with databases using Python in several ways. For this demonstration, we will show two ways, one using `sqlite3` and another using `pandas` through `sqlalchemy`.

### Using `sqlite3`

In [1]:
import sqlite3
import pandas as pd

The first step is to establish a connection with a database, which is done via the `sqlite.connect` function.

In [2]:
conn = sqlite3.connect("data/farmers_market.db")

In [3]:
conn

<sqlite3.Connection at 0x10ee47b50>

This command does two things, it creates a link between the SQLite database file `farmers_market.db` and creates a `Connetion` object, which is then used to manage interactions with the database.

As an example, let's try to retrieve all of the rows and columns of the `product` table. We first write the query as,

In [4]:
query = "SELECT * FROM product"

Then execute the query on the established connection with the database.

In [5]:
cursor =  conn.execute(query)
cursor

<sqlite3.Cursor at 0x10eecb240>

Upon execution, it returns a `cursor` object which we can use to retrieve the result. To get the rows of the result of the query, we may use the `fetchall` method of the cursor.

In [6]:
rows = cursor.fetchall()
rows

[(1, 'Habanero Peppers - Organic', 'medium', 1, 'lbs'),
 (2, 'Jalapeno Peppers - Organic', 'small', 1, 'lbs'),
 (3, 'Poblano Peppers - Organic', 'large', 1, 'unit'),
 (4, 'Banana Peppers - Jar', '8 oz', 3, 'unit'),
 (5, 'Whole Wheat Bread', '1.5 lbs', 3, 'unit'),
 (6, 'Cut Zinnias Bouquet', 'medium', 5, 'unit'),
 (7, 'Apple Pie', '10"', 3, 'unit'),
 (8, 'Cherry Pie', '10"', 3, 'unit'),
 (9, 'Sweet Potatoes', 'medium', 1, 'lbs'),
 (10, 'Eggs', '1 dozen', 6, 'unit'),
 (11, 'Pork Chops', '1 lb', 6, 'lbs'),
 (12, 'Baby Salad Lettuce Mix - Bag', '1/2 lb', 1, 'unit'),
 (13, 'Baby Salad Lettuce Mix', '1 lb', 1, 'lbs'),
 (14, 'Red Potatoes', None, 1, None),
 (15, 'Red Potatoes - Small', ' ', 1, None),
 (16, 'Sweet Corn', 'Ear', 1, 'unit'),
 (17, 'Carrots', 'sold by weight', 1, 'lbs'),
 (18, 'Carrots - Organic', 'bunch', 1, 'unit'),
 (19, 'Farmer`s Market Resuable Shopping Bag', 'medium', 7, 'unit'),
 (20, 'Homemade Beeswax Candles', '6""', 7, 'unit'),
 (21, 'Organic Cherry Tomatoes', 'pint', 1

If we want to convert this list of tuples into a `pandas` DataFrame, we will also need to specify the columns. Which we can get from the `description` attribute of the cursor.

In [9]:
columns = [description[0] for description in cursor.description]
columns

['product_id',
 'product_name',
 'product_size',
 'product_category_id',
 'product_qty_type']

In [10]:
pd.DataFrame(rows, columns=columns)

Unnamed: 0,product_id,product_name,product_size,product_category_id,product_qty_type
0,1,Habanero Peppers - Organic,medium,1,lbs
1,2,Jalapeno Peppers - Organic,small,1,lbs
2,3,Poblano Peppers - Organic,large,1,unit
3,4,Banana Peppers - Jar,8 oz,3,unit
4,5,Whole Wheat Bread,1.5 lbs,3,unit
5,6,Cut Zinnias Bouquet,medium,5,unit
6,7,Apple Pie,"10""",3,unit
7,8,Cherry Pie,"10""",3,unit
8,9,Sweet Potatoes,medium,1,lbs
9,10,Eggs,1 dozen,6,unit


Finally, we convert the result into a dataframe by specifying the rows and columns of the table.

In [11]:
pd.DataFrame(rows, columns=columns)

Unnamed: 0,product_id,product_name,product_size,product_category_id,product_qty_type
0,1,Habanero Peppers - Organic,medium,1,lbs
1,2,Jalapeno Peppers - Organic,small,1,lbs
2,3,Poblano Peppers - Organic,large,1,unit
3,4,Banana Peppers - Jar,8 oz,3,unit
4,5,Whole Wheat Bread,1.5 lbs,3,unit
5,6,Cut Zinnias Bouquet,medium,5,unit
6,7,Apple Pie,"10""",3,unit
7,8,Cherry Pie,"10""",3,unit
8,9,Sweet Potatoes,medium,1,lbs
9,10,Eggs,1 dozen,6,unit


Using `sqlite3` allows us to interact with the database at a low level, meaning we can write SQL statements that Python executes directly using SQLite.

### Using `sqlalchemy`

If we're primarily interested in reading data from a database and performing succeeding analyses using `pandas`, `sqlalchemy` conveniently abstracts some of the steps for us to read and retrieve the result of a query quickly.

As a prerequisite, do make sure that `sqlalchemy` is installed on your environment, you may uncomment the cell below to do so if it isn't installed yet.

In [None]:
# !conda install sqlalchemy -y

In [12]:
import sqlalchemy as sqla

The first step is the same in `sqlalchemy` as it is in using `sqlite3`, we first establish a connection with a local database.

In [13]:
db = sqla.create_engine('sqlite:///data/farmers_market.db')

We'll write the same query as before,

In [15]:
query = """
        SELECT
            *
        FROM product
        """

Then execute the query directly loading it to `pandas` using the `read_sql` function.

In [16]:
product = pd.read_sql(query, db)
product

Unnamed: 0,product_id,product_name,product_size,product_category_id,product_qty_type
0,1,Habanero Peppers - Organic,medium,1,lbs
1,2,Jalapeno Peppers - Organic,small,1,lbs
2,3,Poblano Peppers - Organic,large,1,unit
3,4,Banana Peppers - Jar,8 oz,3,unit
4,5,Whole Wheat Bread,1.5 lbs,3,unit
5,6,Cut Zinnias Bouquet,medium,5,unit
6,7,Apple Pie,"10""",3,unit
7,8,Cherry Pie,"10""",3,unit
8,9,Sweet Potatoes,medium,1,lbs
9,10,Eggs,1 dozen,6,unit


When it comes to reading data from SQL databases, this approach is much more convenient as it simplifies the reading process. However, if we need to perform low-level database manipulation and operations, it is much more straightforward to use `sqlite3`.

## 2 The `SELECT` Statement

Majority of queries that we'll be making will be `SELECT` statements, since our primary goal most of the time will be to retrieve data in a database. When used with other SQL keywords, `SELECT` can be used to view data from a table with selected columns, combine multiple tables, filter the results, perform calculations, and more!

### The `SELECT` Statement

The fundamental syntax structure of a `SELECT` query is as follows:

```sql
SELECT <columns to return>
FROM <table>
WHERE <conditional filter statements>
GROUP BY <columns to group on>
HAVING <conditional filter statements that run after grouping>
ORDER BY <columns to sort on>
```

Here, the `SELECT` and `FROM` clauses are required since they indicate which columns to select and from what table.

The simplest `SELECT` statement is
```sql
SELECT * FROM <table>
```

Using this query, all of the columns and rows from the specified table will be returned.

Let's say we want to select all the rows and columns of the table `product`, we use the query:

In [18]:
query = """
        SELECT *
        FROM product
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name,product_size,product_category_id,product_qty_type
0,1,Habanero Peppers - Organic,medium,1,lbs
1,2,Jalapeno Peppers - Organic,small,1,lbs
2,3,Poblano Peppers - Organic,large,1,unit
3,4,Banana Peppers - Jar,8 oz,3,unit
4,5,Whole Wheat Bread,1.5 lbs,3,unit
5,6,Cut Zinnias Bouquet,medium,5,unit
6,7,Apple Pie,"10""",3,unit
7,8,Cherry Pie,"10""",3,unit
8,9,Sweet Potatoes,medium,1,lbs
9,10,Eggs,1 dozen,6,unit


Usually, we don't want to return all the rows and columns as the query might take too long to run, to do this we can use the `LIMIT` clause to specify the amount of rows to show in the result.

It is also considered best practice to explicitly list the names of columns instead of using the `*`.

In [25]:
query = """
        SELECT
            product_id,
            product_name
        FROM product
        LIMIT 5
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name
0,1,Habanero Peppers - Organic
1,2,Jalapeno Peppers - Organic
2,3,Poblano Peppers - Organic
3,4,Banana Peppers - Jar
4,5,Whole Wheat Bread


### Sorting Results using `ORDER BY`

To sort the output rows, we can use the `ORDER BY` clause together with the column you want to sort on. You can also specify whether you want the sorting to be in ascending or descending order by adding `ASC` or `DESC` after the column name.

In [27]:
query = """
        SELECT
            product_id,
            product_name
        FROM product
        ORDER BY
            product_id DESC
        LIMIT 5
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name
0,23,Maple Syrup - Jar
1,22,Roma Tomatoes
2,21,Organic Cherry Tomatoes
3,20,Homemade Beeswax Candles
4,19,Farmer`s Market Resuable Shopping Bag


We can also specify multiple columns as in the `ORDER BY` clause.

In [32]:
query = """
        SELECT
            market_date,
            customer_id,
            product_id
        FROM customer_purchases
        ORDER BY market_date DESC, customer_id
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,product_id
0,2020-10-10,1,4
1,2020-10-10,1,4
2,2020-10-10,1,5
3,2020-10-10,2,4
4,2020-10-10,2,5
5,2020-10-10,2,7
6,2020-10-10,2,7
7,2020-10-10,5,4
8,2020-10-10,5,4
9,2020-10-10,5,7


In [33]:
query = """
        SELECT
            market_date,
            customer_id,
            product_id
        FROM customer_purchases
        """
pd.read_sql(query, db).sort_values(by=['market_date', 'customer_id'],
                                   ascending=[False, True]).head(10)

Unnamed: 0,market_date,customer_id,product_id
1806,2020-10-10,1,4
1807,2020-10-10,1,4
2601,2020-10-10,1,5
1808,2020-10-10,2,4
2602,2020-10-10,2,5
3171,2020-10-10,2,7
3172,2020-10-10,2,7
1809,2020-10-10,5,4
1810,2020-10-10,5,4
3173,2020-10-10,5,7


The order of the columns specified in the `ORDER BY` clause is important, as it will determine which columns to first sort on.

In [34]:
query = """
        SELECT
            market_date,
            customer_id,
            product_id
        FROM customer_purchases
        ORDER BY customer_id, market_date DESC
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,product_id
0,2020-10-10,1,4
1,2020-10-10,1,4
2,2020-10-10,1,5
3,2020-10-07,1,4
4,2020-10-07,1,4
5,2020-10-07,1,8
6,2020-09-30,1,2
7,2020-09-30,1,3
8,2020-09-30,1,3
9,2020-09-26,1,3


### Retrieving Distinct Rows using `DISTINCT`

If you want to retrieve the unique values of a column (or set of columns) you can use the `DISTINCT` clause in conjunction with the `SELECT` clause.

In [39]:
query = """
        SELECT DISTINCT
            vendor_type
        FROM vendor
        """
pd.read_sql(query, db)

Unnamed: 0,vendor_type
0,Eggs & Meats
1,Fresh Variety: Veggies & More
2,Fresh Focused
3,Arts & Jewelry
4,Prepared Foods


In [42]:
query = """
        SELECT DISTINCT
            market_date,
            customer_id
        FROM customer_purchases
        ORDER BY customer_id
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id
0,2019-07-20,1
1,2020-07-11,1
2,2020-07-22,1
3,2020-08-26,1
4,2020-09-05,1
...,...,...
2013,2020-10-10,26
2014,2019-07-17,26
2015,2019-09-18,26
2016,2019-07-13,26


In [44]:
pd.Series.unique

<function pandas.core.series.Series.unique(self) -> 'ArrayLike'>

### Performing Simple Inline Calculations

In addition to selecting columns already present in the table, we can also create and retrieve columns computed from two or more different columns.

As an exmple, we can compute for the dollars spent by a customer in purchasing a product by multiplying the `quantity` column with the `cost_to_customer_per_qty` column.

In [46]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity,
            cost_to_customer_per_qty,
            quantity * cost_to_customer_per_qty
        FROM customer_purchases
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,vendor_id,quantity,cost_to_customer_per_qty,quantity * cost_to_customer_per_qty
0,2019-07-03,14,7,0.99,6.99,6.9201
1,2019-07-03,14,7,2.18,6.99,15.2382
2,2019-07-03,15,7,1.53,6.99,10.6947
3,2019-07-03,16,7,2.02,6.99,14.1198
4,2019-07-03,22,7,0.66,6.99,4.6134
5,2019-07-06,4,7,0.27,6.99,1.8873
6,2019-07-06,12,7,3.6,6.99,25.164
7,2019-07-06,14,7,3.04,6.99,21.2496
8,2019-07-06,23,7,1.49,6.99,10.4151
9,2019-07-06,23,7,2.56,6.99,17.8944


We can provide a meaningful name for the calculated column using the alias by adding the keyword `AS` after the calculation then specifying our desired name for it.

In [48]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity * cost_to_customer_per_qty AS price
        FROM customer_purchases
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,vendor_id,price
0,2019-07-03,14,7,6.9201
1,2019-07-03,14,7,15.2382
2,2019-07-03,15,7,10.6947
3,2019-07-03,16,7,14.1198
4,2019-07-03,22,7,4.6134
5,2019-07-06,4,7,1.8873
6,2019-07-06,12,7,25.164
7,2019-07-06,14,7,21.2496
8,2019-07-06,23,7,10.4151
9,2019-07-06,23,7,17.8944


Most arithmetic operations such as `+`, `-`, `/`, `*`, and `%` are available in SQL.

### Performing Inline Calculations using Functions

Furthermore, SQL also has functions that can help manipulate data of different types effectively. The syntax of an SQL function is:

```
FUNCTION_NAME([parameter 1], [parameter 2], ..., [parameter n])
```

As an example, the `ROUND` function can be used to round a number.

In [49]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            ROUND(quantity * cost_to_customer_per_qty, 2) AS price
        FROM customer_purchases
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,vendor_id,price
0,2019-07-03,14,7,6.92
1,2019-07-03,14,7,15.24
2,2019-07-03,15,7,10.69
3,2019-07-03,16,7,14.12
4,2019-07-03,22,7,4.61
5,2019-07-06,4,7,1.89
6,2019-07-06,12,7,25.16
7,2019-07-06,14,7,21.25
8,2019-07-06,23,7,10.42
9,2019-07-06,23,7,17.89


And another example is the `CONCAT` function which allows you to concatenate several strings.

In [55]:
query = """
        SELECT
            customer_first_name,
            customer_last_name,
            CONCAT(customer_first_name, ' ', customer_last_name) AS customer_name
        FROM customer
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,customer_first_name,customer_last_name,customer_name
0,Jane,Connor,Jane Connor
1,Manuel,Diaz,Manuel Diaz
2,Bob,Wilson,Bob Wilson
3,Deanna,Washington,Deanna Washington
4,Abigail,Harris,Abigail Harris
5,Betty,Bullard,Betty Bullard
6,Jessica,Armenta,Jessica Armenta
7,Norma,Valenzuela,Norma Valenzuela
8,Janet,Forbes,Janet Forbes
9,Russell,Edwards,Russell Edwards


The list of functions available is dependent on the RDBMS that you're using. For SQLite, you may refer to [https://sqlite.org/lang_corefunc.html](https://sqlite.org/lang_corefunc.html) for a comprehensive list of built-in functions.

## 3 The `WHERE` Clause

To filter the rows of data to include in the result, we can use the `WHERE` clause with conditional statements similar to how we index `DataFrame`s using a `boolean` array.

### Filtering Results using the `WHERE` clause

The `WHERE` clause is an optional clause within the `SELECT` statement, its syntax is as follows,

```sql
SELECT <columns to return>
FROM <table>
WHERE <conditional filter statements>
ORDER BY <columns to sort on>
```

Let's say we want to view products with a `product_cateogory_id` 1,

In [57]:
query = """
        SELECT
            product_id,
            product_name,
            product_category_id
        FROM product
        WHERE product_category_id = 1
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name,product_category_id
0,1,Habanero Peppers - Organic,1
1,2,Jalapeno Peppers - Organic,1
2,3,Poblano Peppers - Organic,1
3,9,Sweet Potatoes,1
4,12,Baby Salad Lettuce Mix - Bag,1
5,13,Baby Salad Lettuce Mix,1
6,14,Red Potatoes,1
7,15,Red Potatoes - Small,1
8,16,Sweet Corn,1
9,17,Carrots,1


SQL provides a whole range of conditional operators which we list some of them below,

| Operator | Description |
| --------------- | ----------- |
| `=` or `==` | Equality |
| `<>` or `!=` | Nonequality |
| `<` | Less than |
| `<=` | Less than or equal to |
| `>` | Greater than |
| `>=` | Greater than or equal to |
| `BETWEEN` | Between two specified values |
| `IS NULL` | Is a `NULL` value |
| `IS NOT NULL` | Is NOT a `NULL` value |

In [63]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity,
            cost_to_customer_per_qty,
            ROUND(quantity * cost_to_customer_per_qty, 2) AS price
        FROM customer_purchases
        WHERE
            price BETWEEN 10 AND 20
        ORDER BY price DESC
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,vendor_id,quantity,cost_to_customer_per_qty,price
0,2019-04-03,7,7,5,4,20.0
1,2019-04-06,2,7,5,4,20.0
2,2019-04-06,12,7,5,4,20.0
3,2019-04-06,12,7,5,4,20.0
4,2019-04-06,16,7,5,4,20.0
5,2019-04-10,4,7,5,4,20.0
6,2019-04-10,5,7,5,4,20.0
7,2019-04-13,5,7,5,4,20.0
8,2019-04-13,13,7,5,4,20.0
9,2019-04-24,3,7,5,4,20.0


We can use the `BETWEEN` operator to select rows with values in between the specified range.

It's often useful to find rows in the database where a field is blank or `NULL`. To do this, we use the `IS NULL` operator.

In [64]:
query = """
        SELECT
            product_id,
            product_name,
            product_size
        FROM product
            WHERE product_size IS NULL
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name,product_size
0,14,Red Potatoes,


Notice that there is a difference between a "blank" value and a `NULL` value.

In [66]:
query = """
        SELECT
            product_id,
            product_name,
            product_size
        FROM product
            WHERE
                product_size IS NULL
                OR TRIM(product_size) = ''
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name,product_size
0,14,Red Potatoes,
1,15,Red Potatoes - Small,


Another important thing to note, `NULL` values aren't comparable to numbers or strings in any way. Observed the next series of queries:

In [67]:
query = """
        SELECT
            product_id,
            product_name,
            product_size
        FROM product
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name,product_size
0,1,Habanero Peppers - Organic,medium
1,2,Jalapeno Peppers - Organic,small
2,3,Poblano Peppers - Organic,large
3,4,Banana Peppers - Jar,8 oz
4,5,Whole Wheat Bread,1.5 lbs
5,6,Cut Zinnias Bouquet,medium
6,7,Apple Pie,"10"""
7,8,Cherry Pie,"10"""
8,9,Sweet Potatoes,medium
9,10,Eggs,1 dozen


In [70]:
query = """
        SELECT
            product_id,
            product_name,
            product_size
        FROM product
        WHERE
            product_size = 'small'
            OR product_size != 'small'
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name,product_size
0,1,Habanero Peppers - Organic,medium
1,2,Jalapeno Peppers - Organic,small
2,3,Poblano Peppers - Organic,large
3,4,Banana Peppers - Jar,8 oz
4,5,Whole Wheat Bread,1.5 lbs
5,6,Cut Zinnias Bouquet,medium
6,7,Apple Pie,"10"""
7,8,Cherry Pie,"10"""
8,9,Sweet Potatoes,medium
9,10,Eggs,1 dozen


In [71]:
query = """
        SELECT
            product_id,
            product_name,
            product_size
        FROM product
        WHERE
            product_size = 'small'
            OR product_size != 'small'
            OR product_size IS NULL
        """
pd.read_sql(query, db)

Unnamed: 0,product_id,product_name,product_size
0,1,Habanero Peppers - Organic,medium
1,2,Jalapeno Peppers - Organic,small
2,3,Poblano Peppers - Organic,large
3,4,Banana Peppers - Jar,8 oz
4,5,Whole Wheat Bread,1.5 lbs
5,6,Cut Zinnias Bouquet,medium
6,7,Apple Pie,"10"""
7,8,Cherry Pie,"10"""
8,9,Sweet Potatoes,medium
9,10,Eggs,1 dozen


### Filtering on Multiple Conditions

You can also define compound conditional statements using the usual logical operators `AND`, `OR`, and `NOT`. When doing so, it is good practifcy to explicitly group conditional statements using a `parenthesis` to be explicit in its evaluation.

In [75]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity,
            cost_to_customer_per_qty,
            ROUND(quantity * cost_to_customer_per_qty, 2) AS price
        FROM customer_purchases
        WHERE
            (customer_id = 4 OR customer_id = 7)
            AND vendor_id = 7
        ORDER BY price DESC
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,vendor_id,quantity,cost_to_customer_per_qty,price
0,2020-07-25,7,7,5.71,6.99,39.91
1,2020-08-05,7,7,4.92,6.99,34.39
2,2020-08-22,4,7,4.56,6.99,31.87
3,2020-07-08,4,7,4.44,6.99,31.04
4,2020-09-09,4,7,4.14,6.99,28.94
5,2020-08-22,4,7,3.47,6.99,24.26
6,2020-09-02,4,7,3.31,6.99,23.14
7,2020-08-26,4,7,3.11,6.99,21.74
8,2019-09-11,7,7,6.14,3.49,21.43
9,2019-07-17,4,7,3.03,6.99,21.18


### Using the `IN` operator

If you want to specify a range of conditions in which any could be matched, you could us the `IN` operator.

In [76]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity,
            cost_to_customer_per_qty,
            ROUND(quantity * cost_to_customer_per_qty, 2) AS price
        FROM customer_purchases
        WHERE
            customer_id IN (4, 7)
            AND vendor_id = 7
        ORDER BY price DESC
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,vendor_id,quantity,cost_to_customer_per_qty,price
0,2020-07-25,7,7,5.71,6.99,39.91
1,2020-08-05,7,7,4.92,6.99,34.39
2,2020-08-22,4,7,4.56,6.99,31.87
3,2020-07-08,4,7,4.44,6.99,31.04
4,2020-09-09,4,7,4.14,6.99,28.94
5,2020-08-22,4,7,3.47,6.99,24.26
6,2020-09-02,4,7,3.31,6.99,23.14
7,2020-08-26,4,7,3.11,6.99,21.74
8,2019-09-11,7,7,6.14,3.49,21.43
9,2019-07-17,4,7,3.03,6.99,21.18


Another way of using `IN` is by using the result of a subquery as the list for the basis of matching. For example, we want to retrieve customer purcheses done on dates where `market_rain_flag` is 1. We can first perform a subquery listing all the dates with `market_rain_flag` 1, then using the result as list for the `IN` operator.

In [78]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity,
            cost_to_customer_per_qty,
            ROUND(quantity * cost_to_customer_per_qty, 2) AS price
        FROM customer_purchases
        WHERE market_date IN (
            SELECT market_date
            FROM market_date_info
            WHERE market_rain_flag = 1
        )
        ORDER BY price DESC
        LIMIT 10
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,customer_id,vendor_id,quantity,cost_to_customer_per_qty,price
0,2019-10-19,23,8,5,18,90.0
1,2019-12-11,23,8,4,18,72.0
2,2020-05-09,22,8,4,18,72.0
3,2020-05-09,24,8,4,18,72.0
4,2020-09-30,19,8,4,18,72.0
5,2020-05-06,24,8,4,18,72.0
6,2020-05-27,24,8,4,18,72.0
7,2020-04-01,3,8,3,18,54.0
8,2020-05-09,5,8,3,18,54.0
9,2020-07-11,16,8,3,18,54.0


In [80]:
query = """
        SELECT market_date, market_rain_flag
        FROM market_date_info
        WHERE market_date = '2019-10-19'
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,market_rain_flag
0,2019-10-19,1


In [81]:
query = """
        SELECT market_date, market_rain_flag
        FROM market_date_info
        WHERE market_rain_flag = 1
        """
pd.read_sql(query, db)

Unnamed: 0,market_date,market_rain_flag
0,2019-03-20,1
1,2019-03-23,1
2,2019-03-30,1
3,2019-07-31,1
4,2019-09-21,1
5,2019-10-19,1
6,2019-12-04,1
7,2019-12-11,1
8,2020-04-01,1
9,2020-04-18,1


In [82]:
query = """
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity,
            cost_to_customer_per_qty,
            ROUND(quantity * cost_to_customer_per_qty, 2) AS price
        FROM customer_purchases
        WHERE market_rain_flag = 1

        ORDER BY price DESC
        LIMIT 10
        """
pd.read_sql(query, db)

OperationalError: (sqlite3.OperationalError) no such column: market_rain_flag
[SQL: 
        SELECT
            market_date,
            customer_id,
            vendor_id,
            quantity,
            cost_to_customer_per_qty,
            ROUND(quantity * cost_to_customer_per_qty, 2) AS price
        FROM customer_purchases
        WHERE market_rain_flag = 1

        ORDER BY price DESC
        LIMIT 10
        ]
(Background on this error at: https://sqlalche.me/e/20/e3q8)

### Filtering on string data type

When working with strings, filtering using wildcards (similar to metacharaters in RegEx) is also possible in SQL thru the `LIKE` operator. The two most common wildcards for SQL are `%` and `_`. The `%` (percent symbol) matches any sequence of zero or more characters in the string. An `_` (underscore) in the `LIKE` pattern amtches any single character in the string. Note that `LIKE` is case insensitive.

In [84]:
query = """
        SELECT
            customer_id,
            customer_first_name,
            customer_last_name
        FROM customer
        WHERE
            customer_first_name LIKE "Jer%"
        """
pd.read_sql(query, db)

Unnamed: 0,customer_id,customer_first_name,customer_last_name
0,13,Jeremy,Gruber
1,18,Jeri,Mitchell


In [87]:
query = """
        SELECT
            customer_id,
            customer_first_name,
            customer_last_name
        FROM customer
        WHERE
            customer_first_name LIKE "jer_"
        """
pd.read_sql(query, db)

Unnamed: 0,customer_id,customer_first_name,customer_last_name
0,18,Jeri,Mitchell


In [88]:
query = """
        SELECT
            customer_id,
            customer_first_name,
            customer_last_name
        FROM customer
        WHERE
            customer_first_name LIKE "jer___"
        """
pd.read_sql(query, db)

Unnamed: 0,customer_id,customer_first_name,customer_last_name
0,13,Jeremy,Gruber


In [90]:
query = """
        SELECT
            customer_id,
            customer_first_name,
            customer_last_name
        FROM customer
        WHERE
            customer_first_name LIKE "%er_"
        """
pd.read_sql(query, db)

Unnamed: 0,customer_id,customer_first_name,customer_last_name
0,18,Jeri,Mitchell


Or if you could use the `REGEXP` operator to specify a regular expression instead.

In [86]:
query = """
        SELECT
            customer_id,
            customer_first_name,
            customer_last_name
        FROM customer
        WHERE
            customer_first_name REGEXP '[aeiou]{2,}'
        """
pd.read_sql(query, db)

Unnamed: 0,customer_id,customer_first_name,customer_last_name
0,2,Manuel,Diaz
1,4,Deanna,Washington
2,5,Abigail,Harris
3,14,William,Lopes
4,20,Valerie,Loftis
5,21,Duane,Sipp
6,22,George,Rai
7,25,Bonnie,Hassan
8,26,Tracie,Goehring


<img src="images/banner-down.png" style="width: 100%;">