# PyPika Tutorial: Slicing, Averaging, and Preserving Order

This notebook will demonstrate how to use PyPika to create SQL queries that involve:
- Basic selection and filtering (slicing).
- Aggregation (averaging).
- Combining selection, averaging, and ensuring the order is maintained for other functions that depend on ordering.

Let's get started by installing and importing the necessary packages.

## Step 1: Installation and Imports

First, ensure that you have PyPika installed.
```bash
!pip install pypika
```

Now, import PyPika and any other required packages:

In [1]:
from pypika import Table, Field, functions as fn
import sqlite3  # To simulate the database
import pandas as pd
import numpy as np

## Step 2: Creating a Sample Database

For demonstration purposes, we will create a small SQLite database with a table named `"ds1"` and populate it with data.

The dataset (`ds1`) will represent a grid mesh where `x` and `y` span from -1 to 1, and `z` is a Gaussian response as a function of `x` and `y` with added noise. We will also have `a` representing 10 copies with different noise levels simulating repeat measurements, and `b` representing two states that flip the sign of `z`.


In [2]:
# Establish a connection to an in-memory SQLite database
connection = sqlite3.connect(':memory:')
cursor = connection.cursor()

# Create the "ds1" table
cursor.execute('''
CREATE TABLE ds1 (
    x REAL,
    y REAL,
    z REAL,
    a INTEGER,
    b INTEGER
);
''')

# Generate sample data
x_vals = np.linspace(-1, 1, 5)
y_vals = np.linspace(-1, 1, 5)
a_vals = range(1, 11)  # 10 copies with different noise
b_vals = [1, -1]  # Two states flipping the sign of z

sample_data = []
for x in x_vals:
    for y in y_vals:
        base_z = np.exp(-(x**2 + y**2))  # Gaussian response
        for a in a_vals:
            noise = np.random.normal(0, 0.1)  # Adding noise
            for b in b_vals:
                z = base_z * b + noise
                sample_data.append((x, y, z, a, b))

# Insert the sample data into "ds1"
cursor.executemany('INSERT INTO ds1 VALUES (?, ?, ?, ?, ?)', sample_data)
connection.commit()

## Step 3: Slicing and Averaging Using PyPika

Now, let's use PyPika to build SQL queries that allow us to perform slicing (selecting rows based on conditions) and averaging.



### Example 1: Simple Selection
We want to select columns `x`, `y`, `z`, `a`, and `b` from `ds1` where `a == 1`.

In [3]:
# Define the table
ds1 = Table('ds1')

# Build the query
query = ds1.select(ds1.x, ds1.y, ds1.z, ds1.a).where(ds1.a == 1)
print(query)
# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("\nSimple Selection Result:")
print(result_df)

SELECT "x","y","z","a" FROM "ds1" WHERE "a"=1

Simple Selection Result:
      x    y         z  a
0  -1.0 -1.0  0.250194  1
1  -1.0 -1.0 -0.020477  1
2  -1.0 -0.5  0.197613  1
3  -1.0 -0.5 -0.375397  1
4  -1.0  0.0  0.351108  1
5  -1.0  0.0 -0.384650  1
6  -1.0  0.5  0.230708  1
7  -1.0  0.5 -0.342301  1
8  -1.0  1.0  0.154760  1
9  -1.0  1.0 -0.115910  1
10 -0.5 -1.0  0.313137  1
11 -0.5 -1.0 -0.259873  1
12 -0.5 -0.5  0.718489  1
13 -0.5 -0.5 -0.494573  1
14 -0.5  0.0  0.767622  1
15 -0.5  0.0 -0.789979  1
16 -0.5  0.5  0.766035  1
17 -0.5  0.5 -0.447026  1
18 -0.5  1.0  0.377010  1
19 -0.5  1.0 -0.196000  1
20  0.0 -1.0  0.295325  1
21  0.0 -1.0 -0.440434  1
22  0.0 -0.5  0.646772  1
23  0.0 -0.5 -0.910829  1
24  0.0  0.0  1.067821  1
25  0.0  0.0 -0.932179  1
26  0.0  0.5  0.830140  1
27  0.0  0.5 -0.727462  1
28  0.0  1.0  0.440276  1
29  0.0  1.0 -0.295483  1
30  0.5 -1.0  0.351158  1
31  0.5 -1.0 -0.221852  1
32  0.5 -0.5  0.723103  1
33  0.5 -0.5 -0.489958  1
34  0.5  0.0  0.60

### Example 2: Aggregating with Averaging
We want to select columns `x`, `y`, and calculate the average of `z` where `a == 1`. The result should be grouped by `x` and `y`.


In [4]:
# Define the table
ds1 = Table('ds1')

# Build the query
query = ds1.select(ds1.x, ds1.y, ds1.z, ds1.a, ds1.b).where(ds1.a == 1)
print(query)
# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("Simple Selection Result:")
print(result_df)

SELECT "x","y","z","a","b" FROM "ds1" WHERE "a"=1
Simple Selection Result:
      x    y         z  a  b
0  -1.0 -1.0  0.250194  1  1
1  -1.0 -1.0 -0.020477  1 -1
2  -1.0 -0.5  0.197613  1  1
3  -1.0 -0.5 -0.375397  1 -1
4  -1.0  0.0  0.351108  1  1
5  -1.0  0.0 -0.384650  1 -1
6  -1.0  0.5  0.230708  1  1
7  -1.0  0.5 -0.342301  1 -1
8  -1.0  1.0  0.154760  1  1
9  -1.0  1.0 -0.115910  1 -1
10 -0.5 -1.0  0.313137  1  1
11 -0.5 -1.0 -0.259873  1 -1
12 -0.5 -0.5  0.718489  1  1
13 -0.5 -0.5 -0.494573  1 -1
14 -0.5  0.0  0.767622  1  1
15 -0.5  0.0 -0.789979  1 -1
16 -0.5  0.5  0.766035  1  1
17 -0.5  0.5 -0.447026  1 -1
18 -0.5  1.0  0.377010  1  1
19 -0.5  1.0 -0.196000  1 -1
20  0.0 -1.0  0.295325  1  1
21  0.0 -1.0 -0.440434  1 -1
22  0.0 -0.5  0.646772  1  1
23  0.0 -0.5 -0.910829  1 -1
24  0.0  0.0  1.067821  1  1
25  0.0  0.0 -0.932179  1 -1
26  0.0  0.5  0.830140  1  1
27  0.0  0.5 -0.727462  1 -1
28  0.0  1.0  0.440276  1  1
29  0.0  1.0 -0.295483  1 -1
30  0.5 -1.0  0.351158  1 

### Example 2: Aggregating with Averaging
We want to select columns `x`, `y`, and calculate the average of `z` where `a == 1`. The result should be grouped by `x` and `y`.

In [5]:
# Build the query for aggregation
query = (
    ds1
    .select(ds1.x, ds1.y, fn.Avg(ds1.z).as_('avg_z'))
    .where(ds1.a == 1)
    .groupby(ds1.x, ds1.y)
    .orderby(ds1.x, ds1.y)
)
print(query)

# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("Aggregation with Averaging Result:")
print(result_df)

SELECT "x","y",AVG("z") "avg_z" FROM "ds1" WHERE "a"=1 GROUP BY "x","y" ORDER BY "x","y"
Aggregation with Averaging Result:
      x    y     avg_z
0  -1.0 -1.0  0.114859
1  -1.0 -0.5 -0.088892
2  -1.0  0.0 -0.016771
3  -1.0  0.5 -0.055797
4  -1.0  1.0  0.019425
5  -0.5 -1.0  0.026632
6  -0.5 -0.5  0.111958
7  -0.5  0.0 -0.011179
8  -0.5  0.5  0.159505
9  -0.5  1.0  0.090505
10  0.0 -1.0 -0.072554
11  0.0 -0.5 -0.132029
12  0.0  0.0  0.067821
13  0.0  0.5  0.051339
14  0.0  1.0  0.072397
15  0.5 -1.0  0.064653
16  0.5 -0.5  0.116572
17  0.5  0.0 -0.169370
18  0.5  0.5 -0.120340
19  0.5  1.0 -0.043103
20  1.0 -1.0  0.093518
21  1.0 -0.5 -0.115737
22  1.0  0.0  0.097399
23  1.0  0.5  0.041123
24  1.0  1.0 -0.170906



### Example 3: Combining Slicing, Averaging, and Preserving Order
We want to make sure that when we select `x`, `y`, and `avg(z)` where `a == 1`, the result preserves the order of `x` and `y` so that other functions depending on the order will still work.


In this example, we ensure the use of `.orderby()` to preserve the order of `x` and `y` in the final result. This ensures that if you pass the resulting DataFrame to any function that relies on the order of `x` and `y`, it will work correctly.


In [6]:
# Build the combined query
query = (
    ds1
    .select(ds1.x, ds1.y, fn.Avg(ds1.z).as_('avg_z'))
    .where(ds1.a == 1)
    .groupby(ds1.x, ds1.y)
    .orderby(ds1.x, ds1.y)  # Ensure the ordering is maintained
)

print(query)
# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("Combined Selection and Averaging with Order Preserved:")
print(result_df)


# Verify that the ordering is maintained for further processing
print("Is the DataFrame sorted by ['x', 'y']?")
print(result_df.equals(result_df.sort_values(by=['x', 'y'])))

SELECT "x","y",AVG("z") "avg_z" FROM "ds1" WHERE "a"=1 GROUP BY "x","y" ORDER BY "x","y"
Combined Selection and Averaging with Order Preserved:
      x    y     avg_z
0  -1.0 -1.0  0.114859
1  -1.0 -0.5 -0.088892
2  -1.0  0.0 -0.016771
3  -1.0  0.5 -0.055797
4  -1.0  1.0  0.019425
5  -0.5 -1.0  0.026632
6  -0.5 -0.5  0.111958
7  -0.5  0.0 -0.011179
8  -0.5  0.5  0.159505
9  -0.5  1.0  0.090505
10  0.0 -1.0 -0.072554
11  0.0 -0.5 -0.132029
12  0.0  0.0  0.067821
13  0.0  0.5  0.051339
14  0.0  1.0  0.072397
15  0.5 -1.0  0.064653
16  0.5 -0.5  0.116572
17  0.5  0.0 -0.169370
18  0.5  0.5 -0.120340
19  0.5  1.0 -0.043103
20  1.0 -1.0  0.093518
21  1.0 -0.5 -0.115737
22  1.0  0.0  0.097399
23  1.0  0.5  0.041123
24  1.0  1.0 -0.170906
Is the DataFrame sorted by ['x', 'y']?
True


### Example 4: Averaging Across `a` and Selecting `b`
We want to select `x`, `y`, and `avg(z)` across all values of `a`, while also selecting only the rows where `b == 1`. The result should be grouped by `x`, `y`, and preserve the order.


In [7]:
# Build the query for averaging across `a` and selecting `b`
query = (
    ds1
    .select(ds1.x, ds1.y, fn.Avg(ds1.z).as_('avg_z'))
    .where(ds1.b == 1)
    .groupby(ds1.x, ds1.y)
    .orderby(ds1.x, ds1.y)
)

# Execute the query
result_df = pd.read_sql_query(str(query), connection)

print(query)
print("Averaging Across 'a' with 'b' Selected Result:")
print(result_df)

SELECT "x","y",AVG("z") "avg_z" FROM "ds1" WHERE "b"=1 GROUP BY "x","y" ORDER BY "x","y"
Averaging Across 'a' with 'b' Selected Result:
      x    y     avg_z
0  -1.0 -1.0  0.104242
1  -1.0 -0.5  0.257561
2  -1.0  0.0  0.352388
3  -1.0  0.5  0.308224
4  -1.0  1.0  0.187247
5  -0.5 -1.0  0.328021
6  -0.5 -0.5  0.618263
7  -0.5  0.0  0.789747
8  -0.5  0.5  0.593065
9  -0.5  1.0  0.356974
10  0.0 -1.0  0.298984
11  0.0 -0.5  0.792378
12  0.0  0.0  0.990404
13  0.0  0.5  0.731540
14  0.0  1.0  0.399185
15  0.5 -1.0  0.355387
16  0.5 -0.5  0.638891
17  0.5  0.0  0.726695
18  0.5  0.5  0.577464
19  0.5  1.0  0.277483
20  1.0 -1.0  0.175300
21  1.0 -0.5  0.303328
22  1.0  0.0  0.374038
23  1.0  0.5  0.305228
24  1.0  1.0  0.114858
