# PyPika Tutorial: Slicing, Averaging, and Preserving Order

This notebook will demonstrate how to use PyPika to create SQL queries that involve:
- Basic selection and filtering (slicing).
- Aggregation (averaging).
- Combining selection, averaging, and ensuring the order is maintained for other functions that depend on ordering.

Let's get started by installing and importing the necessary packages.

## Step 1: Installation and Imports

First, ensure that you have PyPika installed.
```bash
!pip install pypika
```

Now, import PyPika and any other required packages:

In [1]:
from pypika import Table, Field, functions as fn
import sqlite3  # To simulate the database
import pandas as pd
import numpy as np

## Step 2: Creating a Sample Database

For demonstration purposes, we will create a small SQLite database with a table named `"ds1"` and populate it with data.

The dataset (`ds1`) will represent a grid mesh where `x` and `y` span from -1 to 1, and `z` is a Gaussian response as a function of `x` and `y` with added noise. We will also have `a` representing 10 copies with different noise levels simulating repeat measurements, and `b` representing two states that flip the sign of `z`.


In [2]:
# Establish a connection to an in-memory SQLite database
connection = sqlite3.connect(':memory:')
cursor = connection.cursor()

# Create the "ds1" table
cursor.execute('''
CREATE TABLE ds1 (
    x REAL,
    y REAL,
    z REAL,
    a INTEGER,
    b INTEGER
);
''')

# Generate sample data
x_vals = np.linspace(-1, 1, 5)
y_vals = np.linspace(-1, 1, 5)
a_vals = range(1, 11)  # 10 copies with different noise
b_vals = [1, -1]  # Two states flipping the sign of z

sample_data = []
for x in x_vals:
    for y in y_vals:
        base_z = np.exp(-(x**2 + y**2))  # Gaussian response
        for a in a_vals:
            noise = np.random.normal(0, 0.1)  # Adding noise
            for b in b_vals:
                z = base_z * b + noise
                sample_data.append((x, y, z, a, b))

# Insert the sample data into "ds1"
cursor.executemany('INSERT INTO ds1 VALUES (?, ?, ?, ?, ?)', sample_data)
connection.commit()

## Step 3: Slicing and Averaging Using PyPika

Now, let's use PyPika to build SQL queries that allow us to perform slicing (selecting rows based on conditions) and averaging.



### Example 1: Simple Selection
We want to select columns `x`, `y`, `z`, `a`, and `b` from `ds1` where `a == 1`.

In [3]:
# Define the table
ds1 = Table('ds1')

# Build the query
query = ds1.select(ds1.x, ds1.y, ds1.z, ds1.a).where(ds1.a == 1)
print(query)
# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("\nSimple Selection Result:")
print(result_df)

SELECT "x","y","z","a" FROM "ds1" WHERE "a"=1

Simple Selection Result:
      x    y         z  a
0  -1.0 -1.0  0.069362  1
1  -1.0 -1.0 -0.201309  1
2  -1.0 -0.5  0.192130  1
3  -1.0 -0.5 -0.380880  1
4  -1.0  0.0  0.366027  1
5  -1.0  0.0 -0.369732  1
6  -1.0  0.5  0.264043  1
7  -1.0  0.5 -0.308967  1
8  -1.0  1.0  0.202090  1
9  -1.0  1.0 -0.068581  1
10 -0.5 -1.0  0.203297  1
11 -0.5 -1.0 -0.369713  1
12 -0.5 -0.5  0.490341  1
13 -0.5 -0.5 -0.722720  1
14 -0.5  0.0  0.758477  1
15 -0.5  0.0 -0.799125  1
16 -0.5  0.5  0.733358  1
17 -0.5  0.5 -0.479704  1
18 -0.5  1.0  0.295450  1
19 -0.5  1.0 -0.277560  1
20  0.0 -1.0  0.436238  1
21  0.0 -1.0 -0.299521  1
22  0.0 -0.5  0.880419  1
23  0.0 -0.5 -0.677182  1
24  0.0  0.0  0.955376  1
25  0.0  0.0 -1.044624  1
26  0.0  0.5  0.764328  1
27  0.0  0.5 -0.793274  1
28  0.0  1.0  0.375256  1
29  0.0  1.0 -0.360503  1
30  0.5 -1.0  0.149250  1
31  0.5 -1.0 -0.423760  1
32  0.5 -0.5  0.632334  1
33  0.5 -0.5 -0.580727  1
34  0.5  0.0  0.80

### Example 2: Aggregating with Averaging
We want to select columns `x`, `y`, and calculate the average of `z` where `a == 1`. The result should be grouped by `x` and `y`.


In [4]:
# Define the table
ds1 = Table('ds1')

# Build the query
query = ds1.select(ds1.x, ds1.y, ds1.z, ds1.a, ds1.b).where(ds1.a == 1)
print(query)
# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("Simple Selection Result:")
print(result_df)

SELECT "x","y","z","a","b" FROM "ds1" WHERE "a"=1
Simple Selection Result:
      x    y         z  a  b
0  -1.0 -1.0  0.069362  1  1
1  -1.0 -1.0 -0.201309  1 -1
2  -1.0 -0.5  0.192130  1  1
3  -1.0 -0.5 -0.380880  1 -1
4  -1.0  0.0  0.366027  1  1
5  -1.0  0.0 -0.369732  1 -1
6  -1.0  0.5  0.264043  1  1
7  -1.0  0.5 -0.308967  1 -1
8  -1.0  1.0  0.202090  1  1
9  -1.0  1.0 -0.068581  1 -1
10 -0.5 -1.0  0.203297  1  1
11 -0.5 -1.0 -0.369713  1 -1
12 -0.5 -0.5  0.490341  1  1
13 -0.5 -0.5 -0.722720  1 -1
14 -0.5  0.0  0.758477  1  1
15 -0.5  0.0 -0.799125  1 -1
16 -0.5  0.5  0.733358  1  1
17 -0.5  0.5 -0.479704  1 -1
18 -0.5  1.0  0.295450  1  1
19 -0.5  1.0 -0.277560  1 -1
20  0.0 -1.0  0.436238  1  1
21  0.0 -1.0 -0.299521  1 -1
22  0.0 -0.5  0.880419  1  1
23  0.0 -0.5 -0.677182  1 -1
24  0.0  0.0  0.955376  1  1
25  0.0  0.0 -1.044624  1 -1
26  0.0  0.5  0.764328  1  1
27  0.0  0.5 -0.793274  1 -1
28  0.0  1.0  0.375256  1  1
29  0.0  1.0 -0.360503  1 -1
30  0.5 -1.0  0.149250  1 

### Example 2: Aggregating with Averaging
We want to select columns `x`, `y`, and calculate the average of `z` where `a == 1`. The result should be grouped by `x` and `y`.

In [5]:
# Build the query for aggregation
query = (
    ds1
    .select(ds1.x, ds1.y, fn.Avg(ds1.z).as_('avg_z'))
    .where(ds1.a == 1)
    .groupby(ds1.x, ds1.y)
    .orderby(ds1.x, ds1.y)
)
print(query)

# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("Aggregation with Averaging Result:")
print(result_df)

SELECT "x","y",AVG("z") "avg_z" FROM "ds1" WHERE "a"=1 GROUP BY "x","y" ORDER BY "x","y"
Aggregation with Averaging Result:
      x    y     avg_z
0  -1.0 -1.0 -0.065973
1  -1.0 -0.5 -0.094375
2  -1.0  0.0 -0.001853
3  -1.0  0.5 -0.022462
4  -1.0  1.0  0.066755
5  -0.5 -1.0 -0.083208
6  -0.5 -0.5 -0.116190
7  -0.5  0.0 -0.020324
8  -0.5  0.5  0.126827
9  -0.5  1.0  0.008945
10  0.0 -1.0  0.068359
11  0.0 -0.5  0.101619
12  0.0  0.0 -0.044624
13  0.0  0.5 -0.014473
14  0.0  1.0  0.007377
15  0.5 -1.0 -0.137255
16  0.5 -0.5  0.025803
17  0.5  0.0  0.030293
18  0.5  0.5  0.063359
19  0.5  1.0 -0.124204
20  1.0 -1.0 -0.143863
21  1.0 -0.5 -0.096657
22  1.0  0.0  0.120842
23  1.0  0.5 -0.094760
24  1.0  1.0  0.177569



### Example 3: Combining Slicing, Averaging, and Preserving Order
We want to make sure that when we select `x`, `y`, and `avg(z)` where `a == 1`, the result preserves the order of `x` and `y` so that other functions depending on the order will still work.


In this example, we ensure the use of `.orderby()` to preserve the order of `x` and `y` in the final result. This ensures that if you pass the resulting DataFrame to any function that relies on the order of `x` and `y`, it will work correctly.


In [6]:
# Build the combined query
query = (
    ds1
    .select(ds1.x, ds1.y, fn.Avg(ds1.z).as_('avg_z'))
    .where(ds1.a == 1)
    .groupby(ds1.x, ds1.y)
    .orderby(ds1.x, ds1.y)  # Ensure the ordering is maintained
)

print(query)
# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("Combined Selection and Averaging with Order Preserved:")
print(result_df)


# Verify that the ordering is maintained for further processing
print("Is the DataFrame sorted by ['x', 'y']?")
print(result_df.equals(result_df.sort_values(by=['x', 'y'])))

SELECT "x","y",AVG("z") "avg_z" FROM "ds1" WHERE "a"=1 GROUP BY "x","y" ORDER BY "x","y"
Combined Selection and Averaging with Order Preserved:
      x    y     avg_z
0  -1.0 -1.0 -0.065973
1  -1.0 -0.5 -0.094375
2  -1.0  0.0 -0.001853
3  -1.0  0.5 -0.022462
4  -1.0  1.0  0.066755
5  -0.5 -1.0 -0.083208
6  -0.5 -0.5 -0.116190
7  -0.5  0.0 -0.020324
8  -0.5  0.5  0.126827
9  -0.5  1.0  0.008945
10  0.0 -1.0  0.068359
11  0.0 -0.5  0.101619
12  0.0  0.0 -0.044624
13  0.0  0.5 -0.014473
14  0.0  1.0  0.007377
15  0.5 -1.0 -0.137255
16  0.5 -0.5  0.025803
17  0.5  0.0  0.030293
18  0.5  0.5  0.063359
19  0.5  1.0 -0.124204
20  1.0 -1.0 -0.143863
21  1.0 -0.5 -0.096657
22  1.0  0.0  0.120842
23  1.0  0.5 -0.094760
24  1.0  1.0  0.177569
Is the DataFrame sorted by ['x', 'y']?
True


### Example 4: Averaging Across `a` and Selecting `b`
We want to select `x`, `y`, and `avg(z)` across all values of `a`, while also selecting only the rows where `b == 1`. The result should be grouped by `x`, `y`, and preserve the order.


In [7]:
# Build the query for averaging across `a` and selecting `b`
query = (
    ds1
    .select(ds1.x, ds1.y, fn.Avg(ds1.z).as_('avg_z'))
    .where(ds1.b == 1)
    .groupby(ds1.x, ds1.y)
    .orderby(ds1.x, ds1.y)
)

# Execute the query
result_df = pd.read_sql_query(str(query), connection)
print("Averaging Across 'a' with 'b' Selected Result:")
print(result_df)

Averaging Across 'a' with 'b' Selected Result:
      x    y     avg_z
0  -1.0 -1.0  0.181229
1  -1.0 -0.5  0.237509
2  -1.0  0.0  0.377482
3  -1.0  0.5  0.261564
4  -1.0  1.0  0.144088
5  -0.5 -1.0  0.231555
6  -0.5 -0.5  0.561442
7  -0.5  0.0  0.776251
8  -0.5  0.5  0.670895
9  -0.5  1.0  0.337831
10  0.0 -1.0  0.317837
11  0.0 -0.5  0.735524
12  0.0  0.0  0.993011
13  0.0  0.5  0.760221
14  0.0  1.0  0.412656
15  0.5 -1.0  0.269312
16  0.5 -0.5  0.603588
17  0.5  0.0  0.816241
18  0.5  0.5  0.568472
19  0.5  1.0  0.286982
20  1.0 -1.0  0.116679
21  1.0 -0.5  0.296672
22  1.0  0.0  0.402449
23  1.0  0.5  0.310698
24  1.0  1.0  0.138830
