# Query engine
This technical document describes the spoc query engine, a set of classes that implements spoc's interface for querying multi-dimensional genomic data.

## Principles

### Composable pieces
Spoc's query engine consists of composable pieces that can be combined to produce an expressive query language. These pieces represent basic operations on genomic data that are easily implemented and understood on their own. This allows a great degree of flexibility, while also allowing predefined recipes that less experienced users can get started with.

### Lazy evaluation
The spoc query engine is designed with lazy evaluation as a guiding principle. This means that data queries are only executed when they are needed to minimize loading data into memory and computational overhead. To enable this, spoc queries have a construction phase, which specifies the operations to be executed and an exection phase, that actually executes the query.

## Query plans and query steps

The most important ingredient in this query language is a class that implements the `QueryStep` protocol. This protocol serves two purposes:

- It exposes a way to validate the data schema during query building
- It implements adding itself to a query

This way, query steps can be combined into a query plan that specifies the analysis to be executed. Specific examples of query steps are:

- **Snipper**: Implements selecting overlapping contacts or pixels for a set of genomic regions.
- **Transformation**: Transforms one or more columns to add additional columns
- **Aggregation**: Aggregation of data such as counting contacts per region

### Input and output of query steps

A query step takes as input a class that implements the `GenomicData` protocol. This protocol allows retrievel of the data schema (a thin wrapper over a pandera dataframe schema) as well as the data itself. The output of a query step is again a class that ipmlements the `GenomicData` protocol to allow composition. Specific examples of possible inputs are:

- **Pixels**: Represents input pixels
- **Contacts**: Represents input contacts
- **QueryResult**: The result of a query step

### Composition of query steps

To allow specifying complex queries, query steps need to be combined. This is done using the `BasicQuery` class. It takes a query plan (a list of `QueryStep` instances) as input, exposes the `query` method, which takes input data, validates all query steps and adds them to the resulting `QueryResult` instance that is returned.

### Manifestation of results

So far, we have only talked about specifying the query to be executed, but not how to actually execute it. A `QueryResult` has a `load_result()` method that returns the manifested dataframe as a `pd.DataFrame` instance. This is the step that actually executes the specified query.

## Examples

### Selecting a subset of contacts at a single genomic position
In this example, we want to select a subset of genomic contacts at a single location. For this, we first load the required input data:

In [4]:
from spoc.query_engine import Snipper, Anchor, BasicQuery
from spoc.contacts import Contacts
import pandas as pd

contacts = Contacts.from_uri("../tests/test_files/contacts_unlabelled_2d_v2.parquet::2")

Then we specify a target region

In [12]:
target_region = pd.DataFrame({
    "chrom": ['chr1'],
    "start": [100],
    "end": [400],
})

First, we want to select all contacts where any of the fragments constituting the contact overlaps the target region. To perform this action, we use the Snipper class and pass the target region as well as an instance of the `Anchor` class. The `Anchor` dataclass allows us to specify how we want to filter contacts for region overlap. It has two attributes `mode` and `anchors`. `Anchors` indicates the positions we want to filter on (default is all positions) and `mode` specifies whether we require all positions to overlap or any position to overlap. So for example, if we want all of our two-way contacts for which any of the positions overlap, we would use `Anchor(mode='ANY', anchors=[1,2])`.

In [44]:
query_plan = [
    Snipper(target_region, anchor_mode=Anchor(mode="ANY", anchors=[1,2]))
]

A query plan is a list of qury steps that can be used in the basic query class

In [45]:
query = BasicQuery(query_plan=query_plan)

The `.query` method executes the query plan and retuns a `QueryResult` object

In [46]:
result = query.query(contacts)
result

<spoc.query_engine.QueryResult at 0x23d0367eaf0>

The `.load_result` method of the `QueryResult` object can be executed using `.load_result`, which returns a `pd.DataFrame`. The resulting dataframe has additional columns that represent the regions, with which the input contacts overlapped.

In [52]:
df = result.load_result()
print(type(df))
df.filter(regex=r"chrom|start|end|id")

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,chrom_1,start_1,end_1,chrom_2,start_2,end_2,chrom,start,end,id
0,chr1,100,200,chr1,1000,2000,chr1,100,400,0
1,chr1,2000,3000,chr1,200,300,chr1,100,400,0
2,chr1,3000,4000,chr1,300,400,chr1,100,400,0


We can also restrict the positions to filter on, by passing different anchor parameters. For example, we can filter for contacts, where the first position overlaps with our target:

In [50]:
query_plan = [
    Snipper(target_region, anchor_mode=Anchor(mode="ANY", anchors=[1]))
]
BasicQuery(query_plan=query_plan)\
    .query(contacts)\
    .load_result()\
    .filter(regex=r"chrom|start|end|id")

Unnamed: 0,chrom_1,start_1,end_1,chrom_2,start_2,end_2,chrom,start,end,id
0,chr1,100,200,chr1,1000,2000,chr1,100,400,0


This time, only the first contact overlaps.

The same functionality is implemented also for the Pixels class

## Selecting a subset of contacts at multiple genomic regions
The Snipper class is also capable of selecting contacts at multiple genomic regions. Here, the behavior of `Snipper` deviates from a simple filter, because if a given contact overlaps with multiple regions, it will be returned multiple times.

Specify target regions

In [54]:
target_regions = pd.DataFrame({
    "chrom": ['chr1', 'chr1'],
    "start": [100, 150],
    "end": [400, 200],
})

In [55]:
query_plan = [
    Snipper(target_regions, anchor_mode=Anchor(mode="ANY", anchors=[1]))
]
BasicQuery(query_plan=query_plan)\
    .query(contacts)\
    .load_result()\
    .filter(regex=r"chrom|start|end|id")

Unnamed: 0,chrom_1,start_1,end_1,chrom_2,start_2,end_2,chrom,start,end,id
0,chr1,100,200,chr1,1000,2000,chr1,100,400,0
1,chr1,100,200,chr1,1000,2000,chr1,150,200,1


In this example, the contact overlapping both regions is duplicated.

The same functionality is implemented also for the pixels class.