# Relation

pysdql takes `pysdql.Relation` as the only entrance of the programme. Users are expected to create an `pysdql.Relation` object to start all the operations. `pysdql.Relation` is actually designed as a task manager class rather than a data entity class. Its name is simply intended to give the user the illusion that they were manipulating tables or matrices. `pysdql.Relation` is responsible for receiving the data from the user and then invokes appropriate functions depending on the data type and passing it on to the real data entity class. 

`pysdql.Relation` class has many ways of receiving data from users. For example, the name of a relation should be given by the user in construction. In fact, this is the only argument required for construction.

In [1]:
import pysdql

r = pysdql.Relation(name='R')

Even though it is not clever to require users to type the name of the variable twice, it is a last resort. Python does not provide a memory reflection mechanism for variables, i.e. it is impossible to get the name of the variable corresponding to it in memory from the variable itself.

## Slicing

By slicing the `pysdql.Relation` object, i.e. `R[_]`, the user can perform selection and projection operations to extract the rows and columns that satisfy the conditions. This is implemented by overloading `__getitem__` method of `pysdql.Relation` class.

### Projection

Users can select columns by passing a list of strings of column names into the slicing function, i.e. `R[['col1', 'col2', ..., 'coln']]` The type of the list of columns must be `List[str]`

In [2]:
ab = r[['A', 'B']]

let rmp = sum (<r_k, r_v> in R) { < A=r_k.A, B=r_k.B > } in


If the `pysdql.Relation` is sliced by a string of column name, i.e. `R['column_name']`, an `ColUnit` object will be returned, which represents a single column of the relation.

In [3]:
type(r['A'])

pysdql.core.dtypes.ColumnUnit.ColUnit

Other than `R['Col']`, `pysdql` also support `R.col` to select a single column.

In [4]:
type(r.A)

pysdql.core.dtypes.ColumnUnit.ColUnit

`ColUnit` is designed to compose conditional expressions. It represents a single column of the relation. Whereas `ColExpr` represents the arithmetic operations between columns. For example, `r['A'] * r['B']` represents the multiplication of columns `A` and `B` of the relation `R`. `pysdql` supports addition (`+`), subtraction (`-`), multiplication (`*`) and division (`/`) by overloading the corresponding built-in functions as following.

| Operation | Built-in Method |
| ------ | ----------- |
| `+` | `__add__` |
| `-` | `__sub__` |
| `*` | `__mul__` |
| `/` | `__truediv__` |

In [5]:
type((r['A'] + r['B']) * r['C'])

pysdql.core.dtypes.ColumnExpr.ColExpr

When `ColUnit` is compared with other values, a conditional expression is generated and a `CondUnit` object is returned. For example, `R['A'] > 1` indicates the rows that satisfy the value of `A` greater than `1`. `pysdql` supports `>, >=, ==, =<, <` by overloading the corresponding built-in functions as following.

| Operation | Built-in Method |
| ------ | ----------- |
| `>` | `__gt__` |
| `>=` | `__ge__` |
| `==` | `__eq__` |
| `<=` | `__le__` |
| `<` | `__lt__` |

In [6]:
type(r['A'] > 1)

pysdql.core.dtypes.ConditionalUnit.CondUnit

In the `ColUnit` class, when the `__eq__` method is overloaded, the `__hash__` method must be also overloaded to allow it to appear as a key in the dictionary to facilitate support for aggregation functions.

`CondUnit` represents a collection of conditional expressions. The user can concatnate conditions with logical operators `&` (and), `|` (or) and `~` (not). This is by overloading the corresponding built-in functions as following. Please be aware that Python keywords `and`, `or`, `not` are not the same to these logical operators. Those keywords only returns `True` or `False` and cannot be overloaded.

| Operation | Built-in Method |
| ------ | ----------- |
| `&` | `__and__` |
| `\|` | `__or__` |
| `~` | `__invert__` |

In [7]:
type((r['A'] > 1) & (r['B'] > 1))

pysdql.core.dtypes.ConditionalUnit.CondUnit

`TypeError` occurs when `CondUnit` objects are not wrapped by brackets `( )`. Therefore, please always use brackets on both sides of operators.

In [8]:
import traceback

try:
    r['A'] > 1 & r['B'] > 1
except Exception as e:
    traceback.print_exc()

Traceback (most recent call last):
  File "C:\Users\Y\AppData\Local\Temp\ipykernel_5220\513163917.py", line 4, in <cell line: 3>
    r['A'] > 1 & r['B'] > 1
TypeError: unsupported operand type(s) for &: 'int' and 'ColUnit'


There exists reverse methods for operators mentioned above. For example, `1 + r['A']` invokes the reverse method `__radd__` instead of `__add__`. The following table shows the reverse methods corresponding to the operators.

| Operation | Built-in Method | Reverse Method |
| ------ | ----------- | ----------- |
| `+` | `__add__` | `__radd__` |
| `-` | `__sub__` | `__rsub__` |
| `*` | `__mul__` | `__rmul__` |
| `/` | `__truediv__` | `__rtruediv__` |
| `&` | `__and__` | `__rand__` |
| `\|` | `__or__` | `__ror__` |

### Selection

Users can select rows that satisfy conditions by passing `CondUnit` or `CondExpr` object as arguments into the slicing function. For example, the following cells represent selection of rows that `A` is greater than `B` and `C` is less than `100`.

In [9]:
s = r[(r['A'] > r['B']) & (r['C'] < 100)]

let rmp = sum (<r_k, r_v> in R) if (r_k.A > r_k.B && r_k.C < 100) then { r_k } in


In [13]:
s = r[(r.A > r.B) & (r.C < 100)]

let rmp = sum (<r_k, r_v> in R) if (r_k.A > r_k.B && r_k.C < 100) then { r_k } in


Take columns `A` and `B` from the rows that satisfy the condition `(r['A'] > r['B']) & (r['C'] < 100)`.

In [10]:
s = r[(r['A'] > r['B']) & (r['C'] < 100)][['A', 'C']].show()

let rmp = sum (<r_k, r_v> in R) if (r_k.A > r_k.B && r_k.C < 100) then { r_k } in
let tmp = sum (<r_k, r_v> in rmp) { < A=r_k.A, C=r_k.C > } in
tmp


The fucntion `show()` represents that print the output on the terminal of the SDQL database. It simply repeat the name of the latest generated relation and returns `None`. `show()` is required at the end of the python script since it is impossible for `pysdql` to recognize the end of the queries.

### Other Conditional

#### isin()

In [11]:
r = pysdql.Relation('R')
s = pysdql.Relation('S')

s[s['A'].isin(['apple', 'banana'])].show()

let smp = sum (<s_k, s_v> in S) if (s_k.A == "banana" || s_k.A == "apple") then { s_k } in
smp


#### exsists()

In [12]:
r[s.exists()].show()

let rmp = sum (<r_k, r_v> in R) if ((sum (<s_k, s_v> in S) s_v) > 0) then { r_k } in
rmp


## Group By