Tutorial: Data Abstraction
==========================

Objective:

> Learning how to specify project data source requirements independently of actual data to improve portability while maintaining reproducibility.

Principles:

1. Projects are decoupled from physical data sources through their _Schemas_.
2. If capable, the runtime platform resolves project data requirements using one of its configured _feeds_.
3. This tutorial workspace is provided with two dummy schemas (see the local [tutorial.py](tutorial.py) module): `tutorial.Dummy` 
4. This tutorial platform is configured with a feed capable of resolving data for these schemas.


In [1]:
from datetime import datetime, date

from forml.project import Source
from forml.pipeline.payload import ToPandas

from tutorial import Dummy

Basic Loading of the Dummy Dataset
----------------------------------
When submitting a workflow 

In [2]:
STATEMENT = Dummy

SOURCE = Source.query(STATEMENT)
PIPELINE = ToPandas()

SOURCE.bind(PIPELINE).launcher.apply()

INFO: 2023-05-10 13:40:09,443: lazy: Loading Dummy


Unnamed: 0,Title,Key,Target,Timestamp
0,alpha,27,0.314,2021-05-11 17:12:24
1,beta,11,-1.12,2020-11-03 01:24:56


Using Advanced Query DSL to Refine the Data Requirements
--------------------------------------------------------
The `SOURCE` instance returned by `Source.query` supports an advanced [Query DSL Syntax](https://docs.forml.io/en/latest/dsl/query/syntax.html) interpreted by the platform:

* column (expression) projection ([.select()](https://docs.forml.io/en/latest/dsl/query/design.html#forml.io.dsl.Queryable.select) method)
* expression-based filtering ([.where()](https://docs.forml.io/en/latest/dsl/query/design.html#forml.io.dsl.Queryable.where) method)
* aggregation ([.groupby()](https://docs.forml.io/en/latest/dsl/query/design.html#forml.io.dsl.Queryable.groupby) method)
* joining multiple schemas ([.*_join()](https://docs.forml.io/en/latest/dsl/query/design.html#forml.io.dsl.Origin.inner_join) methods)
* and more...

### Exercise: Extend the Basic Dummy Dataset Loading to Select Just the Title, Key and Timestamp Columns

Hints:
* use the ([Dummy.select()](https://docs.forml.io/en/latest/dsl/query/design.html#forml.io.dsl.Queryable.select) method)
* schema columns are referenced using the syntax of `Schema.ColumnName`

In [3]:
STATEMENT = Dummy.select(
    Dummy.Title,
    Dummy.Key,
    Dummy.Timestamp
)

SOURCE = Source.query(STATEMENT)
PIPELINE = ToPandas()

SOURCE.bind(PIPELINE).launcher.apply()

INFO: 2023-05-10 13:40:15,669: lazy: Loading Dummy


Unnamed: 0,Title,Key,Timestamp
0,alpha,27,2021-05-11 17:12:24
1,beta,11,2020-11-03 01:24:56


### Exercise: Extend the STATEMENT to Filter Just Rows with Timestamp After 2021-1-1

Hints: 
* use the ([Dummy.where()](https://docs.forml.io/en/latest/dsl/query/design.html#forml.io.dsl.Queryable.where) method)
* native Python operators and literals (e.g. integers, strings, but also `datetime` instances) can be used directly on Schema columns to compose expressions

In [4]:
STATEMENT = Dummy.select(
    Dummy.Title,
    Dummy.Key,
    Dummy.Timestamp
).where(Dummy.Timestamp > datetime(2021, 1, 1))

SOURCE = Source.query(STATEMENT)
PIPELINE = ToPandas()

SOURCE.bind(PIPELINE).launcher.apply()

INFO: 2023-05-10 13:40:21,014: lazy: Loading Dummy


Unnamed: 0,Title,Key,Timestamp
0,alpha,27,2021-05-11 17:12:24


### Exercise: Extend the STATEMENT to Join the Dummy Schema with...using the `Dummy.Key == ...`

Hints: 
* use the ([Dummy.inner_join()](https://docs.forml.io/en/latest/dsl/query/design.html#forml.io.dsl.Origin.inner_join) method)