# Dataform

## Dataform Overview

**Dataform** is a serverless service for data analysts to develop and deploy tables, incremental tables, or views to BigQuery. Dataform offers a web environment for SQL workflow development, connection with GitHub, GitLab, Azure DevOps Services, and Bitbucket, continuous integration, continuous deployment, and workflow execution.

**Dataform** lets you manage data transformation in the Extraction, Loading, and Transformation (ELT) process for data integration. After raw data is extracted from source systems and loaded into BigQuery, Dataform helps you to transform it into a well-defined, tested, and documented suite of data tables.

**Main features**:
- Develop and execute SQL workflows for data transformation.
- Collaborate with team members on SQL workflow development through Git.
- Manage a large number of tables and their dependencies.
- Declare source data and manage table dependencies.
- View a visualization of the dependency tree of your SQL workflow.
- Manage data with SQL code in a central repository.
- Reuse code with JavaScript.
- Test data correctness with quality tests on source and output tables.
- Version control SQL code.
- Document data tables inside SQL code.

**Repository Project**

**1. Types of files:** (should be put in folder with same name)

- **Config files** (`JSON` or `SQLX` files): let you configure your SQL workflows. They contain general configuration, execution schedules, or schema for creating new tables and views.
- **Definitions**: are `SQLX` and `JavaScript` files that define new tables, views, and additional SQL operations to run in BigQuery.
- **Includes**: are JavaScript files where you can define variables and functions to use in your project.

**2. Workflow development and version control**

In Dataform, the **workflow development** is the same local development, then you can pull changes from the repository, commit all or selected changes, and push them to Git branches of the repository.

In Workflow development, you can:

- Develop the following SQL workflow actions
  - Source data declarations
  - Tables and views
  - Incremental tables
  - Table partitions and clusters
  - Dependencies between actions
  - Documentation of tables
  - Custom SQL operations
  - BigQuery labels
  - BigQuery policy tags
  - Dataform tags
  - Data quality tests, called assertions
- Use JavaScript to reuse your Dataform SQL workflow code.
  - Across a file with code encapsulation
  - Across a repository with includes
  - Across repositories with packages

**3. Workflow compilation**

**4. Workflow execution**

- You can schedule Dataform executions in BigQuery in the following ways:
  - Create workflow configurations to schedule executions of compilation results created in release configurations
  - Schedule executions with Cloud Composer
  - Schedule executions with Workflows and Cloud Scheduler

- To debug errors, you can monitor executions in the following ways:
  - View detailed Dataform execution logs
  - View audit logs for Dataform
  - View Cloud Logging logs for Dataform

### Terms

1. **Release configuration**: let you configure how Dataform should compile the code of your repository. If your repository is connected to a remote git repository, you can create release configurations from different branches. Dataform will pull code from your remote git repository before compiling it
2. **Workflow configurations**: let you schedule workflow executions
3. **Development Workspace**: Is the same local development branch (git) in google cloud web workspace
4. **Dataform core package**: Is the same python version when develop python programming

## Administer & Control Access

### Setup Repository

https://cloud.google.com/dataform/docs/create-repository

### Connect to GIT repository

https://cloud.google.com/dataform/docs/connect-repository

### Config Dataform Settings 


#### `workflow_settings.yaml`
Repository Workflow setting `workflow_settings.yaml` stores Dataform workflow settings in the `YAML` format.

```yaml
defaultProject: my-gcp-project-id             # BigQuery Google Cloud project ID
defaultDataset: dataform                      # BigQuery dataset in which Dataform creates assets
defaultLocation: asia-southeast1              # default BigQuery dataset region
defaultAssertionDataset: dataform_assertions  # BigQuery dataset in which Dataform creates views with assertion results
vars:
  executionSetting: dev
  environmentName: development
```

See all [configs reference for workflow settings](https://dataform-co.github.io/dataform/docs/configs-reference#workflowsettings)

**Access the properties in Dataform code**
from `workflow_settings.yaml` options to the code accessible `dataform.projectConfig` options apply:
- `defaultProject` => `defaultDatabase`.
- `defaultDataset` => `defaultSchema`.
- `defaultAssertionDataset` => `assertionSchema`.
- `projectSuffix` => `databaseSuffix`.
- `datasetSuffix` => `schemaSuffix`.
- `namePrefix` => `tablePrefix`.

use clause:
```SQL
${when(dataform.projectConfig.vars."YOUR_VARIABLE" === "SET_VALUE", "CONDITION", "ELSE")}
```

In [None]:
  config { type: "view" }
  SELECT ${when(
    !dataform.projectConfig.tablePrefix,
    "table prefix is set!",
    "table prefix is not set!"
  )}

In [None]:
  select
    *
  from ${ref("data")}
  ${when(
    dataform.projectConfig.vars.executionSetting === "staging",
    "where mod(farm_fingerprint(id) / 10) = 0",
  )}

### Manage Core Packages

- If **Only Dataform core package + No addition packages**: put `Dataform core package` in the `workflow_settings.yaml`

```yaml
dataformCoreVersion: "3.0.0"        # As a best practice, always use the latest available version of the Dataform core framework                
defaultProject: my-gcp-project-id   # BigQuery Google Cloud project ID
defaultDataset: dataform            # BigQuery dataset in which Dataform creates assets
```
- If  **Dataform core package + Addition packages**: put `Dataform core package` + `addition packages` in the `package.json`

 ```json
 {
   "name": "repository-name",
   "dependencies": {
     "@dataform/core": "3.0.0",
     "dataform-scd": "https://github.com/dataform-co/dataform-scd/archive/0.3.tar.gz"
   }
 }
 ```
   > remove `dataformCoreVersion` in `workflow_settings.yaml`

                   

### Control Access

https://cloud.google.com/dataform/docs/required-access

## Development

### Datafrom Core (SQLX)

Dataform core for the following purposes:
- Defining tables, views, materialized views, or incremental tables.
- Defining data transformation logic.
- Declaring source data and managing table dependencies.
- Documenting table and column descriptions inside code.
- Reusing functions and variables across different queries.
- Writing data assertions to ensure data consistency.

>You can compile and run Dataform core locally through the Dataform CLI outside of Google Cloud.

A SQLX file consists of a **config block** and a **body**.

#### Config block

In the config block, you can perform the following actions:
- **Specify query metadata**: configure how Dataform materializes queries into BigQuery, for example the output table type, the target database, or labels using the config metadata.
- **Document data**: document your tables and their fields directly
- **Define data quality tests** (called `assertions`): check for uniqueness, null values, or a custom condition that run after table creation (also define assertions outside the config block, in a separate SQLX file.)
> All config properties, and the config block itself, are optional

In [None]:
config {
  type: "table",
    description: "This table joins orders information from OnlineStore & payment information from PaymentApp",
  columns: {
    order_date: "The date when a customer placed their order",
    id: "Order ID as defined by OnlineStore",
    order_status: "The status of an order e.g. sent, delivered",
    customer_id: "Unique customer ID",
    payment_status: "The status of a payment e.g. pending, paid",
    payment_method: "How the customer chose to pay",
    item_count: "The number of items the customer ordered",
    amount: "The amount the customer paid"
  },
    assertions: {
    uniqueKey: ["id"]
  }
}

#### SQLX body

following actions:
- **Define a table and its dependencies**: use SQL `SELECT` statements and the `ref` function

`ref` function use to **build a dependency tree of all the tables** to be created or updated, lets you **reference tables defined in project instead of hard coding** the schema and table name

In [None]:
config { type: "table" }

SELECT
  order_date AS date,
  order_id AS order_id,
  order_status AS order_status,
  SUM(item_count) AS item_count,
  SUM(amount) AS revenue

FROM ${ref("store_clean")}

GROUP BY 1, 2, 3

After compilation, the SQL code is:

In [None]:
CREATE
OR REPLACE TABLE Dataform.orders AS
SELECT
    order_date AS date,
    order_id AS order_id,
    order_status AS order_status,
    SUM(item_count) AS item_count,
    SUM(amount) AS revenue
FROM
    Dataform_stg.store_clean
GROUP BY
    1,
    2,
    3

- **Define additional SQL operations to run in BigQuery**: configure Dataform to execute one or more SQL statements before or after creating a table or view, you can [specify pre-query and post-query operations](https://cloud.google.com/dataform/docs/custom-sql).


In [None]:
SELECT * FROM ...

post_operations {
  GRANT `roles/bigquery.dataViewer` ON TABLE ${self()} TO "group:someusers@dataform.co"
}

- **Generate SQL code with JavaScript Block**: define reusable functions to generate repetitive parts of SQL code

Note: Reuse code defined in a **JavaScript block only inside the SLQX file where the block is defined**. For global, to reuse code across your entire repository, you can create **includes**.

In [None]:
js {
  const columnName = "foo";
}

SELECT 1 AS ${columnName} FROM "..."

### Concept of workspace

**Compiled graph**

Filter the graph by the following properties:
- Name
- Tag
- Type
    - Assertion
    - Declaration
    - Incremental Table
    - Materialized view
    - Operations
    - Table
    - Unknown
    - View
You can select multiple filters at once. Dataform will apply them with the `OR` condition.

**Repository Structure**

- `definitions/`: a directory for asset definitions, in Dataform core or JavaScript.
- `includes/`: an empty directory for scripts and variables that you can reuse across the repository.
- `workflow_settings.yaml`(`dataform.json` for early version 3.0.0): the default Dataform configuration file containing the Google Cloud project ID and BigQuery schema to publish assets in. You can override the default settings to customize them to your needs, but it's not a requirement to begin using Dataform.
- `package.json`: the default Dataform dependencies configuration file with the latest version of @dataform/core. You can use this file to import packages.
- `definitions/sample.sqlx`: a sample SQLX file to help you get started.

### Dataform Tables

https://cloud.google.com/dataform/docs/tables

**1. type of table**
- `table`: a regular table.
- `incremental`: an incremental table  must include a `where` clause (updated table by insert new records by date)
- `view`: a table view
    - `materialized`: store underlying data under view (combine `table` and `view` --> increase performance and cost, but need to refresh continuously)

> Other value of `type`: `operations`, `declaration`, `assertion`,...

**2. [Partitions and clusters](https://cloud.google.com/dataform/docs/partitions-clusters)**

**3. [Table/Field description](https://cloud.google.com/dataform/docs/document-tables)**

**4. [Assertions](https://cloud.google.com/dataform/docs/assertions)**
- Test and validate output table. Dataform runs assertions every time it updates your SQL workflow and alerts you if any assertions fail.

**5. [Config additional table settings](https://cloud.google.com/dataform/docs/table-settings)**
- Override default table settings, such as database or schema, and disable table creation, or execute a SQL statement before or after table creation

**6. [Table labels](https://cloud.google.com/dataform/docs/labels)**

**7. [Setting column-level access control](https://cloud.google.com/bigquery/docs/column-level-security-intro)**

#### Create table

**1. `ref`** function: reference and automatically depend on the following objects defined in your Dataform SQL workflow instead of hard coding the schema and table names

- ${ref("database", "schema", "name")} : project_id.schema.name
- ${ref("schema", "name")} : default_project_id.schema.name
- ${ref("name")}: default_project_id.default_schema.name

**2. `resolve`** : similar `ref` but not set the table as a dependency to this action 


### SQL Workflow

### Variables and Functions

### Workspace compilation

## Execution & Monitoring

### Trigger Execution

### Schedule Execution 

### Monitoring

## Best Practice

## Example