# FugueSQL Operators

The previous section talked about FugueSQL syntax. Along with the standard SQL operations, FugueSQL has implemented some additional keywords (and is adding more). These keywords have equivalent methods in the programming interface. Some of them can be executed in standard SQL, but others would be a bit trickier.

FugueSQL aims to make coding fun and more English-like. Instead of outlining the rules, our goal is to provide an intuitive interface.

In [None]:
# Import
from fugue_sql import FugueSQLWorkflow

# Defining data
data = [
    ["A", "2020-01-01", 10],
    ["A", "2020-01-02", None],
    ["A", "2020-01-03", 30],
    ["B", "2020-01-01", 20],
    ["B", "2020-01-02", None],
    ["B", "2020-01-03", 40]
]
schema = "id:str,date:date,value:double"

## Input and Output Operations

## PRINT

Loads a CSV, JSON, or PARQUET file as a DataFrame
* DATAFRAME - If not provided, takes the last
* ROWS - Number of rows
* ROWCOUNT - Displays number of rows for dataframe. This is expensive for Spark and Dask. For distributed environments, persisting might help before doing thi soperation.
* TITLE - Title for display. 

Usage:

`PRINT [dataframes] [ROWS int] [ROWCOUNT] [TITLE “title”]`

In [None]:
with FugueSQLWorkflow() as dag:
    dag("""
    CREATE [[0,"hello"],[1,"world"]] SCHEMA a:int,b:str
    PRINT ROWS 2 ROWCOUNT TITLE "xyz" 
    """)

## LOAD

Loads a CSV, JSON, or PARQUET file as a DataFrame
* FILE TYPE - Parquet, CSV, JSON
* PARAMS - Passed on to underlying execution engine loading method
* COLUMNS - Columns to grab or schema to load it in as

Usage:

`LOAD [PARQUET|CSV|JSON] path (params) [COLUMNS schema|columns]`

## SAVE (or SAVE AND USE)

Saves a CSV, JSON, or PARQUET file as a DataFrame. SAVE AND USE just returns the dataframe so there is no need to load it back in.

* DATAFRAME - If not provided, takes the last
* PREPARTITION - Partitions for file
* MODE - Overwrite, append, or to (error if exists)
* SINGLE - One file output
* FILE TYPE - Parquet, CSV, JSON
* PARAMS - Passed on to underlying execution engine loading method

Usage:

`SAVE [dataframe] [PREPARTITION statement] [OVERWRITE|APPEND|TO] [SINGLE] [PARQUET|CSV|JSON] path [(params)]`

or 

`SAVE AND USE [dataframe] [PREPARTITION statement] [OVERWRITE|APPEND|TO] [SINGLE] [PARQUET|CSV|JSON] path [(params)]`

In [None]:
# Save and Load example
with FugueSQLWorkflow() as dag:
    dag("""
    CREATE [[0,"1"]] SCHEMA a:int,b:str
    SAVE OVERWRITE "/tmp/f.parquet"
    SAVE OVERWRITE "/tmp/f.csv" (header=true)
    SAVE OVERWRITE "/tmp/f.json"
    SAVE OVERWRITE PARQUET "/tmp/f"
    """)
    dag("""
    LOAD "/tmp/f.parquet" PRINT
    LOAD "/tmp/f.parquet" COLUMNS a PRINT
    LOAD PARQUET "/tmp/f" PRINT
    LOAD "/tmp/f.csv" (header=true) PRINT
    LOAD "/tmp/f.csv" (header=true) COLUMNS a:int,b:str PRINT
    LOAD "/tmp/f.json" PRINT
    LOAD "/tmp/f.json" COLUMNS a:int,b:str PRINT
    """)

## Partitioning

Partitioning is an important part of distributed computing. We arrange the data into different logical partitions and then perform operations. This is normally used in conjunction with Fugue extensions

## PREPARTITION

Partitions a dataframe in preparation for a following operation. Use either NUMBER or BY.
* ALGO - RAND, HASH, or EVEN
* NUMBER - Number of partition
* BY - What columns to partition on
* PREPARTITION - Partitions for file
* PRESORT - Presort hint. Check PRESORT syntax

Usage:

`TRANSFORM df [RAND|HASH|EVEN] PREPARTITION BY a,b PRESORT c DESC USING count`

or

`TRANSFORM df [RAND|HASH|EVEN] PREPARTITION 100 USING count`

## PRESORT

Defines a presort before another operation. This is never used alone.

* COLUMN - Column in DataFrame
* ORDER - ASC or DESC

Usage:

`PRESORT a DESC, b ASC`

The example below shows how to use PREPARTITION and PRESORT. We need to define a transformer to apply it with.

In [None]:
# Partitioning Example. We need to make a Fugue Transformer first
import pandas as pd

# schema: *, shift:double
def shift(df: pd.DataFrame) -> pd.DataFrame:
    df['shift'] = df['value'].shift()
    return df

with FugueSQLWorkflow() as dag:
    df = dag.df(data, schema)    # data and schema defined at top
    dag("""
    TRANSFORM df PREPARTITION BY id PRESORT date ASC USING shift
    PRINT
    """)

## Column and Schema Opeartions

## RENAME COLUMNS

* PARAMS: Pairs of old_name:new:name
* DATAFRAME - If not provided, takes the last

Usage:

`RENAME COLUMNS a:aa, b:bb` FROM df

## ALTER COLUMNS

Changes data type of columns

* PARAMS - Pairs of column_name:dtype
* DATAFRAME - If not provided, takes the last

Usage:

`ALTER COLUMNS a:int, b:int from df`

## DROP COLUMNS

Drops columns from DataFrame

* COLUMNS - Column names
* IF EXISTS - Drops if the column exists. Otherwise error.
* DATAFRAME - If not provided, takes the last

Usage:

`DROP COLUMNS a, b IF EXISTS FROM df`

## NULL Handling

## DROP ROWS

Drops rows from DataFrame containing NULLs

* HOW - ALL or ANY (all NULL or any value is NULL)
* NULL - NULL or NULLS. There is no difference.
* ON - Columns to include for operation
* FROM - If not provided, takes the last


Usage:

`DROP ROWS IF ANY NULLS ON a FROM df`


## FILL

Fills values from DataFrame containing NULLs

* NULL - NULL or NULLS. There is no difference.


Usage:

`FILL NULLS PARAMS a:99, b:-99 FROM df`


## Sampling

In [None]:
SAMPLE REPLACE? method=fugueSampleMethod (SEED seed=INTEGER_VALUE)? (FROM df=fugueDataFrame)?

In [None]:
TAKE (rows=INTEGER_VALUE (ROW|ROWS))? (FROM df=fugueDataFrame)? ((partition=fuguePrepartition)|(PRESORT presort=fugueColsSort))? ((NULL|NULLS) na_position=(FIRST|LAST))?

In [None]:
ZIP dfs=fugueDataFrames (how=fugueZipType)? (BY by=fugueCols)? (PRESORT presort=fugueColsSort)?
    ;

In [None]:
    : LAZY? (PERSIST | WEAK CHECKPOINT) (params=fugueParams)?                                                                                                   #fugueCheckpointWeak
    | LAZY? STRONG? CHECKPOINT (partition=fuguePrepartition)? (single=fugueSingleFile)? (params=fugueParams)?                                                   #fugueCheckpointStrong
    | LAZY? DETERMINISTIC CHECKPOINT (ns=fugueCheckpointNamespace)? (partition=fuguePrepartition)? (single=fugueSingleFile)? (params=fugueParams)? fugueYield?  #fugueCheckpointDeterministic
    | fugueYield    

In [None]:
: YIELD (AS name=fugueIdentifier)?

In [None]:
BROADCAST