# Fugue-sql Syntax


The `fugue-sql` syntax is between standard SQL, JSON, and Python. The goals are
* To be fully compatible with standard `SQL SELECT` statement
* To minimize syntax overhead, to make code as short as possible while still easy to read
* Allow users to fully describe their compute logic in SQL as opposed to Python

To achieve these goals, enhancements were made to the standard SQL syntax that will be demonstrated here.

## Hello World

First, we start with the basic syntax `fugue-sql`. We import `fugue_notebook`, which contains a Jupyter notebook extension. Fugue has both a Python interface, and SQL interface which have equivalent functionality. 

The `setup` function in the cell below provides syntax highlighting for `fugue-sql` users. At the moment, syntax highlighting will not work for JupyterLab notebooks.

In [None]:
from fugue_notebook import setup
setup()

In [None]:
%%fsql
CREATE [[0,"hello"],[1,"world"]] SCHEMA number:int,word:str
PRINT

The `CREATE` keyword here is a `fugue-sql` keyword. We'll dive into [extensions](..extensions.ipynb) later and learn more about integrating Python functions into fugue-sql.

## SQL Compliant

All standard SQL keywords are available in `fugue-sql`. In this example, `GROUP BY`, `WHERE`, `SELECT`, `FROM` are all the same as standard SQL.

In [None]:
# Defining data
import pandas as pd
data = pd.DataFrame({"id": ["A","A","A","B","B","B"],
                    "date": ["2020-01-01", "2020-01-02",
                             "2020-01-03", "2020-01-01", 
                             "2020-01-02", "2020-01-03"],
                    "value": [10, None, 30, 20, None, 40]})

In [None]:
%%fsql
SELECT id, date, MIN(value) value
FROM data
WHERE value > 20
GROUP BY id
PRINT

Note that the Pandas DataFrame `df` was accessed inside the SQL expression. DataFrames defined in Python cells are automatically accessible by SQL cells. Other variables need to be passed in through [Jinja templating](../syntax.ipynb). More on this will be shown when we explore how Python and fugue-sql interact.

This example above shows the possibility of combining Python and SQL workflows. This is useful if Python needs to connect to other place (AWS S3, Azure Blob Storage, Google Analytics) to retrieve data that is needed for the compute workflow. The data can be loaded in with Python and passed to `%%fsql` cells. 

## Input and Output

Actual data work often require loading in the `DataFrame`. `Fugue` has two keywords in `SAVE` and `LOAD`. Using these allow `fugue-sql` users to orchestrate their ETL jobs with SQL logic. A csv file can be loaded in, transformed, and then saved elsewhere. Full data analysis and transformation workflows can be done in `fugue-sql`.


In [None]:
%%fsql
CREATE [[0,"1"]] SCHEMA a:int,b:str
SAVE OVERWRITE "/tmp/f.parquet"
SAVE OVERWRITE "/tmp/f.csv" (header=true)
SAVE OVERWRITE "/tmp/f.json"
SAVE OVERWRITE PARQUET "/tmp/f"

In [None]:
%%fsql
LOAD "/tmp/f.parquet" PRINT
LOAD "/tmp/f.parquet" COLUMNS a PRINT
LOAD PARQUET "/tmp/f" PRINT
LOAD "/tmp/f.csv" (header=true) PRINT
LOAD "/tmp/f.csv" (header=true) COLUMNS a:int,b:str PRINT
LOAD "/tmp/f.json" PRINT
LOAD "/tmp/f.json" COLUMNS a:int,b:str PRINT

json, csv, and parquet are support file formats. There are plans to support avro. Notice that parameters can be passed. If running on the default [execution engine](../execution_engine.ipynb), these would be passed on to **Pandas** `read_csv` and `to_csv`.  The file extension is used as a hint to use the appropriate load/save function. If the extension is not present in the filename, it has to be specified.

## Variable Assignment

From here, it should be getting clear that `Fugue` extends SQL in order to make it a more complete language. One of the additional features is variable assignment. Along with this, multiple `SELECT` statements can be used. This is the equivalent of temp tables or Common Table Expressions (CTE) in SQL.

In [None]:
df = pd.DataFrame({"number":[0,1],"word":["hello","world"]})

In [None]:
%%fsql
SELECT * FROM df
SAVE OVERWRITE "/tmp/f.csv"(header=true)

a = LOAD "/tmp/f.csv" (header=true)
temp = SELECT * FROM a WHERE number=1
output = SELECT word FROM temp
SAVE OVERWRITE "/tmp/output.csv"(header=true)

new_a = LOAD "/tmp/output.csv"(header=true)
PRINT new_a

## Execution Engine

So far, we've only dealt with the default [execution engine](../execution_engine.ipynb). If nothing is passed to the `%%fsql`, the `NativeExecutionEngine` is used. Similar to `Fugue` programming interface, the `execution engine` can easily be changed by passing it to `FugueSQLWorkflow`. Below is an example for Spark.

Take note of the output `DataFrame` in the example below. It will be a `SparkDataFrame`.

In [None]:
%%fsql spark
SELECT *, 1 AS one 
FROM df
PRINT

## Anonymity

In `fugue-sql`, one of the simplifications is anonymity. Itâ€™s optional, but it usually can significantly simplify your code and make it more readable.

For a statement that only needs to consume the previous dataframe, a `FROM` keyword is not needed. `PRINT` is the best example. `SAVE` is another example. This is can be applied to other keywords. In this example we'll use the `TAKE` function that just returns the number of rows specified.

In [None]:
%%fsql
a = SELECT * FROM df
TAKE 2 ROWS PRESORT number DESC          # a is consumed by TAKE
PRINT 
b = SELECT * FROM df
TAKE 2 ROWS FROM b PRESORT number DESC   # equivalent explicit synax
PRINT

## Inline Statements

The last enchancement is inline statements. One statement can be written in another in between `(` `)` . Anonymity and variable assignment often make this unneeded, but it's just good to know that this option exists.

In [None]:
%%fsql
a = CREATE [[0,"hello"], [1,"world"]] SCHEMA number:int,word:str
SELECT *
FROM (TAKE 1 ROW FROM a)
PRINT

## Passing DataFrames through fugue-sql cells

DataFrames in precending `fugue-sql` cells can not be used in future `fugue-sql` cells by default. In order to use them in downstream cells, the DataFrame needs to be yielded with `YIELD DATAFRAME` like in the example below. This also makes it available in Python cells. For large DataFrames, `YIELD FILE` stores the file in a temporary location for it to be loaded when used.

In [None]:
%%fsql
a=CREATE [[0,"hello"],[1,"world"]] SCHEMA number:int,word:str
YIELD DATAFRAME AS a

In [None]:
# Using the yielded DataFrame in Python
print(a.as_pandas().head())

In [None]:
%%fsql
b = CREATE [[0,"hello2"],[1,"world2"]] SCHEMA number:int,word2:str

SELECT a.number num, b.word2 
FROM a 
INNER JOIN b
ON a.number = b.number
PRINT

## From notebooks to deployment

While notebooks are good for data exploration and prototyping, some users want to include their `fugue-sql` code in Python scripts. For this, users can use the `fsql` class. Similar to `%%fsql` cells, the execution engine can be defined in the `run` method.

In [None]:
from fugue_sql import fsql

fsql("""
b = CREATE [[0,"hello2"],[1,"world2"]] SCHEMA number:int,word2:str

SELECT a.number num, b.word2 
FROM a 
INNER JOIN b
ON a.number = b.number
PRINT
""").run("spark")

In this tutorial we have gone through how to use standard SQL operations (and more) on top of Pandas, Spark, and Dask. We have also seen enhancements over standard SQL like anonymity and variable assignment.

In a [following section](python.ipynb) we'll look at more ways of integrating Python with `fugue-sql` to extend the capabilities of using SQL.