# FugueSQL

FugueSQL can be used on top of Pandas, Spark and Dask. FugueSQL is parsed and then executed on top of the underlying engine.

In [13]:
import warnings
warnings.filterwarnings(action='ignore')

In [1]:
from fugue_notebook import setup
setup(is_lab=False)

<IPython.core.display.Javascript object>

In [2]:
import pandas as pd

df = pd.DataFrame({"col1": ["A","A","A","B","B","B"], "col2": [1,2,3,4,5,6]})
df2 = pd.DataFrame({"col1": ["A", "B"], "col3": [1, 2]})

### Run FugueSQL 

In [3]:
%%fsql
   SELECT df.col1, df.col2, df2.col3
     FROM df
LEFT JOIN df2
       ON df.col1 = df2.col1
    WHERE df.col1 = "A"
    PRINT

Unnamed: 0,col1,col2,col3
0,A,1,1
1,A,2,1
2,A,3,1


### Using FugueSQL dataframe in Python

In [4]:
%%fsql
SELECT *
  FROM df
 YIELD DATAFRAME AS result

In [5]:
print(type(result))
print(result.native.head())

<class 'fugue.dataframe.pandas_dataframe.PandasDataFrame'>
  col1  col2
0    A     1
1    A     2
2    A     3
3    B     4
4    B     5


### Loading files

In [6]:
%%fsql
df = LOAD "../data/processed.parquet"

new = SELECT *
        FROM df
       YIELD DATAFRAME AS result

In [7]:
print(result.native)

   a  c
0  1  1
1  2  2
2  3  3


Common table expressions (CTEs) are also supported by FugueSQL

### Using python code on SQL

In [8]:
f = pd.DataFrame({"col1": ["A","A","A","B","B","B"], "col2": [1,2,3,4,5,6]})

In [9]:
# schema: *+col2:float
def std_dev(df: pd.DataFrame) -> pd.DataFrame:
    return df.assign(col2=df['col2']/df['col2'].max())

The function above is defined to handle one group of data at a time. In order to apply it per group, we partition the DataFrame first by group using the PREPARTITION and TRANSFORM keywords of FugueSQL.

In [10]:
%%fsql
TRANSFORM df PREPARTITION BY col1 USING std_dev
PRINT

Unnamed: 0,col1,col2
0,A,0.333333
1,A,0.666667
2,A,1.0
3,B,0.666667
4,B,0.833333
5,B,1.0


### Run SQL code using either Duckdb, Spark or Dask engine

Fugue supports Pandas, Spark, Dask, and DuckDB. For operations on a laptop or single machine, DuckDB may give significant improvements over Pandas because it has a query optimizer.

For data that is too large to process on a single machine, Spark or Dask can be used. All we need to do is specify the engine in the cell. For example, to run on DuckDB we can do:

In [11]:
%%fsql duckdb
TRANSFORM df PREPARTITION BY col1 USING std_dev
PRINT

Unnamed: 0,col1,col2
0,A,0.333333
1,A,0.666667
2,A,1.0
3,B,0.666667
4,B,0.833333
5,B,1.0


In [14]:
%%fsql spark
TRANSFORM df PREPARTITION BY col1 USING std_dev
PRINT

Unnamed: 0,col1,col2
0,A,0.333333
1,A,0.666667
2,A,1.0
3,B,0.666667
4,B,0.833333
5,B,1.0
