# FugueSQL in 10 Minutes

All questions are welcome in the Slack channel.

[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

This is a short introduction of FugueSQL geared for new users. FugueSQL is the SQL interface for [Fugue](https://github.com/fugue-project/fugue). The Fugue project aims to make big data effortless by accelerating iteration speed and providing a simpler interface for users to utilize distributed computing engines.

This tutorial only covers the Python interface. For SQL, check the [FugueSQL in 10 minutes section](ten_minutes_sql.ipynb).

Fugue is meant for:
1. Data scientists who need to bring business logic written in Python or Pandas to bigger datasets
2. Data practitioners looking to parallelize existing code with distributed computing
3. Data teams that want to reduce the maintenance and testing of boilerplate Spark code

## Installation

There are two things to install. First is FugueSQL (which is separate from Fugue). Install it with:

```
pip install fugue[sql]
```

FugueSQL has a notebook extension for both Jupyter Notebooks and JupyterLab. This extension provides syntax highlughting and To install the extension, use pip:

```
pip install fugue-jupyter
```

and then to register the startup script:

```
fugue-jupyter install startup
```

See [this documentation](https://github.com/fugue-project/fugue-jupyter) for more details.

## Setup

In [1]:
from fugue_notebook import setup

setup(is_lab=False)

<IPython.core.display.Javascript object>

## First Query

In [31]:
import pandas as pd

df = pd.DataFrame({"col1": ["A","A","A","B","B","B"], "col2": [1,2,3,4,5,6]})
df2 = pd.DataFrame({"col1": ["A", "B"], "col3": [1, 2]})

In [32]:
%%fsql
-- these dataframes are taken from the previous Python cell
   SELECT df.col1, df.col2, df2.col3
     FROM df
LEFT JOIN df2
       ON df.col1 = df2.col1
    WHERE df.col1 = "A"
    PRINT

Unnamed: 0,col1,col2,col3
0,A,1,1
1,A,2,1
2,A,3,1


## Saving and Loading Files

In [33]:
df.to_parquet("/tmp/df.parquet")
df2.to_parquet("/tmp/df2.parquet")

In [38]:
%%fsql
df = LOAD "/tmp/df.parquet"
df2 = LOAD "/tmp/df2.parquet"

new =  SELECT df.col1, df.col2, df2.col3
         FROM df
         LEFT JOIN df2
           ON df.col1 = df2.col1 
        WHERE df.col1 = "A"

SAVE OVERWRITE "/tmp/res.parquet" 



## Variable Assignment

In [43]:
%%fsql
df = LOAD "/tmp/df.parquet"

max_vals = SELECT col1, MAX(col2) AS max_val
             FROM df
         GROUP BY col1

   SELECT df.col1, 
          df.col2 / max_vals.max_val AS normalized
     FROM df
     JOIN max_vals
       ON df.col1 = max_vals.col1
    PRINT
    

Unnamed: 0,col1,normalized
0,A,0.333333
1,A,0.666667
2,A,1.0
3,B,0.666667
4,B,0.833333
5,B,1.0


## Execution Engine

In [39]:
%%fsql duckdb
LOAD "/tmp/df.parquet"
PRINT


Unnamed: 0,col1,col2
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6


**Spark**

In [40]:
%%fsql spark
LOAD "/tmp/df.parquet"
PRINT


Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/15 21:43:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

Unnamed: 0,col1,col2
0,A,1
1,A,2
2,A,3
3,B,4
4,B,5
5,B,6


## Using FugueSQL Tables in Python

## Productionizing SQL

## Built-in Commands

## Invoking Python Code

In [19]:
%%fsql
   SELECT df.col1, df.col2, df2.col3
     FROM df
LEFT JOIN df2
       ON df.col1 = df2.col1
    WHERE df.col1 = "A"
    PRINT

Unnamed: 0,col1,col2,col3
0,A,1,1
1,A,2,1
2,A,3,1


## Distributed Computing Commands