# Using Fugue on Databricks

A lot of Databricks users use the `databricks-connect` library to execute Spark commands on a Databricks cluster instead of a local session. `databricks-connect` replaces the local installation of `pyspark` and makes `pyspark` code get executed on the cluster, allowing users to use the cluster directly from their IDE. 

In this tutorial, we will go through the following steps:
1. 

For more information check the [Databricks](https://docs.databricks.com/dev-tools/databricks-connect.html) documentation for how to configure `databricks-connect`. Getting Databricks setup for each cloud provider will also vary a bit, but in general, you want to create a workspace. A workspace will look like the following:

![Databricks workspace](https://cdn-images-1.medium.com/max/1600/1*YUF7X7cLLse1YDy2dTFh1Q.png)

## Create cluster



## Installing Fugue

![Installing Fugue](https://cdn-images-1.medium.com/max/1600/1*z1AO5S17BxWFE1YGwj8RLQ.png)

## Fugue and databricks-connect

Fugue helps solve this problem by allowing users to use the default `NativeExecutionEngine` for local development and testing. Users can use sampled files for local development, and then bring it to Spark when ready for larger tests. To demo this, we have a sample code snippet below.

In [1]:
import pandas as pd
from fugue import FugueWorkflow
from fugue_spark import SparkExecutionEngine

data = pd.DataFrame({'numbers':[1,2,3,4], 'words':['hello','world','apple','banana']})

# schema: *, reversed:str
def reverse_word(df: pd.DataFrame) -> pd.DataFrame:
    df['reversed'] = df['words'].apply(lambda x: x[::-1])
    return df

with FugueWorkflow() as dag:
    df = dag.df(data)
    df = df.transform(reverse_word)
    df.show()

PandasDataFrame
numbers:long|words:str|reversed:str
------------+---------+------------
1           |hello    |olleh       
2           |world    |dlrow       
3           |apple    |elppa       
4           |banana   |ananab      
Total count: 4



Again, this DataFrame is just 4 rows, but we should need to bring it to the Spark cluster if we were using `databricks-connect`. Here, we perform the test locally first and confirm that it works. After that, we can bring it to Spark with:

In [2]:
# This Pandas DataFrame gets converted to Spark
data = pd.DataFrame({'numbers':[1,2,3,4], 'words':['hello','world','apple','banana']})

with FugueWorkflow(SparkExecutionEngine) as dag:
    df = dag.df(data)
    df = df.transform(reverse_word)
    df.show()

SparkDataFrame
numbers:long|words:str|reversed:str
------------+---------+------------
1           |hello    |olleh       
2           |world    |dlrow       
3           |apple    |elppa       
4           |banana   |ananab      
Total count: 4



No added work is needed. The `SparkExecutionEngine` imports `pyspark`, meaning that it will import the `databricks-connect` configuration under the hood and use the configured cluster. Fugue works with `databricks-connect` seamlessly, allowing for convenient switching between local development and a remote cluster.

## Added Configuration

Most `databricks-connect` users add additional Spark configurations on the cluster. If additional configruation is needed in local code, it can be provided with the following syntax:

In [3]:
from pyspark.sql import SparkSession
from fugue_spark import SparkExecutionEngine

spark_session = (SparkSession
                 .builder
                 .config("spark.executor.cores",4)
                 .config("fugue.dummy","dummy")
                 .getOrCreate())

engine = SparkExecutionEngine(spark_session, {"additional_conf":"abc"})

## Using Fugue-sql on the Cluster

Because Fugue-sql also just uses the `SparkExecutionEngine`, it can also be easily executed on a remote cluster.

In [4]:
from fugue_sql import fsql

data = pd.DataFrame({'numbers':[1,2,3,4], 'words':['hello','world','apple','banana']})

fsql("""
      SELECT *
        FROM data
      TRANSFORM USING reverse_word
      PRINT"""
).run(SparkExecutionEngine)

SparkDataFrame
numbers:long|words:str|reversed:str
------------+---------+------------
1           |hello    |olleh       
2           |world    |dlrow       
3           |apple    |elppa       
4           |banana   |ananab      
Total count: 4



DataFrames()

In [1]:
from fugue_notebook import setup
setup()

<IPython.core.display.Javascript object>

In [None]:
%fsql spark
SELECT *
  FROM data
TRANSFORM USING reverse_word
 PRINT

## Conclusion

Here we have shown the painpoints in using `databricks-connect`. It slows down developer productitity and increases compute costs. We can solve both of these problems by toggling between Fugue's default `NativeExecutionEngine` and `SparkExecutionEngine`. Fugue's `SparkExecutionEngine` will seamlessly use whatever `pyspark` is configured for the user.

Fugue also allows for additional configuration of the underlying frameworks. We showed the syntax for passing a `SparkSession` to the `SparkExecutionEngine`.