# Execution Engine

It is the heart of Fugue. It is the layer that unifies core concepts of distributed computing, and separates the underlying computing frameworks from user's higher level logic. Normally you don't directly operate on execution engines. But it's good to understand some basics.

In Fugue, **the only dataset is schemaed dataframes**, so although there are other important concepts such as `RDD`, we don't touch them. More options may result in more flexibility or more confusion. You can read these [1](https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html) [2](https://data-flair.training/blogs/apache-spark-rdd-vs-dataframe-vs-dataset/) to get more insights.

However, it is important to understand that, you have full access to any underlying computing frameworks, and to use any specific features including `RDD`. We unify certain things to make them easy and consistent, but we don't block anything else.

## Initialization

Although there is no hard rules for initializing an ExecutionEngine. The general way is to initialize from configs. We should use the config to describe each components, such as what logger to use, what SQLEngine to use and other properties.

Here is the best practice to initialize each built-in ExecutionEngine.

In [8]:
from fugue.execution import NativeExecutionEngine

engine = NativeExecutionEngine({"myconfig":"abc"})
assert engine.conf.get_or_throw("myconfig" ,str) == "abc"

In [2]:
from distributed import Client
client = Client() # without this, dask is not in distributed mode
from fugue_dask.execution_engine import DaskExecutionEngine

# fugue.dask.dataframe.default.partitions determines the default partitions for a new DaskDataFrame
engine = DaskExecutionEngine({"fugue.dask.dataframe.default.partitions":4})
assert engine.conf.get_or_throw("fugue.dask.dataframe.default.partitions" ,int) == 4

In [10]:
from pyspark.sql import SparkSession
from fugue_spark.execution_engine import SparkExecutionEngine

# here is the place you get a spark session, it's the same as if there is no Fugue
# notice that you can configure almost everything in this way, even running mode, such as local mode or other mode
# the best way based on my experience is to only use this way + spark-defaults.conf to initialize SparkSessions.
spark_session = (SparkSession
                 .builder
                 .config("spark.executor.cores",4)
                 .config("fugue.dummy","dummy")
                 .getOrCreate())

engine = SparkExecutionEngine(spark_session, {"additional_conf":"abc"})
assert engine.conf.get_or_throw("spark.executor.cores" ,int) == 4
assert engine.conf.get_or_throw("fugue.dummy" ,str) == "dummy"
assert engine.conf.get_or_throw("additional_conf" ,str) == "abc"


A special feature of Fugue is that, the `engine.conf` is also accessible on workers within all types of Fugue exetensions. The engine itself is only accessible on driver or dirver side extensions.


## Create DataFrame
With ExecutionEngine, you only need to tell the system I need to create a DataFrame with certain raw data or data source. And with different ExecutionEngines, different types of DataFrames will be created. `to_df` is the common interface for ExecutionEngines, use it to create DataFrames.

In [13]:
from fugue.execution import NativeExecutionEngine
from fugue_spark.execution_engine import SparkExecutionEngine

engine1 = NativeExecutionEngine()
engine2 = SparkExecutionEngine() # if spark_session is not provided, it will get the current active session

df1 = engine1.to_df([[0]],"a:int")
df2 = engine2.to_df([[0]],"a:int")

print(type(df1))
print(type(df2))
assert df1.as_array() == df2.as_array() # both materialized on driver, and compare

<class 'fugue.dataframe.array_dataframe.ArrayDataFrame'>
<class 'fugue_spark.dataframe.SparkDataFrame'>
