# Transformations

Have questions? Chat with us on Github or Slack:

[![Homepage](https://img.shields.io/badge/fugue-source--code-red?logo=github)](https://github.com/fugue-project/fugue)
[![Slack Status](https://img.shields.io/badge/slack-join_chat-white.svg?logo=slack&style=social)](http://slack.fugue.ai)

We already saw some Fugue API functions including `transform()`, `save()`, `load()`, and `show()`. This section covers the other available functions under the Fugue API. The functions shown here do not accept an `engine` argument. They will just work on whatever input DataFrame is passed (Pandas, Spark, Dask, Ray). All the details of the individual functions can be found in the [Fugue API documentation](https://fugue.readthedocs.io/en/latest/top_api.html#transformation).

### Setup

In [21]:
import pandas as pd
import fugue.api as fa 
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame({"a": [1,2,3], "b": ["Apple", "Banana", "Carrot"]})
sdf = spark.createDataFrame(df)


### Alter Columns

Takes a Fugue schema expression and updates the column types.

In [22]:
fa.alter_columns(sdf, "a:float")

DataFrame[a: float, b: string]

## Drop Columns

Drops columns from a DataFrame. 

In [23]:
fa.drop_columns(df, ["a"])

Unnamed: 0,b
0,Apple
1,Banana
2,Carrot


## Head

Returns the first `n` rows of the DataFrame.

In [24]:
fa.head(df, n=2)

Unnamed: 0,a,b
0,1,Apple
1,2,Banana


## Rename

Takes in a dictionary mapping to rename columns of the DataFrame.

In [25]:
fa.rename(df, {"a": "_a"})

Unnamed: 0,_a,b
0,1,Apple
1,2,Banana
2,3,Carrot


## Select Columns

Takes a list of columns to return.

In [26]:
fa.select_columns(df, ["b"])

Unnamed: 0,b
0,Apple
1,Banana
2,Carrot


## Distinct

Returns distinct rows of a DataFrame.

In [30]:
temp = pd.DataFrame({"a": [1,1]})
fa.distinct(temp)

Unnamed: 0,a
0,1


## Dropna

Drops records with NA values. This function has some additional kwargs 

In [None]:
temp = pd.DataFrame({"a": [None,1]})
fa.dropna(temp)


fillna()
sample()
take()

## SELECT Query

Firstly, please read [SQLEngine](./execution_engine.ipynb#SQLEngine) to understand the concept. Notice that in this abraction layer, there is no [FugueSQL](./sql.ipynb), the select statement must be acceptable by the specified SQLEngine.

[FugueSQL](./sql.ipynb) will use this feature, but it's way more than that.

In [None]:
from fugue import FugueWorkflow, SqliteEngine
from fugue_spark import SparkExecutionEngine

dag = FugueWorkflow()
a=dag.df([[0,1],[1,2]],"a:long,b:long")
b=dag.df([[1,1],[2,2]],"a:long,c:long")

# see how the dependency are represented in the select function
dag.select("SELECT * FROM",a).show() # if you directly use "SELECT * FROM a", it will not be able to identify the dependency and will throw error
dag.select("SELECT * FROM",b,"WHERE c=2").show()
dag.select("SELECT a.*,c FROM",a," AS a INNER JOIN",b," AS b ON a.a=b.a").show()

# Force using SqliteEngine regardless ExecutionEngine
dag.select("SELECT a.*,c FROM",a," AS a INNER JOIN",b," AS b ON a.a=b.a", sql_engine=SqliteEngine).show(title="Force using SqliteEngine regardless ExecutionEngine")


dag.run()
dag.run(SparkExecutionEngine)

## Lazy Evaluation

All the examples below will apply the Fugue API functions on Pandas. It will also work for Spark, Dask, and Ray also. Note that the distributed backends will be lazy so we need to call `fa.show()` to make it run. For example:

In [None]:
fa.show(fa.drop_columns(sdf, ["a"]))