# X-like Objects

In Fugue, it's flexibile to initialize many built-in objects. This is a tutorial for all of them.

## Schema

Fugue creates a special syntax to represent schema: Separated by `,`, each column type pair is `<name>:<type expression>`

For example: `a:int,b:str` or `a:int,b_array:[int],c_dict:{x:int,y:str}`

In [None]:
from fugue import Schema

print(Schema("a:int,b:str"))
print(Schema("a:int32,b_array:[int64],c_dict:{x:int,y:string}"))

# get pyarrow schema
schema = Schema(" a : int , b : str") # space is ok
print("pa schema", schema.pa_schema)

# more ways to initialized fugue Schema
print(Schema(schema.pa_schema)) # by pyarrow schema
print(Schema(c=str,d=int)) # pythonic way
print(Schema(dict(c=str,d=int))) # pythonic way
print(Schema("e:str","f:str")) # you can separate
print(Schema(["e:str","f:str"], ("g",int))) # you can separate, notice int in python means long in schema
print(Schema(Schema("a:int","b:str"))) # you can separate

## Parameters

`ParamDict` is not that flexible, it can only accept dict or list of tuples just like python dict. `ParamDict` itself is a python dict.

In [None]:
from triad.collections import ParamDict

print(ParamDict())
print(ParamDict(dict(a=1,b="d")))
print(ParamDict([("a",1),("b","d")]))

## DataFrame

Normally, you should create a dataframe from [ExecutionEngine](execution_engine.ipynb) or [FugueWorkflow](dag.ipynb). In general, all execution engines and workflows support list/iterable of python arrays and pandas or Fugue dataframes.

In [None]:
from fugue import ExecutionEngine, FugueWorkflow, NativeExecutionEngine, PandasDataFrame
from fugue_dask import DaskExecutionEngine
import pandas as pd

def construct_df_by_execution_engine(eng:ExecutionEngine):
    eng.to_df([[0]], "a:int", {"x":1}).show(title="from array")
    df = PandasDataFrame([[0]], "a:int")
    eng.to_df(df).show(title="from fugue dataframe")
    eng.to_df(df.as_pandas()).show(title="from pandas dataframe")
    
construct_df_by_execution_engine(NativeExecutionEngine())
construct_df_by_execution_engine(DaskExecutionEngine())  # notice the dataframe types change

print("-----------------------------------")

def construct_df_by_workflow(eng:ExecutionEngine):
    with FugueWorkflow(eng) as dag:
        dag.df([[0]], "a:int", {"x":1}).show(title="from array")
        df = PandasDataFrame([[0]], "a:int")
        dag.df(df).show(title="from fugue dataframe")
        dag.df(df.as_pandas()).show(title="from pandas dataframe")
        
construct_df_by_workflow(NativeExecutionEngine())
construct_df_by_workflow(DaskExecutionEngine())  # notice the dataframe types change   

## Partition

In [None]:
from fugue import PartitionSpec

assert PartitionSpec().empty # empty partition spec means no operation needed, it can be the default value
PartitionSpec(num=4)
PartitionSpec(algo="even",num=4,by=["a","b"],presort="c,d desc") # c,d desc == c ASC, d DESC

# you can use expression in num, ROWCOUNT can be used to indicate using the row count of the dataframe to operate on
# if a df has 1000 rows, this means I want to even partition it to 10 rows per partition
PartitionSpec(algo="even",num="ROWCOUNT/10")

PartitionSpec({"num":4, "by":["a","b"]}) # from dict, using dict on `partition-like`  parameters is common
PartitionSpec('{"num":4}') # from json

a = PartitionSpec(num=4)
b = PartitionSpec(by=["a"])
c = PartitionSpec(a,b) # combine

p = PartitionSpec(num=4, by=["a"])
PartitionSpec(p, by=["a","b"], algo="even") # override