# Extensions

The FugueWorkflow object creates a Directed Acyclic Graph where the nodes are DataFrames that are connected by extensions. Extensions are code that creates/modifies/outputs DataFrames. The `transformer` we have been using is an example of an extension. In this section, we'll cover the other types of extensions: `creator`, `processor`, `outputter`, and `cotransformer`. For all extensions, schema has to be defined. Below are the types of extensions.

<img src="../images/extensions.svg" width="700">

`outputtransformer` and `outputcotransformer` will be convered in the Advanced section. 

We have actually already seen some built-in extensions that come with Fugue. For example, `load` is a `creator` and `save` is an `outputter`. There is a difference between `Driver side` and `Worker side` extensions. This will be covered in the advanced section. For now, we'll just see the syntax and use case for each extension.

## Creator

A creator is an extension that takes no DataFrame as input, but returns a DataFrame as output. It is used to generate DataFrames. Custom creators can be used to load data from different sources (think AWS S3 or from a Database using pyodbc). Similar to the `transformer` in the previous section, `creators` can be defined with the schema hint comment, or with the `@creator` decorator. `pd.DataFrame` is a special output type that does not require schema. For other output type hints, the schema is unknown so it needs to be defined.



In [9]:
import pandas as pd
from fugue import FugueWorkflow
from typing import List, Dict, Any

def create_data() -> pd.DataFrame:
    df = pd.DataFrame({'a': [1,2,3], 'b': [2,3,4]})
    return df

# schema: a:int, b:int
def create_data2() -> List[Dict[str, Any]]:
    df = [{'a':1, 'b':2}, {'a':2, 'b':3}]
    return df

with FugueWorkflow() as dag:
     df = dag.create(create_data)
     df2 = dag.create(create_data2)
     df2.show()

IterableDataFrame
a:int|b:int
-----+-----
1    |2    
2    |3    
Total count: 2



## Processor

A `processor` is an extension that takes in one or more DataFrames, and then outputs one DataFrame. Similar to the `creator`, schema does not need to be specified for pd.DataFrame because it is already known. It does need to be specified for other output types.

In [15]:
# schema: a:double, b:int
def create_data2() -> List[Dict[str, Any]]:
    df = [{'a':None, 'b':2}, {'a':2, 'b':3}]
    return df

def concat(df1:pd.DataFrame, df2:pd.DataFrame) -> pd.DataFrame:
    return pd.concat([df1,df2]).reset_index(drop=True)

# schema: a:double, b:double
def fillna(df:List[Dict[str,Any]], n=0) -> List[Dict[str,Any]]:
    for row in df:
        df['a'] = df['a'] or n

with FugueWorkflow() as dag:
     df = dag.create(create_data2)
     df2 = dag.create(create_data2)
     df3 = dag.process(df , df2, using=concat)
     df3 = dag.process(df3, using=fillna, params={'n': 10})
     df3.show()

_3 _State.RUNNING -> _State.FAILED  list indices must be integers or slices, not str


TypeError: list indices must be integers or slices, not str