# Pre-process node inputs

You can apply functions on the dataframes of the dependencies before they are available for your node.
Suppose you have a nodeA that has nodeB as dependency, once the nodeB outputs the dataframe `flypipe` will check if there is a preprocess function set on nodeA, if exists, it will apply the preprocess function and pass the result on to nodeC.

here is the syntax to activate the preprocess

```python

def my_preprocess_func(df):
    ...
    return df

@node(
    ...
    dependencies=[
        nodeA.preprocess(my_preprocess_func).alias("df_nodeA")
        nodeB.alias("df_nodeC")
    ]
)
def nodeC(df_nodeA, df_nodeC):
    ...
    
```

## Dataframes processing and flows between nodes

### Dependencies are cached and no preprocess
<hr/>

Example:

```python

@node(
    ...
    dependencies=[
        nodeA.alias("df_nodeA")
        nodeB.alias("df_nodeC")
    ]
)
def nodeC(df_nodeA, df_nodeC):
    ...
    
```

**Nodes sequence processing:**

    1. process nodeA
    2. process nodeB
    3. pass nodeA and nodeB dataframes to nodeC


### Dependencies hav cache and no preprocess
<hr/>

```python

@node(...cache=MyCache(...))
def nodeA(...):
      ...

@node(
    ...
    dependencies=[
        nodeA.alias("df_nodeA")
        nodeB.alias("df_nodeC")
    ]
)
def nodeC(df_nodeA, df_nodeC):
    ...
    
```

**Nodes sequence processing:**

    1. process nodeA

           - if nodeA cache does not exists       
                - runs nodeA
                - caches (write) nodeA dataframe

           - if nodeA cache exists
                 - reads cache
     
     2. process nodeB
     3. pass both nodeA and nodeC dataframes to nodeC       

### Dependencies are cached and exists preprocess
<hr/>

```python

@node(...cache=MyCache(...))
def nodeA(...):
      ...

@node(
    ...
    dependencies=[
        nodeA.preprocess(my_preprocess_func).alias("df_nodeA")
        nodeB.preprocess(my_preprocess_func).alias("df_nodeC")
    ]
)
def nodeC(df_nodeA, df_nodeC):
    ...
    
```

**Nodes sequence processing:**

    1. process nodeA

           - if nodeA cache does not exists       
                - runs nodeA
                - caches (write) nodeA dataframe

           - if nodeA cache exists
                 - reads cache
     
     2. process nodeB
     3. applies preprocssing function node nodeA and nodeB and then resulting dataframes are passed on to nodeC 

## Preprocess on different dependencies
<hr/>

```python

@node(
    ...
    dependencies=[
        nodeA.preprocess(my_preprocess_func).alias("df_nodeA")
    ]
)
def nodeB(df_nodeA):
    ...


@node(
    ...
    dependencies=[
        nodeA.alias("df_nodeA")
        nodeB.alias("df_nodeB")
    ]
)
def nodeC(df_nodeA, df_nodeB):
    ...
    
```

**Nodes sequence processing:**

    1. process nodeA (df_nodeA)
    2. applies preprocessing function node nodeA (df_nodeA_preprocessed) and pass it on to nodeB
    3. pass on df_nodeA and df_nodeA_preprocessed to nodeC

**_NOTE:_**  This example shows that the preprocessing is only applied on the df_nodeA generating df_nodeA_preprocessed that is passed on to nodeB; df_nodeA is not affected and it is passed to nodeC


## Working example

In [1]:
from flypipe import node
import pandas as pd
from datetime import datetime

@node(type="pandas")
def raw_sales():
    return pd.DataFrame(data={
        "product": ["apple", "banana", "orange"], 
        "price": [5.33, 1.2, 7.5],
        "datetime_sale": [datetime(2025, 1, 1, 10, 55, 32), datetime(2025, 1, 3, 13, 15, 22), datetime(2025, 1, 4, 1, 5, 1)]
    })
df = raw_sales.run()
display(df)

Unnamed: 0,product,price,datetime_sale
0,apple,5.33,2025-01-01 10:55:32
1,banana,1.2,2025-01-03 13:15:22
2,orange,7.5,2025-01-04 01:05:01


<h3>Preprocess function</h3>

In [2]:
def cdc_changes(df):
    sales_from_datetime = datetime(2025, 1, 3, 0, 0, 0)
    print(f"==> Getting cdc_changes from {sales_from_datetime}")
    return df[df['datetime_sale'] >= sales_from_datetime]

In [3]:
@node(
    type="pandas",
    dependencies=[
        raw_sales.preprocess(cdc_changes).alias("df_raw")
    ]
)
def sales(df_raw):
    return df_raw
    
df = sales.run()
display(df)

==> Getting cdc_changes from 2025-01-03 00:00:00


Unnamed: 0,product,price,datetime_sale
1,banana,1.2,2025-01-03 13:15:22
2,orange,7.5,2025-01-04 01:05:01


## Disabling preprocessing

### All nodes dependencies

In [4]:
from flypipe.mode import PreProcessMode

@node(
    type="pandas",
    dependencies=[
        raw_sales.preprocess(cdc_changes).alias("df_raw")
    ]
)
def sales(df_raw):
    return df_raw
    
df = sales.run(preprocess=PreProcessMode.DISABLE)
display(df)

Unnamed: 0,product,price,datetime_sale
0,apple,5.33,2025-01-01 10:55:32
1,banana,1.2,2025-01-03 13:15:22
2,orange,7.5,2025-01-04 01:05:01


### Specific node dependencies

In [5]:
from flypipe.mode import PreProcessMode

@node(
    type="pandas",
    dependencies=[
        raw_sales.preprocess(cdc_changes).alias("df_raw")
    ]
)
def sales(df_raw):
    return df_raw
    
df = sales.run(preprocess={
    
    # node: {node_dependency: PreProcessMode.DISABLE}
    sales: {raw_sales: PreProcessMode.DISABLE}
})
display(df)

Unnamed: 0,product,price,datetime_sale
0,apple,5.33,2025-01-01 10:55:32
1,banana,1.2,2025-01-03 13:15:22
2,orange,7.5,2025-01-04 01:05:01


## Enable preprocess for all dependencies by default

Setting environment variables `FLYPIPE_DEFAULT_DEPENDENCIES_PREPROCESS_MODULE` and `FLYPIPE_DEFAULT_DEPENDENCIES_PREPROCESS_FUNCTION`, will tell `flypipe` to use and apply the function to all dependencies of all nodes

**⚠️ Important:** preprocess defined on node dependencies level have preference over default preprocess function!

for example if your function import looks like:

`from my_project.utils import global_preprocess`

the environment variables would look like:

```
FLYPIPE_DEFAULT_DEPENDENCIES_PREPROCESS_MODULE=my_project.utils
FLYPIPE_DEFAULT_DEPENDENCIES_PREPROCESS_FUNCTION=global_preprocess
```

**Example**:

In [6]:
import os
from flypipe.config import config_context

@node(
    type="pandas",
    dependencies=[
        raw_sales.alias("df_raw")
    ]
)
def other_sales(df_raw):
    return df_raw


# with context was used here only to show how global processes work, in production use environment variables
with config_context(
    default_dependencies_preprocess_module="preprocess_function",
    default_dependencies_preprocess_function="global_preprocess"
):
    df = other_sales.run()
    display(df)

==> Global Preprocess


Unnamed: 0,product,price,datetime_sale
1,banana,1.2,2025-01-03 13:15:22
2,orange,7.5,2025-01-04 01:05:01


**⚠️ Important:** preprocess defined on node dependencies level have preference over default preprocess function!

as you can see bellow, `flypipe` still uses `cdc_function` to preprocess the dependency of `sales` node

In [7]:
import os
from flypipe.config import config_context

@node(
    type="pandas",
    dependencies=[
        raw_sales.alias("df_raw")
    ]
)
def other_sales(df_raw):
    return df_raw


# with context was used here only to show how global processes work, in production use environment variables
with config_context(
    default_dependencies_preprocess_module="preprocess_function",
    default_dependencies_preprocess_function="global_preprocess"
):
    df = sales.run()
    display(df)

==> Getting cdc_changes from 2025-01-03 00:00:00


Unnamed: 0,product,price,datetime_sale
1,banana,1.2,2025-01-03 13:15:22
2,orange,7.5,2025-01-04 01:05:01


## Chaining preprocessing functions

Multiples preprocess functions, i.e. `.preprocess(func1, func2...)`, can be set.
All preprocess functions will be called in the order defined:
- `.preprocess(func1, func2)` will call `func1`, then the output dataframe from `func1` will be passed to `func2`, and so on.
- `.preprocess(func2, func1)` will call `func2`, then the output dataframe from `func2` will be passed to `func1`, and so on.

**Example**:

In [8]:
def preprocess_1(df):
    datetime_sales = datetime(2025, 1, 3, 0, 0, 0)
    print(f"==> Applying preprocess_1 (filter datime_sale from `{datetime_sales}`)")
    return df[df['datetime_sale'] >= datetime_sales]

def preprocess_2(df):
    datetime_sales = datetime(2025, 1, 4, 0, 0, 0)
    print(f"==> Applying preprocess_2 (filter datime_sale from `{datetime_sales}`)")
    return df[df['datetime_sale'] >= datetime_sales]

In [9]:
@node(
    type="pandas",
    dependencies=[
        raw_sales.preprocess(preprocess_1, preprocess_2).alias("df_raw")
    ]
)
def chaining(df_raw):
    return df_raw
    
df = chaining.run()
display(df)

==> Applying preprocess_1 (filter datime_sale from `2025-01-03 00:00:00`)
==> Applying preprocess_2 (filter datime_sale from `2025-01-04 00:00:00`)


Unnamed: 0,product,price,datetime_sale
2,orange,7.5,2025-01-04 01:05:01


reverting the order fo the preprocess functions, reverts the callings

In [10]:
@node(
    type="pandas",
    dependencies=[
        raw_sales.preprocess(preprocess_2, preprocess_1).alias("df_raw")
    ]
)
def chaining(df_raw):
    return df_raw
    
df = chaining.run()
display(df)

==> Applying preprocess_2 (filter datime_sale from `2025-01-04 00:00:00`)
==> Applying preprocess_1 (filter datime_sale from `2025-01-03 00:00:00`)


Unnamed: 0,product,price,datetime_sale
2,orange,7.5,2025-01-04 01:05:01
