# Labrea
https://github.com/8451/labrea

- Q: What is Labrea?
- A: A framework for declarative, functional dataset definitions.

What does that mean? Let's look at an example.

In [1]:
import pandas as pd

stores = pd.read_csv('stores.csv')
transactions = pd.read_csv('transactions.csv')
display(stores)
display(transactions.head(5))

Unnamed: 0,store_id,region
0,1,North
1,2,North
2,3,South
3,4,South
4,5,East
5,6,East
6,7,West
7,8,West


Unnamed: 0,transaction_id,store_id,amount,date
0,1,2,120.5,2023-04-01
1,2,4,95.75,2023-04-01
2,3,1,210.0,2023-04-01
3,4,3,180.25,2023-04-01
4,5,5,300.5,2023-04-01


## The simple approach
We can put everything into one file, but that's not ideal

In [2]:
import pandas as pd

stores = pd.read_csv('stores.csv')
transactions = pd.read_csv('transactions.csv')
combined_transactions = pd.merge(transactions, stores, on='store_id')

sales_by_region = combined_transactions.groupby('region')['amount'].sum().reset_index()
display(sales_by_region)
top_region = sales_by_region.loc[sales_by_region['amount'].idxmax()]['region']
print(f'Top region is: {top_region}')

sales_in_top_region = combined_transactions[combined_transactions['region'] == top_region]
sales_in_top_region_top_store = sales_in_top_region.groupby('store_id')['amount'].sum().reset_index()
display(sales_in_top_region_top_store)
top_store_id, top_store_amount = sales_in_top_region_top_store.loc[sales_in_top_region_top_store['amount'].idxmax()]
print(f'The top store in the top region is store number {int(top_store_id)} with {top_store_amount} sales')

Unnamed: 0,region,amount
0,East,6152.75
1,North,6032.0
2,South,6279.0
3,West,6203.0


Top region is: South


Unnamed: 0,store_id,amount
0,3,3331.0
1,4,2948.0


The top store in the top region is store number 3 with 3331.0 sales


That example had some extra code to help show the data, let's cut it down to just what's required for our ultimate answer

In [3]:
import pandas as pd

stores = pd.read_csv('stores.csv')
transactions = pd.read_csv('transactions.csv')
combined_transactions = pd.merge(transactions, stores, on='store_id')

top_region = combined_transactions.groupby('region')['amount'].sum().idxmax()

sales_in_top_region_top_store = combined_transactions[combined_transactions['region'] == top_region].groupby('store_id')['amount'].sum().reset_index()
top_store_id, top_store_amount = sales_in_top_region_top_store.loc[sales_in_top_region_top_store['amount'].idxmax()]

print(f'The top store in the top region is store number {int(top_store_id)} with {top_store_amount} sales')

The top store in the top region is store number 3 with 3331.0 sales


## A little bit better

Our first implementation would be hard to test, so let's rewrite it in a functional style.

In [4]:
import pandas as pd

def get_stores(input_path: str = 'stores.csv') -> pd.DataFrame:
    return pd.read_csv(input_path)

def get_transactions(input_path: str = 'transactions.csv') -> pd.DataFrame:
    return pd.read_csv(input_path)

def combine_transactions(stores: pd.DataFrame, transactions: pd.DataFrame) -> pd.DataFrame:
    return pd.merge(transactions, stores, on='store_id')

def find_top_region(combined_transactions: pd.DataFrame) -> str:
    return combined_transactions.groupby('region')['amount'].sum().idxmax()

def find_top_store_in_top_region(combined_transactions: pd.DataFrame, top_region: str) -> tuple[int, float]:
    sales_in_top_region_top_store = combined_transactions[combined_transactions['region'] == top_region].groupby('store_id')['amount'].sum().reset_index()
    top_store_id, top_store_amount = sales_in_top_region_top_store.loc[sales_in_top_region_top_store['amount'].idxmax()]
    return int(top_store_id), top_store_amount

stores = get_stores()
transactions = get_transactions()
combined_transactions = combine_transactions(stores, transactions)
top_region = find_top_region(combined_transactions)
top_store, top_store_amount = find_top_store_in_top_region(combined_transactions, top_region)

print(f'The top store in the top region is store number {int(top_store_id)} with {top_store_amount} sales')

The top store in the top region is store number 3 with 3331.0 sales


Our functional code would be much easier to test, but invoking it still required us to chain together a bunch of calls while saving intermediate values.

We also have a number of implict dependencies between these functions. In this simple project it's easy to tell that `combine_transactions()` uses the results from `get_stores()` and `get_transactions()`. Across a much larger codebase, these things become less clear.

We could make the dependencies explicit by calling base functions directly within the functions where they're used, but then we're redoing work. Let's write a quick decorator to demonstrate that. For our first implementation, we should see each function only being used once.

In [5]:
from functools import wraps
import pandas as pd

def log_func_call(func):
    @wraps(func)
    def wrapper(*args, **kwargs):
        print(f'{func.__name__} called')
        return func(*args, **kwargs)
    return wrapper

@log_func_call
def get_stores(input_path: str = 'stores.csv') -> pd.DataFrame:
    return pd.read_csv(input_path)

@log_func_call
def get_transactions(input_path: str = 'transactions.csv') -> pd.DataFrame:
    return pd.read_csv(input_path)

@log_func_call
def combine_transactions(stores: pd.DataFrame, transactions: pd.DataFrame) -> pd.DataFrame:
    return pd.merge(transactions, stores, on='store_id')

@log_func_call
def find_top_region(combined_transactions: pd.DataFrame) -> str:
    return combined_transactions.groupby('region')['amount'].sum().idxmax()

@log_func_call
def find_top_store_in_top_region(combined_transactions: pd.DataFrame, top_region: str) -> tuple[int, float]:
    sales_in_top_region_top_store = combined_transactions[combined_transactions['region'] == top_region].groupby('store_id')['amount'].sum().reset_index()
    top_store_id, top_store_amount = sales_in_top_region_top_store.loc[sales_in_top_region_top_store['amount'].idxmax()]
    return int(top_store_id), top_store_amount

stores = get_stores()
transactions = get_transactions()
combined_transactions = combine_transactions(stores, transactions)
top_region = find_top_region(combined_transactions)
top_store, top_store_amount = find_top_store_in_top_region(combined_transactions, top_region)

print()
print(f'The top store in the top region is store number {int(top_store_id)} with {top_store_amount} sales')

get_stores called
get_transactions called
combine_transactions called
find_top_region called
find_top_store_in_top_region called

The top store in the top region is store number 3 with 3331.0 sales


Now let's rewrite those functions to explicitly call their dependencies. The relationships are now more clear, and we don't need to chain together our intermediate steps. We're using a declarative style, but now we're duplicating function calls, as we can see from the decorator.

In [6]:
import pandas as pd

@log_func_call
def get_stores(input_path: str = 'stores.csv') -> pd.DataFrame:
    return pd.read_csv(input_path)

@log_func_call
def get_transactions(input_path: str = 'transactions.csv') -> pd.DataFrame:
    return pd.read_csv(input_path)

@log_func_call
def combine_transactions() -> pd.DataFrame:
    return pd.merge(get_transactions(), get_stores(), on='store_id')

@log_func_call
def find_top_region() -> str:
    return combine_transactions().groupby('region')['amount'].sum().idxmax()

@log_func_call
def find_top_store_in_top_region() -> tuple[int, float]:
    combined_transactions = combine_transactions()
    top_region = find_top_region()
    sales_in_top_region_top_store = combined_transactions[combined_transactions['region'] == top_region].groupby('store_id')['amount'].sum().reset_index()
    top_store_id, top_store_amount = sales_in_top_region_top_store.loc[sales_in_top_region_top_store['amount'].idxmax()]
    return int(top_store_id), top_store_amount

top_store, top_store_amount = find_top_store_in_top_region()

print()
print(f'The top store in the top region is store number {int(top_store_id)} with {top_store_amount} sales')

find_top_store_in_top_region called
combine_transactions called
get_transactions called
get_stores called
find_top_region called
combine_transactions called
get_transactions called
get_stores called

The top store in the top region is store number 3 with 3331.0 sales


## Using Labrea

It would be nice if we could be functional, declarative, and not do unnecessary rework. Labrea can help. We'll use the `@dataset` decorator to annotate our functions. When we reference those functions, we'll use the default value for the arg to explicitly denote the dependency between our datasets. This is somewhat similar to FastAPI's dependency injection.

You can also see that where datasets need other arguments, we're using Labrea's `Option` class to reference those. We won't be calling those earlier steps explicitly, so instead we'll need to make sure we pass in all of the required values as part of an option config. That config will be passed as an argument to the ultimate dataset that we're calling.

In [8]:
import pandas as pd
from labrea import dataset, Option

@dataset
def get_stores(input_path: str = Option('PATHS.STORES')) -> pd.DataFrame:
    return pd.read_csv(input_path)

@dataset
def get_transactions(input_path: str = Option('PATHS.TRANSACTIONS')) -> pd.DataFrame:
    return pd.read_csv(input_path)

@dataset
def combine_transactions(stores: pd.DataFrame = get_stores, transactions: pd.DataFrame = get_transactions) -> pd.DataFrame:
    return pd.merge(transactions, stores, on='store_id')

@dataset
def find_top_region(combined_transactions: pd.DataFrame = combine_transactions) -> str:
    return combined_transactions.groupby('region')['amount'].sum().idxmax()

@dataset
def find_top_store_in_top_region(combined_transactions: pd.DataFrame = combine_transactions, top_region: str = find_top_region) -> tuple[int, float]:
    sales_in_top_region_top_store = combined_transactions[combined_transactions['region'] == top_region].groupby('store_id')['amount'].sum().reset_index()
    top_store_id, top_store_amount = sales_in_top_region_top_store.loc[sales_in_top_region_top_store['amount'].idxmax()]
    return int(top_store_id), top_store_amount

app_config = {
    'PATHS': {
        'STORES': 'stores.csv',
        'TRANSACTIONS': 'transactions.csv',
    }
}

top_store, top_store_amount = find_top_store_in_top_region(app_config)

print(f'The top store in the top region is store number {int(top_store_id)} with {top_store_amount} sales')

The top store in the top region is store number 3 with 3331.0 sales


And to demonstrate that Labrea is caching the results, so that we're no longer duplicating work:

In [9]:
import pandas as pd
from labrea import dataset, Option

@dataset
@log_func_call
def get_stores(input_path: str = Option('PATHS.STORES')) -> pd.DataFrame:
    return pd.read_csv(input_path)

@dataset
@log_func_call
def get_transactions(input_path: str = Option('PATHS.TRANSACTIONS')) -> pd.DataFrame:
    return pd.read_csv(input_path)

@dataset
@log_func_call
def combine_transactions(stores: pd.DataFrame = get_stores, transactions: pd.DataFrame = get_transactions) -> pd.DataFrame:
    return pd.merge(transactions, stores, on='store_id')

@dataset
@log_func_call
def find_top_region(combined_transactions: pd.DataFrame = combine_transactions) -> str:
    return combined_transactions.groupby('region')['amount'].sum().idxmax()

@dataset
@log_func_call
def find_top_store_in_top_region(combined_transactions: pd.DataFrame = combine_transactions, top_region: str = find_top_region) -> tuple[int, float]:
    sales_in_top_region_top_store = combined_transactions[combined_transactions['region'] == top_region].groupby('store_id')['amount'].sum().reset_index()
    top_store_id, top_store_amount = sales_in_top_region_top_store.loc[sales_in_top_region_top_store['amount'].idxmax()]
    return int(top_store_id), top_store_amount

app_config = {
    'PATHS': {
        'STORES': 'stores.csv',
        'TRANSACTIONS': 'transactions.csv',
    }
}

top_store, top_store_amount = find_top_store_in_top_region(app_config)

print()
print(f'The top store in the top region is store number {int(top_store_id)} with {top_store_amount} sales')

get_stores called
get_transactions called
combine_transactions called
find_top_region called
find_top_store_in_top_region called

The top store in the top region is store number 3 with 3331.0 sales


# Labrea
https://github.com/8451/labrea

So that's Labrea. What we've seen here is a subset of the features of Labrea. It also has:
- Dataset classes
- Abstract/parameterizable datasets
- Pipelines of sequential transformations
- Overloads
- Callbacks

Even if you don't use *this* library, I think a lot of us in the data space could benefit from adopting a declarative style for our data processes.